Page  00000001 A Mathematical/Psychometric Framework for Comparing the Perceptual Response to Different Analysis-synthesis Techniques: Ground Rules For A Synthesis Bake-off1 Gregory H. Wakefield Department of Electrical Engineering and Computer Science, University of Michigan Abstract "Does it sound like the original?" is the question often asked of a synthesis algorithm. Answers vary, despite the best of psychophysical method, because each method approaches the question in a different way. We review two general approaches for perceptual assessment and then introduce an approach that is new to sound quality quantification. Each of these techniques is discussed with respect to the special issues of sound synthesis. 1. INTRODUCTION Consider the following gedanken experiment. A music synthesis "demon" plays two sounds and you must determine whether these are the same sound repeated twice or different sounds. The sounds from which the demon draws are either acoustic recordings of a violinist playing the same note or synthesized versions of such recordings. You have never heard the violin or its synthesis before. Nevertheless, when presented with repeated samples of either the acoustic or synthesized signals, you always correctly identify the samples as repeated. Furthermore, when presented with the acoustic recording and its synthesized version, you also correctly identify the pair of observations as different 100% of the time. Surprised by this result, you look at the original wavefiles to discover that the demon is either of the "malicious" or "dyslexic" kind (after all, you never know about these demons, either in physics or in psychophysics). On those trials in which acoustic recordings were compared with their synthesized versions, the demon had permuted the labels of the synthesized versions so that none of them were properly compared with their acoustic counterparts. To make matters worse, the demon flatly denies his mistake and refuses to fix the labelling. Playing the devil's advocate, he suggests that you participate in one more discrimination test of his choosing using the set of stimuli currently on hand. What discrimination test does the demon run? The only accurate information is whether the signal belongs to the set of original acoustic recordings, 0, or to the set of synthesized versions, or reproductions, of such recordings, R. Accordingly, the demon changes the discrimination task so that on any given trial you hear two samples drawn from the same set (either both from O or from R) or from different sets (one from O and the other from R ). Once analyzed the demon reveals that you responded S1The author would like to thank Maureen Mellody for comments on earlier drafts of this manuscript. This work was supported, in part, by funds provided by the MusEn Project at the University of Michigan. "different," given that the samples were drawn from the same set, 70% of the time, but you also respond "different," given that the samples were drawn from different sets, 70% of the time. You conclude that you are no better at discriminating among samples drawn from different sets than from the same set. But what can you say about the quality of the synthesis algorithm used in the experiment? The present paper addresses the question of how to measure the perceptual quality of a synthesized versions of musical instruments. Two standard approaches to perceptual assessment will be discussed along with the "samplediscrimination task," a psychophysical procedure that was introduced within the past decade to investigate other aspects of auditory perception. We will discuss each of these with respect to the goals of instrument synthesis. 2. SUBJECTIVE MEASURES OF QUALITY 2.1. Psychophysical measures of discrimination The most conservative measure of quality is whether or not the reproduction is discriminable from the original. Two stimuli are judged indiscriminable with respect to the threshold rl if, for two stimuli o 0 and r R, the probability of discriminating is bounded by Pdisc(o, r) < rl (1) Eqn. (1) can be interpreted with respect to a perceptual mapping function Y as bounding the distance between o and r in the perceptual space d(T(o), Y(r)) < e(n, T(o)) (2) where the distance bound F depends on the discrimination threshold and, in general, on the location in space. Under this interpretation, any further measure, such as "quality", will also be similarly bounded. Thus, if the sensory responses do not differ statistically between the two waveforms, then any subsequent evaluative (or scaled) measure of quality must be the same. The power of the discrimination measure is also its shortcoming, as listeners should generally utilize any reliable difference between the signals in making their

Page  00000002 decisions. For example, in evaluating the quality of synthesis, one often encounters some form of recording artifact among the original signals, such as microphone noise or reverberation, which are not modeled in the reproduction process. In such cases, the signals will be readily discriminated even when the reproduced signal is identical to the noise-free signal in the recording. There are at least two methods for reducing the sensitivity to irrelevant cues in the stimulus. Instructing the listener to ignore the unwanted cue introduces an unmeasurable degree of bias in the measurement. Under such instruction, the discrimination score depends not only on the intrinsic discriminability between the intended stimuli (the quantity one wants to measure) but on the ability of the listener to understand and follow the instructions. Attempts to factor out "good" from "bad" listeners are fraught with measurement problems. The second approach introduces the irrelevant cues to both the original and reproduced signal, under the assumption that these irrelevant cues do not interfere with the ability to discriminate the relevant cues. For example, when measuring discrimination thresholds for piano synthesis, we have added a low-level background noise to both the original and reproduced signals to reduce the microphone noise cue from the original signal (Guevara 1997). The introduction of reverberation to synthesized sounds may also be rationalized as a means of equalizing such an irrelevant cue across the original and reproduced sounds. A second shortcoming of discrimination measures is the training required to reliably measure sensory capacity. We, generally, adhere to statistical models of sensory discrimination and the body of psychophysical procedures that have been developed for measuring the parameters of such models (Green and Swets 1966) which assume welltrained listeners. Through feedback and training, listeners can learn the cues appropriate for the discrimination task. Without such, one risks drawing erroneous conclusions. For example, advocates of categorical perception have cited the steepness of identification and psychophysical discrimination functions in the voice-onset time (VOT) continuum as evidence for coarse sensory quantizing of VOT. However, the naive listeners were given feedback and training, the evidence for such sensory quantization disappears (Carney 1977). Training and feedback can also reveal the rather exceptional limits of one's sensory system. Consider the finding that sensory discrimination thresholds for frequency continue to improve after months of training (Turner and Nelson 1982)! 2.2. Supra threshold measures: Scaling psychophysics Stevens introduced a body of techniques for directly measuring sensory magnitude, under the assumption that human observers can accurately report the strength of a percept along a particular dimension of interest (Bolanowski and Gescheider 1991). Alternatively, Braida and Durlach have advanced methods for building a perceptual scale from measures of discrimination (Durlach and Braida 1969). In conjunction, psychoacousticians have often debated the set of primitive sensory features, or dimensions, of a stimulus. When these are a matter of hypothesis, multidimensional scaling (Borg and Groenen 1997) and clustering techniques (Jain and Dubes 1988) have been used to characterize the perceptual representation, e.g., (Shepard 1963; Soli, Arabie et al. 1986). With the exception of methods based on measures of discrimination, all scaling procedures are subjective in the sense that there is no way to independently score the correctness of the listener's response. Statistical procedures are often used to characterize whether the listener is making consistent use of the scale, which reduces the probability of contaminating a study with noisy data. However, there is no independent method for ensuring that the listener uses either the intended perceptual variable (e.g., loudness, pitch, roughness, naturalness) or scale as intended by the experimenter. Despite these caveats, carefully designed and executed scaling studies have led to a number of acoustic standards (e.g., sones, mel). Thurstonian scaling reconstructs the perceptual distance along the "dimension" of interest, e.g., synthesis quality, based on pair-wise comparisons among elements in the stimulus set (Torgerson 1958). Because the listener is only required to make a directional judgment concerning which member of the pair is "larger" along the dimension of interest, Thurstonian scaling is robust to many of the problems listeners have in assigning numbers to subjective magnitude. It is, however, O(N-) with respect to the number of judgements required to reconstruct the orientation of N stimuli along the dimension of interest. Nevertheless, we, like many others, have used it to study sound quality (Otto and Wakefield 1992; Mellody, Wakefield et al. 2000). Considerable attention has been focused on the related problem of speech quality in communication systems (Quackenbush, Barnwell et al. 1988). With respect to scaling, in addition to the method of paired comparisons, direct and indirect methods are distinguished. Both methods use some form of magnitude estimation, such as 5-11 pt. numerical scales, to measure strength along a particular perceptual dimension. Because of this, both methods are susceptible to problems listeners might have in properly using the numeric scales. This limitation is offset by the fact that the measurement is O(N) and more recent statistical developments in the modeling of ordinal data (Johnson and Albert 1999). Direct methods assess the dimension of interest, such as quality. Like the method of paired comparisons, however, results using a direct method only indicate information about the projection of the stimulus along the dimension or curve of interest. Indirect methods assess multiple perceptual dimensions, each of which is thought to influence the aggregate measure. In the case of synthesis

Page  00000003 quality, for example, indirect methods ask listeners to scale quality of the reproduced transient, steady-state, and decay segments of the waveform, or quality of roughness, brightness, etc. 2.3. Signals vs. ensembles Underlying both of the preceding approaches is the assumption that the signals, o or r, rather than their ensembles, O or R, are the objects over which synthesis quality is assessed. This may be valid if there is little perceptual variation in the ensemble. For example, recordings of a single bell always struck with the same force at the same location are likely to be perceptually similar so that an accurate reproduction for one sample is likely to imply accurate reproductions of all samples. However, even in this case, there are limitations with both procedures. A presumed consequence of the mathematical framework used in scaling psychophysics is that the measure of a particular signal o does not depend on what the signals present in the experiment. Evidence suggests that this assumption is not always true. Listeners will respond differently according to variations in the context created by the set of stimuli. These so-called anchoring effects, among others, have been the object of psychometric study. In short, such context effects can limit the ability to generalize from the set of signals to the ensemble. Threshold psychophysics also faces certain limitations with respect to generalization. The simplest form of an ideal receiver, according to signal detection theory (SDT), should ignore all irrelevant components of the signal and focus only on sources of information that maximize detection or discrimination performance. Stimulus context can only affect performance through changes in the parameters of the likelihood ratio. Three decades of experimental research, however, have demonstrated the sometimes profound effect stimulus context can have on performance. Watson et al. coined the term informational masking to refer such degradations in performance (Watson, Wroton et al. 1975). For example, the frequency difference limen for a pure tone depends on whether the tone is presented alone or in a sequence of tones, each of which varies randomly in frequency. Such distractor tones can substantially increase discrimination thresholds, although the simplest form of ideal receiver factors them out. Are our measures of synthesis quality obtained from scaling or threshold psychophysical procedures contaminated by such context effects? Assuming that there are few, if any, musical instruments for which a single sample adequately describes the instrument's ensemble of sounds, we suggest that the musical instrument creates a stimulus context such that detailed measurements of a small number of samples may not be representative of performance across the ensemble. In other words, when measuring synthesis quality, we are really attempting to measure how well the ensemble of original sounds O is discriminated from the ensemble of reproduced sounds R. Given the potentially large variations in stimulus context induced by O, measurements of quality using isolated members of O may be unduly conservative. 3. SOURCE-ENSEMBLE PSYCHOPHYSICS The argument drawn above reduces to the observational claim often made in the literature: if it sounds like a violin, then it might as well be a violin, despite the fact that under controlled listening experiments, differences between the original and the reproduced signal are always heard. In this case, the violin ensemble creates a stimulus context such that, when exposed to the range of variations within 0, the variational differences between any sample of O and its mapping to R by the synthesis algorithm are masked. Under "partial" informational masking, we may even hear differences between a sample of O and its reproduction, but we would attribute these differences to different samples of O as much as to differences induced under synthesis. In other words, we have arrived at an understanding of the wisdom demonstrated by the dyslexic sound-synthesis demon at the beginning of the paper. The stimulus sampling procedure was originally proposed by Sorkin et al. (Sorkin, Robinson et al. 1987) as a method for quantifying information presented in auditory or visual displays. Lutfi extended the application of Sorkin's method, using concepts from information and detection theory, to better model and experimentally manipulate informational masking in psychophysical discrimination experiments (Lutfi 1992). More recently, the stimulus-sampling paradigm has been applied to study informational masking of "every-day" sounds (Oh and Lutfi 1999). The basic structure of the stimulus-sampling procedure applies the psychophysical discrimination procedure to the ensemble rather than an instantiation of the ensemble. For example, the listener is presented samples from standard and comparison ensembles and they must determine which interval contained a sample from the comparison ensemble. When there is no variation in the ensemble, the task reduces to that described in Section 2.1. However, when substantial variations exist, one can measure the ability to discriminate between the statistical distributions of samples within each ensemble, which is very much in line with the goal of sound synthesis. 4. SOURCE-ENSEMBLE PSYCHOPHYSICS AND PERCEPTUAL CODING Space limits the details we can provide about the application of the stimulus-sampling paradigm to the study of synthesis quality. Instead, we conclude with some general remarks. The classes of problems for which stimulus-sampling have been applied are fairly small with respect to the probability distributions that describe the ensembles. To

Page  00000004 date, statistical sampling theory has not adequately addressed the number of trials necessary to accurately estimate a discriminability index across two ensembles. When applied to measures of signal discrimination, such theory points to the need for relatively large numbers of observations. Extrapolating to measures of ensemble discrimination suggests an exponential growth in the number of trials. The existence of informational masking in the present case is likely to attenuate this growth. When examined structurally, the stimulus-sampling paradigm and synthetic reproduction of musical instruments share many mathematical properties with the problem of designing lossy compression systems to optimize a perceptual fidelity criterion (Kahrs and Brandenburg 1998). The stimulus-sampling paradigm is the most general of the three in that there is no presumed mapping from the standard to the comparison ensembles. When these ensembles are linked through sound synthesis, the paradigm measures the perceptual fidelity of the mapping. An important outcome in this regard is the existence of a number of techniques in the coding literature to speed up Monte Carlo simulation of a compression algorithm. In applying stimulus sampling to measurements of sound-synthesis quality, such techniques as importance sampling may be quite useful in improving the statistical power of the test with far fewer trials. A second outcome from the association of these three domains draws upon techniques developed within the stimulus-sampling paradigm. Berg showed the trial-bytrial responses of listeners in a complex discrimination task could be used to better understand the perceptual salience of various stimulus dimensions (Berg 1989). Lutfi used this analytic technique to identify which parameters in a physical model of a resonant system are most important perceptually (Lutfi and Oh 1997), the so-called "psychomechanics" of sound sources (McAdams 2000). By extension, such reasoning should prove useful in finetuning a synthesis algorithm. In conclusion, the stimulus-sampling paradigm addresses a number of the shortcomings of traditional psychophysical procedures with respect to the measurement of synthesis quality. Like threshold procedures, feedback and training can orient the listener to the meaningful cues. Typically, these cues involve the variational dependencies of supra threshold features of the ensemble. Therefore, the information provided by the measurements can be treated in many of the same ways as data collected from supra threshold scaling techniques. The conference presentation will demonstrate the application of these techniques to a sound-synthesis problem. 5. REFERENCES Berg, B. G. 1989. "Analysis of weights in multiple observation tasks." Journal of the Acoustical Society of America 86: 1743 -1746. Bolanowski, S. J. J. and G. A. Gescheider, Eds. 1991. Ratio Scaling of Psychological Magnitude: In Honor of the Memory of S. S. Stevens. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Borg, I. and P. Groenen 1997. Modern Multidimensional Scaling. New York: Springer. Carney, A. E. 1977. "Noncategorical perception of stop consonants differing in VOT." Journal of the Acoustical Society ofAmerica 62(4): 961-70. Durlach, N. I. and L. D. Braida 1969. "Intensity perception. I. Preliminary theory of intensity resolution." Journal of the Acoustical Society ofAmerica 46(2): 372-383. Green, D. M. and J. A. Swets 1966. Signal detection theory and psychophysics. New York: John Wiley & Sons. Guevara, R. C. L. (1997). Modal Distribution Analysis and Sum of Sinusoids Synthesis of Piano Tones. Electrical Engineering and Computer Science. Ann Arbor, University of Michigan. Jain, A. K. and R. C. Dubes 1988. Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall, Inc. Johnson, V. E. and J. H. Albert 1999. Ordinal data modeling. New York: Springer. Kahrs, M. and K. Brandenburg, Eds. 1998. Applications of Digital Signal Processing to Audio and Acoustics: Kluwer Academic Publishers. Lutfi, R. A. 1992. "Informational processing of complex sound. III: Interference." Journal of the Acoustical Society of America 91(6): 3391-3401. Lutfi, R. A. and E. L. Oh 1997. "Auditory discrimination of material changes in a struck-clamped bar." Journal of the Acoustical Society ofAmerica 102(6): 3647-3656. McAdams, S. 2000. "The psychomechanics of real and simulated sound sources." Journal of the Acoustical Society of America 107(5): 2876. Mellody, M., G. H. Wakefield, et al. 2000. "Modal Distribution Analysis and Synthesis of a Soprano's Sung Vowels." Journal of Voice submitted for publication. Oh, E. L. and R. A. Lutfi 1999. "Informational masking by everyday sounds." Journal of the Acoustical Society ofAmerica 106(6): 3521-3528. Otto, N. and G. H. Wakefield 1992. The Design of automotive acoustic environments: using subjective methods to improve sound quality. 36th Annual Meeting of the Human Factors Society. Quackenbush, S. R., T. P. I. Bamwell, et al. 1988. Objective Measures of Speech Quality. Englewood Cliffs: Prentice Hall. Shepard, R. 1963. "Circularity in judgements of relative pitch." Journal of the Acoustical Society ofAmerica 36: 2345-2353. Soli, S. D., P. Arabie, et al. 1986. "Discrete representation of perceptual structure underlying consonant confusions." Journal of the Acoustical Society ofAmerica 79(3): 826-837. Sorkin, R. D., D. E. Robinson, et al. 1987. A detection-theoretic method for the analysis of visual and auditory displays. 31st Annual Meeting of the Human Factors Society. Torgerson, W. S. 1958. Theory and methods of scaling. New York: Wiley. Turner, C. W. and D. A. Nelson 1982. "Frequency discrimination in regions of normal and impaired sensitivity." Journal of Speech & Hearing Research 25(1): 34-41. Watson, C. S., H. W. Wroton, et al. 1975. "Factors in the discrimination tonal patterns. I: Component frequency, temporal position, and silent intervals." Journal of the Acoustical Society ofAmerica 57: 1175-1185.