Page  00000001 A Search for Best Error Metrics to Predict Discrimination of Original and Spectrally Altered Musical Instrument Sounds Andrew Homer', James Beauchamp2, and Richard So3 Department of Computer Science, Hong Kong University of Science & Technology horer@cs.ust.hk 2School of Music & Dept of Electrical and Computer Eng., Univ. of Illinois at Urbana-Champaign jwbeauch@uiuc.edu 3Department of Industrial Engineering & Eng. Management, Hong Kong Univ. of Science & Technology rhyso@ust.hk Abstract The correspondence of various error metrics to human discrimination data was investigated. Eight musical instrument sounds were resynthesized with various amounts of random spectrum alteration, with spectral errors ranging from 1 to 50%. Listeners were asked to discriminate the randomly altered sounds from reference sounds resynthesized from the original data. Then, various spectral difference formulas designed to predict the perceptual difference between two similar sounds were tested by calculating the correspondence between the discrimination data and the spectral differences. Averaged over the 8 instruments, 91% correspondence was achieved using relative spectral error. The correspondence was approximately 82%for decibel-amplitude differences. Other error metrics such as those based on critical band grouping of components worked well but did not give any improvement over the method based on harmonic amplitudes. Spectral differences using a small number of representative frames emphasizing attack and decay transients yielded results slightly better than using all frames. 1 Introduction How to best predict the perceptual difference between two musical instrument sounds, such as between an acoustic original and a synthetic sound, is a longstanding problem in computer music. Listening tests are ideal for measuring such differences, but they are not always possible or practical. Therefore, a numerical error metric that accurately predicts average listener discrimination between similar sounds would be highly desirable. The metrics in this paper are based on the time-varying amplitudes of the harmonics of sustained musical instrument sounds. In all of the sounds tested, time-varying frequencies are replaced by fixed harmonic frequencies in order to focus on perceptual differences based on harmonic amplitudes. Many error formulas can be devised based on measured harmonic amplitudes. Amplitudes can be measured on the basis of harmonics or critical bands. Linear harmonic amplitudes can be normalized by rms amplitude. Decibel amplitudes can be used. Harmonics can be grouped in critical bands. Time-averaged or peak error can be used. All time frames can be considered or only certain frames. Spectral differences are especially important in applications such as spectral modeling and data reduction. For example, the optimization of parameters for frequency modulation synthesis needs an error metric to measure how closely the synthetic result matches the original (Homer et al., 1993a). An error metric is also useful for measuring the synthesis error resulting from piecewise linear approximations to amplitude-vs.-time functions in additive synthesis (Horner et al., 1993b). Plomp (1970) considered the correspondence between an error metric and discrimination data in his early work on timbre differences. Plomp's metric was a decibel spectral error using 1/3-octave bands. Investigating static spectra of musical instruments and vowels, Plomp found that spectral differences correlated quite well (80-85%) with listener's judgments of timbral dissimilarity. Plomp concluded that timbre differences can be predicted well from spectral differences. The authors of this paper recently measured the discrimination of eight resynthesized sustained musical instruments from corresponding sounds whose spectra were randomly altered by various amounts (Homer and Beauchamp, 2003). The current paper extends the work by determining how well various error metrics match human discrimination. Our goal is to determine which error metrics provide the best correspondence and answer several related questions: Does the correspondence vary from instrument to instrument? Does one error metric stand out as the best, or are there several equally-effective methods? Should absolute spectral differences, squared (Euclidean) differences, or absolute differences raised to some other power be used? Which is preferable, decibel or linear amplitudes? Harmonic amplitudes or amplitudes averaged over critical bands? Should we use all analysis frames obtained from a sound's short-time spectrum analysis or only a sub-set of them in calculating spectral error? Proceedings ICMC 2004

Page  00000002 Below, we present the random spectrum alteration procedure, followed by a comparison of error metrics for calculating spectral differences. Then we outline the discrimination experiment. The discrimination results are shown along with their correspondence with the various error metrics, which are discussed in detail. Finally, conclusions are made about which error metrics are best for applications such as music resynthesis. 2 Random Spectrum Alteration Eight sustained (non-percussive) musical instrument sounds were selected as prototype signals for stimulus preparation. Except for the horn and bassoon, these reference sounds were also used in McAdams et al. 1999 and all of them were used in Horner & Beauchamp 2003. They were first subjected to spectrum analysis using a computer-based phase vocoder method. Random spectrum alteration was performed on the analysis data, after which the sounds were generated by the additive synthesis method. Random alteration was done by multiplying each harmonic amplitude by a random scalar rk: Ak(t) rkAk(t), (1) where harmonics in the same critical band share the same random scalar. This is tantamount to a linear stationary process. The goal of this random spectral alteration is to perturb each harmonic amplitude, without changing the spectral centroid or loudness. By uniformly picking rk in the range [1 - 2c, 1 + 2c], the error is expected to be approximately s, though the actual error c' will slightly deviate from c. For this study, we generate 50 tones for each instrument, where the error s ranges from 1% to 50% in increments of 1%. So, for 50% error, rk will be picked in the range [0, 2]. Preserving the spectral centroid after random spectrum alteration has been applied provides an important group of related, yet different, timbres. To preserve spectral centroid, we iteratively tilt the altered spectra to achieve the desired centroid using Newton's method. For loudness equalization, an amplitude multiplier was again determined such that the altered sound had a loudness of 87.4 phons. An iterative procedure adjusted the amplitude multiplier starting from a value of 1.0 until the resulting phons were within 0.1 phons of 87.4, as measured by Moore and Glasberg's loudness program (Moore, Glasberg, and Baer 1997). The random spectrum alteration algorithm therefore consists of the following steps: (1) Pick initial values for rk such that 1 - 2r < rk < 1 + 2I. (2) Apply random alteration: Ak'(t) = rkAk(t). (3) Centroid equalization: A. Calculate the average spectra of the original and altered sounds. B. Calculate the spectral centroids of the average spectra from step 3a. C. Iteratively tilt the average altered spectrum using Newton's method by modifying rk until the centroids match. (4) Iterative loudness equalization using Moore and Glasberg's LOUDEAS program. (5) End 3 Error Metrics There are several different approaches for computing spectral differences, including harmonic amplitude differences, critical-band amplitude differences, spectral errors, envelope errors, amplitude weighting, frequency weighting, and whether to use a subset of the frames or all of them in the calculations. We will outline these various error metrics in this section. 3.1 Harmonic-Amplitude Error Metrics Probably the simplest error metric is the average simple distance measure based on linear harmonic amplitudes treated as vectors, which we call linear-amplitude spectral error: la =-ZIAk(t,) - A n=1 k=1 (2) where n = analysis frame number, tn = time in s of analysis frame n, N= number of analysis frames, k = harmonic number, K= number of harmonics, Ak(tn) is the amplitude of the first signal's kth harmonic at time tn, A'k(tn) is the amplitude of the second signal's kth harmonic at time tn, and a is an arbitrary exponent applied to each difference. Normally for the metric calculations we take N=20, where 10 points equally spaced in time are taken from the "attack" portion of the sound (defined as the time from onset up to the point of maximum rms amplitude of the original signal) and the rest are equally spaced in time over the remainder of the sound. This gives equal emphasis to the all-important attack and the steady-state/decay portions of the sound. Note that the highest amplitude harmonics of the highest amplitude time frames make the strongest contributions in linear spectral error. This emphasizes the sustained part of most sounds, since they are usually the loudest. Statistical procedures, such as principal components analysis data, have been used to reduce additive synthesis data to wavetable synthesis data by minimizing the error in Equation 2 (Horner et al., 1993b; Sandell and Martens, 1995). Also, a least-squares approach has been used to determine optimum amplitude-vs.-time envelopes for Proceedings ICMC 2004

Page  00000003 frequency modulation and wavetable synthesis by minimizing Eq. 2 (Horner et al., 1993a&b). Both applications minimize the Euclidean spectral distance where a=2. Also, double frequency modulation matching work has used linear spectral error with a= 1 (Tan et al., 1994). Alternatively, one might argue that decibel differences are a better measure of how humans hear. Decibelamplitude spectral error can be formulated as: dase N K )L' (3) where Lk(t) = 20log(Ak(t )) and L (t) = 20log(A k(t)). Both Eqs. 2 and 3 emphasize spectral snapshots with higher amplitudes, which de-emphasizes the perceptually important attack and decay. Instead, we can weight each time frame equally, using a relative-amplitude spectral error: |.li, t)-, *Ak. (4) N N IAk(t.)- Ak4 I Z k=l The relative-amplitude spectral error varies from 0 to 1, since Eq. 4 normalizes the error for each time frame by the amplitude of the first sound. We refer to this as relativeamplitude spectral error with simple normalization. Previous wavetable and frequency modulation matching work by the authors and others has used Eq. 4 with a=2 (Homer et al., 1993a&b; Lim and Tan, 1999; Horner, 2001) and sometimes with a=l1. It is also possible to use both sounds in the normalization. We call the resulting error measure relativeamplitude spectral error with dual normalization: N Ak(t)-Ak(t, (5) Crasewn = Y k K McAdams et al. (1999) used a variation on Eq. 5 with an alternative normalization method, which we will refer to as relative-amplitude spectral error with maximum normalization: 1 N IAk(t)- Ak(t, n 1 A ak(t= k=l (8) 3.2 Critical-Band-Amplitude Error Metrics Eqs. 2-8 define error metrics based on amplitudes of individual harmonics. They can be rewritten to depend on the combined amplitudes of critical bands in order to better represent the human auditory system. 3.3 Envelope Error Metric The error metrics defined above equally weight each spectral frame. Alternatively, we can weight each harmonic equally with a relative harmonic-amplitude envelope error given by N 1 K IAAkt)- Ak()t-. k A, n =1 n=l (9) SN k(t,)- Ak(t,) (mrxAectwm)= Y AkK t~^ "'=1 (max(Ak(t.),Aklt,)l k=l (6) For random spectrum alteration, since Ak = rk Ak, Eq. 9 can be simplified as: rhe 1, (10) K k=1 which is independent of a (it would depend on a only if different rk were picked for each time frame). 3.4 Frequency-Weighted Error Metric If a frequency-dependent term kb is introduced as an amplitude weight in the relative-amplitude spectral error metric (Eq. 4 with a=1), low or high frequencies can be emphasized. The frequency-weighted relative harmonicamplitude error then becomes S kbN )-Ak(t)I (11) k=1 k=(1_ N=1 kbAk(tn) Eq. 11 emphasizes high frequencies for b>0, low frequencies for b<0, and is a special case of Eq. 4 for b=0. 3.5 Error Metrics using All Available Frames Another question is whether to use all analysis frames obtained from a sound's short-time spectrum analysis or only a subset of them in calculating spectral error. The advantage of using a subset is that it is computationally cheaper and it can provide more emphasis on perceptually important time-regions such as the sound's attack and decay. Several previous studies (e.g., Homer et al., 1993a&b; Homer, 2001) have utilized spectral error metrics based on 10 equally-spaced spectral frames taken from the attack (defined as the time-period before the peak RMS amplitude point) and 10 equally-spaced frames taken from the rest of the sound. However, it has never been empirically shown whether using all available spectral Rather than averaging the harmonic differences, one could only consider the worst difference at each time frame to emphasize the occurrence of a single large spectral difference. The following is the maximum relativeamplitude spectral error: iN lmaxk {k(t,)- Ak(tI). (7) r -se N It is also possible to take the root after the summation in relative spectral error (Eq. 4). The following is the rms relative-amplitude spectral error: Proceedings ICMC 2004

Page  00000004 frames or only a few carefully chosen representative spectral frames provides a better correspondence to perceptual difference. 4 Experimental Method Twenty subjects were selected for the listening test, undergraduate students at the Hong Kong University of Science and Technology, ranging in age from 18 to 23 years, who reported no hearing problems. The eight sustain instruments used belong to the air column (air reed, single reed, lip reed, double reed) and bowed string families. They included the bassoon, clarinet, flute, horn, oboe, saxophone, trumpet, and violin. Each sound was analyzed and resynthesized using the reference analysis data with no frequency variations and no inharmonicity, since the original sustain sounds had relatively small frequency deviations. This was done to prevent listeners from detecting cues stemming from frequency deviations which might be amplified by some of the random spectrum alterations. A randomly altered sound was generated for each error level from 1-50% in increments of 1%, yielding a total of 50 modified sounds for each instrument. The randomly altered sounds were also generated using strictly fixed harmonics. To compare random alterations across instruments, the same initial set of random scalars rk were used on all eight instruments, though the scalars were slightly modified by centroid tilt-correction. A two-alternative forced-choice (2AFC) discrimination paradigm was used. The listener heard two pairs of sounds and chose which pair was "different". Each trial structure was one of AA-AB, AB-AA, BB-BA, or BA-BB, where A represents the reference sound and B one of the 50 randomly altered sounds. Horner and Beauchamp (2003) gives complete details of the experiment. 5 Results 5.1 Discrimination Scores Discrimination scores were computed for each random alteration across the four trial structures for each subject. Scores averaged over the 20 subjects for the 50 random alterations on all eight instruments are shown in Figure 1. For errors up to 10%, almost all scores are in the range 40 -60%. The scores are around the "indistinguishability level" of 50% that corresponds to random guessing. The range is wider and more variable for intermediate errors between 15 -25%, where the scores cover nearly the full range from 50 -100%. Intermediate errors correspond to "somewhat distinguishable" cases. For errors more than 30%, most scores are above 90%, and are "very distinguishable". 0.9 * o 0.8 ~ 0.6 0.4 0 0.1 0.2 0.3 0.4 0.5 error Figure 1. Mean subject discrimination scores for randomly altered sounds vs. error level for all eight instruments (the solid line shows the 4th-order polynomial trend). An underlying S-curve is clearly visible in the superimposed 4th-order polynomial regression curve that best fits the data in the least-squares sense. Most points are close to the trend. Since the same initial random scalars were used in all eight instruments, a few points seem to be outliers. The most obvious examples are the outlying 27% and 36% errors. The use of a 4th-order S-shaped polynomial regression fit actually agrees with our hypothetical understanding of the subjective responses of the listeners. As the percentage error is close to zero (say 0-10%), listeners are forced to randomly choose, and the discrimination results will be close to 50%. As the percentage error increases from 10% to 30%, listeners would find it easier to discriminate between the two sound clips and the discrimination results will increase as the percentage error increases. When the percentage error increases beyond about 30%, the rate of increase will reduce towards an asymptotic level of 100%. In other words, it was hypothesized that a plot of discrimination results as a function of percentage error will have an S-shape consisting of two points of inflection: one at around 10% error and the other at around 30% error. An analysis of variances (ANOVA) with StudentNewman-Kuel (SNK) analyses was conducted to investigate the main effects of percentage error on the discrimination data. The analyses identified three statistically separated groups of mean discrimination data: (i) mean data up to 58%; (ii) mean data from 64% to 87%; and (iii) mean data from 91% to 98%. The first group corresponds exactly to data collected from experimental conditions with percentage errors of 0 to 12%, while the other two groups approximately corresponded to data from 13 to 31% and from 32 to 50%. The results of the ANOVA are consistent with the observations of Fig. 1. 5.2 Correspondence of Error Metrics and Discrimination Scores Each of the error metrics of Section 3 was calculated to determine their correspondence with the discrimination data. Regression analysis provides a measure of how much Proceedings ICMC 2004

Page  00000005 variance each error metric accounts for in the discrimination data. The coefficient of determination or squared multiple correlation coefficient R2 (Pedhazur, 1982) measures how well the data points fit the regression curve, and thus the correspondence between the discrimination scores and error metric. R2 is calculated on the 400 data points corresponding to the 50 error levels for each of the 8 instruments and is defined as: (400-d (12) R = 400 (d,- )2 where di is the ith discrimination score, di is the regression curve approximation of the ith discrimination score, and d is the mean discrimination score. For example, if R2=0, the error metric explains none (i.e., 0%) of the variation in the discrimination data. On the other hand, R2=1 means that all the data points lie on the regression curve, and all (i.e., 100%) of the variation in the discrimination scores is explained by the error metric. With R2=0.9, the error metric accounts for 90% of the variance in the discrimination data. Linear-Amplitude Error Metric Results. Fig. 2 shows R2 plotted against various values of a for the linear-amplitude spectral error metric of Eq. 2. The best a-values account for about 85-89% of the variance when 0.4<a<1.5. The absolute spectral difference (with a= 1) is clearly much better than the Euclidean spectral distance (with a=2), which only accounts for 70% of the variance. For a=3, the correspondence is only 40%. The best R2 correspondence is 89% at a = 0.54. Fig. 3 shows R2 plotted vs. a for the decibel-amplitude spectral error metric of Eq. 3. The maximum correspondence is 82% at a=1.34, much worse than that of linear spectral error. For a=2, only about 77% of the variance is accounted for, which is not as good as a= 1 but is more forgiving than linear-amplitude spectral error at this value of a. Fig. 4 shows R2 plotted vs. a for relative-amplitude spectral error metric of Eq. 4. The maximum correspondence is 91% (at a=0.64). The curve is relatively flat, with correspondences of 87% at a=2 and 85% at a=0.3 and a=3. Thus, it does not matter so much if a=0.5, a=1, or a=2, the results are good in all cases. This robustness is an advantage over linear and dB spectral error. The correspondence curve for the relative-amplitude spectral error with dual normalization (Eq. 5) is almost identical to Fig. 4, and the maximum correspondence is 91% at a=0.68. Likewise, the correspondence for the relativeamplitude spectral error with maximum amplitude normalization (Eq. 6) is nearly identical to Fig. 4, with a maximum correspondence about 0.5% better than the other two (91% at a=0.70). It is perhaps a bit surprising that it is slightly superior to the metric of Eq. 4, but the Eq. 6 metric is a bit more difficult to interpret mathematically. Fig. 5 shows R2 vs. a for the maximum relativeamplitude spectral error metric of Eq. 7. Low a-values give a poor correspondence, though the metric is reasonably good at higher values. The maximum correspondence is 83% at a=1.90. The correspondence for the rms relativeamplitude spectral error metric (Eq. 8) is nearly identical to Fig. 4, with a maximum correspondence of 91% at a=0.64. 1 0.9 0.8 0.7 0.6 ^ 0.5 0.4 0.3 0.2 0.1 0 LEI ^""^^ ^ t-": R 0 0.5 1 1.5 2 2.5 3 a Figure 2. R2 verses a for linear-amplitude spectral error. 0.9 0.8 0.7 0.6 0 0.5 1 1.5 2 2.5 3 0.4 - 0.2 8 0.1 0.6 0 0.5 1 1.5 2 2.5 3 a Figure 3. R2 verses a for decibel-amplitude spectral error. 0.9 0.8 - 0.7 0.6 " 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 a Figure 4. R2 verses a for relative-amplitude spectral error. 0.9pectral error. 0.8 -- - 0.7 0.6 N 0.5 - S 0.4 0.3 - 0.2 0.1 0 -- - - - - - - - ------- 0 0.5 1 1.5 2 2.5 3 a Figure 5. R2 verses a for maximum relative-amplitude spectral error. Proceedings ICMC 2004

Page  00000006 Critical Band Error Metrics. Next we consider criticalband rather than harmonic spectral-amplitude differences. Fig. 6 shows R2 verses a for the linear-amplitude criticalband error metric. The maximum correspondence is 87% at a=0.20. It is similar to the linear-amplitude harmonic error in Fig. 2, but the peak is slightly less. On the other hand, it yields good correspondence for a-values as low as 0.15. The correspondence for the decibel-amplitude criticalband error metric was significantly better than the decibelamplitude harmonic error metric of Fig. 3, with a maximum of 89% at a=0.66. However, it was noticeably worse than the relative-amplitude metrics, such as the one shown in Fig. 4, in terms of robustness with respect to variation of a. The correspondence for the relative-amplitude criticalband error metric plotted in Fig. 7 is excellent for all avalues in the range given and shows significant improvement over harmonic error for a<0.5. However, it is somewhat surprising that the critical-band measure is slightly worse than relative-amplitude harmonic error metric (see Fig. 4) for most a-values. Relative-amplitude criticalband error metrics with dual and maximum normalization are almost identical to Fig. 7, as is rms relative-amplitude critical-band error metric. The maximum relative-amplitude critical-band error correspondence curve rises faster than the comparable harmonic curve (see Fig. 5) for a<1, but the two curves are nearly identical for a>1. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 10) is time-invariant, so the correspondence does not depend on a, and a flat curve results at 79%. Relative-amplitude envelope error is equivalent to the "actual error" c'discussed in Section 2, which there has a correspondence of 79%. The correspondence for the "expected error" c is 81%. Therefore, the expected and actual errors are in close agreement, though other metrics give better correspondences. Frequency-Weighted Relative-Amplitude Error. Fig. 8 shows R2 vs. b for the frequency-dependent relative spectral error of Eq. 11, where a is set to 1.0. Both positive and negative b-values are included. The maximum correspondence is 91% for b=0.54, which corresponds to weighting high harmonics slightly more than lower ones. Note, however, that the correspondence is close to 90% at b=0, corresponding to the usual relative-amplitude spectral error. Also, optimizing both a and b yields little improvement. Therefore, overall, frequency weighting only improves the correspondence by less than 1%. 1 0.9 0.8 0.7 0.6 o:0.5. 0.4 0.3 0.2 0.1 0 MONO i i i i i "I -3 -2 -1 0 1 2 3 b Figure 8. R2 verses b for frequency-dependent relativeamplitude spectral error. 0.1' 5.3 Sensitivity Analyses 0 0.5 1 1.5 2 25 3 Effects of Using All Frames in Calculations. All of the a metrics discussed in previous sections have assumed N=20, with 10 frames taken from the attack and 10 frames taken Figure 6. R2 verses a for linear-amp. critical band error. from the rest of the sound. The correspondence of the relative-amplitude spectral error (Eq. 4) was calculated with 1 N = Nm,,, where Nm,, is the total number of frames 0.9 -- (approximately 1250 for these sounds) available in the 0.8 0.7 analysis. The result was nearly identical to that of Fig. 4 0.6 (where the number of frames was N=20), though using all S0.5 __ 0.4 frames was slightly inferior. Using 20 representative 0.3 0.2 spectra, with 10 in the attack and 10 over the rest of the 0.1 sound, appears to yield marginally better correspondence than using all of the frames (and is much faster to compute). 0 0.5 1 1.5 2 2.5 3 We also found that using fewer than 20 frames does just a about as well. Because random spectrum alteration is timeinvariant, if you measure the spectral difference at a few Figure 7. R2 verses a for relative-amp. critical band error. points, that gives you a very good idea of the spectral difference at all points. Of course, this result may be Envelope Error Metric. The correspondence of the different for time-varying alterations. relative-amplitude envelope error metric (given in Eqs. 9 - Effects of Instruments. Fig. 13 illustrates the R2 values for the relative-amplitude error metric on the 8 instruments Proceedings ICMC 2004

Page  00000007 individually (bassoon, clarinet, flute, hom, oboe, saxophone, trumpet, and violin). Inspection of Fig. 13 indicates all instruments follow a similar trend. Wilcoxon signed ranked tests conducted to compare the R2 values, for all values of a, from different instruments indicated the following results: (i) R2 was significantly higher for flute and oboe (p<0.001), (ii) this was followed by R2 for saxophone, clarinet, and trumpet, (iii) R2 values from trumpet, bassoon, and violin were not significantly different from each other (p>0.2), and (iv) R2 values associated with horn was the lowest (p<0.001). While the statistical tests can indicate a consistent and reliable ranking of the R2 values, the impact of the ranking on the absolute values of R2 is not large except the violin and horn, the two extreme cases. The above analyses were repeated for the rms relative-amplitude spectral error metric and similar results were obtained. 0.8 -0.9 - - - - - - - - 0.6 ~ 0.4 |horn 0.3 horn flute 0.2, 0.1.. violin 0 0.5.5 2 2.5 3 a Figure 13. R2 verses a for relative-amplitude spectral error on each of the 8 instruments. Effects of Changing the 4th-Order Regression Fit to 3rd and 5th Orders. The reason for using a 4th-order regression fitting curve was documented in Section 5.1. In this section, the effects of replacing the 4th-order regression curve with a 3rd and 5th-order fit are examined. There were only very slight observable differences among the three curves for relative amplitude spectral error. Further examination of the R2 data indicated that the maximum differences occurred at the peak R2 values, but the differences were less than 1%. The above analyses were repeated for the rms relativeamplitude spectral error and similar results were obtained. 6 Discussion Table 1 summarizes the maximum R2 values for each of the error metrics described in Sections 3 and 5, and the range of the parameter a over which R2 is within 5% of the maximum R2 value. Relative-amplitude spectral error explains over 90% of the variation in the discrimination data. This metric is very robust, with good results for absolute differences, Euclidean differences, and differences raised to other powers. All forms of normalization performed well, as did rms relativeamplitude spectral error. Max a- a- aError Metric R2 value value value of of of lower Max upper bound R2 bound Linear-amplitude spectral error 0.887 0.36 0.54 1.42 Decibel-amplitude spectral error 0.824 0.20 1.34 1.90 Relative-amplitude spectral error with simple 0.908 0.30 0.64 2.24 normalization Relative-amplitude spectral error with dual 0.906 0.30 0.68 2.20 normalization Relative-amplitude spectral error with 0.914 0.32 0.70 1.96 maximum normalization Maximum relative-amplitude spectral error 0.834 0.94 1.90 3.00+ RMS relative-amplitude spectral error 0.908 0.30 0.64 2.36 Linear-amplitude critical band error 0.867 0.14 0.20 1.52 Decibel-amplitude critical band error 0.895 0.22 0.66 1.86 Relative-amplitude critical band error with 0.903 0.06 0.54 2.52 simple normalization Relative-amplitude critical band error with dual 0.903 0.06 0.60 2.40 normalization Relative-amplitude critical band error with 0.908 0.10 0.60 2.16 maximum normalization Maximum relative-amplitude critical band error 0.836 0.86 1.84 3.00+ RMS relative-amplitude critical band error 0.903 0.06 0.52 2.66 Relative-amplitude envelope error 0.791 0.02- (all) 3.00+ Frequency-dependent relative-amplitude 0.907 0.70 0.54 1.58 spectral error Relative-amplitude spectral error using all 0.907 0.30 0.68 2.24 analysis frames Table 1. The maximum R2 for various error metrics, and the lower and upper bound of the parameter a over which R2 is within 5% of the maximum R2 value. The best results for linear-amplitude spectral error were about the same, but the metric was less robust than relativeamplitude spectral error in terms of sensitivity to the power of the spectral difference. This perhaps accounts for why some researchers have noticed artifacts resulting from principal components analysis-based methods for data reduction of additive synthesis data (Horner et al., 1993b), since principal components analysis optimizes linear spectral error with a=2. The biggest surprise was that critical-band based errors were no better and sometimes worse than harmonic errors for most a-values. Still, though slightly inferior in absolute terms, the critical-band errors were among the least sensitive to changes in the power a of all the metrics studied, with excellent performance for all powers between 0.02 and 3. Decibel-amplitude spectral error, maximum relativeamplitude spectral error, and relative-amplitude envelope error gave the worst results of the metrics tested, with only 80% correspondence. However, decibel-amplitude differences with critical bands did significantly better. Another surprise was that employing a frequencydependency in the metric did not improve the average correspondence. We would have expected that emphasizing the lower harmonics might help, since they are usually more Proceedings ICMC 2004

Page  00000008 prominent, but this was not the case. Amplitude-based weighting seems to work best without frequency weighting. Spectral differences using 20 representative frames, rather than all frames, resulted in slightly better correspondence. This allows more emphasis on the perceptually important attack and decay. It is also about two orders of magnitude more efficient. 7 Conclusions We have compared the correspondence of discrimination data and various error metrics. All the error metrics had at least a reasonable correspondence (70% or more), and several had excellent correspondences over 90%. Overall, the best metrics were relative-amplitude spectral error and rms relative-amplitude spectral error, since they had strong correspondence and robustness with respect to the power a. Surprisingly, critical band errors were not significantly better than harmonic errors and sometimes performed slightly worse. Frequency weighting did not provide an improvement. However, correspondence using a small number of well-selected frames in the relative-amplitude spectral metric yielded a slight improvement over taking all frames, verifying the effectiveness of this useful shortcut. Absolute spectral differences (a=1) outperformed Euclidean spectral difference (a=2) on nearly every metric, with a=0.6 about optimal. The strong correspondence of metrics such as relativeamplitude spectral error allows music synthesists to use them with confidence, since they account for over 90% of the variance in discrimination data. Relative-amplitude spectral error is an excellent tool for optimizing parameters for applications such as frequency modulation and wavetable synthesis. Relative-amplitude spectral error also provides a means for estimating how much data reduction is possible with additive synthesis data without perceptually altering musical sounds. Moreover, the strong correspondences of the relativeamplitude and rms relative-amplitude spectral error metrics to the empirically collected discrimination data have been shown to be robust against the use of 3rd, 4th, and 5th-order regression fits. The strong correspondences were also maintained when regression analyses were conducted on each of the 8 individual instruments. Some questions for further study are: 1) Is it possible to devise an error metric with a discrimination correspondence of more than 91%? 2) Can some linear combination of the error metrics produce a "super metric"? 3) Can other measures of the spectra, such as spectral irregularity, improve the correspondence? 4) What is the best method for selecting frames for the spectral difference computation, and how many frames must be taken from the attack, sustain, and decay of sounds to give adequate results? 5) Although use of critical band amplitudes rather than harmonic amplitudes did not yield any error metric improvement, is there some other way to combine frequency components that does? 8 Acknowledgments This work was supported by the Hong Kong Research Grant Council's Projects HKUST6194/02E and HKUST6167/03E. We would like to thank Simon CheukWai Wun for his excellent listening test. References Homer, A., Beauchamp, J., and Haken, L. (1993a). "Genetic Algorithms and Their Application to FM Matching Synthesis." Computer Music Journal 17(4), 17-29. Horner, A., Beauchamp, J., and Haken, L. (1993b). "Methods for Multiple Wavetable Synthesis of Musical Instrument Tones." Journal of the Audio Engineering Society 41(5), 336-356. Homer, A. (2001). "A Simplified Wavetable Matching Method Using Combinatorial Basis Spectra Selection." Journal of the Audio Engineering Society 49(11), 1060-1066. Homer, A. and Beauchamp, J. (2003) "Discrimination of Sustained Musical Instrument Sounds Resynthesized with Randomly Altered Spectra", In Proceedings of the International Computer Music Conference, pp. 87-90. San Francisco: International Computer Music Association. Lim, S.M. and Tan, B.T.G. (1999). "Performance of the Genetic Annealing Algorithm in DFM Synthesis of Dynamic Musical Sound Samples." Journal Audio Eng. Soc. 47(5), 339-354. McAdams, S., Beauchamp, J. W., and Meneguzzi, S. (1999). "Discrimination of musical instrument sounds resynthesized with simplified spectrotemporal parameters." Journal of the Acoustical Society of America 105(2), 882-897. Moore, B. C. J., Glasberg, B. R., and Baer, T. (1997). ). "A model for the prediction of thresholds, loudness and partial loudness." Journal of the Audio Engineering Society 45(4), 224-240. Pedhazur, E. J. (1982). Multiple regression in Behavioral Research, Holt, Rinehart and Winston, Chapter 3. Plomp, R. (1970). "Timbre as a multidimensional attribute of complex tones." In Frequency Analysis and Periodicity Detection in Hearing, R. Plomp and G. F. Smoorenburg, eds. (Sijthoff, Eliden, The Netherlands), 405-408. Sandell, G. J. and W. L. Martens (1995). "Perceptual Evaluation of Principal Component-Based Synthesis of Musical Timbres." Journal of the Audio Engineering Society 43(12), 481-496. Tan, B.T.G., Gan, S.L., Lim, S.M., and Tang, S.H. (1994). "RealTime Implementation of Double Frequency Modulation (DFM) Synthesis." Journal Audio Eng. Soc. 42(11), 918-926. Proceedings ICMC 2004