Page  00000179 Timbral, Perceptual, And Statistical Attributes for Synthesized Sound James McDermott1, Niall J.L. Griffith1, and Michael O'Neill2 1Dept. of Computer Science and Information Systems, University of Limerick 2Natural Computing Research and Applications Group, University College Dublin j amesmichaelmcdermott@gmail. com, niall. grif f ith@ul. ie, m. oneill@ucd. ie Abstract A set of 40 timbral, perceptual, and statistical sound attributes is described and studied, with reference to machine learning applications and statistical experiments using a software synthesizer The attributes include trajectory, vibrato, and statistical subsets, and subsets defined in the time, Fouriertransform, and partial domains. High correlations between some attributes are confirmed: this has application to future choice of attributes for machine learning applications. The synthesizer's achievable attribute ranges give an indication of its relative flexibility and strengths, and the method described has application to synthesizer design. Introduction The aim of this research is to define, describe, and study a set of sound attributes. Since a large number of attributes have been used in recent timbral and machine-learning research, it is necessary to compare their usefulness in applications and their relative importance. In particular, we wish to test the hypothesis that some attributes are redundant in the presence of others. Secondly, we wish to determine the attribute ranges achievable by a particular synthesizer, and to compare them to attribute values exhibited by other instrument sounds. The attributes' statistical distributions are calculated in order to give an idea of the overall character of the synthesizer. 1.1 Motivation Sound analysis applications, such as machine learning algorithms, require some form of re-representation of input sound signals. Digital audio is too low-level, in the sense that individual samples have little meaning in isolation, to be easily analysed without some form of pre-processing. Several pre-processing methods allow us to re-represent digital audio as windowed spectral data, either exactly (as with the Discrete Fourier or Wavelet transforms), or with some loss of information (as with Mel-Frequency Cepstral Coefficients). In either case the result is a vector of vectors, each vector representing spectral information for a small time-section of the input sound. Another possibility is a representation according to a re-synthesis model, such as the SMS decomposition into harmonics plus noise (Serra 1997). 1.2 Timbre Another method is to define and measure a set of timbral attributes of the input sounds, where each attribute is intended not to quantify the energy in some spectral band in a single window, but to express some overall quality of the sound. Well-known examples are mean energy, mean spectral centroid, and attack time. One advantage of this approach is that many of these measures are known to be perceptually significant, as demonstrated by Grey (1975) and others. In applications involving direct human interaction with the measures, their high-level meaning is an asset. This is the method investigated in the current research. 1.3 Machine learning applications Sound attributes have been used with some success in various areas of machine learning. Perhaps the most common application is that of sound classification - automatic labelling of libraries of sound samples according to whether they contain speech or music, for example, or the genre of music they contain, or the instrument being played. Examples include the work of Martin and Kim (1998) and Tzanetakis and Cook (2002); Herrera-Boyer, Peeters, and Dubnov (2003) give a summary of work in this area. Different researchers have used different subsets of the available attributes, though there is general agreement on several of the core attributes. Another application is the Evolutionary Algorithm (EA): research by Homer, Beauchamp, and Haken (1993) and oth 179

Page  00000180 ers has used synthesizers driven by EAs in attempting to match the spectra of target sounds. Our previous work (McDermott, Griffith, and O'Neill 2005) compared the use of sound attribute comparisons in driving evolution with the more standard method using Fourier-transforms comparisons. 1.4 Organisation of the rest of this paper The work reported here builds on previous work by taking sound attributes discovered (section 1.2) and used (section 1.3) by other researchers, and applying them to a new domain of sounds: the outputs of a particular synthesizer. This problem domain is different from that of much previous work in that a synthesizer user is generally not interested in classification into discrete categories: the possible outputs are seen as part of a continuum. It is therefore necessary to establish the properties of the sound attributes in this context. In the following sections, we list and define the attributes, and report the distributions and ranges achieved by the synthesizer; we compare these to attribute values achieved by orchestral instruments; we look for correlations both between synthesizer parameters and attributes, and among the attributes; and we discuss the relative applicability of various attributes to machine learning applications. and similarly, a Zero-crossings Rate signal Z is defined over the same windows: #{j: x(j +HiL) <0: x(j +1+iL) > 0: j < L} Zi L L (3) The P and C signals thus contain 2T/L values for a sound of T samples, while the R and Z signals contain T/L. 2.2 Classification into subsets The attributes do not break down hierarchically into subsets: for example, Pitch Vibrato Depth is both a partial-domain and a periodic attribute. We have chosen to classify attributes into 6 overlapping groups: time domain, Fourier-transform domain, partial domain, trajectory, periodic, and statistical. 2.2.1 Time-domain attributes * Attack Time (att) * RMS energy (rms) mean(R) * Zero Crossing Rate (zcr; (Burred and Lerch 2004)) * Crest Factor (crest; (Eronen and Klapuri 2000)) Attributes crest= T max~to (xte) mean(R) (4) 2.1 Definitions and terminology The extraction of many of the attributes depends on extracting a few preliminary signals. Given an input sound signal x(t) of length T samples (with x(t) e [-1, 1] V t), we define 2 x-overlapped windows of length L, using the Hann window. Xi is the Fourier transform of the ith window. We define a centroid signal, C: * Fast Modulation (fastm) the root-mean-square of the high-pass filtered R signal. 2.2.2 Fourier-transform domain attributes * Spectral Centroid (cen) mean(C) * Spectral Spread (sprd) * Spectral Flatness (flat; (Herrera-Boyer, Peeters, and Dubnov 2003)) * Flux (fix; (Burred and Lerch 2004)) a measure of how much each Fourier-transformed window differs from that of the previous window L C -- "L/2 X E=f(k) X(k) ci Ek 0 Xi (k) (1) We determine a list of partials for each window by finding the peaks in that window's Fourier transform, and then use the Two-Way Mismatch procedure (Maher and Beauchamp 1994) to find a pitch value for each window, giving a pitch signal P. Finally, we define windows, again of length L, but without overlap and with a flat windowing function. The RMS energy signal R is defined over these windows as follows: flxw (X) E=(Xw (k) - Xw (k))2 k-1 2T/L flx = E fflxw w~~1 (5) (6) Lox(t + iL)2 Ri (2) * Presence (pres) * Rolloff (roff) 180

Page  00000181 2.2.3 Partial-domain attributes * Pitch (pit) the median (to avoid the effects of outlying, anomalous pitch values) of the P signal, as calculated by two-way mismatch * Pitch Error (twm.err) the average of the errors returned by two-way mismatch * Tristimulus 1 (tril; see Jensen (1999) for these attributes) * Tristimulus 2 (tri2) * Tristimulus 3 (tri3) * Odd Ratio (odd) * Even Ratio (evn) * Irregularity (irr) a measure of how much the strengths of partials vary * Energy Heuristic Strength "the peak height of the feature divided by the average value surrounding the peak" (Martin and Kim 1998). hs.rms = R). *,Ri = mnax(R) (13) * Energy Delta Ratio (dr.rms) "the ratio of the feature value during the transition from onset to steady state (~ 100msec) to the feature value after the transition period" (Martin and Kim 1998). * Centroid Temporal Centroid (tcn.cen) This and the following three attributes are defined over the C signal analogously to the previous 3. * Centroid Temporal Peakedness (tpk.cen) * Centroid Heuristic Strength (hs.cen) * Centroid Delta Ratio (dr.cen) 2.2.5 Periodic attributes * Energy Vibrato Depth (vdpth.rms) Energy Vibrato is also known as Tremelo * Energy Vibrato Rate (vrate.rms) * Centroid Vibrato Depth (vdpth.cen) SZK=1(ak - ak+1)2 Ir i "K 2 irr~ Zk-1a ir 1 2k irr = mean(irri) (7) (8) (where ak are the amplitudes of the K partials calculated for the ith window) * Inharmonicity (inh) a measure of the amount and effect of any detuning of partials from the expected multiples of the fundamental inhi=1 - Zk~lak(l 1-pkjk) Zk, lak (9) * Centroid Vibrato Rate (vrate.cen) * Pitch Vibrato Depth (vdpth.pit) * Pitch Vibrato Rate (vrate.pit) 2.2.6 Statistical attributes inh = mean(inhi) (10) (where pk are the frequencies of the ith window partials) High and Low Feature-Value ratios of the Energy, Centroid, and ZCR signals. (Lu, Zhang, and Jiang 2002): 2.2.4 Trajectory attributes * Energy Temporal Centroid (tcn.rms) a measure of when, during the sound, the sound energy is most concentrated (Herrera, Yeterian, and Gouyon 2002) tcn.rms cT/ tR (T/L) (11) * Energy HFVR (hfvr.rms) * Energy LFVR (lfvr.rms) * Centroid HFVR (hfvr.cen) * Centroid LFVR (lfvr.cen) * ZCR HFVR (hfvr.zcr) * ZCR LFVR (lfvr.zcr) * Energy Temporal Peakedness (tpk.rms) a measure of to what degree the energy peaks or dips toward the middle of the sound. tpk.rms = 1 - 2|t~ - (TjL)/2| R RT t=o /(TL) (12) 181

Page  00000182 Several attributes would have undefined values in boundary cases, such as where total energy is zero. Reasonable default values have been chosen in these cases: for example, the Energy Temporal Centroid for a silent or near-silent sound is defined to be 0.5. In other cases, attributes have been clamped between reasonable boundary values (for example, Pitch is clamped between 20Hz and 10,000Hz): this is necessary for the calculation of attribute difference values, discussed next. 2.3 Attribute differences We can analyse the difference between two sounds by comparing their respective attribute values. For each attribute, a difference function can be defined, which depends on the attribute's theoretical upper and lower bounds, and on whether the attribute is supposed to have a logarithmic or a linear quality: values over time often do produce a timbral effect (e.g. Attack Time is calculated based on changes in RMS Energy over time). Other attributes attempt to quantify the "trajectories" taken by primary attributes: for example, RMS Energy Temporal Centroid quantifies the amount by which RMS Energy rises or falls over the course of a sound. Since differences in RMS Energy and Spectral Centroid are perceptible, their changes over time can also be perceived: therefore trajectory measures like Temporal Centroid, Temporal Peakedness, and Delta Ratio and the periodic measures Vibrato Rate and Depth, all when applied to either a signal of RMS Energy values or a signal of Spectral Centroid values, are perceptually significant too. The distinction between harmonic and non-harmonic sounds is also known to be perceptually significant, and this distinction can be quantified using attributes such as Harmonicity, Two-Way Mismatch Pitch-detection Error, and Spectral Flatness. Roughness is another attribute known to be perceptually significant (Terhardt 1974), though it is difficult to define in terms of digital signal processing (DSP) (Daniel and Weber 1997). In the current work, we use a simply-defined measure, Fast Modulation, as a substitute: according to Zwicker and Fastl (1990), fast RMS Energy modulation (in the range 15Hz-300Hz) is the major factor in determining roughness. Several partial-domain measures, such as the Tristimulus measures and the Odd and Even ones, are known to be perceptually significant. A good example is the well-known "hollowness" of the clarinet, resulting from a relative absence of the even partials. The remaining attributes in our study, such as the High Feature-Value Ratio measures, have not been shown to have perceptual significance. di(s, S2) = fi(Vi(S1)) - fi(Vi(S2))| (14) where fi(v) E [0, 1] is a scaling function: fi (V) = v-lbi ubi - lbi (15) for linear-domain attributes, and v - lbi fi (v) = log (1 -l(b k - 1))/k ubi - (bi (16) for log-domain attributes, where sI and S2 are the two sounds, vi (s) is the ith attribute value extracted from a sound s, k is an arbitrary constant here assigned the value 5 (as used by some synthesizers for this purpose), and ubi and lbi are the theoretical upper- and lower-bounds, respectively, for the ith attribute. Note that di G [0, 1] V i. We can make an overall comparison between two sounds by combining the individual differences: Wii(1,82) d(sl, 82) i=1 n 3 3.1 Results Synthesizer (17) where the weights wi are taken to be equal to 1 if we require simple averaging, rather than weighting. 2.4 Perceptual Significance Perhaps the best-known perceptual sound attributes are pitch, loudness (correlated with RMS Energy), brightness or sharpness (correlated with Spectral Centroid), and Attack Time. Pitch and loudness are not properly classed as "timbral", by definition: but we include their correlates (Pitch and RMS Energy) here because they are perceptually significant in discriminating between sounds, and because changes in their The synthesizer used is a slightly restricted version of the Xsynth synthesizer (Bolton 2005), an analogue-modular style subtractive synth written in C, featuring two oscillators, two assignable envelopes, one assignable low-frequency oscillator, and a six-mode filter. The full version of the synthesizer has 32 input parameters: however to avoid an instability in the filter, and to prevent very large changes in pitch, a few of the parameter ranges have been restricted, and in two cases closed off altogether. The restricted version and all software used in this research (in C, C++ and Python), is available for download at www. skynet. ie/-j mmcd. All sounds used in this study were 1.5s long, of which 0.5s was the "release 182

Page  00000183 tail", recorded after the note-off signal. 3.2 Achievable Ranges for the Xsynth Synthesizer A Genetic Algorithm (GA; (Goldberg 1989)) was used to map the minimum and maximum achievable values for each attribute. A GA is a search technique suited to poorlyunderstood or oddly-shaped search spaces, which works according to the metaphor of natural selection: any candidate solution (here, a set of input parameters) can be regarded as an individual's genome, and mapped to a phenotype (here, a clip of digital audio). A fitness value is assigned to the phenotype according to the goal of the search: the fitness value determines the likelihood that an individual will be selected for reproduction, which happens by crossover among the individual genes (here, the individual input parameters); the other operator is mutation, which can cause a change in one or more genes. The overall effect is that a population of individuals can converge towards the goal of the search. The GA used here was a steady-state GA over 200 generations with 200 individuals in the population. Each individual genome consisted of 32 floating-point values, one per synthesizer parameter. The synthesizer, in mapping from the parameters to digital audio, performed the genotype-phenotype mapping. The objective function value of an individual was defined to be the value of the attribute under investigation: the GA was set to reward high (respectively low) objective values when maximising (respectively minimising) the attribute values. The replacement probability was 0.5, one-point crossover had a probability of 0.5, and Gaussian mutation had a probability of 0.1. This amounts to a fairly typical floating-point GA. In table 1, for each attribute, we show a theoretical lowerand upper-bound, followed by the minimum and maximum values achieved by the GA-driven Xsynth, over the course of 7 runs. For several of the attributes, the synthesizer achieves the full theoretical range. At least one value, the maximum achieved value for pitch (pit), is in error: the sound in question features high, out-of-tune partials, and appears to be near 6kHz, rather than the 9kHz reported by the two-way mismatch procedure. The maximum achieved value for Irregularity (irr) is also anomalous: Jensen (1999) says that irregularity should be below 2 by definition. This may be caused by differing methods of labelling partials with reference to the fundamental. These results and method are of potential use in the field of synthesizer design. For example, a synthesizer intended to accurately mimic a particular instrument can be automatically tested to determine whether it can achieve the attribute values achieved by a sample of that instrument's output. key lb ub min max att rms zcr crest cen sprd flat fix pres roff fastm vdpth.rms vrate.rms vdpth.cen vrate.cen tcn.rms tcn.cen tpk.rms tpk.cen hfvr.rms Ifvr.rms hfvr.cen Ifvr.cen hfvr.zcr Ifvr.zcr hs.rms dr.rms hs.cen dr.cen pit twm.err vdpth.pit vrate.pit inh irr tril tri2 tri3 odd evn 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.1 1.0 0.1 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 22050.0 1.0 512.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 20.0 1.0 20.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 10.0 10.0 10.0 10.0 10000.0 40.0 1.0 20.0 1.0 10.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 2.848 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.388 0.0 0.425 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.1 1.0 0.1 21.553 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.659 20542.8 1.0 161.587 0.525 1.0 1.0 0.5 0.887 1.0 1.0 15.0 1.0 15.0 0.847 0.773 0.776 0.731 0.611 0.989 0.656 0.922 0.667 0.989 10.0 10.0 7.848 10.0 9999.61 40.0 1.0 15.0 0.926 2.158 1.0 0.995 1.0 0.99 0.995 Table 1: Lower- and upper-bounds and minimum and maximum achieved ranges for all attributes for Xsynth synthesizer As an example, these results were compared with the values extracted for samples of violin and flute, from the MUMS sample library (Opolko and Wapnick 1987). In all cases the values for the violin and flute lie within the ranges achievable by the Xsynth synthesizer. This result is necessary but not 183

Page  00000184 sufficient for a realistic imitation of the instruments by the synthesizer (and in fact the synthesizer is probably not capable of producing very realistic imitations). 3.3 Statistical Results The distributions of the attributes for Xsynth sounds were studied by generating 100,000 random Xsynth patches, transforming each to a clip of digital audio, and calculating its attributes. Again, these results are of potential use in synthesizer design. For example, a synthesizer designer aiming for a synthesizer with a "warm" character (correlated with low centroid values) will attempt to ensure not only that lowcentroid sounds are achievable, but also that the distribution of centroid is left-skewed, so that synthesizer users are more likely to find sounds of this type. attribute changes. For example, note the large white area in middle of the graph: this indicates that many of the envelope generator parameters (prefixed with "EG") are totally uncorrelated with any attribute. This occurs because each envelope can be assigned, affecting the movement of different internal synthesizer parameters, depending on the value of the envelope assignment parameters. Fig. 2 also shows correlations among the attributes themselves. There is a two-fold redundancy in this graph. The black diagonal line shows that each attribute is totally correlated with itself, as expected. Other large correlations include Rolloff (roff) with Centroid (cen), Energy Delta Ratio (dr.rms) with Crest Factor (crest), Odd Harmonic Ratio (odd) with Tristimulus 3 (tri3), and Even Harmonic Ratio (evn) with Tristimulus 2 (tri2). These correlations are suggested by the attribute definitions. It is likely that a set of attributes including just one of each correlated pair will not omit too much useful information. This is important, because the performance of applications such as Neural Networks can be degraded by working with spaces of too high a dimensionality. The fact that so few attributes can be considered redundant confirms that the timbral space is of high dimensionality. In general, there is greater correlation within the attribute subsets than between them. This confirms that the various subsets measure different types of information, and suggests that the best way to reduce the total number of attributes is by omitting a few from each subset, rather than by omitting any subset entirely. None of the subsets shows much more or less correlation than the others. 1 0.5 0 ' I I I 1 0.5 0 0.5 att 250 cen 500 1 0.5 0 1 0.5 0 0.5 10 20 vdpth.pit vrate.pit Weaknesses Figure 1: Distributions for Attack, Spectral Centroid, Pitch Vibrato Depth and Pitch Vibrato Rate For example, Fig. 1 gives clues to the character of the Xsynth synthesizer: the vast majority of sounds have a very short attack; most sounds are not too bright; and a fair proportion of sounds have significant Pitch Vibrato. 3.4 Correlations Two correlation graphs are given at the end of this paper. Fig. 2 shows correlations between attribute values and Xsynth parameters for the 100,000-point dataset mentioned above. In general, correlations are very low between attributes and parameters: this confirms that the problem of controlling a synthesizer is a genuinely hard one: there are few consistant relations between parameter changes and corresponding Most attributes discussed here are easy to implement with DSP algorithms: however a few, in particular those based on Pitch, and the Vibrato measures, are difficult to define or measure. We have used the two-way mismatch pitch-detection method, which is quite good, and has the advantage of giving an error estimate to accompany its results; however it is probably not as good as (e.g.) the YIN method (de Cheveigne and Kawahara 2002). No pitch-detection method is perfect: according to Sood and Krishnamurthy (2004), the best pitch detectors rarely achieve even 97% accuracy (this figure is quoted for human speech, not synthesised sound). Similarly, the vibrato measures are based on a simplified version of the "minima-maxima detection" method described in (Rossignol et al. 1998). A more sophisticated method, or a fuller implementation of this method, might improve the detection of vibrato features. The Fast Modulation attribute is an inadequate substitute for roughness: more sophisticated models are given by Zwicker 184

Page  00000185 and Fastl (1990) and Daniel and Weber (1997). The set of 40 attributes discussed here represents the majority of the attributes used in recent research: however other attributes, such as some of those used in Keller and Berger (2001) could be usefully added. Conclusions and Future Work Our results confirm that certain pairs of attributes are quite strongly correlated: this will allow us to omit certain attributes in future Evolutionary Algorithm applications. However this is true of only a small number of attributes. The methods of calculating synthesizer's attribute distributions and achievable ranges provide an automatic method of comparing the abilities of different synthesizers. Because the Xsynth synthesizer's achievable attribute ranges encompass the values required by instruments such as the flute and violin, the question arises as to whether realistic imitations of these instruments are possible, and (more realistically) if not, why not. Two possibilities have to be considered: the first is that the precise attribute values of (say) the violin are not all achievable at the same time; the second is that the attributes as described are simply not sensitive enough to differentiate between sounds which are distinguishable to the human ear. Goldberg, D. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA: Addison-Wesley. Grey, J. M. (1975). An Exploration Of Musical Timbre. Ph. D. thesis, CCRMA, Dept. Of Music, Stanford University. Herrera, P., A. Yeterian, and F. Gouyon (2002). Automatic classification of drum sounds: a comparison of feature selection methods and classification techniques. In International Conference on Music and Artificial Intelligence. Herrera-Boyer, P., G. Peeters, and S. Dubnov (2003). Automatic classification of musical instrument sounds. Journal of New Music Research 32(1), 3-21. Homer, A., J. Beauchamp, and L. Haken (1993). Machine tongues XVI: Genetic algorithms and their application to FM matching synthesis. Computer Music Journal 17(4), 17-29. Jensen, K. (1999). Timbre Models of Musical Sounds. Ph. D. thesis, Dept. of Computer Science, University of Copenhagen. Keller, D. and J. Berger (2001). Everyday sounds: synthesis parameters and perceptual correlates. In Proceedings of the VII Brazilian Symposium of Computer Music. Lu, L., H.-J. Zhang, and H. Jiang (2002). Content analysis for audio classification and segmentation. IEEE transactions on speech and audio processing 10(7), 504-516. Maher, R. C. and J. W. Beauchamp (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. Journal of the Acoustical Society ofAmerica 95(4), 2254-2263. Martin, K. D. and Y. E. Kim (1998). Musical instrument identification: A pattern-recognition approach. In Proceedings of the 136th meeting of the Acoustical Society of America. Acoustical Society of America. McDermott, J., N. J. Griffith, and M. O'Neill (2005). Toward user-directed evolution of sound synthesis parameters. In F. R. et al. (Ed.), EvoWorkshops 2005, Berlin. SpringerVerlag. Opolko, F. and J. Wapnick (1987). McGill University Master Samples. Compact disc. http: //www. music.mcgill. ca/resources/mums/html/, last viewed 7 July 2006. Rossignol, S., P. Depalle, J. Soumagne, X. Rodet, and J.-L. Collette (1998). Vibrato: Detection, estimation, extraction, modification. In Proceedings of the DAFx98 Conference. Serra, X. (1997). Musical sound modeling with sinusoids plus noise. In C. Roads, S. Pope, A. Picialli, and G. D. Poli (Eds.), Musical Sound Processing. Swets and Zeitlinger. Sood, S. and A. Krishnamurthy (2004). A robust on-the-fly pitch (OTFP) estimation algorithm. In Proceedings of the 12th annual ACM international conference on Multimedia. Terhardt, E. (1974). On the perception of periodic sound fluctuations (roughness). Acustica 30, 201-213. Tzanetakis, G. and P. Cook (2002, Jul). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5), 293-302. Zwicker, E. and H. Fastl (1990). Psychoacoustics: Facts and Models, Chapter 11, pp. 257-264. Springer. Acknowledgements Co-author James McDermott gratefully acknowledges the guidance of his co-authors and supervisors; and is supported by IRCSET grant no. RS/2003/68. References Bolton, S. (2005). XSynth-DSSI. http: //dssi. source forge. net/, last viewed 2 March 2006. Burred, J. J. and A. Lerch (2004). Hierarchical automatic audio signal classification. Journal of the Audio Engineering Society 52(7/8), 724-739. Daniel, P. and R. Weber (1997). Psychoacoustical roughness: Implementation of an optimized model. Acta Acustica 83, 113 -123. de Cheveigne, A. and H. Kawahara (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America 111(4), 1917-1930. Eronen, A. and A. Klapuri (2000). Musical instrument recognition using cepstral coefficients and temporal features. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 753-756. IEEE. 185

Page  00000186 OSCI-Pitch OSCI-Waveform OSC1-Pulse-Width OSC2-Pitch OSC2-Waveform OSC2-Pulse-Width Oscillator-Sync Oscillator-Balance LFO-Frequency LFO-Waveform LFO-Osc-Pitch-Mod LFO-VCF-Cutoff-Mod EGI-Attack-Rate EG1-Decay-Rate EGI1-Sustain-Level EGI-Release-Rate EG 1-Velocity- Sens EG1-Osc-Pitch-Mod EG1-VCF-Cutoff-Mod EG2-Attack-Rate EG2-Decay-Rate EG2-Sustain-Level EG2-Release-Rate EG2-Velocity- Sens EG2-Osc-Pitch-Mod EG2-VCF-Cutoff-Mod VCF-Cutoff VCF-Resonance VCF-Mode Glide-Rate Volume Tuning 0U. U Q 4ý -4ý -4ý -ý Cs 4, att rms zcr crest cen sprd flat fix pres roff fastm vdpth.rms vrate.rms vdpth.cen vrate.cen tcn.rms tcn.cen tpk.rms tpk.cen hfvr.rms lfvr.rms hfvr.cen lfvr.cen hfvr.zcr lfvr.zcr hs.rms dr.rms hs.cen dr.cen pit twm.err vdpth.pit vrate.pit inh irr tril tri2 tri3 odd evn U U U. U. U -E EU U... U~ U. U U U U - II I Q ~ ~ ~ ~ 4-C Qý ý ý Figure 2: Attribute-parameter, and attribute-attribute correlations for Xsynth. Darker shades indicate greater correlations. 186