Page  00000414 Binary Decision Tree Classification of Musical Sounds Kristoffer Jensen and Jens Amspang Department of Computer Science, University of Copenhagen Universitetsparken 1, 2100 Copenhagen 0, Denmark, Abstract This paper presents a novel method of classifying musical sounds. An earlier work has shown the ability of a subset of the timbre attributes of musical sounds to classify musical sounds correctly in instrument families. This work focuses on the interpretation of the timbre attributes. The question is: which timbre attributes are useful for the classification of the sounds? These attributes are found by creating binary trees, the split questions of which are based on timbre attributes of musical sounds, including envelope, spectral envelope and noise parameters. The timbre attributes are calculated, from the additive parameters, for a large collection of musical instruments, generally for the full pitch range of each instrument. Experiments with recorded sounds from a variety of instruments have shown that the binary decision tree created with average entropy generally separates the instruments in few levels. Most sounds from each instrument are collected in separate terminal nodes. The tree created with the timbre data can be used to classify new, unknown sounds, and help understand which timbre attributes are pertinent in the identification of musical sounds. 1 Introduction This paper presents a novel method of classification of musical sounds, using binary trees created with average entropy. The binary decision tree has been used in the speech recognition community for many years, in language models [1] and phonetic baseform determination [2]. It uses average entropy to decide on the best split in each node. This greedy method generally creates well-balanced trees. The method uses a small number of automatically estimated timbre attributes [7], [8], including spectral envelope, envelope and noise parameters, but excluding pitch, amplitude and length, from a large number of sounds from different musical instruments. Listening tests have shown that these timbre attributes are sufficient to recreate a sound with good sound quality. They have furthermore been used to correctly classify 150 sounds from five different instruments using log likelihood function for normal-distributed data [7]. Earlier work [7] has used the analysis of the eigenvectors of the principal component analysis, and the addition/removal of timbre attributes in the classification using maximum likelihood to determine the importance of the timbre attributes in the identification of musical sounds. The method presented in this paper gives more information about the importance of the timbre attributes in the identification of instruments. The tree created using these attributes can be used to classify new sounds, that is, instrument recognition. Other methods of classification of musical sounds can be found in, for instance, [4], [7], [12], [13]. However, the main aim of this paper is to study the attributes that are used to create the tree, thereby gaining knowledge about the importance of the timbre attributes when distinguishing between instruments. The question is, therefore, which timbre attributes best separates the sounds of, for instance, the piano from those of the flute. 2 Input Data The trees are created using a few timbre attributes estimated from a large number of sounds. The estimation of the timbre attributes are presented in [7], [8]. The attributes are derived from the additive parameters [10], and they can be divided into spectral envelope attributes, envelope attributes, and noise attributes. The spectral envelope attributes are calculated from the maximum amplitude of the harmonic partials, and they are the brightness [3] (Xka,/ ak), the tristimulus1 (az /ak ) and tristimulus2 ((a2 + a3 + a4) a) [5], the odd value (Xa2k_- l ak), and the irregularity ( (a - ak+I, 2 /a) [6]. The envelope attributes are calculated as the virtual fundamental value, i.e. the value at the fundamental of a exponential curve fitted to the individual partial values of the attack, sustain and release duration, curve form and relative amplitude. The noise attributes are calculated the same way, and they are the shimmer (amplitude irregularity) and jitter (frequency irregularity) [9] standard deviation, filter coefficient and correlation [7], [8]. Furthermore, the inharmonicity [5] is used. 3 Tree Construction The tree is constructed by asking a large number of questions, and for each question splitting the data (the -414 - ICMC Proceedings 1999

Page  00000415 sounds) in two groups, calculate the goodness of split, and choosing the question that renders the best goodness of split. This approach is a greedy approach, but it renders fairly well balanced trees with an appropriate goodness of split function. The main difficulties in this method are choosing the goodness of split function, the question set and the stop criteria. 3.1 Goodness of Split The goodness of split used in this paper is the average entropy. 2 H(Y) = {p(1()H(Y) (1) 1=1 where p(l1) is the probability of the leaf i, and Hi(Y) is the entropy of the leaf i, Hi(Y) = {p(c, 1 i) log2(p(c I,))} (2) j=1 The lower entropy, the better. The entropy is zero when there is elements in one group only. 3.2 Question Set Now remains the choice of questions to ask. There exist automatic methods for choosing questions, but manual methods generally renders good results as well. Furthermore, single, simple or complex questions, such as the pylonic tree questions [1] can be asked. This paper has chosen an automatic method of choosing questions. Indeed all possible and necessary questions are asked, that is, for each timbre attribute, the values are sorted, and all questions in between the values are asked. A simple question structure has been chosen, for computational speed reasons. In this structure, the answer is yes, if timbre attribute(k) is larger than a given value v, and timbre attribute(l) is larger (or smaller) than a given value v2. TAk>v,&TAI/v2 (3) For instance, the question could be "attack time larger than 60 msec AND brightness less than 6." The question chosen is the one with the smallest average entropy, If the constructed tree is to be used to classify unknown sounds, care must be taken to avoid too specialized -a tree, which only fits the data used to create it. Methods to avoid this include not building too deep a tree, smoothing the tree, or testing all potential splits against some held out data [1]. 4 Tree Analysis Once a tree has been created, there are several potential uses for it. The obvious one is to use the tree to classify unknown sounds. Instrument recognition could be used in, for instance, automatic scoring. Other uses include the analysis of the tree; which question splits the data? Are instruments first split into instrument families? etc. Furthermore, the tree can be used to improve the analysis, by finding out what's wrong with the misplaced data. The main focus in this paper is to understand which timbre attribute split the sounds into instrument classes. The binary tree constructed using average entropy is well adapted for this use, for, as Bahl et al. [1] says, "...seeking questions which minimize entropy is just another way of saying that questions are sought which are maximally informative about the event being predicted...". In the following, the tree is shown only until where the main part of an instrument is isolated. The questions, or occurrences of each instrument, are shown above (right aligned) or below (left aligned) each node. The abbreviations of the timbre attributes used are explained top left and the order and total occurrences of the instruments are shown bottom left. 4.1 Experiment 1: 5 sounds, piano, violin, clarinet, flute and soprano. In this experiment, the binary tree has been created with the timbre attributes of five sounds: piano, violin, clarinet, flute and soprano. There are 30 sounds of each instrument. sP: sustaInjrlcnt (sJf.0.75:sJp>3.19e-03) aC: attacrcurve sSp: sustaln_hikmmer_std sJp: sustainjttserstd nf: austanjitterfc mnh: Inharrmoncty A q = arg min H(Y I ql,k) i,k (4) (0,0,0,29.1) 3.3 Stop Criteria There exists a large number of methods for choosing when to stop growing a tree. These include looking at the decrease in entropy, the value of the entropy, the number of levels or leaves, or the number of elements in the leaf. The method used in this paper is to stop growing the tree when the entropy is zero, which is identical to saying that there is only one group (instrument) left in the leaf. (24,1.0,0.0) (Piano. Vlotn, Clarine. Flute, Soprano) (30, Figure 1. Binary tree of instruments. 150 sounds from five ICMC Proceedings 1999 -415 -

Page  00000416 The resulting tree is shown in figure 1. It is clear, by looking at the occurences of the instruments in the leaves, that most sounds are classified correctly in the fewest possible leaves. The first node in the tree divides the instruments into flute and soprano on the left (true) side, with the question sustain jitter frequency coefficient less than -0.75 AND sustain jitter standard deviation greater than 3.19 10J3. The jitter is the irregularity of the frequency tracks, and the frequency coefficient is a measure of how low-frequency the jitter is, therefore the left node includes sounds with strong lowfrequency jitter, such as, for instance, vibrato. The flute is separated from the soprano sounds with the question attack curve form less than 1.25 AND sustain jitter std greater than 0.02. The attack curve form is a measure of how exponential the attack is. Therefore, the left node contains sounds with logarithmic, linear or slightly exponential attack and strong vibrato. In the right side of the tree, the violin sounds are separated from the piano and clarinet sounds with the question attack curve form less than 0.97 AND sustain shimmer std greater than 0.15. The shimmer is the irregularity of the amplitude tracks. The left node contains sounds with logarithmic attack and rather irregular amplitude, for instance tremolo. Finally, most of the piano sounds are separated from the clarinet sounds.with the question sustain percentages less than 0.59 AND inharmonicity greater than 5.9 10.6. The sustain percentage is the relative amplitude at the end of the sustain. A low percentage indicates a decaying sound. As could be expected, the left node (the piano sounds) have decaying sounds with strong inharmonicity. In conclusion, the questions used to separate the sounds from the five instruments seem to express fundamental features of the sounds, and not expression particularities of the performers. It is possible, still, that the presence of vibrato in the soprano sounds helped separate these sounds. Jitter correlation is, a priori, the best separator for vibrato sounds. 4.2 Experiment 2: piano, tree intensities (p, mf and f) In this experiment, three intensities of a midi controlled acoustic piano are separated. There are 72 sounds of each intensity, here for convenience called piano (midi velocity 40) mezzo forte (72) and forte (104). The resulting tree is shown in figure 2. The most surprising thing here is that the mezzo forte sounds are separated first from the piano and forte sounds. The split question is release percents less than 0.06 AND jitter correlation greater than 0.35. The left node (most of the mezzo forte sounds) have a small amplitude at the end of release (could correspond to low background noise, or complete dampening) and rather correlated frequency tracks. The piano sounds are separated from theforte sounds mainly by the question brightness less than 12.23 AND release jitter std greater than 0.01. The piano sounds have, as could be expected, less brightness, and also more jitter in the release segment. rP: reteasepercent (rP<0.06:Jc>0.35) rJp: releasejitter.std Sc: shlmmer.cor. Jc: jttercorr. Bnr Brightness (Br<12.23:rJp>0.01) (Sc<0.45: 11.13) (1,69.6) (5.1,44) (1,0,8) (652,14) (piano, mezzoforte, forte) (72,72,72) Figure 2. Binary tree of three different intensities of the piano. 4.3 Experiment 3: 7 instruments In this experiment, a larger collection of instruments has been used to create the tree, using only single questions in each node. 7 instruments, totaling almost 1500 sounds have been recorded, analyzed, and used for the construction of the tree shown in figure 3. aT: attacltme4) bP: starlpereent sC: sutaincunrv sC: endcutve aSt: atlacK._shlmnwne aSf. sustaln. shnWerj aWp: alttckjle-std sJ.70) a: uslalnl.tterc Jc: llttfer-c r. (aS<- ).Jp0O.02) aSf<-J. (bP ) (5,0 4.541,0,0) sCd. ) eC<1.00) 0.5.2, (0,0.0, aT<4 1) ( 0..0...Ia. 102161.0.7.0) (3.4.19. 1, (piano. cello, dlar, nlute, sopr. voin, vitoa) (216.139,93.130.555267.98) Figure 3. Large tree. Some interesting facts may be seen from the tree. Soprano (541/555) sounds have large jitter correlation and low sustain jitter filter coefficient. This indicates vibrato, or slow glissando on these sounds. Piano (157/216) sounds have low jitter correlation, fast attack shimmer, large attack jitter and logarithmic end curve form. Violin (237/267) sounds and cello (54/139) sounds have low jitter correlation, fast attack shimmer and low attack jitter. The violin here has less than 48 msec attack time, whereas the cello has longer -416 - ICMC Proceedings 1999

Page  00000417 attack time. The remaining cello sounds have large instead of low attack jitter and exponential end curve. The viola sounds, third member of this instrument family, have low jitter correlation, slow attack shimmer and low begin amplitude. The flute (83/130), has high begin amplitude, but otherwise the same values. 4.4 Experiment 4. 5 instruments, divided into two groups each. In this experiment, the same sounds as in experiment 1 have been used, but this time, every second sound of each group has been placed in a new group. The resulting tree can be seen in figure 4. sP- suain-jer* (aJ.-0.75:sJp>3.190-03) aC: attack-curve eC: end..urv &Jp: sustaln.jttersltd &t0 usWtalnlJttrfc rJ:. nr s.jftterJfe Odd: Odd (C<1.35rJf>-0.45) (aC<1.30.02) A bench mark test, including larger sets of instruments and sounds is left (or right) for further application oriented research. References [1] L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer, A tree-based statistical language model for natural language speech recognition. IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-37, No. 7, July 1989. [2] L. R. Bahl, S. Das, P. V. de Souza, M. Epstein, R. L. Mercer, B. Merialdo, D. Nahamo, M. A. Picheny, J. Powell, Automatic phonetic baseform determination. Proc. ICASSP, 1991. [3] J. Beauchamp, Synthesis by spectral amplitude and "Brightness" matching of analyzed musical instrument tones. J. Acoust. Eng. Soc., Vol. 30, No. 6. 1982. [4] S. Dubnov, N. Tishby, D. Cohen, Investigation of frequency jitter effect on higher order moments of musical sounds with application to synthesis and classification. Proc of the Int. Comp. Music Conf. 1996. [5] H. Fletcher, Normal vibrating modes of a stiff piano string, J. Acoust. Soc. Am., Vol. 36, No. 1, 1964. [6] K. Jensen, Spectral envelope modeling. Proc. DSAGM, pp. 91-97, Dept. of Computer Science, University of Copenhagen, 1998. [7] K. Jensen, Timbre Models of Musical Sounds. Ph.D. dissertation, Department of Computer Science, University of Copenhagen, 1999. [8] K. Jensen, High Level Attributes: A timbre model of musical sounds. Computer Music Journal, submitted 1999. (0.0,0,14,1,0.0,0.15.0) (0,15,1.1,1,,0) (14,0,1.0,0,14,0.00.00) (Piano, Violin, Clarinet. Flute. Soprano, Plano2. Voln2. Clarinet2, Flute2, Soprano2) (15,15.15.15,1,5,15.15,5.15,15) Figure 4. Five instruments separated in two groups each. As can be seen, it is very similar to the tree, created using the 5 groups only. The conclusions must therefore be that the timbre attributes, and therefore the sounds are very similar for each instrument, and that this method can indeed be used to classify musical sounds, certainly when the sounds are recorded at the same time. 5 Conclusions This paper has presented the classification of musical sounds in instrument families using binary trees. The automatic tree construction, using average entropy, has created well-balanced, efficient trees. Several experiments, separating instruments and intensities, have been performed using this method. Some of the results of the analysis of the trees are that similar instruments (from the same instrument family, for instance) are generally classified as close, and that sounds from the same recording are classified as almost identical. Some conclusions are that the piano has high inharmonicity and low amplitude at the end of sustain, the flute and soprano have rather much frequency irregularity, stemming from, for instance, vibrato. [9] F. Klingholz. 1987. The measurement of the signal-to-noise ratio (SNR) in continuous speech. Speech Communication 6. [10] R. J. McAulay, T. F. Quatieri, Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. on Acoustics, Speech and Signal Proc., vol. ASSP-34, No. 4, August 1986. [11] H. F. Pollard, E. V. Jansson, A tristimulus method for the specification of musical timbre. Acustica, vol. 51. 1982. [12] J. Sporring, A. M0ller, P. Hjaresen, Automatic recognition of musical instruments. Proc. of the Nordic Acoustic Meeting 1994. [13] K. D. Martin, Toward automatic sound source recognition: Identifying musical instruments, NATO computational Hearing Advanced Study Institute, 11 Ciocco, Italy, July 1-12, 1998. ICMC Proceedings 1999 -417 -