# Timbre Recognition with Combined Stationary and Temporal Features

Skip other details (including permanent urls, DOI, citation information)This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact mpub-help@umich.edu to use this work in a way not covered by the license. :

For more information, read Michigan Publishing's access and usage policy.

Page 00000001 Timbre Recognition with Combined Stationary and Temporal Features Shlomo Dubnov Xavier Rodet Institute of Computer Science, Analysis/Synthesis Team, the Hebrew University of Jerusalem, Israel IRCAM, Paris 75004, France Abstract In this paper we consider the problem of modeling spectro-temporal behaviour of musical sounds, with applications for musical instrument recognition. Using instanteneous sound features, such as cepstral envelopes and cepstral derivatives, the temporal evolution of the sound is transcribed into a new representation as a sequence of spectral features. Applying information-theoretic sequence matching methods, the sound dynamics can be modeled and compared. Testing this method on recordings of several solo musical pieces of different instrument, excellent matching results were obtained. 1 Introduction Most studies of instrumental sounds deal with timbre of isolated sounds recorded in "laboratory" conditions, thus having practically fixed pitch and limited dynamics. When considering sounds recorded in natural conditions, e.g. recording of a musical performance, one must consider the spectro-temporal behaviour of sound, such as variations in sound parameters due to instrument dynamics and expressive inflections. These new questions are idiomatic to the specific instrument (involving subtle sound properties such as noise or "texture" of a sound), and may-be even related to the style of playing by a specific performer. As such, these questions usually are not considered to be in the "timbre" domain. In our work we have considered so far statistical models of the variations that occur during a sustained portion of the sound. Various aspects, such as phase coupling and its relation to Higher Order Statistical (HOS) analysis were investigated in its application for timbre recognition. It has been shown that this additional information is important for sound characterization, specifically regarding its "textural" aspects [3]. The purpose of the current work is to extend this research towards modeling of temporal behavior in sound and music. We examine the utility of a certain type of learning techniques based on "universal", information-theoretic methods for sequence characterization and classification. In a companion paper, similar models were used for synthesis[4]. In order to understand the problems in comparing sounds, one must note that there are different temporal scales for its behavior. This includes short term correlations related to the timbral properties (such as formants), correlations due to pitch period, slower modulations such as vibrato, expressivity inflections, and transitions between different notes. Thus a sequence that might seem stationary on one time scale, departs from stationarity and ergodicity on another time scale. This situation poses a problem for assessing the right probability function for the sequence of samples. Moreover, for purposes of classification, introducing similarity measures between dynamic sounds is usually based upon specific models, such as Markov models of a certain order. Here we suggest a different original approach that represents sound as sequence of instantaneous features and then modeling feature dynamics by using information theoretic methods for sequence modeling and comparison. The raw sound signal is converted into symbolic representation by clustering of the features into several representative values, thus effectively "quantizing" the features. Using Vector Quantization (VQ) methods [8], we create a sequence of discrete feature labels, i.e. a quantized feature sequence over a finite alpha-bet. Next we apply the notion of entropy to compare the sequences in terms of similarity between their statistical sources. This method provides a powerful scheme for comparison and generation of sequences without explicit knowledge of their statistical model. Using the Lempel-Ziv (LZ)[10] data compression algorithm, a very simple parsing of a sequence is performed by putting commas on every new substring found in the original string. Having Ct the number of commas (distinct sub-phrases) of a sequence of length n, Ziv and Lempel have shown that the quantity Ct ~ log2CO/n converges asymptotically to the entropy with increasing n. By growing a parsing tree, an estimate of the conditional entropy for a symbol given a subsequence is provided. By means of an clustering algorithm suggested by E1-Yaniv, Fine and Tishby [5], an improvement to the Ziv-Merhav[ll] "universal" sequence classification method is provided. Moreover, their mutual source algorithm is used to create new sequences to represent an

Page 00000002 "average" for a set of sources. This allows for a distributional clustering procedure that iteratively finds similar statistical sources and replaces pairs of sequences by their mutual source representation. In is important to note that Markovian models were suggested to be powerful tools for music analysis and composition. One of the main difficulties in using the Markov approach is the need to explicitly specify the model and estimate its parameters. This problem is alleviated in our "agnostic" approach. Moreover, this sequence matching method is "universal" in terms of not assuming any a priory knowledge of the source statistics and having its asymptotic performance as good as any Markov or finite state model. To summarize our method, the investigation into temporal structure of the sound and music was done along the following lines: 1. the short time temporal evolution in sound is described by specific features such as cepstral envelope and cepstral derivative. 2. for the long term behavior of the signal we applied information-theoretic tools for classification of the feature sequences. 3. The agnostic clustering method was also applied to a database of musical pieces by different composers from different periods. 2 Feature Extraction One of the problems in modeling sound is the multitude of features that are relevant for our perception. These include sound intensity, overall spectral envelope that defines the color of the sound and longer correlations that correspond to pitch / repetitive properties in the signal. A convenient representation of the different spectral features is by means of cepstral coefficients [9],[1]. For a power-spectrum S(w) the cepstral coefficients represent the Fourier series expansion of logS(w) log S(w,t)l q c(t)e- (1) n=--oo Note that the first coefficient corresponds to the overall energy of the signal, the next few lower coefficients correspond to the overall spectral shape (interpreted as the filter transfer function in a source-filter model), and the remaining coefficient contain a set of decaying, regularly spaced peaks due to pitch periodicities and the finer details of the power-spectra. For short time description of the sound we use a of spectral envelopes, very much like in speech, which allow for up to 90% of data reduction in sound representation. In order to capture the transitional spectral characteristics, we used cepstral derivative as an additional feature. Spectral variation in time is represented by 3loglS(w,t)| - cct)(_ n (2) n= --oo Since the cepstral coefficients series c,(t) does not have an analytic form, the derivative ) can be approximated at by some finite difference. A first order difference between neighboring cepstra was used as an estimate for the derivative at a given time instant. In order to capture the information present in higher cepstral coefficients as well, additional parameters were considered. One must note that the higher cepstral coefficients correspond to the excitation signal (also called the residual). Variations in the fundamental frequency are important for characterization of the transitional spectra and should be considered in correlation to the cepstral derivative. Also the Higher Order Statistical (HOS) parameters, that describe the residual signal in terms of its non-linear properties seem to be promising features for sound classification. These features are related to phase coupling in harmonic signals and texture properties in noise signals, and were shown to be characteristic of different instrumental families in stationary conditions. In order to be able to consider HOS features in a dynamic situation, the HOS values must be properly normalized in terms of dependence on pitch and signal energy. We plan to consider these parameters in future work. 3 Vector Quantization As mentioned above, we need a finite alphabet for the Markovian model. This requires to quantize the model, preferably so as to represent our sounds in terms of 8 byte vectors. This gives 5-6 bytes for the spectral envelopes (32 - 64 typical spectra) and 2-3 bytes for the other, more specific features.

Page 00000003 Standard Linde-Buso-Gray (LBG) algorithms was used for vector quantization. The distance measure that was used for comparison between cepstral coefficient and cepstral derivatives was the Euclidian measure. Using p cepstral coefficients without the gain term co, comparing the spectral shapes between two cepstrally smoothed log spectra gives p dCEP = Y(Cn - Cn)2 n=l and similarly for the transitional cepstral (3) p dACEP = aCn - ac=)2 n=l (4) The VQ algorithm requires to chose a proper initialization, i.e. find an initial set of sound envelopes from which the quantization codebook will be iteratively designed. We took an evenly spaced selection of initialization points along the time axis of the training signal. At each iteration the new codebook reduces the error, or improves the "quantization to signal" (SQNR) ratio, until 1 a certain threshold is reached and the iterations are terminated. As an example, Figure 1 presents the results of a VQ performed with 4 envelope types and p = 20 cepstral coefficients. The sound analyzed was a 3 second excerpt from Brahms clarinet sonata. VQ with 4 spectral shapes 4000 3000 2000 1000 -1000 -2000 -3000 -4000 0 2 4 6 8 10 12 14 x 104 Figure 1: The segmentation of sound into 4 typical envelope types, as done by applying an LBG VQ method to vectors containing 10 first cepstral coefficients. The bottom graph represents the error between the actual envelope and the quantized value. Figure 2 presents the spectral envelopes that correspond to some of the codebook vectors as designed by LBG. As can be seen from the figure, the quantization corresponds roughly to segmentation of the sound into areas where the spectral envelope differs. A big error occurs at the transitions or the sound attacks (starting points of new notes) and thus requires a special representation of these portions. 4 Sound Clustering based on Spectral Envelopes In order to see the effect of Spectral Envelope VQ, a small database of 18 sounds was analyzed into segments of 12 ms with 6 ms overlap between segments (the original sounds were 1 second in duration, thus resulting

Page 00000004 4 spectral env. found by LBG on 20 cepstral coeff. 4 3~ 0 0.5 1 1.5 2 Freq. (Hz) xl&4 Figure 2: Spectral envelopes that correspond to the codebook vectors. in approximately 166 segments) and then clustered by VQ of 25 codes (allowed clusters). The results of the classification are shown in the following figure (fig. 3), in terms of sound sample belonging to a cluster. As one can see, most of the sound were classified perfectly right (i.e. one cluster center to all samples from the same sound). The errors are mostly at the beginning or end portions of the sound, which is reasonable. The exceptions are Violin and Viola, that have several samples miss-classified. Also it is interesting to observe that Oboe vibrato and non-vibrato samples are constantly confused between two clusters, which is again underst andable. 5 Transitional Spectral Characterization As seen in section 3, the transition portions of a signal can not be represented by cepstral envelopes only due to the big error in this portion of the sound (see figure 2). To deal with this problem we added a cepstral difference feature which is calculated as a first order difference between neighboring cepstra. In order to evaluate the amount of cepstral coefficients that are required for the transitions, an L2norm was considered for the cepstral coefficients between 0-1 msec. and 1-Gmsec. The reason for this choice was to comparatively check the cepstral part due to the envelope only and the part that includes the pitch. The cepstral difference norm can be obtained for cepstral coefficients between ccpstral tiracs [ti, t2]. Figure 4 shows the two cepstral derivative norms in a clarinet example. Note that there are two segments that have big spectral envelope change but not pitch change, and vice versa, two different pitches that were played with the same "color". 6 The temporal model and "agnostic" clustering The quantized features are combined into a cross alphabet, so as to represent the long term spectral dynamics by means of sequences of feature symbols. If we assume now that these sequences are derived from a stochastic source, we can compare them by calculating the cross entropy between their sources.

Page 00000005 VQ results for sustained portion Figure 3: Clustering of 18 different instrumental sounds (166 samples each) into 25 clusters. Most of the samples of each sound were clustered into a single cluster. Figure 4: Cepstral derivative norm, calculated for coefficients between 0-1 msec. and 1-Gmsec (the msec are in terms of cepstral time). Calculation of the distortion between probabilities P(Y) and Q(Y) is achieved by means of the KullbackLiebler (KL) divergence (Kullback 1968) [7]. D[P| Q]= KIn1 t (5) Given a typical sample Y = {yo,.., YN} drawn from probability P one might use the KL distance to evaluate the probability that the same sample appears in Q. It can be shown that for typical sequences (or if N is big enough) that Q(Y) cc e-ND[P||Q] (6) In other words, close probabilities generate the same high probability (typical) sequences. The cross-entropy has several serious drawback, such as non-symmetricity, and even worse, its being unbounded and unreliable to low probability events in Q. A better approach is to use a symmetric and bounded version of the cross-entropy known as the Jensen-Shannon measure [6]. This measure was generalized for markov source by El-Yaniv, Fine and Tishby [5] and was used for our classification tests. Reader who is interested in more details about the distance measure and its associated clustering method is referred to the paper by Dubnov, El-Yaniv and Assayag[2] appearing in these proceedings.

Page 00000006 7 Sound Matching Results In order to demonstrate how the quantization routine functions in conjunction with the agnostic clustering method, we have coded and then clustered a total of 20 music excerpts of 2.75 sec each, taken from 5 music pieces (4 excerpts from each piece). Accidentaly, the first two pieces were identical, thus having 4 distinct instuments and two groups of identical sounds. The different pieces were clarinet sonata, a piece for fagot (bassoon), piece for flute and violin. A description of the analysis steps is given below: * the pieces were analyzed in terms to cepstral coefficients (cc) and cepstral derivative (dc) coefficients. * cc and dc arrays of vectors were quantized using the LBG method. The clustering routine was run iteratively until the SQNR stoped to improve. * the original cc and dc are quantized in terms of the final codebooks obtained in the earlier step. * the agnostic clustering routine was applied to the sounds, coded as ASCII files. This routine performs greedy matching, combining in each step the two closest sequences and creating a clustering tree. For example, the first excerpt of bassoon piece was coded as: 6EiHLXI8888s 423334455555255552822528285585555555255787825587222868888727555585 522277782875668775544558676667587822827766762225822255852528855276758558822277 258887776225578678676766667676BBBBBBBBHBH6EEEBHB6?6??BEEBBBB6BBBEE6HHE?HEE iii ii i @ 4444 i @44A4A4GG:@4ý4:::@@@444:A:4: A i@A4:88ii i3i3iB999Ei3331455552 and the last clarinet segment was coded as: LahZZhShSShSLL hlSSZaZhhSZahZhjLLShZZZZNNIPAZ"@h@hSDDDLLLLjaahjLSShhahS ihNNN Z'Ziii"NHTTEE~LLLLHH]xZ'TZ'N'NZTZNNZi'Z'NHifBEEZZ~LLLHHHNiiBZ"'ZZ'NiNNZTBThH HT]TTTTTiii'Z'NNNNN'NH'fBBEZZhLL6HHT]]]]]oxBfKBBZh769T]fffTTIZTHNofBBBEZZLLShh hLLLHHHHTTT]offxxoxBoBoxB]]]o]xx]xBox]]xxxx]]]BBxBxBxxB]]]oooo]xx]x]]o]]oxB]]] One can see obviously that the two asci representations of sound files were mostly coded using different symbols, which makes the distinction easy, even not considering the temporal differences of the sequences. Nonetheless, there are still several common symbols like ý,i,@,6,H,L,B,E. The agnostic clustering provided a correct result, as demonstrated in figure 5. 00 04 03 07 02 06 01 05 08 10 09 11 15 12 13 14 16 17 18 19 Figure 5: Clustering tree of 20 music excerpts from 4 different instruments. Segments {00-03} and{04-07} were identical. Each four successive segments {08-11},{12-15} and {16-19} belong to a single instrument. Acknowledgments The authors are most grateful to Ran El-Yaniv for providing us with the clustering routines and for the important remarks and fruitful discussions of this work.

Page 00000007 References [1] J.Deller Jr., J.Proakis, J. Hansen, "Discrete-Time Processing of Speech Signals", Prentice Hall 1987. [2] S.Dubnov, R.El-Yaniv, G.Assayag, "Universal Classification Applied to Musical Sequences" in this volume (ICMC 98). [3] S.Dubnov, X.Rodet, "Analysis of Sound Apperiodicites", ICMC, Thessaloniki, 1997. [4] S.Dubnov, X.Rodet, "Study of spectro-temporal parameters in musicalperformance for expressive instrument synthesis.", to appear in Proceedings of the IEEE Conference on System Man and Cybernetics, California, 1998. [5] R.El-Yaniv, S.Fine, and N.Tishby, "Agnostic Classification of Markovian Sequences", To appear in NIPS 1997. [6] L.Jianhua, "Divergence measures based on the Shannon entropy," IEEE Trans. Information Theory, 37(1), 145-151, 1991. [7] S. Kullback, Information Theory and Statistics, John Wiley and Sons, New-York, 1959. [8] Y.Linde, A.Buzo and R.Gray, "an algorithm for vector quantizer design", IEEE trans. Commun., vol. COM-28, pp.84-95, 1980. [9] L.Rabiner, B.Juang, "Fundamentals of Speech Recognition", Prentice Hall 1993. [10] J.Ziv and A.Lempel, "Compression of individual sequences by variable rate coding", IEEE Trans. Information Theory 24, 1978. [11] J.Ziv and N.Merhav, "A measure of Relative Entropy between Individual Sequences with Applications to Universal Classification", IEEE transactions on Information Theory, Vol.39, No.4, 1993.