Page  460 ï~~Clustering of Musical Sounds using Polyspectral Distance Measures. Shlomo Dubnov and Naftali Tishby Institute for Computer Science and Center for Neural Computation Hebrew University, Jerusalem 91904, Israel ABSTRACT: This paper describes a hierarchical clustering of musical signals based on information derived from spectral and bispectral acoustic distortion measures. This clustering reveals the ultra metric structure that exists in the set of sounds, with a clear interpretation of the distances between the sounds as the statistical divergence between the sound models. Spectral, bispectral and combined clustering results are presented. 1 Introduction The ability to assess similarity between musical timbres is basic both for the understanding of our musical perception and for compositional practice. Originally addressed by psychoacousticians, perceptually relevant timbre spaces were devised based on human listening experiments. These results, (Grey) reported in a series of classic papers, had become the standard representation for the musical instruments timbre space. We propose to probe into this space by mathematical tools that could measure this similarity directly from the musical signal. Acoustical distortions measure, although well known in the speech literature (Gray Markel) received little attention in music applications. One reason for this is that the underlying linear model used for these functions cannot capture the wealth of sound found in musical signals. Here we discuss distance measures for acoustic signals based on spectral and higher order statistical information (Nikias Raghuveer) drawn from the steady state portion of the signal. Special attention is given to musical signals whose timbral properties are difficult to model. The following issues will be addressed in this paper: (1) the notion of higher order spectra, (2) acoustic distance measure and its generalization to bispectral information, (3) a method for clustering stochastic models and finally (4) application to classification of musical instruments. 2 Higher Order Statistics In order to understand our motivation for studying higher order correlations, we will recapitulate briefly some of the reasons for using the ordinary double correlation. The common assumption is that out ears perform spectral analysis of the incoming signal. Naturally, not all of the signal information is retained in our ears and the simplest assumption is that the phase is ignored. It is well known that the amplitude of the Fourier spectrum is the Fourier transform of the signal's autocorrelation. This autocorrelation in the time domain is the basic information extracted from the acoutic signal by our auditory system. This information represent the signal's spectral envelope in the frequency domain. Now we intend to extend the scope of thb acoustic analysis by including triple and may be 460 I C M C P R OCE ED I N G S 1995

Page  461 ï~~higher correlations of the signal, also known as polyspectra in the frequency domain. The kth-order correlation, hk(il,.., ik1) of a signal {h(i)} =0 is defined as, N..~i, -1)_ h(i)h(i + i).hi+ i-)"(I) i=0 In frequency domain it corresponds to the kth-order spectrum, N Hk(w1,...,Wk.1) j3 hk(ii, k-1.i)e (2) il,..,ik-l=-N = H(wl)... H(wk-l).- H(-W1 -...- Wk-1)Â~ 3 Acoustic Distortion Measures Acoustic distortion measures are widely used in speech literature as means of signal comparison, and are closely related to feature representation of the signal. One of the most common representations of speech signals is the linear predictive coding (LPC), which provides reasonable representation for speech analysis and synthesis. We believe, however, that this model is insufficient for the acoustic representation of musical signals and suggest a simple extension to it, which still avoids the numerically expensive schemes encountered in other common musical signal representations (such as phase vocoding for instance). The LPC representation can be considered as an acoustical model that treats the signal as a random Gaussian excitation passing through a linear resonant chamber. A Gaussian random excitation is completely characterized by it's first two moments and thus contains no additional information in its higher order spectra. In a previous work (Dubnov et al.) we have suggested an acoustic model that treats the signal as a random non-Gaussian excitation passing through a linear resonant chamber. Using this model, an acoustic distortion measure based on higher order statistics of the signal has been derived. It was shown also that this measure is a generalization of the Itakura Saito distortion measure to include bispectral information 1. Thus, we propose a model of a linear filter driven by a non-Gaussian white noise, which may better capture aspects of musical sound related to timbre or 'sound color' of the musical instrument. By having a well defined statitsical model of the signal the problem of measuring distances between two signals is turned into the question of evaluating the statistical similarity between the model distributions. The statistical similarity is measured by the Kullback-Liebler (KL) (Kullback) divergence between the distributions, and provides us with a distortion function DPJJ'_= -da --A'__f dWldW2 B~lw)) 3) D(P IP') - -N((A0- A0) + (A2 2 - A2 f 2 r S(w ) + (3 f _J (27r)2 B(w1,w2) - S y(w) r dwlw272BI(()1,wL2)) where Sy (w) and By (w1, w2) stand for the spectrum and bispectrum of the signal, and { ui, Ai }'s are derived from the moments of the non-Gaussian input noise. 4 The clustering method An interesting application of our new distortion measure is an acoustic taxonomy of various musical instrument sounds. Musical sounds are hard to cluster since it is difficult to model sounds 1 The Itakuro-Saito distortion is a statistical, Kuilback-Liebler (KL) divergence, between signals represented by their LPC models. ICMC PROCEEDINGS 199546 461

Page  462 ï~~as points in a some low dimensional space. Though we limit ourselves to the stationary portion of the signal, a grouping of musical instruments, by the above similarity measure, appears in a rather interesting manner, as shown below. In our approach, acoustic signals are treated as samples drawn from a stochastic source, namely, music signals are represented by their corresponding statistical models. Each model is chosen to be the most likely to have produced the observed waveform, in its parameter class. It is important to notice that our relevant observable are the spectrum and the bispectrum of the signal, but in order to evaluate all the parameters of the distortion function, the {, Ai }'s must be estimated as well. Although this can be done in principle and an iterative solution procedure for that was described in another work (Dubnov Tishby) here we prefer to avoid the complete estimation of these parameters (4) by using the following arguments. The distortion function is characterized by two integrals over ratios of the spectra and bispectra of the two signals involved. These are actually the only two quantities that are directly derived from the observed signal. We expect theses two observables to contain all the relevant acoustic information and thus we substitute the original distortion function, Eq. 4, by the pair (Sry,y,,Bry,y,) where t y y -- dw S uw ) n r ~, - d dw B,w 2 ) fI-- f,7,(') and Brppi = f7 2 B,(w1,w2) are the spectral and bispectral integral ratio expressions, respectively. The idea is to represent a sound y by a collection of all pairs (Sry,y', Bry,y,), calculated with respect to all signals y' in our data set. This "trace" vector provides a signature of the particular signal y, characterizing it by the values of spectral and bispectral integral ratios, with respect to all the other signals. These signatures are expected to be similar for sounds with similar characteristics and serve as the basis for an acoustic classification, which in turn determines the underlying structure in the musical domain. By this transformation we have turned our data into a collection of points in a high dimensional space, the dimensionality being equal to the length of our signature vectors. 2 Using this vector-signature representation we explore our signal collection using a deterministic annealing hierarchical clustering approach. (Rose et al.) As in other fuzzy clustering methods, each point (a signature vector in our case) is associated in probability with all the clusters. The "annealed" system consists of a set of probability distributions for associating points with clusters. The associations probability is controlled by a inverse temperature-like parameter, /3. As f3 gets larger. the temperature is lowered and the associations become less fuzzy. At the starting point,/3 = 0 and each point is equally associated with all clusters. As /3 is raised, the cluster centroids bifurcate through a sequence of phase transitions and we obtain an ultrametric structure of associations, controlled by the temperature parameter. 5 Clustering Results As described above, an hierarchical clustering was performed iteratively by splitting the models (cluster centers) while raising/3. This creates a sequence of partitions that give refined models at each stage, with clusters containing a smaller group of signature vectors, which are the representatives for the signals themselves. In order to test in detail the spectral and bispectral components of our data, we performed three separate clusterings: (1) a clustering with the spectral integral ratios vector only, (2) a bispectral integral ratios vector only, and (3) a combined clustering with vectors consisting of the complete (Srt,,,Bry,,) set of pairs. Our data set was taken from a collection of 31 sampled instruments sounds, taken from the Proteus sample player module. The results of spectral, bispectral and combined clustering are 2 The signature vector could be calculated with respect to a small representative subset of our signal set. We chose to evaluate the complete distances matrix, i.e. each signal is represented by the distances from all other signals, and the space dimension is equal thus to the size of the data set. 462 IC M C P RO C E E D I N G S 1995

Page  463 ï~~shown in figures 1,2,3 respectively. Regarding the spectral and bispectral clusterings, we first note that many similar pairs of sounds, such as Flutes, Clarinets, Trumpets are grouped together in both clustering methods. It is interesting to notice also that the Violin and Viola are related (to different extents) in both methods, while the Cello is distant from both of them in the two clustering trees. Notice also that the Bassoon is close to Contra Bassoon in the spectral tree, while beeing very distant in the bispectral case. Similarily, but in the other direction, the Alto Oboe, English Horn pair are close in bispectral tree and distant in spectral tree. Qualitative analysis of these results suggests some possible interpretations to the behavior of spectral and bispectral clusterings. A rather "trivial" observation is that the spectral tree "grows" in direction of spectral richness - the Flutes are close to the root while the deeper branches belong to spectrally richer string and wind instruments. The bispectral tree exhibits clustering that puts at the deep nodes instruments which are normally identified as "solo" instruments, while the sounds closer to the root belong to a lower register instruments of a more "supportive" character. It may be said that the hierarchical clustering splitting corresponds to some "penetrating" quality of the sounds, with Flutes and Violin at the extreme. The combined spectral/bispectral results are difficult to interpret, although traces of the spectral and bispectral trees are apparent there. Acknowledgments We would like to thank Prof. Dalia Cohen for fruitful musical discussions, and Mr. Yona Golan for discussions of clustering and for providing us with the routines. Bibliography (Dubnov Tishby) S.Dubnov, N.Tishby, Spectral Estimation using Higher Order Statistics, Proceedings of the 12th International Conference on Pattern Recognition, Jerusalem, Israel, 1994. (Dubnov et al.) S.Dubnov, N.Tishby, D.Cohen, Hearing Beyond the Spectrum, Journal of New Music Research, in press. (Gray Markel) A.H.Gray, J.D.Markel, Distance Measures for Speech Processing, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, No.5, October 1976 (Grey) J.M. Grey An Exploration of Musical Timbre, Ph.D. dissertation, Stanford University, CCRMA Report no. STAN-M-2, Stanford, CA., 1975. (Nikias Raghuveer) C.L. Nikias, M.R. Raghuveer, Bispectrum Estimation: A Digital Signal Processing Framework, Proceedings of the IEEE, Vol. 75, No. 7, July 1987 (Kullback) S.Kullback, Information Theory and Statistics, New-York, Dover, 1968. (Rose et al.) K.Rose, E.Gurewitz, G.C.Fox, A deterministic annealing approach to clustering, Tech. Rep. C3P-857, California Institute of Technology, 1990. ICMC PROCEEDINGS 199546 463

Page  464 ï~~Figure 1: Spectral clustering tree. The numbers on the nodes are the splitting values of f. 464 I C M C PROCEEDINGS 1995

Page  465 ï~~1, B.CI Figure 2: Bispectral clustering tree. IC M C P R O C E E D I N GS 199546 465

Page  466 ï~~Figure 3: A combined spectral and bispectral clustering tree. I CM C P ROC EE D IN GS 1 99 5 466