Page  00000002 in Section 6 show that the proposed metric allows us to consider strong variations such as vibrato or tremolo but also micro-modulations to cluster partials of the same entity. 2. REVIEW OF EXISTING DISSIMILARITY METRICS To compare the metrics that will be described in the remainder of this article, we consider a set of partials extracted from various musical sounds: a tone of saxophone with vibrato (entity CI), a modulated singing voice (entity C2), a piano tone (entity C3) and a triangle tone (entity C4). For each sound, the five partials with the highest amplitude are selected. Five other partials, erroneously extracted from a white noise signal are added to the evaluation set (entity Co). All the partials are truncated to be of the same duration ( 1 second). This testing set is plotted on Figure 1 where the partials are sub-indexed by the number of the entity it belong. Except for the metric d, proposed by Virtanen in [6] where the amplitudes and frequencies parameters of partials are used as-is, the mean is subtracted before any computation and the amplitudes are normalized. During the experiments, it appears that the correlated evolutions of the frequency parameter was the most relevant cue. It will then be used for the comparison of the different metrics although the correlated evolutions of the amplitudes can also be considered for the clustering of partials as it will be shown in Section 6. Some widely known metrics are now described. The euclidean distance d, between two vectors is defined as: N de (F1 F2) = (F1 (i) - F2 i))i2 (1)1 where F1 and F2 are frequency vectors of size N. Let us consider a harmonic tone modulated by a vibrato of given depth and rate. All the harmonics are modulated at the same rate but their respective depth depends on their harmonic rank. It is then important to consider a dissimilarity metric which is scale-invariant. The euclidean distance is therefore not suitable for our purpose. Alternatively, Cooke uses a distance [9] equivalent to the cosine dissimilarity dc defined as: 81: 7 1 61 50 30 10 Time Figure 1. Frequency evolution of five partials extracted from a white noise signal (class Co), five partials of a saxophone tone with vibrato (entity CI), five partials of a singing voice (entity C2), five partials of a piano tone (entity C3) and five another from a triangle tone (entity C4). These frequency evolutions are arbitrarily distributed over the ordinate and are indexed by growing index and sub-indexed by number of entity. where F1 and F2 are frequency vectors of size N and X denotes the mean of X. If one considers the evolution in time of the frequencies of a partial as a signal, one can consider a decomposition of this signal in two parts. One part evolves slowly and continuously with time and therefore comply to the requirements of the sinusoidal model. The other part belongs to observation noise due to estimation error of the frequencies of the partial or background noise. Only the first part should be considered to identify similarities between the parameters of the partials. As a consequence, a relevant dissimilarity for our purposes should be scale-invariant only for the part of the signal that comply to the sinusoidal model. This issue will be studied in the next section to propose an improved dissimilarity metric. dc(Fi, F2) c(Fi,F2) /c- (Fl, F1) c-(F2,F2) N c(F1, F2) ZF1(i) F2 (i) (3) where F1 and F2 are frequency vectors of size N. Thanks to the normalization, this dissimilarity is scale-invariant. Virtanen proposed in [6] to use the mean-squared error between the frequency vectors first scaled by their average values: I N FI i 2)2 dv (F1, F2I(j -i 2 (4) N F= 1 l F2

Page  00000003 3. PROPOSED DISSIMILARITY METRIC Let FI be the frequency vector of the partial 1. According to the Auto Regressive (AR) model [11], the sample F1(n) can be approximated as a linear composition of past samples: k F1(n)= ZKl(i)Fi(n - i) +El(n) (5) i=1 where El(n) is the prediction error. The coefficients K1(i) model the predictable part of the signal and it can be shown that these coefficients are scale invariant. On contrary, the non-predictable part El(n) is not scale invariant. We have shown in [12] that AR-modeling of the frequency and amplitudes parameters is relevant to improve the tracking of partials. In this article, we show that the AR modeling is a good candidate for the design of a robust dissimilarity metric. For each frequency vector F1, we compute a vector KI of 4 AR coefficients with the Burg method [ 13, 14]. Since the direct comparison of the AR coefficients computed from the two vectors F1 and F2 is generally not relevant, the spectrum of these coefficients may be compared. The Itakura distortion measure [15], issued from the speech regognition community can therefore be considered: FW lf K(o)) gd dAR(F1, F2) =log 1 |K2(o)) 20 d (6) -_ | 2((o)| 27n where k KI(o) =1 + KI(i)e-ji( (7) i= 1 An other approach may be considered. The amount of error done by modeling the vector F1 by the coefficients computed from vector F2 indicate the proximity of these two vectors. Let us introduce a new notation E2, the crossed prediction error defined as the residual signal of the filtering of the vector F1 with K2: k E2 (n) = F1 (n) - K2 (i)F (n - i) (8) i= 1 The principle of the dissimilarity dG is to combine the two anti-symmetrical dissimilarities |E] and E2 to obtain a symmetrical one: dG(F1,F2)= (E +|E ) (9) Given two vectors F1 and F2 to be compared, the coefficients K1 and K2 are computed to minimize the power of the respective prediction errors E1 and E2. If the two vectors Fi and F2 are similar, the power of the crossed predictions errors E] and E1 will be as weak as those of E1 and E2.We can consider n consider an other dissimilarity do defined as the ratio between the sum of the crossed prediction errors and the sum of the direct prediction errors: 4. COMPARISON OF DISSIMILARITY METRICS A relevant dissimilarity between two elements (the partials) is a dissimilarity which is low for elements of the same class (acoustical entity) and high for elements that do not belong to the same class. The intra-class dissimilarity should then be minimal and the inter-class dissimilarity as high as possible. Let U be the set of elements of cardinal # U and Ci the class of index i between Nc different classes. An estimation of the relevance of a given dissimilarity d(x,y) for a given class is: intra(Ci) inter(Ci) Qd(Ci) ni ni S d(ci(j), Ci(k)) j= k=l ni # U-ni L d(c(j), F(1)) j=1 1=1 inter(Ci) intra(Ci) (11) (12) (13) U- i. (14) where ni is the number of elements of Ci and yi The overall quality Qd is then defined as: ~ ENc inter(Ci) d(U)- i- (Ci) Ncdc, intra(Ci) For example, let A = {{1.1,5.1}, {1.0,5.2}, {1.0,5.3}} and 2 = {{1.0, 1.1}, {1.1,0.9}} be two classes of points e = {xi,Yi} in a two dimensional space. One dissimilarity considers the abscissa dx(ei, ej) = xi -xj and the other considers the ordinate dy(ei, ej) = lyi -yj. If we study the data, the most relevant dissimilarity is dy which is verified by the quality measure Qd: Qdx(U) = 0.75 < Qdy(U) 32. The criterion defined in Equation 13 is first used to evaluate the capability of the metrics proposed in the last two sections to discriminate partials of a given class from the others. Next, the criterion defined in Equation 14 is used to globally evaluate this criterion for each metric. The results, summarized in Table 1 will be further detailed in the remaining of the section. It can however be noticed that this criterion is highly dependant of the scale of the studied dissimilarity metric. We then also consider an other criterion, noted ý which is more independent of the scale the evaluated dissimilarity metric than the previous one. Given a set of elements X, 5(X) is defined as the ratio of couples (a, b) so that b is the closest element to a and a and b belong to the same class. Given a function named "cl" defined as: cl: X N a - i where i is the index of the class of a. We get: (X)# {(a,b) d(a,b) = mincexd(a, c)A cl(a) = cl(b)} #X (15) where X can be either a class Ci or the set of elements U and # x denotes the cardinal ofx. As shown on the first column of Tables 1 and 2, the dissimilarity de obtain bad marks for the saxophone tone and E= 1 + |E | dG(F1,F2) = I +|El|fl +|E2 (10) These dissimilarity metrics based on AR modeling will now be compared to the ones presented in Section 2 using two criteria described in the next section.

Page  00000004 Qdl de d dv dAR do d Co 3.9 4.3 0 57.3 8.9 16.8 Ci 10.5 806.6 47634 37.4 97.3 46.9 C2 3.1 1586.6 37.4 33.6 23.1 29.5 C3 147.9 4.8 57.1 22.8 251.8 81.3 c4 49.5 43.9 83866 23.3 72.1 22.6 U 5.5 10.6 5.1 27.8 21.2 36.2 Table 1. Quality estimation according to the criterion Qd defined in equations 13 for the first five lines and 14 for the last line. The evaluated dissimilarities are: de the euclidean distance, dc the cosine distance, dAR the Itakura distortion measure, dG the crossed prediction error dissimilarity and d" the normalized cross prediction error dissimilarity. The frequency vectors used for the experiment are plotted on Figure 1. The metrics based on AR modeling obtain the better overall results. 5 de dc dv dAR do d4 Co 0.6 0.6 0.6 1 0 1 Ci 0 1 1 1 0.8 0.8 C2 0 1 1 1 0.2 1 C3 1 0.6 0.4 1 1 1 C4 0 1 1 1 1 1 U 0.3 0.84 0.8 1 0.6 0.96 Table 2. Quality estimation according to the criterion r defined in Equations 15 for dissimilarity de the euclidean distance, dc the cosine distance, dAR the Itakura distortion measure, do the crossed prediction error dissimilarity and d' the normalized cross prediction error dissimilarity. The frequency vectors used for the experiment are plotted on Figure 1. The dc and dv metrics obtain comparable results while dAR and dJ obtain the best overall results. the modulated voice because this dissimilarity is not scaleinvariant. The dissimilarity dc gets better results in case of modulations, as shown by the second column of Tables 1 and 2. The dissimilarity dv shows disparate results. Some entities are easily discriminated like the saxophone tone C1 and the triangle one C4. On contrary, the marks are not as satisfying for others entities like the piano one and the "noisy" partials. The dissimilarity dAR gets very good marks as it can be noticed in the fourth column of Tables 1 and 2. The dissimilarity do offers good performances in case of predictable evolutions of the frequency. On contrary, the correlations between the partials of the voice tone or the noisy partials are not clear, see the fifth column of Table 2. The dissimilarity d' offers more homogeneous results for the classes of partials we want to handle, see the last column of Table 1. This homogeneity is crucial for the clustering method described in the following section. 5. AGGLOMERATIVE HIERARCHICAL CLUSTERING As stated before, dissimilarity-vector based classification involves calculating a dissimilarity metric between pairwise combinations of elements and grouping together those for which the dissimilarity metric is small according to a given clustering method. One employed by Cooke [9] selects a seed element from the data set and computes the distance between this seed and all other elements in the data set in order to clusters the elements whose distance from the seed is below a given threshold. If any elements remain in the data set after this search, a new seed is selected from the remaining elements and the grouping procedure is repeated. If no element is found in the data set that is within the threshold of the seed, the seed is considered to belong to a singleton group. The entire process is repeated until no elements remain in the data set. The method of Virtanen [6] is a slight variation to this approach where the initial seeds are not individual tracks but rather small groups of tracks that have been formed by matching onset times. The distance between the seed group and each of the remaining tracks in the data set is then defined as the average distance between each of the tracks in the seed and the track under consideration. Virtanen adopted this method as a mean for computational complexity reduction, over performing an exhaustive minimisation of the distance between tracks within each group. These two approaches rely on a good initialization, and would failed if the threshold or the seed are not relevant for the considered data set. Alternatively, we propose to cluster partials by means of the agglomerative hierarchical clustering (AHC) method which requires no initialization. An agglomerative hierarchical clustering procedure produces a series of partitions of the data: (Pn,Pn-1,...,Pi). The first partition Pn consists of n singletons and the last partition PI consists of a single class containing all the elements. At each stage, the method joins together the two classes which are most similar according to the chosen dissimilarity metric. At the first stage, of course, this amounts to joining together the two elements that are closest together, since at the initial stage each class has one element. Hierarchical clustering may be represented by a two dimensional diagram known as "dendrogram" which illustrates the fusions made at each successive stage of clustering, see Figure 2(b) where the length of the vertical bar that links two classes is calculated according to the distance between the two joined classes. The classes are then found by "cutting" the dendrogram at levels where the difference between the distance of this level and those of the previous level is above a given threshold. Differences between methods arise because of the different ways of defining dissimilarity between classes. For computing efficiency, the elements properties (amplitude of frequency vector of partials) should not be considered to compute the distance between the union of two joined classes Ci and Cj and the remaining classes set Ck, Vk (i,j).

Page  00000005 A first method, known as the minimal linkage method, consists in choosing the smaller distance between d(ci, 7Ck) and d(cj, Ck). The distance between two classes is then the distance between the two nearest elements of these classes. Another method, proposed by Ward [16], minimises the intra-class inertia during the aggregation process. Given a partition of K classes, the intra-class inertia considers the class homogeneity: (a) K nk I Z d(Ck(i), Ck) k=1 i=1 (16) where Ck is a class with nk elements and Ck its barycenter. The distance between the union of two classes Ci and c1 and another class Ck is computed as follow: (0.83, 0.47) (0.86, 0.92) (0.88, 0.75) (0.15, 0.47) (0.08, 0.94) (0.24, 0.91) d(ci U cj, Ck) 1 S + [(n+ni)d(ci, Ck)+ (17) nk + nj + i (nk+nj)d(C],Ck) + (ni+nj)d(ci, C)] where ni is the number of elements of the class Ci. This method is designed for the analysis of scalar data vectors and gives good results for our applications. As an example, let the data set be 12 points which attributes are their coordinates, as shown in Figure 3(a). The Figures 3(b) and 3(c) are respectively the dendrograms computed using the minimal link method and the Ward method with the euclidean distance as dissimilarity metric. The first exhibit a "chain" effect not suitable for our purpose. The second leads to a more balanced dendrogram, easing the identification of classes. 6. EXPERIMENTS Several experiments were conducted to evaluate the relevance of the dissimilarity metrics described in this article if the HAC algorithm is used to cluster partials. Only the hierarchies obtained with the normalized crossed prediction error dissimilarity d, defined in Equation 10 are reported here since it gave the most relevant results. The hierarchies were computed with the Ward method. We first consider the data set plotted on Figure 1. If we consider the similarity between the frequency vector of the partials, the resulting hierarchy is almost perfect, see Figure 4. The partials of musical tones are correctly clustered and the noisy ones are inserted in the hierarchy at a high level, easing their elimination by the use of the cutting threshold (set to 0.1 here). If we consider the similarity between the amplitude vector of the partials, the hierarchy is not as satisfying, see Figure 5 because partials from entities 1 (saxophone tome with vibrato) 2 (voice tone) and 4 (piano tone) are mixed. The testing material used above consider a wide range of modulations. We now focus on the micro modulations of the parameters of some piano tones to clusters partials. The partials of three piano tones from the IOWA database with different pitches and similar intensity are used as testing material. Five partials per tone are considered. The hierarchies are computed with the same algorithm. (b) Figure 2. On top are plotted six points on a plane with arbitrary axes to be clustered by the AHC method. At bottom is plotted the dendrogram representing the hierarchy obtained using an euclidean distance as the dissimilarity metric between points. We can clearly distinguish two classes, one composed of points with abscissa close to 0 and the other composed of points with abscissa close to 1. Even if the resulting hierarchies are not perfect, see Figures 6 and 7, some correlations are clear especially for the tone with the highest pitch (cluster 3). It shows that the d, dissimilarity is able to discriminate between micromodulations and observations noise even for steady pitch tones of musical instruments like piano. 7. CONCLUSION We have shown in this article that the long-term sinusoidal model allows us to consider a generic cue for partials clustering: the common variation cue. The autoregressive modeling of the evolutions of the parameters of the partials appears to be relevant for the design of a robust dissimilarity metric exploiting this cue. The experiments showed that thanks to the dissimilarity proposed in this article, not only the large modulations such as vibrato or tremolo but also the micro-modulations are relevant for clustering partials into acoustical entities. The analysis of these modulations may be of interest for the description of acoustical entities with applications to instruments recognition. This topic should be explored in a near future. 8. REFERENCES [1] Mathieu Lagrange, Sylvain Marchand, and JeanBernard Rault, "Improving the Tracking of Partials

Page  00000006 for the Sinusoidal Modeling of Polyphonic Sounds," in IEEE ICASSP, March 2005, vol. 4, pp. 241-244. [2] Albert S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, The MIT Press, 1990. [3] Stephen Grossberg, Pitch Based Streaming in Auditory Perception, Cambridge MA, Mit Press, 1996. [4] Paulo Fernandez and Javier Casajus-Quiros, "MultiPitch Estimation for Polyphonic Musical Signals," in IEEE ICASSP, April 1998, pp. 3565-3568. [5] Anssi Klapuri, "Separation of Harmonic Sounds Using Linear Models for the Overtone Series," in IEEE ICASSP, 2002. [6] Tuomas Virtanen and Anssi Klapuri, "Separation of Harmonic Sound Sources Using Sinusoidal Modeling," in IEEE ICASSP, April 2000, vol. 2, pp. 765 -768. [7] Julie Rosier and Yves Grenier, "Unsupervised Classification Techniques for Multipitch Estimation," in 116th Convention of the Audio Engineering Society. AES, May 2004. [8] Stephen McAdams, "Segregation of Concurrrents Sounds: Effects of Frequency Modulation Coherence," JAES, vol. 86, no. 6, pp. 2148-2159, 1989. [9] Martin Cooke, Modelling Auditory Processing and Organization, Cambridge University Press, New York, 1993. [10] S. C. Johnson, "Hierarchical Clustering Schemes," Psychometrika,, no. 2, pp. 241-254, 1967. [11] Steven M. Kay, Modern Spectral Estimation, chapter Autoregressive Spectral Estimation: Methods, pp. 228-231, Signal Processing Series. Prentice Hall, 1988. [12] Mathieu Lagrange, Sylvain Marchand, Martin Raspaud, and Jean-Bernard Rault, "Enhanced Partial Tracking Using Linear Prediction," in Proc. DAFx. Queen Mary, University of London, September 2003, pp. 141-146. [13] John P. Burg, Maximum Entropy Spectral Analysis, Ph.D. thesis, Stanford University, 1975. [14] Florian Keiler, Daniel Arfib, and Udo Z6lzer, "Efficient Linear Prediction for Digital Audio Effects," in Proc. DAFx. Universita degli Studi di Verona and COST, December 2000. [15] Fumitada Itakura, "Minimum Prediction Residual Principle Applied to Speech Recognition," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 23, no. 1, pp. 67-72, 1975. [16] Joe H. Ward, "Hierarchical Grouping to Optimize an Objective Function," Journal of the American Statistical Association, vol. 58, pp. 238 - 244, 1963. S02 0 04 0.5 06 07 08 0 (a) (0.51, 0.56) (0.44, 0.43) (0.63, 0.77) (0.62, 0.60) (0.58, 0.53) (0.71, 0.87) (0.87, 0.53) (0.82, 0.69) (0.33, 0.44) (0.33, 0.46) (0.98, 0.66) (0.05, 0.48) (b) (0.51, 0.56) (0.44, 0.43) (0.33, 0.44) (0.33, 0.46) (0.05, 0.48) (0.63, 0.77) (0.62, 0.60) (0.58, 0.53) (0.71, 0.87) (0.98, 0.66) (0.87, 0.53) (0.82, 0.69) (c) Figure 3. 12 points on a plane with arbitrary axes are clustered by the AHC method. Dendrograms of the hierarchies obtained with either the minimal link method (b) and the Ward's method (c). In both hierarchies, the 2 -dimensional euclidean distance is used as a dissimilarity metric between the elements to be classified. The first hierarchy exhibits a "chain" effect not suitable for our purpose. The second one is more balanced, easing the identification of classes.

Page  00000007 040 030 061 101 071 081 091 214 224 234 244 254 112 122 152 132 142 163 203 193 173 183 Figure 4. Dendrogram obtained while clustering the partials of the testing set plotted on Figure 1 according to the dissimilarity of their frequency vectors computed with the do metric. The partials are indexed by growing index and sub-indexed by number of entity. The cuts, represented with dots allows us to identify classes that clusters partials from the same entity. Using the frequency variation cue, all partials are correctly clustered. Additional cuts split the cluster of partials erroneously extracted from noise (class 0). This is not a major disadvantage since these partials does not explicitly belong to an acoustical entity. 010 050 -020 -030 -040 061 234 224 081 071 091 10, 112 152 142 254 122 244 214 132 163 193 203 173 183 Figure 5. Dendrogram obtained while clustering the partials of the testing set plotted on Figure 1 according to the dissimilarity of their amplitude vectors computed with the d, metric. The hierarchy is not perfect because partials from entities 1 (saxophone tone with vibrato) 2 (voice tone) and 4 (piano tone) are mixed.

Page  00000008 011 051 113 133 123 153 143 092 - 021 * 031 04i ~ ------- 062 082 072 e 102 * Figure 6. Dendrogram obtained according to the frequency vectors of 15 partials of 3 piano tones with different pitches. The partials are indexed by growing index and sub-indexed by number of entity. The two higher entities (sub-indexed 2 and 3) are well identified in the hierarchy whereas the lower one (sub-indexed 1) is split. Even if the piano has an almost steady pitch, some correlations between the frequency vectors of the same tone can be exploited. 011 133 143 153 113 123 021 - 062 102 082 031 041 - 051 072 ~ 092 Figure 7. Dendrogram obtained according to the amplitude vector of three piano tones (of five partials each) with different pitches. Only the highest tone (sub-indexed 3) is clearly identified on the hierarchy. The correlation of the amplitude of the partials appears to be a less relevant cue for the clustering of partials than the one that considers the frequencies of partials.