Page  00000475 Data Association Techniques for a Robust Partial Tracker of Music Signals Hamid Satar-Boroujeni*, Bahram Shafai*, and Patrick J. Wolfet *Department of Electrical and Computer Engineering, Northeastern University {hsattar, shafai} tDepartment of Electrical Engineering and Computer Science, Harvard University Abstract In this paper we present a robust Kalman filter for tracking of partials in music signals and introduce novel quality factors for improved data association strategy and rejection rules. The robust Kalman tracker is based on a regularized least-squares approach that is designed to minimize the worst-possible regularized residual norm over the class of admissible uncertainties at each iteration. This method provides improved tracking capabilities, compared with the conventional Kalman filter, which was proposed before. We also propose novel data association techniques and rejection rules based on quality factors for individual components of a track and also for individual tracks. 1 Introduction Partial tracking has been widely used in different areas of music signal analysis, where prominent features of these signals, such as pitch and frequency-amplitude of harmonics are extracted (Depalle, Garcia et al. 1993), (Sterian 1999), and (Lagrange, Marchand et al. 2004). We proposed a partial tracking technique before (Satar-Boroujeni and Shafai ICMC2005), which was based on the conventional Kalman filter. There, we estimated the parameters of evolution models for our Kalman tracker through a statistical analysis of a large database of musical sounds and by averaging over varying estimates. This inaccuracy in model parameters, which is unavoidable when dealing with real world models, degraded the performance of our tracker in certain situations. A feasible solution to this problem can be the use of a robust Kalman tracker which deals with model parameters as bounded uncertainties. This can be especially rewarding since we do not need to tediously estimate these parameters for different situations where they can never be accurate enough, and, on the other hand, our robust tracker can perform a significantly better job in critical situations (e. g., the closely-spaced partial in a polyphonic scenario). Another important issue in tracking is the data association strategy. Data association can be viewed as a tool for adjusting the performance of partial tracker based on our specific needs. Regardless of how efficient our tracking algorithm is, a loosely designed data association algorithm can degrade the performance of the whole system. On the other hand, a careful consideration of this issue can result in considerable improvements in the final outcome. In our system, we try to collect all possible evidences pertaining to the quality of the estimated data and incorporate them in the pass-and-fail rules. Parts of these evidences are collected before performing the actual tracking and in the process of peak detection. We will begin by defining a quality factor for spectral peaks that are collected to be grouped into partial trajectories. We then describe the state-space modeling of the system in Section 3, where we introduce our uncertainty model, which localizes all the unknown parameters into bounded perturbations. In Section 4, we provide the formulation of our robust Kalman tracker and describe the process of partial tracking. In Section 5, we concentrate on data association issue and formulate our track quality factor and define some rejection rules. At the end, we provide some results on robust tracking and show the effectiveness of rejection rules. 2 Peak Quality Factor The process of partial tracking starts with estimation of a time-frequency image of the music signal followed by peak detection, where the salient spectral peaks that can be considered as elements of a partial trajectory are collected. Considering the nature of music signals, we proposed a novel approach for detection of peaks before, details of which can be found in (Satar-Boroujeni and Shafai EUSIPCO2005). There, we designed a two-step algorithm, and in the first step we used a mathematical framework to collect all the peaks that fit into the very definition of a peak as a local maximum. These were referred to as raw peaks. In the next step we used statistical properties of a relative number of data points surrounding each raw peak to examine the concreteness of the detected peak and rule out any incompetent maxima. The rejection rule in the second step is based on the local tallness of a peak. We accept a raw spectral peak at 475

Page  00000476 discrete time frame t and frequency bin f as a genuine peak pertaining to a real partial track if p (t,f ) p(tf)]+ d op(t ) (1) where p(t,f) is the power of the peak (amplitude in logarithmic scale), mWp(t,f) and a[p(t,)] refer to the mean and standard deviation of the elements surrounding p(t,f) respectively, and d is a tuning parameter. In this peak detection strategy, we use carefully designed and tuned thresholds for identifying a genuine peak, but after a peak is identified, there is no consideration of how well this peak has passed the qualification tests. Based on this observation, we propose to associate each peak with a quality factor that can represent the level of credibility that we are considering for each peak. This can be especially rewarding since we can define a membership level for spectral peaks instead of dividing them with "passed" and "failed" labels. This quality factor can be carried over to the partial tracking process for defining similar factors for partial tracks. These factors can come very handy in identifying the meaningful partial tracks. If we consider a qualified peak p(t,f), the proposed Peak Quality Factor (PQF) for this peak, i.e., PQF(p) is defined as PQF(p) = LT (p) + OT (p) (2) Where LT(p) is the local tallness measure of this peak and is defined as the amplitude distance between peak p and the local tallness threshold defined in (1). It can be written as LT (p) = p(t,f ) - (m[,,)] + d o,,) (3) Also OT(p) is the overall tallness of peak p which is defined as OT (p) = p (t,f ) - Trend (t,f ) (4) Where Trend(t/) is the overall trend of the spectrum, which is calculated by finding the best linear fit to the spectrum at time frame t and evaluating it at frequency f 3 System Modeling The whole idea of music signal modeling is based on the model of time-varying sinusoidal components plus noise (Serra 1997), which is commonly used in the area of music signal analysis and synthesis. This model is formulated as N s Y(t) = A (t)cos (0 (t))+e(t) (5) n=1 with S(t) = (r)d r + (0) (6) Here, A,(t) is the instantaneous amplitude, On(t) is the instantaneous phase of the nth harmonic, wno(t) is the instantaneous frequency in radians per second, and e(t) is the additive noise that can be modeled as a stationary autoregressive process. In this class of modeling, sinusoidal components can be directly interpreted as the harmonics of musical notes. However, parameters of this model, which are frequency, amplitude and time duration of sinusoidal components, are estimated by connecting spectral peaks from successive time frames of the temporal signal. These peaks are connected to each other using tracking and data association techniques, and for working with Kalman filter as our tracker we must build a model for evolution of these peaks in time. Kalman filter takes the noisy observations and based on a model for evolution of certain states, finds the optimal estimate of the process behavior. Here, the noise corrupted observations are the identified peaks and system model is a state-space model for evolution of frequency and power. This model can be shown as x (k +1) = Ax (k) + Bv (k) y (k) = Cx (k)+w (k) where x(k)=[f(k) p(k) n,(k).. nm (k)]T v(k)= [u,(k)... um(k)]T (7) (8) y (k) = [f (k) p(k)] Here, f(k) and p(k) are frequency and power for a detected peak, respectively. v(k) and w(k) are process noise and observation noise, and ni(k), i=l,...,m are states for as many shaping filters for which the uncorrelated noise processes ui(k), i=l,...,m are white. The A matrix is the transition matrix, B describes coupling of the process noise v(k) into the system states, and C is the observation matrix. In this model, v(k) and w(k) are zero-mean and jointly uncorrelated Gaussian processes with covariance matrices Q and R, respectively. 3.1 Conventional Models For specifying matrices and number of states needed for our models, prior information about the power and frequency partials is needed. This can help us to specify the model by a piecewise-linear fit to p(t)=20log An(t) and f(t)= on(t)/27T. We introduced two sets of conventional models for different classes of musical instruments before (SatarBoroujeni and Shafai ICMC2005). For the first model, which is more suitable for classes of instruments with nearly constant frequency and power partials in their steady states, e.g., woodwind and brass instruments, the state-space matrices can be written as 10 1 0 0 0 0 1 0 1 0 0 A =, B = (9) 0 0 a( 0 b, 0 0 0 0 a2_ 0 b2 476

Page  00000477 and 0 0 I, R01 (10) 0 1 0 0], IR (10) Here a2/fand a2f are the variances of the observation noise process and Q is assumed to be identity for simplicity. We refer to this model as the Continued Energy Injection (CEI) model. The second model, which better describes the classes of instruments with linearly decaying power partials (e.g., piano and guitar), has one more state and its state matrices are 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 a1 0 0 0 0 0 1, 0 a2_ 0 1 0 Q B 0 0 0 b, 0 0 0 0 0 b 07 o2:[ j (11) 0[1 I, R This model is called the Discontinued Energy Injection (DEI) model. Before using these models we needed to estimate the new parameters introduced in both models and this is done be performing a statistical analysis on a large database of recorded music sounds from individual instruments playing single notes. The estimated parameters are shown in figure 1. x (k + 1)= (A + A)x(k)+Bv(k) (12) y (k) = Cx (k) +w (k) In this model, all the parameters in A and B matrices in (9) and (11), i.e., al, a2, bl, b2, are captured in 5A. Also, {x(0), v(k), w(k)} are uncorrelated zero-mean random variables with covariance matrices -I, Q and R respectively. The perturbation of A is modeled as SA DAE (13) for some known matrices {D, E} and for an arbitrary A, IIA 1<1. Considering the known boundaries of our parameters that can be derived from parameters estimated for our conventional models in figure 1, we can transform our model in (9) into an appropriate model represented by (12). Intending to move uncertainties in the input matrix into the transition matrix, we write 1 0 b, 0 0 0 o 1 0 b 0 0 A+DAE=, B= 00 a,0 b, 0 O 0 0 a 2 b2 where b1 - bb1 b, = bb, and bl and b2 come from (9). For isolating bounded uncertainties, which appear in figure 1, into A, we can write b"b = [2,20], b, - 5 b, = [0.4,4] = 2.2 +1.8 x [-1,1] = 2.2 + 1.86, b22 = [1.5, 4.5], b 2 3 b = [0.5,1.5] = 1+ 0.5 x [-1,1] 1 + 0.562 a, = [0.36, 0.5] = 0.43 + 0.07 x [-1, 1] = 0.43 + 0.0763 a, = [0.43, 0.61] = 0.52 + 0.09 x [-1,1] = 0.52 + 0.0964 which results in 1 0 2.2 0 0 0 1.8 0 0 1 0 1 0 0 0 0.5 A=,E0 0 0.43 0 0 0 0.07 0 0 0 0 0.52 0 0 0 0.09 0 0 S00 0 [' 0 o o 3 0 1 0 0 (14) 0 5 D = I4, A = diag (Si ), i = 1,..., 4, -1 _i _ 1 0.4i- 1 ""- - \' i 1000 200o 000 4a0oo 0 0 6o0 0o 7oo 800 o Frlquncy (Hz) 4.Bi /"CI "" 'J...... ,, 0.3I5 0.41 r.4, - - - - -- -- -- -- 2f r L,..-:-!!::::: 2 111111.......................................... o 2000 4<000 oo 8000 De Frequency (Hz) DEI:I 2c' I- 'S 5 r /,. T-,.0i -/ 0 20o0 4 00 600 0 oo00 000 2000 3030 4000 6000 6000 7000 8000 Frequency (Hz) Frequency (Hz4 Figure 1: parameters for CDI (dashed) and DEI (solid) 3.2 Model with Uncertainties The robust Kalman filter proposed here requires a model in which all the parameters are represented as bounded uncertainties with known boundaries. The state-space description of this model can be written in the following form 477

Page  00000478 4 Robust Kalman Tracker As mentioned earlier, in practical applications, where parameters of the evolution model are not guaranteed to be accurate enough, the performance of Kalman filter can be poor. However, if the parameters of the system are bounded in small boundaries, we can make use of robust Kalman filter, which provides significant improvements in tracking of partials for different classes of musical instruments. The class of robust Kalman filter that best suits our requirements is called the Regularized Least Squares Kalman Tracker (Sayed 2001). Compared with the standard Kalman filter, which minimizes the regularized residual norm at each iteration, this class of robust filters is designed to minimize the worst-possible regularized residual norm over the class of admissible uncertainties at each iteration. The recursive formulation for our robust tracker can be written as M(k) =P(k | k -1)CT[R~+CP(k | k -1)CT] P(k k)= kIM(k|-1) + M (k) [y (k) - i) (k ( k -1)] P(k| k)= [I - M (k)C ] P(k I k - 1) (15) b =y(k +1)-CAx^(k I k -1) F=C[A B] Q, = (P (k I k - 1) @ Ql) (20) W = R H =CD Here, the notation (a ~ b) denotes a block diagonal matrix with entries a and b. The minimization of (17) will always yield a unique solution for A, since G(,) will always have a global minimum in the interval [II HTWHII,cx) (Sayed 2001) 5 Data Association 5.1 Tracking Procedure Our robust tracker is initiated with peak data from the first frame, with the initial values (in the forth order model) ^(1 |0)=[f(0) p(O) 0 0]T -1 (21) O] P(1|0)(= 11+CTR1C) and Q=H=IJ~ x (k +I1| k) = Ai(k I k) P(k +~1 k)=.AP(k | k)AT~+BQBT where R =R -AJ1CDDTCT PT( )-1 P (k I k - 1) = (P (k I k - 1) + E TE 2 0 0 (T2 (16) p (22) A = A (I - AP(k | k - 1)E TE) Here, A is a nonnegative scalar parameter that can be determined from optimization A=-arg minm G(A) (17) 2~|HTWH where the function G(u) is defined as G (A)-z T((A)Qf z (A)~+A{ Ez (A) + Ei (k | k -1)12( (18)( and W (A) AW +WH(AI -HTWH)-HTW Q (A) Qf - AE TE (k | k - 1) ~TW(19) Z (/I) [Q (A) + F T (A)F x [FT (A)b - AE T.E (k I k - 1)] The relation between new parameters in (17)-(19) and those in (15) and (16) is as follows where and 2 2 are the variances of observation noise processes and take values close to one. At the next step we calculate A by minimizing G(u) over the interval [I HTWHII,x) and this process is repeated for every iteration. After finding A we use modifications of (16) and Kalman tracker of (15) to estimate noise-free values for power and frequency in the following frame. If the following frame contains a peak that is close enough to the estimated peak, that peak is added to the track and is used to update the tracker. This process is continued through successive frames until there is no peak close enough to the last estimated peak. Here, this track is terminated or considered as "dead" and a new track is initiated in the next frame. The process starts with all peaks in the first frame and also with all peaks from other frames that have not been used in any previous tracks. 478

Page  00000479 5.2 Distance Function and Gating A peak is close enough to our estimated peak if it falls into the acceptance gate of the track. We use the distance function or Mahalanobis distance to define the closeness of peaks to the estimated values as follows (Blackman 1986) d2(k)=eT(k)[CP(k k)CT +~]le(k) (23) Here, e(k) = y (k) - C (k k) is the error between current observation and the predicted values, andCP(kI k)CT +R is the covariance matrix of this error. A peak falls into the acceptance gate of an estimated peak if the value of its distance function is less that the gate value. If more than one peak is in the acceptance gate, the one with less distance is selected. Based on our experience, if we set a universal value for our acceptance gate, the tracking result will be poor. The nature of our frequency tracks is suggests an adaptive acceptance gate with different values at different frequencies. As mentioned earlier, we are dealing with pseudo-stationary signals. Frequencies of our partials vary with time but these variations are magnified when we move from lower harmonics to the higher harmonics. So, if we consider the same value for our acceptance gate in all frequencies, we have the risk of missing tracks in higher frequencies or loosely accepting false partial tracks in lower frequencies. To cope with these variations we set the gate value as a function of frequency, which is g (f) = 10 + 0.01f (24) In fact, we increase the chance of continuing a track where peaks are sparser and less likely to join a track with lower variations. 5.3 Track Quality Factor The distance function of (23) can also provide valuable information about the quality of a track. A partial with consistently high values for d2(k) implies that the updated observations are very noisy and the associated track should be weighted less than those tracks with smaller values for d2(k). This information, along with PQF values for peaks that have joined a track, can provide a measure for how meaningful a track is. This can provide us with another level of information for rejection of fake partial tracks. With these considerations, we define our Track Quality Factor (TQF) as 5.4 Rejection Rules After all the partial tracks are collected, we need rejection rules to extract all the erroneous tracks which are resulted from randomly successive sets of spurious peaks, or noise in the signal itself. Using TQF and other properties of observed partial tracks, we arrive at the following rejection rules: Track Length. This is the most effective rejection rule since most of the erroneous tracks are discontinued after few recursions. Track Quality Factor. If a track is formed by randomly successive set of false peaks which fall inside the edges of acceptance gate, the resulting TQF will be comparably smaller than that of meaningful tracks, even when this false track is lengthy enough to leap from the track length threshold. Track Smoothness. We expect the frequency of a genuine track to be smooth to some degree. A track is defined to be smooth enough is -a Sx 100 l s fm (26) where of is the standard deviation of the frequency track, fm is the average frequency of the track and s is the smoothness factor. It should be noted that track smoothness varies for different instruments. For example, saxophone has very smooth frequency tracks, while those of flute carry noticeable ripples. Also there is no guarantee that the smoothness rule works fines if a signal contains vibrato or glissando. However, if these effects are not present in our signal, with carefully tuned smoothness measure we can still reject some erroneous partial tracks that happen to pass both previous rejection tests. 6 Results As we mentioned earlier, the robust Kalman tracker is proposed as an improvement to our previous tracker which was based on the conventional Kalman filter. These improvements are present in figure 2 and figure 3. In figure 2 our robust estimates, i.e. the dashed line, track the frequency partial more closely than the conventional estimates. Since we used estimated and averaged values for parameters of the evolution models in our conventional tracker, the deviation of estimates can be randomly high and divertive. On the other hand, in the robust tracker our estimates, which are calculated based on the worst case scenario, follow the partial track in a more reliable manner. This improvement can be further investigated in figure 3, where we have two closely spaced partial tracks belonging to two different notes in a polyphonic setting. In such settings, where we can have more than one musical note at a time, it is always possible that harmonics of different notes get very close to each other. In this situation, estimates with higher deviations can TQF(f)=1 g (+PQF(k,f ) 2N k=1 g(f ) (25) where PQF(k,f )is the PQF normalized with the largest PQF over all the detected peaks. Since the distance value for all accepted peaks is less than the gate and the normalized PQF is always less than one, TQF always takes values less than one and tracks with higher TQF receive more credibility. 479

Page  00000480 follow the wrong trajectory as can be observed in the lower part of figure 3, which shows the result of our conventional Kalman tracker. In the conventional tracker, this is due to the more weight given to noise power in higher frequencies (see the right side of figure 1) for coping with magnified frequency variations in higher frequencies. On the contrary, the tracking properties of our robust tracker are not influenced by these variations, as observed in the upper part of figure 3. The tracks of PQF for these two erroneous partial tracks also have distinctively smaller values as seen in figure 5. 700 600 500 400 300 200 1 0.8 0.6 0.4 0.2 0 TQF 20 40 60 80 100 120 Time (x50 msec) Robust Tracker 1340 Figure 4: Frequency tracks along with their TQF 1330 -1320 -LL1310 -1300 - 1340 - 5 10 15 PQF Tracks 1.2: 5 10 15 Time (x50 msec) Conventional Tracker 20 25 30 -1330- - ~1320 > I LL 1310 1300 - 0 5 10 15 20 25 30 Time (x50 msec) Figure 2: Performance of the two trackers. Upper: Robust tracker with estimated track (dashed) and observed track (solid). Lower: conventional tracker with estimated track (dashed) and observed track (solid) Robust Tracker 8050 8000 7950 ~ 7900 - 6W7850- C--D- - Cýr - - 0 7800 77501 0 2 4 6 8 10 12 14 16 18 20 Time (x50 msec) Conventional Tracker 8050 8000! 7950 ~7900 /0 ~ 780U 0 0 0 0 Cr 0 0 7800 7750 0 2 4 6 8 10 12 14 16 18 20 Time (x50 msec) Figure 3: Tracking in the presence of closely spaced partials: Robust tracker (upper) and conventional tracker (lower); with observed track for higher harmonic (solid), its estimates (asterisk), observed track for lower harmonic (dashed) and its estimates (circle). The effectiveness of our data association techniques are also investigated in figure 4 and figure 5. These figures show frequency tracks of an A3 note played on the saxophone along with its TQF and tracks of PQF. As we can see in figure 4, the two tracks at frequencies of 130 and 310 Hz have smaller TQFs as compared to the other three tracks. 0. 6. 044 -02 0 20 40 6 0 80 100 120 Time (x50 msec) Figure 5: Tracks of power quality factors for example of figure 4 For tuning our tracking system, we determined the most effective values for our rejection factors by examining a large number of musical sounds. Based on these observations we accept a partial track as a genuine one if * It is longer than 10 time frames, * Its TQF is larger than 0.4, and * Its smoothness measure is less than 0.7. The statistical data for effectiveness of our rejection rules is provided in table 1. We gathered this data by analyzing a larger database of music notes taken from an online resource of musical instrument samples (Fritts 1997), and counting the number of real and false partial tracks rejected through each rejection rule. The two first rows of table 1 contain the percentages of false and real tracks rejected by each individual rule and the third row contains the contribution of each rule to rejection of all detected false tracks. Length TQF Smoothness False %99 %92 %85 Real 01 %8 %15 Total false %45 %41 %14 rejected TIo c Table 1: Rej.ection accuracy 480

Page  00000481 As it can be seen, rejection based on track length is the most effective, but when this fails to reject the false tracks, TQF and smoothness nile can provide exceptional performance and have noticeable contribution to the overall rejection process. 7 Conclusion We provided a robust Kalman f~ilter for tracking of partials in music signals and introduced some novel data association strategies and rejection rules. These rules along with the robust estimation capabilities of our tracker provide an accurate tracking for sounds from different classes of musical instruments. References Blackmnan, S. S. 1986. Multiple-target tracking with radar applications. Dedhnam, MA, Artech House. Depalle, P., G. Garcia, et al. 1993. Tracking of partials for additive sound synthesis using hidden Markov models. In Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '93). Fritts, L. 1997. Musical Instrument Samlples (M~IS). The University of Iowa Electronic Music Studios. htt~p:/there-ninm~-usic~u~~~iow edu/T' I$ htrnl. Lagrange, M., S. Marchand, et al. 2004. Using linear prediction to enhance the tracking of partials. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04). Satar-Boroujeni, H. and B. Shafai. 2005. Peak Extraction and Partial Tracking of Music Signals using Kialman Filtering. In Proceedings of the 2005 International Comzputer M~usic Conference. Barcelona, Spain. Satar-Boroujeni, H. and B. Shafai. 2005. Peak Tracking and Partial Formation of Music Signals. In Proceedings of the 13th European Signal Processing Conference. Antalya, Turkey. Sayed, A. H. 2001. A framework for state-space estimation with uncertain models. In Automatic Control, IEEE Transactions on, vol. 46, pp. 998-1013. Serra, X. 1997. Musical Sound Modelling with Sinusoids plus Noise. In Musical Signal Processing, pp. 9 1-122. Studies on New Music Research. Lisse [Netherlands]; Exton, PA: Swets & Zeitlinger. Sterian, A. 1999. "Model-Based Segmentation of Time-Frequency Images for Musical Transcription," PhD dissertation. University of Michigan. 481