# Analysis of Sound for Additive Synthesis: Tracking of Partials Using Hidden Markov Models

Skip other details (including permanent urls, DOI, citation information)This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact mpub-help@umich.edu to use this work in a way not covered by the license. :

For more information, read Michigan Publishing's access and usage policy.

Page 94 ï~~Analysis of Sound for Additive Synthesis: Tracking of Partials Using Hidden Markov Models Ph. Depalle, G. Garcfa and X. Rodet IRCAM, 31 rue Saint Merri, 75004 Paris, France Tel: (33.1) 44.78.48.45, Fax: (33.1) 42.77.29.47 e-mail: phd, garcia, rod @ircam.fr Abstract We present a sinusoidal partial tracking method, based on Hidden Markov Models (HMM), for additive synthesis of sound. It allows to track time-varying partials as well as constant ones, and has no harmonicity constraints at all. In addition, it is able to easily track crossing partials. Several interesting examples will be shown, as well as a complete description of the HMM-tracker. 1. Introduction We present here a HMM-based tracker of sinusoidal partials, for additive synthesis of sound [Garcia 92]. Additive synthesis is one of the most powerful synthesis techniques. It is based on a model which represents a sound signal as a sum of sinusoids with time-varying frequency and amplitude [Rodet 92]. These parameters are generally supplied to the synthesizer as breakpoint time-functions. Such functions are interpolated at the sampling rate to obtain instantaneous parameter values. When working with natural sounds, these breakpoint time-functions can be estimated from the original signal by an automatic analysis. This is basically performed in three steps: short-term spectral estimation, spectral peak detection, and spectral peak continuation. In IRCAM's additive synthesis environment, the first step is implemented by a sliding window FFT, and the second one by extraction of local maxima followed by a Lagrange interpolation. A more detailed description of this is presented in [Depalle 93]. The third step of the analysis procedure is by far the more critical one. Existing approaches [Mc Aulay 86], [Serra 89] work well enough for some categories of sounds (harmonic, voiced, and slow time-varying sounds), but fail in presence of multiple harmonic structures, inharmonics, crossing partials, voiced/unvoiced transitions, and strong frequency variations. Our idea has been to cope with all these problems by globally optimizing the set of detected trajectories. More explicitely, all trajectories in a given time interval will be detected simultaneously, after evaluation of continuity criteria in the whole set of peak frames. This differs from existing methods, which generally consider that tracking has been done up to the previous frame, and examine each peak of the current frame to eventually associate it to an existing trajectory. The tracking problem has been formulated in terms of a hidden Markov model. Thus, we find the optimal set of trajectories by looking for the highest probability state sequence, by means of the Viterbi algorithm [Rabiner 86]. This leads to very good tracking performances, as it will be shown in section 5. In addition, the method can be applied without any modification to formant tracking. In sections 2 and 3, we describe the HMM-tracker, in section 4 we discuss the principles of the algorithm and in section 5 we present some examples. 2. Method description The model of trajectory we use in our 11MM tracker is intuitively defined as a time-sequence of peaks which satisfies continuity constraints on parameter slopes. Consequently, our method tends to identify trajectories whose amplitude and frequency slopes evolve smoothly in time. The use of parameter slopes rather than parameter values enables to track time-varying partials as easily as constant ones, and solves the problem of detecting crossing trajectories. To formalise our HMM tracker, let us introduce some definitions: " A trajectory is labelled by an index i > 0. " The time series of peak frames is referenced by subscript k. At time k, there are hk peaks. The frequency and amplitude of peak j in frame k are denoted fk(J) and ak (j) respectively. " A peak which belongs to a trajectory is called an useful peak; otherwise it is called a spurious peak. " An index Ik (j), 1:5 j <hk, is associated to each peak of frame k. Indexes of useful peaks are the label of the trajectory to which they belong, and indexes of spurious peaks are null. " Ik is the vector of indexes at time k. " The transition probability from a state x at time k --1to a state y at time k, is denoted by ak(X,Y). Then, the elements of the HMM are defined as follows: at time k, the state Sk is the ordered pair of vectors ([k-1, Ik), and the observation Ok is the ordered pair of scalars (hk, hk-l). 2A.3 94 ICMC Proceedings 1993

Page 95 ï~~Defining state and observation with ordered pairs is based on the fact that we want to retrieve the right set of trajectories as the globally optimal succession of elementary trajectories. The more elementary trajectory is a straight line which links two points. In this way, slopes appear as the relevant parameters to compute the transition probabilities between states. Some important remarks need to be done here: Peak parameters are not considered as observations, but define the transition probabilities; they are parameters of the model. With the above definition of the observation, we retain only the combinatorial aspect of the peak matching problem. As a consequence, for a given state there is only one possible observation. On the other hand, a given observation can be produced by several states, and this is why states remain hidden. In other words, the observation probabilities for a given state are equal to unity for one particular observation, and null for all others. Consequently, only transition probabilities need to be computed. For example, all possible states for the observation (3, 2) are: - States with no trajectory: 0 00 00 - States with one trajectory: Let us consider a transition from a state x at time k -1 to a state y at time k. It is clear that frames k, k -1and k - 2 are concerned. For each peak j of frame k, we compute an individual score 0r (j) in the following manner:.. - If state y defines peak j as useful, we look for peaks t an d r of frames k-l1 a n d k-2 respectively, such that Ik(j) = Ik_,i(t) = lk,_,2(r) Then, a gaussian fonction is used to evaluate the continuity of frequency and amplitude slopes, for the trajectory defined by these three peaks: 1 (j) = exp{ ( 2 of a, where of,,a are input parameters and Afrk =fk(J)-fk-,_(t); Aak = ak(j)- ak_,l(t). -If state y defines peak j as spurious, we look for peaks t and r of frames k-1 and k- 2 respectively, such that (Afk -Afk_1) is minimum. The discontinuity of the trajectory defined by these three peaks is evaluated as: (A)tT(1 { -Af f)2 (g a 2 {1-(1-t)exP { -(a A2 where,(0<.t-1) is an input parameter. Figure 3 illustrates the computation of individual scores 0kr(j) for a given state transition, concerning useful and spurious peaks. S0 0 0 0 0 oa 0'4 0 0 O 00O *0 - States with two trajectories: Fig. 1: All possible states for the observation (3,2). 3 Evaluation of state transition probabilities According to the intuitive model of trajectory proposed below, transition probabilities must favour state transitions which provide: f the best continuity of parameter slopes between useful peaks, f the largest discontinuity of parameter slopes between spurious peaks. Figure 2 shows an example of this: the first transition must be more probable than the second one, because it defines more smooth trajectories. Fig. 3: Evaluation of individual scores 6k (j). I 2 i 1 1 Fig. 2: Two possible state transitions. Then, a global score is computed as ICMC Proceedings 1993 95 2A.3

Page 96 ï~~j=ht g (Xy) = fJOk j) j=l Finally, as transition probabilities from a given state at time k- 1 to all possible states at time k must sum to unity, we obtain the transition probability from state x to state y as ak( xy)g(x,y) g /g x, z) ZEQk 4. Algorithm The computational complexity of the algorithm depends strongly on the number of combinations of indexes in the vectors Ik, which determines the number of states. It also depends on the number of frames on which trajectories have to be optimised. In order to limit this complexity (otherwise the method is not practicable) we use the Viterbi algorithm on a window of T frames length which slides frame by frame, and we introduce some constraints on the index combinations. For instance, births and deaths of trajectories are disallowed in the states, forcing the state sequences to have a constant number of trajectories. To take into account births and deaths on the whole set of frames, we use a higher level procedure. Births are detected by looking for partials in current window which did not exist in the preceding one; deaths are found by looking for partials in the preceding window which disappear in the current one. Consequently, the length T fixes the shortest duration of a partial. Other constraints can be eventually introduced at the state level, such as: limiting frequency or amplitude slopes, limiting the number of line crossings, selecting a reduced set of possible number of trajectories, etc. The detection of births and deaths is schematised in figure 4. ooO o 0 0.0 0 0 i 0 0-.a---- Birth of,=. trajectory 3 't -O...O is detected Deaths of 0 0 trajectores 2and 3 0 0 are detected,O----o Result: 0 Fig. 4: Birth and death detection. 5. Experimental results The following figures show some experimental results obtained by applying our method to simulated and real daa. -.. Â~..:.... %........-:.. Crossing of partials: three sinusoids embedded in white noise (S/N =0 dB).Z.._....., Tracking of partials of a voice "glissando" PecsÂ~ eihroi on Tbada 2A.3 96 ICMC Proceedings 1993

Page 97 ï~~_,'. a.. ': ' -:Â~... ' -.\ I.o... I-:2 -Crossing of partials: speech signal (/SASAS/) corrupted by a chirp sinusoid K=, -'....... --...,....-- 6. Conclusion The previous examples show that a purely combinatorial HMM-based tracker is quite performant for additive analysis/synthesis in a musical context. The presented method has two key principles: f The use of parameter slopes rather than parameter values, to evaluate the continuity of trajectories. This allows to track variable frequency and variable amplitude partials, and solves the problem of frequency line crossing. * The global optimization, in frequency and time, of the set of detected partials, which is a consequence of the HMM formulation of the method. This ensures a certain coherence in tracking multiple partial structures, and provides a very good voicing discrimination: the algorithm stops tracking during unvoiced regions since it can't find any continous trajectory. Our HMM-tracker has been implemented in a modular way so that constraints can be chosen before running it. We are currently working on the reduction of the computational cost of the algorithm by refining the set of constraints. We are also looking for a more efficient type of algorithm, to take advantage of the factorised structure of transition probabilities. In particular, we are examining the possibility of substituting the Viterbi algorithm to eliminate computation redundancy. References [Garcia 92] Guillermo Garcfa. Analyse des signaux sonores en termes de partiels et de bruit. Extraction automatique des trajets frequentiels par des modules de Markov caches. Mbmoire de DEA en automatique et traitement de signal. Orsay, July 1992. [Mc Aulay 86] Robert McAulay, Thomas Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. on Acoust., Speech, and Signal Proc., vol. ASSP-34, Aug.1986. [Depalle 93] Philippe Depalle, Guillermo Garcia, Xavier Rodet. Tracking of partials for additive sound synthesis using hidden Markov models. IEEE ICASSP-93, Minneapolis, Minnesota, Apr. 1992. [Rabiner 86] Lawrence Rabiner, Biing-Hwang Juang. An introduction to Hidden Markov Models. IEEE ASSP Magazine, Jan. 1986. [Rodet 92] Xavier Rodet, Philippe Depalle. A new additive synthesis method using inverse Fourier transform and spectral envelopes. Proc. of ICMC, San Jose, California, Oct. 1992. (Serra 89] Xavier Serra. A system for sound analysis/transformation/synthesis based on a deterministic plus stochastic decomposition. Philosophy Dissertation, Stanford University, Oct. 1989... of f.. o s.:.1...... Tracking of formants: diphone /10/ ICMC Proceedings 1993 97 2A.3