Page  00000441 Score-Performance Matching using HMMs Pedro Cano, Alex Loscos, Jordi Bonada Audiovisual Institute, Pompeu Fabra University Rambla 31, 08002 Barcelona, Spain { pcano, aloscos, jboni }@iua.upf.es http://www.iua.upf.es Abstract In this paper we will describe an implementation of a score-performance matching, capable of score following, based on a stochastic approach using Hidden Markov Models. 1. Introduction Linking notes in a musical performance to the corresponding notes in a score is called scoreperformance matching. Proper matching algorithms are crucial for automatic accompaniment systems or in systems where the computer has to find out where the musician is with respect to a known score, in order to make an appropriate musical response. For example audio effects can be applied to the performer's sound depending on his/her location in the score. In the context of musical performance research, matching algorithms are valuable to measure differences between the performance and the score, thus being able to extract expressive timing patterns, calculate tempo, pitch deviation patterns, or any other performance characteristics. One category of algorithms focuses on real-time matching and they are often called score-following. Early work in score matching was performed by Dannenberg [1] and Vercoe [2], both primarily interested in making real-time matchers. Dannenberg describes an algorithm solely based on pitch information, intended to robustly follow a monophonic instrument. Later these algorithms were extended to deal with polyphonic music and multiple instruments [3]. Most algorithms primarily match pitch, possibly in combination with time information. The method presented here focuses on monophonic music and can be seen as a continuation of Puckette's work [4] moving from knowledge-based to stochastic models. 2. Stochastic modeling The input features needed to make a decision when performing a match, namely fundamental frequency, notes duration, etc, are inherently uncertain. The uncertainty rises from the fact that instrument players or singers do not perform an ideal realization of what it is written in the score and the fundamental frequency algorithms are not absolutely reliable. Others algorithms can perform reasonably well with specific instruments like flute but they fail when the problem above stated is especially troublesome. This occurs very clearly with the singing voice in which the output does not resemble at all a sequence of discrete tempered pitches attained at well-defined times. Stochastic modeling is a flexible general method for situations like the above described. It consists on employing a specific probabilistic model for the uncertainty or incompleteness of the information. A music performance is a nonstationary process since its statistical parameters vary over time. Hidden Markov Models (HMM) are well studied and used for statistical modeling of nonstationary stochastic processes such as speech and our work has been to apply them to our application. 3. Hidden Markov Models A HMM is most easily understood as a generator of vector sequences. It is a finite state machine which changes state once every time unit and each time t that a state j is entered, a n acoustic speech vector y, is generated with probability density b/(Y). Furthermore, the transition from state i to statej is also probabilistic and governed by the discrete probability a,1. In the figure below we show an example of this process where the model moves through the state sequence X=1,1,2,2,2,2,2,3,3,3 in order to generate the 10 observation vectors of k-index model. bI(y) b2(y) b3(y) JAIL. r.2 0.4 0.7 Lo. 0.6 0.3 0.3 I I 111 2 12 21 2 12 3 131 3 I I - - T i. me Model k-1 Model k Model k-t1 Figure I. Markov Process example ICMC Proceedings 1999 - 441 -

Page  00000442 The joint probability of a vector sequence Y and state sequence X given some model M is calculated simply as the product of the transition probabilities and the output probabilities. The joint probability of an acoustic vector sequence Y and some state sequence X =x(1), x(2), x(3),...,x(T) is: P(Y, XIM) = ax(o)x(l) bx(,) (y,)ax(t)x(t+) tul In practice, of course, only the observation sequence Y is known and the underlying state sequence X is hidden. This is why it is called Hidden Markov Model. In our problem, the model M is the sequence of notes specified in the score script, Y the feature sequence extracted from the audio signal and X, the state sequence, which is actually what we aim for in this paper. 4. Note models based in HMM In this section we present the features and the characteristics of the HMMs used to model the notes. 4.1 Front-End Parameterization The function of front-end parameterization stage is to divide the input signal into blocks and to extract from each block relevant features. The six features that will be used for the observation sequence are: and harmonic frequencies generated by trial values of Fo. For each trial Fo mismatches between the harmonics generated and the measured partial frequencies are averaged over a fixed subset of the available partials. The predicted to measured error is defined as: Errp.,, = Ew(Af, f, a,, A,) = ZAf.( a + x[qAf, u.(,) -r n-lt n Amox) where Af, is the difference between a predicted and its closest measured peak, f, and are the frequency and magnitude of the predicted peaks, and A,,x is maximum peak magnitude. The measured to predicted error is defined as: K Err.,, = E,,w(Afk, ak, Am=) k=1 Z: fk.(jY-P + x.L f-k 'k)-r] where Afk is the difference between a measured and its closest predicted peak, fk and ak are the frequency and magnitude of the measured peaks, and Amax is maximum peak magnitude. The total error is: Err0,1a = Err p,, / N + pErrK..,/ K This pitch estimation method gives a temporal evolution of the Fo. This envelope is then used to calculate some of the system's input features. 4.2 Note Model Architecture We have three left-to-right HMM models: a note, a no-note and a silence model. We model notes with 3 states so as to mimic the note behavior of attack, steady state and release. The silence is modeled with only I state, since there is no temporal structure to exploit. The no-note, modeled with 3 states, aims to account for all the unpitched sounds that appear in the performance. This can include noisy spurious sounds, or in the case of singing voice aspirations, fricatives, plosives... The choice of output probability function is crucial since it must model the intrinsic variability of the score realizations. Some HMM systems use discrete output probably functions in conjunction with a vector quantizer. Each incoming feature vector is replaced by the index of the closest vector in a precomputed codebook and the output probability functions are just look-up tables containing the probabilities of each possible VQ index. This approach is computationally very efficient but the quantization introduces noises, which limit the precision that can be obtained. Hence, it seems a better choice to use parametric continuous density output distribution, which model the feature vectors Energy, Energy = 20 log to ZIX(k)l Det Energy Delta Energy, AEnergy (n) = Energy (n + 1) - Energy (n - 1) Zero Crossing, Z (n) I sgn{x(m)}- sgnx(m - 1)w(n - m) N nn-MAl 2 Delta Fundamental Frequency, log (F(n+l) ifFo(n-1)"-F(n+l)0 (n)= o(n-) y elsewhere Fundamental Frequency, and Fundamental Frequency Error, which is a measure of "goodness" of the Fo estimated. Obtaining fundamental frequencies is a popular subject of study. The particular algorithm we use is an adaptation of the Two-Way Mismatch [5]. In the procedure, the estimated F0 is chosen as to minimize discrepancies between measured partial frequencies -442 - ICMC Proceedings 1999

Page  00000443 directly, for instance with multivariate mixture Gaussian. b,(y,) = Zc N (y,(y;,U j) mini where ci, is the weight of mixture component m in statej and N(';,Q.) denotes a multivariate Gaussian of mean p and covariance E This method results in a system with too many parameters to train. Thus, in this work, we use the same large Gaussian codebook for all the states and only estimate different mixture weights for each state. We build a codebook with vectors that are composed of all the above features but the Fo, which will be treated differently. To construct a codebook 1, the N, corresponding observations, yIl are clustered into Mt subsets, being M, the number of mixture components of the codebook. To do this clustering we use LBG algorithm. Then 1 N/,Il y,-l The mixture coefficient c,,,, for the case Nj, vectors are assigned to the mth mixture of the codebook, results in CtM N, Im N, The output probability of the observation Fo is defmned with two discrete symbols, one symbol when Fo is not defined and another one when Fo has been found. The discrete symbol in the states of note models is decomposed in a continuous density function whenever Fo has been found as shown in figure 2. 5. Model Training For the training of models we need labeled training set of musical phrases, where each sentence consists of the music waveform and its transcription into notes. According to the transcription we concatenate the HMMs of the musical units to build an extended finite-state network (FSN). It is important to realize that, even though, apart from silences and unvoiced sounds, most models are notes, each HMM note model differs from each other because they keep information of the fundamental frequency associated and also its duration. This information is needed to calculate some input features. Once a composite sentence FSN is created for each sentence in the training set, the training problem becomes one of estimating the unit model parameters that maximize the likelihood of the models for all the given training data. The maximum likelihood parameters can be solved for using either the forwardbackward procedure or the segmental k-means training algorithm. For our system we have chosen the Segmental K-Means [6] and we have implemented it as follows: 1. Initialization: Linearly segment each training utterance into units and HMM states. 2. Estimation: The transition probabilities are estimated by merely counting the number of times the transition is used and dividing it by the number of times the source state for the transition is used. This requires maintaining counters to track each transition and each output symbol during training. The mixture weights for the five-feature vector probability function are estimated for each state i as: Ctm N For the F0, the output probability function is estimated: no. of times in s and observed symbol vk no. of times in s For the discrete symbol that expands to a continuous density function, we estimate this function the same way as the five-feature one but now it is only the states belonging to note models that share the codebook. This codebook has to be re-estimated every iteration. 3. Segmentation: The updated set of unit models (based on the estimation of step 2) is used to resegment each training utterance into units and states (via Viterbi decoding). Nde Modd I 1 -I j \ Fo, 1.,| DeIi Fo-O '. t o 0.!e No.se and Sien Mode Fo-O Fos ' F(_ Figure 2: Fo observation probability function The continuous density function is modeled as a mixture of gaussians whose input is: F,Dev = log(Fo,,,r,)-log(Fo The total output probability is the product of the fivefeature vector's and the fundamental frequency's. ICM C Proceedings 1999 -443 -

Page  00000444 4. Iteration: Steps 2 and 3 are iterated until convergence. Since we wanted the system to be accurate in the definition of borders, we manually supervised the resulting segmentation and adjusted some note borders. By doing so, we got a growing parallel database where the notes are labeled and segmented with fixed borders. These segmented sentences are used for the training then in a different way than the not segmented ones. Instead of linearly segmenting each training utterance into units and HMM states, for the note-segmented utterances, we only linearly segment the HMM states inside the unit borders. When we iterate, only these inside HMM states borders are allowed to reallocate, the note borders are left fixed. Doing this resulted in a more accurate alignment. 6. Viterbi decoding For alignment the trained models are concatenated and run against the feature vectors extracted from the sound using Viterbi decoding. As a result, the most probable path through the models is found, giving the points in time for every transition from one model to the following. The Viterbi algorithm is an efficient algorithm to find the state sequence that most likely produced the observations. Let O(t) represent the maximum likelihood of observing speech vectors y, to y, and being in statej at time t. This partial likelihood can be coniputed using the following recursion (t) = max{ -l(t - 1) - a,l b(y,) where 0(1)=1 (1) = ajbj(y1) for 1<j<N. The maximum likelihood P'(YIM) is then given by N(T)=max{ I(T)a1\ By keeping track of the state j giving the maximum value in the above recursion formula, it is possible, at the end of the input sequence, to retrieve the states visited by the best path, thus obtaining the timealignment of input frames with models states. It is possible to modify the algorithm to work on-line. To do so, the backtracking is adapted to determine the best path at each frame iteration instead of waiting until the end of the utterance. This adjustment implies a lost of robustness, some hints to overcome it can be found in [7]. The implementation for explicit note duration modeling by modifying the Viterbi algorithm [6] allows us to include the note duration information. The proposed modified Viterbi algorithm keeps track of the duration D(t) of each note n at time t and introduces a duration penalty P of making a transition from state i at time t to state j at time t+ I given by, 0 if AD(t) <0 and i = j 1(AD(t + 1)) - l(AD(t)) if AD(t) 2 0 and i-= j P= 1(AD(t)) if AD(t) < 0 and i j 1(0) if AD(t)( 0 and i ~ j where AD(t)= D(t) - D, is the difference between the duration of the model and the note duration, D,, specified in the score, and l(u)=log p(u), p() is the probability density of AD and it is modeled by mixture gaussian densities. 7. Conclusions The instrument we have mainly worked with is the singing voice. This choice is due not only to the fact that, from all instruments, the larger training database available for us is the singing voice one, but also because we believe this instrument is the most critic, and once solved the score-matching problem for the singing voice case, we will have solved it for any other harmonic instrument. There is still experimentation to be done to tune the different parameters but results are promising. Improvements can be achieved by considering context. We can train different note models depending on the notes that precede and follow them. In the case of the singing voice, using the lyrics [7] can add robustness to the alignment. This kind of information can also be used to improve accuracy. References [1] R. Dannenberg, "An On-line Algorithm for Real-Time Accompaniment". Proceedings of the ICMC 1984. [2] B..Vercoe. "The Synthetic Performer in the Context of Live Musical Performance", Proceedings of the ICMC 1984. [3] P. Desain, H. Honing and H. Heijink. "Robust Score-Performance Matching: Taking Advantage of Structural Information". Proceedings of the ICMC 1997. [4] M. Puckette. "Score following using the sung voice Proceedings of the ICMC 1995. [5] P. Cano. "Fundamental Frequency Estimation in the SMS Analysis". DAFX Proceedings 1998. [6] L. Rabiner and B.H. Juang Fundamentals of Speech Recognition. Prentice Hall, 1993 [7] A. Loscos, P. Cano, J. Bonada. "Low-Delay Singing Voice Alignment to text". Proceedings ofthe-CMC 1999. -444 - ICMC Proceedings 1999