Page  00000001 On tempo tracking: Tempogram Representation and Kalman filtering Ali Taylan Cemgil*; Bert Kappen*; Peter Desaint; Henkjan Honingt *SNN, Dept. of Medical Physics and Biophysics, University of Nijmegen, The Netherlands tMusic, Mind and Machine, University of Nijmegen,The Netherlands email: {taylan, bert} {desain, honing} Abstract We formulate tempo tracking in a Bayesian framework where a tempo tracker is modeled as a stochastic dynamical system. The tempo is modeled as a hidden state variable of the system and is estimated from a MIDI performance by Kalman filtering and smoothing. We also introduce the Tempogram representation, a wavelet-like multiscale expansion of a real performance, on which the Kalman filter operates. 1 Introduction An important and interesting subtask in automatic music transcription is tempo tracking: how to follow the tempo in a performance that contains expressive timing and tempo variations. When these tempo fluctuations are correctly identified it becomes much easier to separate the continuous expressive timing from the discrete note categories (i.e. quantization). The sense of tempo seems to be carried by the beats and thus tempo tracking is related to the study of beat induction, the perception of beats or pulse while listening to music (see Desain and Honing (1994)). However, it is still unclear what precisely constitutes tempo and how it relates to the perception of rhythmical structure. There is a significant body of research on the psychological and computational modeling aspects of tempo tracking. Early work by Michon (1967) describes a systematic study on the modeling of human behavior in tracking tempo fluctuations in artificially constructed stimuli. Longuet-Higgins (1976) proposes a musical parser that produces a metrical interpretation of performed music while tracking tempo changes. Knowledge about meter helps the tempo tracker to quantize a performance. Desain and Honing (1991) describe a connectionist model of quantization. Here as well, a tempo tracker helps to arrive at a correct rhythmical interpretation of a performance. Both models, however, have not been systematically tested. Still, quantizers can play a important role in addressing the difficult problem of what is a correct tempo interpretation by defining it as the one which results in a simpler quantization. Large and Jones (1999) describe an empirical study on tempo tracking, interpreting the observed human behavior in terms of an oscillator model. Another class of models makes use of prior knowledge in the form of an annotated score (Dannenberg, 1984; Vercoe, 1984). They match the known score to incoming performance data. More recently attempts are made to deal directly with the audio signal (Goto and Muraoka, 1998; Scheirer, 1998) without using any prior knowledge. How ever, these models assume constant tempo (albeit timing fluctuations may be present), so are in fact not tempo trackers but beat trackers. Although successful for music with a steady beat, they report problems with syncopated data. All tempo track models assume an initial tempo (or beat length) to be known to start up the tempo tracking process (e.g., Longuet-Higgins (1976); Large and Jones (1999). There is few research addressing how to arrive at a reasonable first estimate. Longuet-Higgins and Lee (1982) propose a model based on score data, Scheirer (1998) one for audio data. A complete model should incorporate both aspects. In this paper we formulate tempo tracking in a statistical framework where a tempo tracker is modeled as a stochastic dynamical system. The tempo is modeled as a hidden state variable of the system and is estimated by Kalman filtering. 2 Dynamical Systems and the Kalman Filter Mathematically, a dynamical system is characterized by a set of state variables and a set of state transition equations that describe how state variables evolve with time. For example, a perfect metronome can be described as a dynamical system with two state variables: a phase f and a period A. Given the values of state variables at j - 1'th step as j-_1 and Aj1, the next beat occurs at j = fj_1 + Aj_1. The period is constant so Aj = Aj_. By using vector notation and by letting sj = [fj, A]T we write the state transition model as S1 1) sj - 0 1 s) s_1 -- Asj_-l (1) When the initial state so = [TO, Ao]T is given, the system is fully specified. Such a deterministic model is not realistic for natural music performance and can not be used for tracking the

Page  00000002 tempo in presence of tempo fluctuations and expressive timing deviations. Tempo fluctuations may be modeled by introducing a noise term that "corrupts" the state vector -..*.............. tp:......... v~~:.......:~'~:...::.... -.:; ~- -... --v - - - T.................. -........ t sj = Asj_l + (2) A t t where e is a Gaussian random vector with mean 0 and diagonal covariance matrix Q, i.e. c A/'(0, Q). In addition, expressive timing deviations can be modeled by introducing a noise term rj = 73 + v = Csj + v (3) where v ~ '(0, R). Here, Tj is the observed "noisy" beats. In this formulation, tempo tracking corresponds to estimation of sj given observations upto j'th step. We note that we do not observe the (noisy) beat Tj directly but induce it from events in music. This will be the topic of the next section. Equations 2 and 3 define a linear dynamical system, because all noises are assumed to be Gaussian and all relationships between variables are linear. Hence, all state vectors sj have Gaussian distributions. A Gaussian distribution is fully characterized by its mean and covariance matrix and in the context of linear dynamical systems, these quantities can be estimated very efficiently by a Kalman filter(Kalman, 1960). The operation of the filter is illustrated in Figure 1. The basic model can be extended in Figure 2: Tempogram Calculation. The continuous signal x(t) is obtained from the onset list by convolution with a Gaussian function. Below, three different basis functions 4, are shown. All are localized at the same r and different A. The tempogram at (T, A) is calculated by taking the inner product of x(t) and 4(t; T, A). Due to the sparse nature of the basis functions, the inner product operation can be implemented very efficiently. the random walk behavior since they introduce inertia to the system. The linearity constraint on the Kalman filter can also be relaxed. Indeed, in tempo tracking such a extension is necessary to ensure that the period A is always positive. Therefore we define the state transition model in a warped space defined by the mapping w = log2 A. This warping also ensures the perceptually more plausible assumption that tempo changes are relative rather than absolute. For example, under this warping, a deceleration from A -+ 2A has the same likelihood as an acceleration from A -- A/2. 3 Tempogram Representation In this section, we propose a method to extract the noisy estimate Tj from the performance. We demonstrate how a phase 7 and period A can be inferred locally, i.e. from an short segment of an onset list t = [ti]. The Bayesian formulation of this problem is 0:i: (a);0 c )l (d) 000 (e) p(T, A t) oa p(t 7, A)p(7, A) (4) Figure 1: Operation of the Kalman Filter and Smoother. The horizontal axis represents the time and the vertical axis represent the period of the tracker. The system is given by Q = 0.011 and R = 0.02. p-jlj-1 and Pjlj-1 denote the mean (center) and covariance (ellipses) of the hidden state sj given observations Ti... Tj -1. (a) The algorithm starts with the initial state estimate (Pllo, Pilo) at T = 0 and period A = 1. in presence of no evidence, (b) [The beat is observed to be at -r, The state is updated to (Li 11, Pi 1) according to the new evidence. Note that the uncertainty "shrinks", (c) On the basis of current state a new prediction (p211, P2|1) is made, (d) Steps are repeated until all evidence is processed to obtain filtered estimates (pj I j, Pj I), J = 1... N. In this case N = 3. (e) Filtered estimates are updated by backtracking to obtain smoothed estimates (p|i N, Pil N) (Kalman smoothing). several directions. First, the state space can be extended to include additional variables. Additional variables reduce The likelihood term p(t T, A) is interpreted as the probability of the performance given the tempo track. p(T, A) is the prior probability of 7 and A given by the Kalman filter. It is reasonable to assume that the likelihood p(t |, A) is high when onsets [ti] in the performance coincide with the beats of the tempo track. To construct a likelihood distribution having this property we propose a similarity measure between the performance and a local constant tempo track. First we define a a continuous time signal x(t) E= G(t - ti) where we take G(t) exp(-t2/2o ), a Gaussian function with variance r.2 We represent a local tempo track as a pulse train b(t; T, A) = Z am6(t - 7 - mA) where 6(t - to) is a translated Dirac delta function, which represents an impulse located at to. The coefficients am are positive constants such that Zm am= 1 (See Figure 2). If a causal analysis is desired, am can be set to zero

Page  00000003 for m > 0. When am is a sequence decaying to zero exponentially, i.e. am = ao7, one has the infinite impulse response (IIR) comb filters employed by Scheirer (1998). We define the tempogram of x(t) at (T, A) as the inner product " --t ' ' Tg(7T, A) = / dt x(t)4(t; T, A) (5) v........v.......... v..... The tempogram representation can be interpreted as the response of a comb filter bank and is analogous to a multiscale representation (e.g. the wavelet transform), where 7 and A correspond to transition and scaling parameters (Rioul and Vetterli, 1991). In Figure 3 we show a tempogram obtained from a simple onset sequence. We define the likelihood as p(t|T, A) oc exp(Tgx(T, A)). The tempogram gives a local estimate of likely (7, A) values. 4 Evaluation Many tempo trackers described in the introduction are often tested with ad hoc examples. However, to validate tempo tracking models, more systematic data and rigorous testing is necessary. A tempo tracker can be evaluated by systematically modulating the tempo of the data, for instance by applying instantaneous or gradual tempo changes and comparing the models responses to human behavior (Michon, 1967). Another approach is to evaluate tempo trackers on a systematically collected set of natural data, monitoring piano performances in which the use of expressive tempo change is free. This type of data has, next to being ecologically valid, the advantage of reflecting the type of data one expects automated music transcription systems to deal with. The latter approach was adopted in this study. For the experiment six pianists were invited to play arrangements of two Beatles songs, Michelle and Yesterday. Both pieces have a relatively simple rhythmic structure with ample opportunity to add expressiveness by fluctuating the tempo. The subjects consisted of one professional jazz player (PJ), four professional classical performers (PC) and one amateur classical pianist (AC). Each arrangement had to be played in three tempo conditions, three repetitions per tempo condition. The tempo conditions were normal, slow and fast tempo (all in a musically realistic range and all according to the judgement of the performer). We present here the results for these six subjects (6 subjects x 3 tempi x 3 repetitions x 2 pieces - 2 performances = 106 performances). The final data set will contain four pianists for each category (PJ, PC and AC). The performances were recorded on a Yamaha Disklavier Pro MIDI grand piano using Opcode Vision. To be able to derive tempo measurements related to the musical structure (e.g., beat, bar) the performances were matched with the MIDI scores using the structure matcher of Heijink et al. (2000) available in POCO (Honing, 1990). Tempo measurements were extracted for the notes that coincide with the beat (quarter note) level and the bar (whole note). In other words, we extract the (noisy) Tj from the performance guided by the score. Figure 3: A simple rhythm and its Tempogram. x and y axes correspond to 7 and A respectively. The bottom figure shows the onset sequence (triangles). Assuming flat priors on 7 and A, the curve along the A axis is the marginal p(A|t) oc fdrexp(Tgx(r,A)). We note that p(A|t) has peaks at A, which correspond to quarter, eight and sixteenth note level as well as dotted quarter and half note levels of the original notation. This distribution can be used to estimate a reasonable initial state. 4.1 Training the Kalman Filter There are free parameters in the model, namely A, Q, C and R. In principle, all of these parameters can be estimated from data. Here, however, we restrict ourselves to the estimation of A and Q and set C and R to appropriate values. We divided the data set in a training set and a test set. We compute wj = log2 (Tj+ - Tj) from the extracted tempotrack [Tj] and learn a linear dynamics in the w space by an EM algorithm (Ghahramani and Hinton, 1996). To find the appropriate filter order (Dimensionality of s) we trained Kalman filters of orders from 1 to 6. We observed that a filter of order roughly between 1 and 4 is sufficient both in bar and beat levels. In any case, there is no large difference between models of different order. 4.2 Evaluation of tempo tracking performance We evaluated the accuracy of the tempo tracking performance of the complete model with a Kalman filter of order one and a non-causal comb-filter tempogram with ax = 0.04 and ao = 0.4. In the tracking experiments, we have initialized the filter to a reasonable estimate at beat level. For each performance in the data set, we obtain smoothed estimates of the beat #j. We compare 'j to the assumed true tempotrack Tj as follows: for all j we check whether our beat estimate j is contained in the time window Tj ~ 0.075 sec. For each performance we calculate the percentage of correct beats. Figure 4 shows the results for the whole data set. Note that this is a quite "pessimistic" measure; if the tracker misses just one beat in the beginning but otherwise tracks the beat correctly, the correct beat percentage score would still be very low. Many of the poor quality tempo tracks are due to problems of this nature. Naturally, the performance of the tracker depends on the amount of tempo variations introduced by the performer. For example, the tempo tracker fails consistently for subject PC2 who tends to use quite some tempo variation (Table 1). The performance is not very different among tempo

Page  00000004 conditions but somewhat better for normal tempo (Table 2). Condition: fast normal slow % Correct 82.7 88.1 80.5 0 C) 0 10 20 30 40 50 60 70 80 90 10 0 Figure 4: Histogram of correct beat percentage. 80 performances (of a total of 106) are tracked with an accuracy between %90-100. 5 Discussion and Future Research In this paper, we have formulated a tempo tracking model in a Bayesian framework that incorporates a dynamical system and a measurement model. We employed a Kalman filter based dynamical system and a Tempogram based measurement model. In our view, many of the existing methods can be viewed as particular choices of a dynamical model and a measurement model. Bayesian formulation has several advantages: First, uncertainties can be integrated into the system in a natural way and desired quantities can be inferred in a consistent way. Moreover, prior knowledge (such as smoothness constraints in the state transition model and the particular choice of measurement model) are explicit and can be changed when needed. For example, the same state transition model can be used for both audio and MIDI; only the measurement model needs to be elaborated. For MIDI data, the Tempogram can also be replaced by a rhythm quantizer (Cemgil et al., 2000). Another advantage is that, for a large class of related models efficient inference and learning algorithms are well understood (Ghahramani and Hinton, 1996). This is appealing since we can train tempo trackers with different properties automatically from data. Online (filtering) or offline (smoothing) formulation is also possible. Online processing is necessary for real time applications such as automatic accompaniment and offline processing is desirable for transcription applications. The evaluation of the model on a systematicly collected set of natural data shows a high overall correctness. The next step will be an analysis of the local tempo behavior of the model (e.g., to test for its robustness once an error occurred) and characterize it in more qualitative terms (making use of the different musical conditions present in the full data set). Subject: PJ AC PC1 PC2 PC3 PC4 Yesterday 95.8 68.0 92.6 62.7 97.6 83.7 Michelle 96.6 98.5 98.5 44.9 75.5 93.5 Table 1: Correct beat percentage for subjects and pieces. Table 2: Correct beat percentage for tempo conditions. Acknowledgments: This research is supported by the Technology Foundation STW, applied science division of NWO and the technology programme of the Dutch Ministry of Economic Affairs. We would like to thank Ric Ashley and Paul Trilsbeek for their contribution in the design and running of the experiment and we gratefully acknowledge the pianists from Northwestern University for their excellent performances. References Cemgil, A. T., Desain, P., and Kappen, H. Summer 2000. "Rhythm quantization for transcription". Computer Music Journal, 24:2:60-76. Dannenberg, R.B. 1984. "An on-line algorithm for real-time accompaniment". In Proceedings of ICMC, San Francisco. pages 193-198. Desain, P. and Honing, H. 1991. "Quantization of musical time: a connectionist approach". In Todd, P. M. and Loy, D. G., editors, Music and Connectionism., pages 150-167. MIT Press., Cambridge, Mass. Desain, P. and Honing, H. 1994. "A brief introduction to beat induction". In Proceedings of ICMC, San Francisco. Ghahramani, Zoubin and Hinton, Goeffrey E. "Parameter estimation for linear dynamical systems. (crg-tr-96-2)". Technical report, University of Totronto. Dept. of Computer Science., 1996. Goto, M. and Muraoka, Y. 1998. "Music understanding at the beat level: Real-time beat tracking for audio signals". In Rosenthal, David F. and Okuno, Hiroshi G., editors, Computational Auditory Scene Analysis. Heijink, H., Desain, P., and Honing, H. 2000. "Make me a match: An evaluation of different approaches to score-performance matching". Computer Music Journal, 24(1):43-56. Honing, H. 1990. "Poco: An environment for analysing, modifying, and generating expression in music.". In Proceedings of ICMC, San Francisco. pages 364-368. Kalman, R. E. 1960. "A new approach to linear filtering and prediction problems". Transaction of the ASME-Journal of Basic Engineering, pages 35-45. Large, E. W. and Jones, M. R. 1999. "The dynamics of attending: How we track time-varying events". Psychological Review, 106:119-159. Longuet-Higgins, H. C. and Lee, C.S. 1982. "Perception of musical rhythms". Perception. Longuet-Higgins, H.C. 1976. "The perception of melodies". Nature, 263:646-653. Michon, J.A. 1967. "Timing in temporal tracking". In Soesterberg: RVO TNO. Rioul, Oliver and Vetterli, Martin. 1991. "Wavelets and signal processing". IEEE Signal Processing Magazine, October: 14 -38. Scheirer, E. D. 1998. "Tempo and beat analysis of acoustic musical signals". Journal ofAcoustical Society ofAmerica, 103:1: 588-601. Vercoe, B. 1984. "The synthetic performer in the context of live performance". In Proceedings ofICMC, San Francisco. pages 199-200.