# Polyphonic Pitch Identification and Bayesian Inference

Skip other details (including permanent urls, DOI, citation information)This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact mpub-help@umich.edu to use this work in a way not covered by the license. :

For more information, read Michigan Publishing's access and usage policy.

Page 00000001 Polyphonic Pitch Identification and Bayesian Inference Ali Taylan Cemgil IAS, University of Amsterdam, The Netherlands email: cemgil @science.uva.nl Abstract In this article, we review the essential ideas of Bayesian polyphonic pitch identification with simple examples. We describe a signal model and a simple greedy algorithm to infer the pitch classes in a sound mixture with unknown number of notes. An important advantage of our approach is that it provides a clear framework in which both high level prior information on music structure can be coupled with low level acoustic-physical knowledge in a principled manner to perform the analysis. 1 Introduction One of the hard problems in musical scene analysis is polyphonic pitch tracking. that is, the identification of individual pitch classes from a (usually additive) mixture. This problem in various forms pops up in a vast majority of computer music applications, ranging from interactive music performance systems, music information retrieval (Music-IR) to musicological analysis of performances. In its most unconstrained form, i.e., when operating on an arbitrary polyphonic acoustical input possibly containing an unknown number of different instruments, pitch tracking remains still a challenge. Pitch identification in polyphony has attracted significant research effort in the past - see (Scheirer 2000) for a detailed review of early work. In speech processing, the related task of tracking the pitch of a single speaker is a fundamental problem and methods proposed in the literature are well studied(Hess 1983). However, most current pitch detection algorithms are based largely on heuristics (e.g., picking high energy peaks of a spectrogram, correlogram, auditory filter bank, etc.) and their formulation usually lacks an explicit objective function or signal model. It is often difficult to theoretically justify the merits and shortcomings of such algorithms, and compare them objectively to alternatives or extend them to polyphonic scenarios. Pitch tracking is inherently related to the detection and estimation of sinusoidals. The estimation and tracking of single or multiple sinusoidals is a fundamental problem in many branches of applied sciences, so it is less surprising that the topic has also been deeply investigated in statistics, (e.g. see (Quinn and Hannan 2001)). However, ideas from statistics seem to be not widely applied in the context of musical sound analysis, with only a few exceptions (Irizarry 2002; Saul, Lee, Isbell, and LeCun 2002; Parra and Jain 2001). Sterian (Sterian) described a system that viewed transcription as a model driven segmentation of a time-frequency image. Walmsley (Walmsley) treats transcription and source separation in a full Bayesian framework. He employs a frame based generalized linear model (a sinusoidal model) and proposes inference by reversible-jump Markov Chain Monte Carlo (MCMC) algorithm. The main advantage of the model is that it makes no strong assumptions about the signal generation mechanism, and views the number of sources as well as the number of harmonics as unknown model parameters. Davy and Godsill (Davy and Godsill) address some of the shortcomings of his model and allow changing amplitudes and frequency deviations. The reported results are encouraging, although the method is computationally very expensive. In this article, we describe the essential ideas of Bayesian polyphonic pitch identification with simple examples. For a more detailed and extended description of our approach, the interested reader is referred to (Cemgil, Kappen, and Barber 2004). 2 A Toy Example Consider a "one sample source identification" problem. Suppose there are two numbers (=sources) si and s2 which we wish to identify, but we only observe their sum (=superposition, mixture) y. Given only this much of information, of course this problem does not have a unique solution since there are infinitely many number pairs with s + S2 = y. Suppose now we are given additional information that the sources can be independently present or absent in the mixture. Say if the j'th source (j = 1, 2) is present in the mixture, it has a Gaussian distribution with mean pj and variance Ps, which we denote by AN(pj, Ps). If source j is absent, say it will be zero mean which we denote by Ar(0, Pm). To get a compact notation, let us denote the condition that source is present or absent with a discrete binary variable rj E {"sound", "mute"} r p(rj) j =1,2 sj rj [rj = sound]NA(pj, Ps) + [rj = mute]AN(0, Pm) y S= Sj Here, and elsewhere in the article, the notation [x = text] has value equal to 1 when variable x is in state text, and is zero otherwise. Now, given the model we can infer the Proceedings ICMC 2004

Page 00000002 Si M (a) Left: A toy model for source separation and identification. Square and oval shaped nodes denote discrete and continuous stochastic variables respectively. Diamond-shaped node is observed. Right: Equivalent Model using "plates" where M = 2. The sources sj correspond to individual note events. The indicator variables rj denote whether a particular sound is present in the observed mixture y. Of course, in reality y and each of rj and sj will be time series. Provided we know how the individual sources are generated, source separation and source identification (= transcription) correspond to inference of posterior quantities p(sj y) and p(rj y) respectively. In the following section, we will define a signal model p(y s)p(s r). 3 Model It is well known that musical instruments tend to create oscillations with modes that are roughly related by integer ratios, albeit with strong damping effects and transient attack characteristics (Fletcher and Rossing 1998; Serra and Smith 1991; Rodet 1998). It is common to model such signals as the sum of a periodic component and a transient non-periodic component. The sinusoidal model (McAulay and Quatieri 1986) is often a good approximation that provides a compact representation for the periodic component. Our signal model is also in the same spirit, but we will define it in state space form, because this provides a natural way to derive online algorithms. First we consider how to generate a damped sinusoid yt through time, with angular frequency wJ. Consider a Gaussian process where typical realizations Y1:T are damped "noisy" sinusoidals with angular frequency uw: 00:0 0: 0.4 -0 3 5 8 / 7 1 / (b) The conditional p(ylrl, r2). The "mute" and "sound" states are denoted by o and * respectively. Here, /i = 3, A2 = 5 and Pm < Ps. The bottom figure shows the winning configuration as a function of y, i.e. arg maxr1,r2 p(rl, r2 y). We assume a flat prior, p(rj = "mute") = p(rj = "sound") = 0.5. posterior distribution over the ri and r2 using the Bayes rule, which is P(r1, Tr2 = Y - p(y r1, r2)p(r1)p(r2) Here, P(y rT,r2) = J ds2ds2p(y S1,S2)P(S\r}1)p(S2 r2) Conditioned on r, this quantity can be found analytically. For example, when rT = T2 = "sound", p(y rT, r2) JV/(P + P2, 2Ps). For any given y, we can also calculate the most likely configuration by (rl, I2)* = argmax p(ri, r2 y) r1,r2 The model and a numeric example is shown in Figure 2. Although quite simple, this toy example exhibits some of the basic ideas in our polyphonic pitch tracking model. st ~ V(ptB(w)st-j,Q) yt J V(Cst,R) so ~ Ar(0o,S) (1) (2) (3) We use AN/(, E) to denote a multivariate Gaussian distribution with mean p and covariance E. Here B(w) = ( i)-j cis(w) ) is a Givens rotation matrix that rotates two dimensional vector st by uw degrees counterclockwise. C is a projection matrix defined as C = [1, 0]. The phase and amplitude characteristics of yt are determined by the initial condition so drawn from a prior with covariance S. The damping factor 0 < pt < 1 specifies the rate at which st contracts to 0. See Figure 1 for an example. The transition noise variance Q is used to model deviations from an entirely deterministic linear model. The observation noise variance R models background noise. In reality, musical instruments (with a definite pitch) have several modes of oscillation that are roughly located at integer multiples of the fundamental frequency w. We can model such signals by a bank of oscillators giving a block diagonal transition matrix At = A(w, Pt) defined as Pt(1B(w) 0 0 0... O p0 B(2w) Sp(H) B(Hw) (4) where H denotes the number of harmonics, assumed to be known. A(w, pt) is the transition matrix at time t and en Proceedings ICMC 2004

Page 00000003 Figure 1: A damped oscillator in state space form. Left: At each time step, the state vector s rotates by uw and its length becomes shorter. Right: The actual waveform is a one dimensional projection from the two dimensional state vector. The stochastic model assumes that there are two independent additive noise components that corrupt the state vector s and the sample y, so the resulting waveform yl:T is a damped sinusoid with both phase and amplitude noise. codes the physical properties of the sound generator as a first order Markov Process. The state of the sound generator is represented by st, a 2H dimensional vector that is obtained by concatenation of all the oscillator states in (1). 3.1 From Piano-Roll to Audio signal A piano-roll is a collection of indicator variables rj,t, where j = 1... M runs over sound generators (i.e. notes or "keys" of a piano) and t =1... T runs over time. Each sound generator has a unique fundamental frequency wy associated with it. For example, we can choose wj such that we cover all notes of the tempered chromatic scale in a certain frequency range. This choice is arbitrary and for a finer pitch analysis a denser grid with smaller intervals between adjacent notes can be used. Each indicator is binary, with values "sound" or "mute". The essential idea is that, if previously muted, rj,t_- = "mute" an onset for the sound generator j occurs if rj, = "sound". The generator continues to sound (with a characteristic damping decay) until it is again set to "mute", when the generated signal decays to zero amplitude (much) faster. The piano-roll, being a collection of indicators rl:M,I:T, can be viewed as a binary sequence, e.g. see Figure 2. Each row of the piano-roll rj,i:T controls onsets and offsets of the underlying sound generator by changing the damping coefficient. We characterise the damping factor for each note j = 1,..., M by two decay coefficients psound and pmute such that 1 > psound > Pmute > 0. The piano-roll rj,i:T controls the signal process by changing the damping coefficient of the transition matrix. We denote the transition matrix as Amte - A(cwj, Pmute); similarly for A.ound 3.1.1 Piano-Roll: Onsets At each new onset, i.e. when (rj,t-_ = mute) - (rj,t sound), the old state st-1 is "forgotten" and a new state vector is drawn from a Gaussian prior distribution #A(0, S). This models the energy injected into a sound generator at an onset (this happens, for example, when a guitar string is plucked). The amount of energy injected is proportional to the determinant of S and the covariance structure of S describes how this total energy is distributed among the harmonics. The covariance matrix S thus cap -------------------------------------- E------- ---------------------------------------------------------------- -------- ------------------ ---------------------------------------------- I..---------.----------------------------------------------------------- - - - -, - I-- - Figure 2: Piano-roll. The vertical axis corresponds to the sound generator index j and the horizontal axis corresponds to time index t. Black and white pixels correspond to "sound" and "mute" respectively. The piano-roll can be viewed as a binary sequence that controls an underlying signal process. Each row of the piano-roll rTj,lT controls a sound generator. As in a piano, the fundamental frequency is a function of the generator index j. The actual observed signal yl:T is a superposition of the outputs of all generators. tures some of the timbre characteristics of the sound. The transition and observation equations are given by isonsetj,t = (rj,t-1 = mute A rj,t = sound) Aj, = [rj,t = mute]Ante+ [rj,t = sound]A und Sj,t ' [ isonsetj,,t]j(Aj,tst_, Q) +[isonsetj,t]NA(0, S) yj, t N(Csj,tR) In the above, C is a 1 x 2H projection matrix C = [1, 0, 1, 0,...,1, 0] with zero entries on the even components. Hence yj,t has a mean being the sum of the damped harmonic oscillators. R models the variance of the noise in the output of each sound generator. Finally, the observed audio signal is the superposition of the outputs of all sound generators, Yt = Y yj,t (5) This generative model can be described qualitatively by the graphical model in Figure 3. The important message is that this model has conceptually the same structure as the toy model described in section 2. 4 Simulation Results To infer the most likely piano-roll we need to compute arg maxrl:m, T p(ri:M,1:T|Yl:T) as in the toy example. Although the exact calculation of this quantity is intractable, we are able to derive a simple approximate inference algorithm which we describe in a chord identification problem. Proceedings ICMC 2004

Page 00000004 4.1 Vertical Problem: Chord identification Chord identification is the simplest polyphonic transcription task. Here we assume that a given audio signal Y1:T is generated by a piano-roll constant piano roll where there are no onsets or offsets for t > 1. The task is to find the MAP configuration T*:m argmaxp(yl:T, ri:M) Figure 3: Graphical Model. The rectang "plates", M replications of the nodes insic j = 1,..., M represents the sound genera ables through time. VI I!1 Q{ Each configuration corresponds to a chord. The two extreme cases are "silence" and "cacophony" that correspond to configurations with all rj mute or sound respectively. le box denotes The size of the search space in this case 2", which is proJe. Each plate, hibitive for direct computation..tor (note) vari- A simple approximation is based on greedy search: we start iterative improvement from an initial configuration (silence, or randomly drawn from the prior). At each iteration i, we evaluate the probability of all neighbouring configurations of that can be reached by adding or removing single notes. If all neighbours have smaller probability, the algorithm terminates, having found a local maximum. Otherwise, we pick the neighbour with the highest probability and iterate until convergence. We illustrate the algorithm on a signal sampled from the generative model, see Figure 4. We observe that for many examples this procedure is able to identify the correct chord. Using multiple restarts from different initial configurations will improve m og p(Yi:T,rl:M) the quality of the solution at the expense of computational -1220638254 cost. -665073975 * -311983860 * -162334351 4.2 Piano-roll estimation * -43419569 * -1633593 The piano-roll estimation problem can be viewed as an * -14336 * 5766 extension of chord identification in that we also detect on-. -5210 sets and offsets for each note within the analysis frame. A * -4664 -.---- - -- - _-__1 - 1 iteration r, 1 o 2 0 3 o 4 o 5 o 6 o 7 0 8 o 9 10 o True o 0 0 0 0 0 0 * 0 0 0 0 0 * 0 0 0 * 0 0 0 * 0 0 0 * 0 0 0 0 0 0 l acitcarp approach is to analyze the signal in sufficiently aoao * -4664 Short time windows and assume that for each note, at most one change-point can occur within the window. The results gure 5. Figure 4: We have first drawn a random piano-roll configuration (a random chord) r1:M. Given ri:M, we generate a signal of length 400 samples with a sampling frequency F, = 4000 from p(y:T nrl:M). We assume 24 notes (2 octaves). The synthesized signal from the generative model and its discrete time Fourier transform modulus are shown above. The true chord configuration and the associated log probability is at the bottom of the table. For the iterative algorithm, the initial configuration in this example was silence. At this point we compute the probability for each single note configurations (all one flip neighbours of silence). The first note that is added is actually not present in the chord. Until iteration 9, all iterations add extra notes. Iteration 9 and 10 turn out to be removing the extra notes and iterations converge to the true chord. The intermediate configurations visited by the algorithm are shown in the table below. Here, sound and mute states are represented by *'s and o's. 5 Discussion and Conclusion We have presented a model driven approach where transcription is viewed as a Bayesian inference problem. In this respect, at least, our approach parallels the previous work of (Walmsley; Davy and Godsill; Raphael). We believe, however, that our formulation, based on a switching state space model, has several advantages. We can remove the assumption of a frame based model and this enables us to analyse music online and to sample precision. Practical approximations to an eventually intractable exact posterior can be carried out frame-by-frame, such as by using a fixed time-lag smoother. This, however, is merely a computational issue (albeit an important one). We may also discard samples to reduce computational burden, and account for this correctly in our model. Proceedings ICMC 2004

Page 00000005 0 500 1000 1500 2000 2500 Figure 5: Polyphonic transcription of a short segment from a recording of a bass guitar. (Top) The signal, original sampling rate of 22050 Hz is downsampled with a factor of D = 5. (Middle) Spectrogram (Short time Fourier transform modulus) of the downsampled signal. Horizontal and vertical axes correspond to time and frequency, respectively. Grey level denotes the energy in a logarithmic scale. The low frequency notes are not well resolved due to short window length. Taking a longer analysis window would increase the frequency resolution but smear out onsets and offsets. (Bottom) Estimated piano-roll. The model used M = 30 sound generators where fundamental frequencies were placed on a chromatic scale that spanned the 2.5 octave interval between the low A (second open string on a bass) and a high D (highest note on the forth string). Model parameters are estimated by a expectationmaximization algorithm recorded from the same instrument. The analysis is carried out using a window length of W = 450 samples, without overlap between analysis frames (i.e. L = W). The greedy procedure was able to identify the correct pitch classes and their onsets to sample precision. For this example, the results were qualitatively similar for different window lengths W around 300 - 500 and downsampling factors D up to 8. An additional advantage of our formulation is that we can still deliver a pitch estimate even when the fundamental and lower harmonics of the frequency band are missing. This is related to so called virtual pitch perception (Terhardt 1974): we tend to associate notes with a pitch class depending on the relationship between harmonics rather than the frequency of the fundamental component itself. One of the advantages of our generative model based approach is that we can in principle infer a chord given any subset of data. For example, we can simply downsample Y1:T (without any preprocessing) by an integer factor of D and view the discarded samples simply as missing values. Of course, when D is large, i.e. when we throw away many samples, due to aliasing, higher harmonics will overlap with harmonics in the lower frequency band which will cause a more diffuse posterior on the piano-roll, eventually degrading performance. Although our approach has many desirable features (automatically deducing number of correct notes, high temporal resolution e.t.c.), one of the main disadvantage of our method is computational cost associated with updating large covariance matrices in Kalman filtering. It would be very desirable to investigate approximation schemas that employ fast transformations such as the FFT to accelerate computations. When transcribing music, human experts rely heavily on prior knowledge about the musical structure - harmony, tempo or expression. Such structure can be captured by training probabilistic generative models on a corpus of compositions and performances by collecting statistics over selected features (e.g. (Raphael and Stoddard 2003)). One of the important advantages of our approach is that such prior knowledge about the musical structure can be formulated as an informative prior on a piano-roll; thus can be integrated in signal analysis in a consistent manner. We believe that investigation of this direction is important in designing robust and practical music transcription systems. We have not yet tested our model for more general scenarios, such as music fragments containing percussive instruments or bell sounds with inharmonic spectra. Our simple periodic signal model would be clearly inadequate for such a scenario. On the other hand, we stress the fact that the framework presented here is not only limited to the analysis of signals with harmonic spectra, and in principle applicable to any family of signals that can be represented by a switching state space model. This is already a large class since many real-world acoustic processes can be approximated well with piecewise linear regimes. We can also formulate a joint estimation schema for unknown parameters and integrate them out (e.g. see (Davy and Godsill)). However, this is currently a hard and computationally expensive task. If efficient and accurate approximate integration methods can be developed, our model will be applicable to mixtures of many different types of acoustical signals and may be useful in more general auditory scene analysis problems. Proceedings ICMC 2004

Page 00000006 References Cemgil, A. T., H. J. Kappen, and D. Barber (2004). A generative model for music transcription. Submitted to IEEE Transactions on Speech and Audio Processing. Davy, M. and S. J. Godsill (2003). Bayesian harmonic models for musical signal analysis. In Bayesian Statistics 7. Fletcher, N. H. and T. Rossing (1998). The Physics of Musical Instruments. Springer. Hess, W. J. (1983). Pitch Determination of Speech Signal. New York: Springer. Irizarry, R. A. (2002). Weighted estimation of harmonic components in a musical sound signal. Journal of Time Series Analysis 23. McAulay, R. J. and T. E Quatieri (1986). Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 34(4), 744-754. Parra, L. and U. Jain (2001). Approximate Kalman filtering for the harmonic plus noise model. In Proc. of EEE WASPAA, New Paltz. Quinn, B. G. and E. J. Hannan (2001). The Estimation and Tracking of Frequency. Cambridge University Press. Raphael, C. (2002). Automatic transcription of piano music. In Proc. ISMIR. Raphael, C. and J. Stoddard (2003). Harmonic analysis with probabilistic graphical models. In Proc. ISMIR. Rodet, X. (1998). Musical sound signals analysis/synthesis: Sinusoidal + residual and elementary waveform models. Applied Signal Processing. Saul, K. L., D. D. Lee, C. L. Isbell, and Y. LeCun (2002). Real time voice processing with audiovisual feedback: toward autonomous agents with perfect pitch. In Neural Information Processing Systems, NIPS*2002, Vancouver. Scheirer, E. D. (2000). Music-Listening Systems. Ph. D. thesis, Massachusetts Institute of Technology. Serra, X. and J. 0. Smith (1991). Spectral modeling synthesis: A sound analysis/synthesis system based on deterministic plus stochastic decomposition. Computer Music Journal 14(4), 12-24. Sterian, A. (1999). Model-Based Segmentation of TimeFrequency Images for Musical Transcription. Ph. D. thesis, University of Michigan, Ann Arbor. Terhardt, E. (1974). Pitch, consonance and harmony. Journal of the Acoustical Society of America 55(5), 1061-1069. Walmsley, P. J. (2000). Signal Separation of Musical Instruments. Ph. D. thesis, University of Cambridge. Proceedings ICMC 2004