Page  00000248 SONG IDENTIFICATION THROUGH HMM-BASED MODELING OF THE MAIN MELODY Nicola Orio and Cristiano Zen University of Padova Department of Information Engineering ABSTRACT This paper describes a methodology for the identification of pop and rock songs based on the statistical modeling of the leading voice. The identification is based on the use of hidden Markov models (HMM), which are automatically built from digital music scores. States of the HMMs are labeled by the notes of the leading voice, and the transition and observation probabilities are directly computed from the information on the score. The methodology has been experimentally evaluated on a collection of pop and rock songs, with encouraging results. 1. INTRODUCTION Automatic music identification has different applications, that range from digital rights management (DRM), to automatic metadata extraction and music retrieval. In the particular case of pop and rock music, the most relevant feature of a song is its main melody, because a song may be performed with different accompaniment, different orchestration, at a different tempo, and even with chord substitutions, but be nevertheless perceived as that song. A common approach to music identification is to extract, directly from a recording in digital format, its audio fingerprint, which is a unique set of features that allows for the identification of digital copies even in presence of noise, distortion, and compression. It can be seen as a content-based signature that summarizes an audio recording. A comprehensive tutorial about audio fingerprinting techniques and applications can be found in [1]. Audio fingerprinting systems normally identify each particular recording of a given song, assuming that users are interested in a specific recording of a given song. There are some cases where the identification of a song may be carried out also without linking the process to a particular performance. For instance, song identification of broadcasted live concerts may not benefit from the fingerprints of other performances, because most of the acoustic parameters may be completely different. Moreover, users may be interested in discovering if there are covers of a song they like. Approaches to the identification of songs usually exploits the information on the harmonic content. The basic assumption is that songs have a chord progression which is its unique signature. For instance, in [5] the identification is carried out using the chroma representation of the content, and performance were recognized using the dynamic time warping algorithm. Chroma representation is also used in [3]. Another approach for song identification, reported in [2], is based on the automatic computation of the chords of a song and the identification is based on the relative frequency of each chord within a song. Other approaches to music identification have been applied to Western art music. For instance, in [6] the identification is based on a chroma representation of recordings of orchestral works, where each sequence of features were warped (compressed or expanded) by a set of given percentages in order to cope with different tempi. The work reported in [7] is based on Hidden Markov Models (HMMs). Models are built from a set of polyphonic scores, and identification is carried out by computing the probability that each model generated the performance. This paper extends the methodology reported in [7] for the automatic identification of songs, taken from the pop and rock repertoires. Also in this case, identification is based on HMMs and it is carried out independently from the particular performance, by comparing the audio recording directly with a simplified version of the score. The main assumption behind this work is that, at least for pop and rock music, the main melody of a polyphonic recording can be used as the only information to carry out the identification. To this end, the accompaniment is considered as additional noise and the goal is to provide a description of the main melody that is robust to that noise. The approach can be considered as a generalization of audio fingerprinting, because the relevant features used for identification are not linked to a particular performance of a music work. The application scenario is the automatic song labeling through a match with pre-labeled songs that are already part of a music collection, aimed both at the development of DRM systems for live music and at the retrieval of relevant metadata to be provided to final users. 2. SONG MODELING The idea underlying the proposed approach is that a performance can be considered as the realization of a process that converts a music score into a sequence of audio features. In particular, the only information available is the main melody, as sung by the vocalist. The process is clearly non deterministic because different performances correspond to the same score, depending on a number of 248

Page  00000249 parameters which are not known: harmony, orchestration, playing techniques, room reverberation, and so on. Nevertheless, given the score of a melody, it is possible to make some simple assumptions about the sequence of acoustic parameters corresponding to its performances and to model them as a stochastic process. In particular, HMMs proved to be a powerful statistical tool to model sequences in different application domains [8]. HMMs are stochastic finite-state automata. At each transition, the new state emits a random vector with a given probability density function. A HMM A is completely defined by: (1) a set of N states Q = {qi,..., qN}, in which the initial and final states are identified; (2) a probability distribution for state transitions, that is the probability to go from state qi to state qj in a single step; (3) a probability distribution for observations, that is the probability to observe a given feature r when in state qj. The process that generates the audio parameters extracted from a performance can be modeled with a HMM providing that: states are labeled with notes of the melody, transitions model the temporal evolution of the score, and observations are related to audio features that help distinguishing different notes. 2.1. Melody Modeling The first step regards the creation of a set of HMM states which are labeled by notes. To this end, polyphonic scores need to be parsed and preprocessed in order to extract the main melody. In the case of polyphonic MIDI files, the task is carried out in two steps: (1) identification of the channels containing the main melody, and (2) extraction of a monophonic score in the case these channels are still polyphonic. The technique for the extraction of the main melody from a polyphonic score is based on the automatic classification of the MIDI channels in a number of groups - melody, solo instrument, accompanying instrument, bass, drums - according to the approach presented in [4], while the extraction of the main melody is carried out according to the technique proposed in [9]. Given that the aim of this paper is to describe the identification process, the details of the melody extraction step are left to the references. An example of score parsing is shown in Figure 1 that depicts a monophonic score and its representation as a sequence of states of a HMM. As it can be seen, each note corresponds to a group of states (two in the figure), where self- and forward-transitions allows to model the note duration. Moreover, the different note durations correspond to different self-transition probabilities. In fact, it can be shown that if a note with expected duration d is labeled by a sequence of n states, with the same self-transition probability p, the probability that a path stays d times within the states is a negative binomial, with equation cordingly. In the case that each note is labeled with only one state, P(d) becomes a negative exponential, and thus no maximization can be carried out. In this latter case, it has been chosen to set all the transition probabilities to 0.5. The modeling of the complete score can be carried out by simply connecting two subsequent events with a transition, as shown in the right part of Figure 1. In this case, the only assumption that is made on the performance is that the vocalist will sing the notes in the same order in which they are written in the score. The overall topology of the HMM is strictly left-to-right, with self-transitions used to model events duration and forward transitions used to model the evolution in time of the score. 2.2. Observations The choice of features to be used as observations is crucial, because they should allow us to identify a melody even in the presence of an accompaniment, which is assumed to be unknown. Moreover, the features need to be robust to distortions of the recording, timbre variations of musical instruments, room acoustics, and possible digital compression. Because states are labeled by notes, it is possible to compute a probability density function of the features that are relevant for each note. A simple and general assumption is that the Fourier transform of observations should have a number of peaks corresponding to the expected harmonics of the note [7]. For each note with pitch f running at a certain score frame, h harmonic peaks at frequencies k - f, 1 < k < h are generated. The effects of the value of h is presented in Section 4. The peaks take the form of rectangular spectral bands with an equal amplitude of 1 in an otherwise zero spectrum. Each band has a bandwidth of one half-tone to accommodate for slight tuning differences and vibrato. Figure 2 exemplifies the approach, showing the FFT of a polyphonic signal with the corresponding set of four spectral bands. As it can be seen, part of the signal - i.e., the accompaniment - falls outside the bands, while the main melody is not perfectly in tune. The observations of state n for frame r are then computed according to equation o(r, n) =- F i Sr,(i) (2) where F, are the spectral bands for state n and S, is the Fourier transform of the signal frame r. The probabilities for state n to emit observation o are modeled with an exponential probability density function, computed through the equation e-o/P P(on) = (3) P(d) ( d v/ n 1 -"(P - p) where p has been experimentally determined. Another relevant feature is the overall energy of each frame, which allows us to cope with the parts of the score were there is only the accompaniment. The overall energy may vary dramatically depending on the audio quality, the from which P(d) can be maximized on the expected event duration by choosing a value of n and computing p ac 249

Page  00000250 A # A Figure 1. Alternative representations of a simple melody: the music score and the corresponding HMM signal to noise ratio, and the sound level of the recording. For this reason the audio level of the recordings are normalized. If one assumes that the leading voice has an energy which is higher than the accompaniment, it is likely that the overall energy will change when the voice is not singing, and this information is used to weight the relevance of the features related to the harmonic content. Figure 2. Frequency plot of a polyphonic signal (solid) and the spectral bands (dashed) 3. SONG IDENTIFICATION Recognition, which in this case becomes an identification problem, is the most common application of HMMs. The identification task can be stated as follows: given an unknown audio recording, described by a sequence of audio features R = {r(1),..., r(T)}, and a set of competing models Ai, find the model that more likely generates R. This can be expressed as a maximization problem A* = argmaxiP(R Ai) (4) where the conditional probability is computed over all the possibile state sequences, or paths, across the HMMs. In fact, the definition does not impose a particular evolution of models Ai, that is the path across the N states that corresponds to the generation of R. The common approach to solve the identification problem is through the forward probabilities, that allow us to compute P(R Ai) using a dynamic programming approach and thus in O(TN7) time, where T is the duration of the audio sequence and Ni is the number of states of the Ai. It has to be noted that, given the HMM topology shown in Figure 1, each state of the HMM may per form only two transitions, a self-transition and a forwardtransition, reducing the complexity to O(TNi), thus linear with the number of states Ni. With the current implementation, a complete identification task is linear with the number of competing models D, with an overall complexity of O(DTN). Future work will regard the reduction of the set of competing models using music information retrieval techniques. For instance, the number of music scores used for the identification task can be reduced by choosing the first J, with J <~ D, retrieved using an efficient query by example paradigm. 4. EXPERIMENTAL RESULTS A set of experiments has been carried out to evaluate the methodology with real acoustic data from original recordings, taken from the personal collections of the authors. The focus of our approach is the identification of songs where the leading voice, which corresponds to the main melody, is prominent in respect to the accompaniment. This is almost always the case for pop and rock repertoires, and thus the experimental collection has been created using songs from these genres. The approach can be extended also to other genres, such as opera or folk music, where the main melody is clearly distinguishable over the accompaniment. The scores used to build the HMMs have been downloaded from the Web. Thus, the quality and the adherence to the original compositions varied, because no control is possible on the editing process. The songs to be recognized were 16 short excerpts, usually the first notes of the main melody, with durations ranging from 9 to 11 seconds. The songs represent a variety of combination of the leading voice and the accompaniment, including: voice accompanied by a single instrument (e.g., "Let It Be" - The Beatles and "Love Me Tender" - Elvis Presley), by typical rock orchestrations (e.g., the central part of "Stairway to Heaven" - Led Zeppelin and "Crazy Little Thing Called Love" - The Queen and by electronic instruments (e.g., "Another One Bites the Dust" - The Queen). The audio files were all mono recordings, with a sampling rate of 44.1 kHz, divided in frames of 2048 samples with an overlap of 1024 samples giving a new observation each 23.2 msec. The models used for identification have been computed using 212 scores, in MIDI format, including the ones corresponding to the performances to be recognized. The additional scores have been chosen within the same repertoire of the songs to be recognized. All the MIDI files 250

Page  00000251 N F =1 <3 <5 < 10 MRR 1 4 6.25 25.00 56.25 87.50 24.4 2 4 62.50 81.25 93.75 100.00 73.2 4 4 62.50 81.25 93.75 100.00 76.0 2 2 50.00 56.25 75.00 87.50 58.6 2 4 62.50 81.25 93.75 100.00 73.2 2 6 37.50 68.75 81.25 100.00 57.9 Table 1. Effect of the number of states (N) and of the number of spectral bands (F) on the identification rate and on the Mean Reciprocal Rank have been preprocessed in order to automatically extract the main melody. With the aim of investigating the effectiveness of the identification approach alone, the 16 scores corresponding to the audio excerpts have been manually inspected, in order to test if the main melodies were extracted correctly. Each of the 16 audio excerpts has been considered as an unknown sequence to be identified. Table 1 reports the identification rates at different thresholds, that is the percentages at which the correct score was ranked the most similar one ( 1), and when it was ranked within the first three (< 3), five (< 5), and ten (< 10) scores. Results are encouraging, the configuration with two states for each note and four spectral bands allowed us to correctly identify 10 out of 16 songs. Another important result is that, with this same configuration, 100% of the songs have been ranked within the first 10 scores, meaning that the approach can be applied to a supervised labeling of unknown songs. Another measure reported in Table 1, which is very common in information retrieval, is the Mean Reciprocal Rank (MRR) that takes into account the average rank of the documents that should be identified by the system. MRR showed that the best results are given by four states and four bands, with a MRR of 76.0%, representing a slight increase compared to the 73.2% of MRR obtained with only two states. The results reported in Table 1 show that modeling each note with a single state gives poor performances, meaning that the modeling of note durations plays an important role. The increase from two to four states does not improve significantly the performances, with an important effect on the computational complexity, as mentioned in Section 3. As regards the number of spectral bands for modeling the main melody, it seems that four bands is a good compromise between the number of relevant harmonics of the main voice and the amount of noise introduced by the fact that background accompaniment is likely to have a similar spectral content. All the results have been obtained by manually setting At 0.3 in Equation 3. Even if the theory of HMMs allows us to train the value of the emission probabilities, we chose to set this value experimentally, through a set of tests, to prevent from parameter overfitting that is nor mally associated to a reduced set of examples. Future work will regard the application of training to a larger dataset. 5. CONCLUSIONS A methodology for the automatic identification of songs has been presented, based on the modeling of the main melody with an HMM. Tests with a small collection of songs gave an identification rate of 62.5%, which is an encouraging result considering that identification has been carried out without taking into account information about the accompaniment or the chord progression. We are in the process of extending the test collection, including a larger sample of the pop and rock repertories. 6. REFERENCES [1] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of audio fingerprinting. Journal o) VLSI Signal Processing, 41:271-284, 2005. [2] E. G6mez and P. Herrera. The song remains the same: Identifying versions of the same piece using tonal descriptors. In Proc. of the International Conference on Music Information Retrieval, pages 180-185, 2006. [3] D.P.W. Ellis and G.E. Poliner. Identifying 'Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages IV-1429-IV-1432, 2007. [4] Y. Hijikata, K. Iwahama, K. Takegawa, and S. Nishida. Content-based music filtering system with editable user profile. In Proc. o) the ACM Symposium on Applied Computing, pages 1050-1057, 2006. [5] N. Hu, R.B. Dannenberg, and G. Tzanetakis. Polyphonic audio matching and alignment foi music retrieval. In Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 185-188, 2003. [6] M. Milller, F. Kurth, and M. Clausen. Audio matching via chroma-based statistical features. In Proc. of the International Conference of Music Information Retrieval, pages 288 -295, 2005. [7] N. Orio. Automatic recognition of audic recordings. In Proc. of the Italian Research Conference on Digital Library Management Systems, pages 15-20, 2006. [8] L.R. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs, NJ, 1993. [9] A. Uitdenbogerd and J. Zobel. Manipulation of music for melody matching. In Proc. of the ACM Conference on Multimedia, pages 235 -240, 1998. 251