Page  00000001 THE RHYTHM TRANSFORM: TOWARDS A GENERIC RHYTHM DESCRIPTION Enric Guaus, Perfecto Herrera Music Technology Group, Institut Universitari de l'Audiovisual Universitat Pompeu Fabra {enric.guaus,perfecto.herrera} ABSTRACT In the past few years, automatic genre classification has become one of the most interesting topics in the Music Information Retrieval (MIR) field. Musical genre is one of the most valuable metadata when managing huge music databases and many successful efforts have been done to automatically compute it. Basically, timbric and rhythmic description of audio is used for this purpose. From our point of view, some of these descriptors have been designed under some rigid constrains and the generalization of these algorithms into more flexible applications becomes a difficult task. In this paper, we present a method for rhythmic description which can improve the efficiency and flexibility of genre classifiers when used in combination with timbre descriptors. 1. INTRODUCTION Musical Genre Classification has become one of the most valuable metadata when managing huge music databases. Broadcast radio stations or CD-Stores base its organization on Musical Genres. As a consequence, the Music Information Retrieval community has developed many different algorithms for Automatic Genre Classification [ 1 ][2]. The problem arises when Genre has to be defined. Neither musicians nor musicologists agree with a common definition of genre. For simplicity, genre can be defined as a class of music with a set of common properties that in the perception of the average listener distinguish music in that category from the other songs[3]. Unfortunately, no more precise definition is possible. From a technical point of view, the basic structure for most of the systems can be divided into three different parts: feature extraction, pattern recognition and conditioning. Feature extraction is the most important part of the process. Depending on the extracted features, the system will classify genres according to this specific description. In a general way, timbre related descriptors (MFCC and derivatives) and rhythm related descriptors (onsets, time signature and beat) are used. Some melodic and harmonic features should also be included in more sophisticated forthcoming systems. The second step is usually based on Pattern Recognition techniques. There are several general-purpose machinelearning and heuristic-based techniques that can be adapted to this task. Hidden Markov Models or Neural Networks are the usual "tools of the trade". Finally, some conditioning of the output data needs to be done. As mentioned above, some systems use rhythmical description for genre description. It is commonly accepted that rhythm is an important musical feature when describing genre. Important studies about rhythmical properties of music can be found in [4] [5]. Beat tracking and meter detection are also very active topics in the MIR community. But some problems appear when some non-rhythmic data have to be recognized, i.e. classical music or speech. The goal of this paper is to propose and discuss about a flexible rhythmic description and their inclusion in a genre classification system in order to improve its robustness. Speech data has been included to the database forcing the system to have a high variety of rhythms, as well as Dance and Classic music are also included providing a high variety in timbre. The paper is organized as follows: In Section 2, the proposed rhythm transform is explained and how to use it in a real environment is shown. In Section 3 the classification scheme and the used databases are introduced, in Section 4 the experiments and results are discussed and, finally, the conclusions and future work can be found in Section 5 2. FEATURE EXTRACTION 2.1. The Rhythm Transform As mentioned in the previous section, a number of different algorithms for rhythm feature extraction have been developed (e.g. Beat Histogram proposed by Tzanetakis in [1] or the Beat Spectrum proposed by Foote in [6]). Although some of them provide a general representation of the rhythmicity of the input signal, manual tuning is usually required. Other successful studies on rhythm are based on the the periodicities in some specific BPM values (e.g. the Beat tracking systems proposed by Scheirer in [7] and Goto in [8]) but they do not provide rhythmical information beyond the BPM scale. Here, we use the proposed Rhythm Transform for computing the rhythmic

Page  00000002 properties of the input signal [9]. The main goal of the Rhythm Transform is to provide a general rhythmic representation in the same sense that the Fourier Transform provides a general timbric representation. The basic idea of this transformation is inspired on the human behavior when listening to the music: we don't care about BPM values or about the different amplitude of two peaks in a bar. Humans tend to conceptualize all this information and extract some high level musical information (fast/slow, impulsive/continuous, etc.). Furthermore, faster periodicities (>208 BPM) provide some extra musical information (swing ratio, musical density, etc.). By using transformed data (that is data in Rhythm Domain), some rhythm-related descriptors can be computed (e.g. Rhythm Centroid, RollOff, etc.) From a mathematical point of view, the Rhythm Transform can be computed as follows: Data preparation: The input data x(t) is sampled at fs = 22050Hz. The frame length is 1 = 300ms and the hop-size is h = 25ms (both values are selected according to perceptual criteria). Hamming window is applied and the FFT is computed for each frame. Frequency Decomposition: A 1/3rd. octave filter bank is used for the frequency decomposition. About 30 different sub-bands should be available for the whole audible frequency range, but only 12 subbands are finally implemented due to computational constraints. Energy: The frame energy is computed for all the different sub-bands. Derivative: The derivative of the frame to frame energy evolution for each specific sub-band is extracted. Periodogram: The periodogram for each sub-band is obtained. by using the following expression: Figure 1. Overview of the Rhythm Transform method 3/4 @ Take this waltz - Leonard Cohen S1 150 Bins of the FFT - ZOOM 300 P (k) = | V[k] 2 LU (1) where V[k] is the N-point DFT of the windowed input signal x [n] w [n] and LU are scale factors. The frame length used to compute the periodogram is L' = 6[s] (equivalent to 4 beats in a 40BPM 4/4 bar score). The hop-size is h' = 25 [ms]. This value has been selected as an approximation to the maximum energy variation that humans can consider as belonging to separate events. Weighted Sum: Finally, the addition of all the periodograms obtained for each of the sub-bands is performed. This addition can be implemented as a simple sum operation or by using some kind of weighting to emphasize some specific sub-bands (i.e. low frequency bands for pop or rock music) An overview of the global method is shown in Fig. 1. As the Periodogram is computed via DFT, data in Rhythm Figure 2. Example of data in Rhythm Domain Domain can be assumed to have similar properties than the spectrum: each bin of data in Rhythm Domain can be associated to a specific BPM value. Then, this representation is giving us a representation of all the existing periodicities of the input signal. Fig. 2 shows a typical example of data in Rhythm Domain. Finally, it is important to note that the Rhythm Transform does not allow the Inverse Rhythm Transform due to some quadratic operations in its definition. 2.2. Extracted Features As discussed in Section 1, a proper genre classification algorithms requires more than timbre related features. As a first approach for genre description, both timbric an rhythmical descriptors will be used: 2.2.1. Timbre related descriptors Zero Crossing provides a measure of the weighted average of the spectral density distribution. Spectral Centroid is a measure of the balancing point of the spectral power distribution. Spectral Flatness is a measure of the complexity of the spectral shape. Roll-off point is the frequency bin value on a spectrum so that the 85% of the spectral energy is contained below it (not including). It provides a measure of the spectral bandwidth.

Page  00000003 MFCC is the Discrete Cosinus Transform of the logarithm of the mel-frequency spectrum of the input signal. Delta Coefficients for some of the proposed descriptors D in order to evaluate its time variation. 2.2.2. Rhythm related descriptors As described above, data in the Rhythm Domain can be conceptually associated to the power rhythmical density in the same sense that the FFT provides a representation of the power spectral density. Thus, different descriptors can be computed over this rhythmic representation in the same way than it is done for spectral transformation with th FFT. In our case, it is useful to compute the following rhythmical descriptors: Rhythmic Flatness: The computation of the Spectral Flatness of the power rhythmical density function provides a measure of the beatedness of the input signal, i.e., a high value for Dance music and a low value for the so known Music for relaxation. Rhythmic Centroid: The computation of the Spectral Centroid of the power rhythmical density function provides a measure of the balance between high or low BPM values, i.e. high value for Trance music. Note that this measure may be independent of the BPM value. Roll-off point: The computation of the Roll-off point of the power rhythmical density function provides a measure of the beatedness of the input signal. Strong beat songs may have faster periodicities increasing the Roll-off point value. MFCC: The computation of the MFCC provides a rough representation of the distribution of all the periodicities (that is BPMs) all over the Rhythm Domain. The idea of the MFCC computation is inspired by many factors: The BPM scale is a log scale (as in the MFCC computation) as well as a global rhythmic distribution is perceived by humans (as timbre information in MFCC). Many other representations could be defined for the same purpose, but the MFCC is chosen for simplicity. Delta Coefficients: They add the micro temporal evolution information of the MFCC values providing a rough information on the evolution of the existing periodicities. 3. SYSTEM DESCRIPTION Our genre classification system is implemented by using a Gaussian Mixture Model approach. One specific mix ture for each defined genres is trained by using the HTK software, with the training datasets described below. 16 Gaussian clusters for each mixture are used. Style # train songs # test songs Classical 50 5 Dance 50 5 Hip-Hop 49 5 Jazz 49 5 Pop 51 5 Rhythm'n'Blues 50 5 Rock 50 5 Speech 50 5 Table 1. # Songs in our Database % Cli Da] Hi [ Ja Po I Rh] Ro Sp Cl 25 15 12 17 9 5 7 10 Da 2 21 23 6 17 16 8 7 Hi 5 11 40 6 8 12 6 12 Ja 6 9 24 16 9 18 7 11 Po 3 17 27 7 16 18 6 6 Rh 10 13 31 8 10 12 6 10 Ro 3 9 28 10 12 14 14 10 Sp 9 6 23 6 5 6 2 43 Table 2. Similarity Matrix - timbric descriptors. Rows: Ground truth; Columns: Percent of correctly classified instances. The set of descriptors for all the experiments are computed with frame-rate f, 10Hz (that is overlap 100ms). All the analysis are made using f, = 22050Hz. The defined taxonomies have been manually selected for musicological reasons (See [10] for a more detailed discussion about taxonomies) and the audio files in each one of the genres have been carefully selected for for getting a sample with an appropriate degree of variance and representativeness. This database is based on CD or highquality MP3 audio files. A more detailed information of this database can be found in Table 1 4. EXPERIMENTS & RESULTS After the construction of the Gaussian Mixture classifier, three different experiments have been done. For the first one, only timbric features have been used, for the second one only rhythmic features have been selected and, finally, both timbric and rhythmic features have been included. Results of these classifications are shown in Tables 2, 3 and 4 respectively'1. All the frames of all the test audio files have been classified for this experiment, and results can be summarized as follows: Frame-based accuracy for Timbric classification: 23.3 80%; frame-based accuracy for Rhythm classification: 24.90%; frame-based accuracy for both Timbric and Rhythmic classification: 32.7500 If the maximum accumulated probability for each song is selected for genre classification, the accuracy is im1 Cl: Classical, Da: Dance, Hi: Hip-Hop, Ja: Jazz, Po: Pop, Rh: Rhythm'n'Blues, Ro: Rock, Sp:Speech

Page  00000004 % Cl Da Hi Ja Po Rh Ro Sp Cl 12 12 10 16 10 19 11 10 Da 3 67 3 5 4 5 10 3 Hi 5 12 25 08 18 12 10 10 Ja 9 18 12 12 13 18 10 8 Po 7 14 12 11 13 18 15 10 Rh 4 11 19 9 9 26 9 13 Ro 11 17 7 11 13 8 27 6 Sp 5 11 22 12 9 15 9 17 Table 3. Similarity Matrix - rhythmic descriptors % Cl Da Hi Ja Po Rh Ro Sp Cl 27 27 3 17 8 5 7 6 Da 1 81 5 3 3 3 3 1 Hi 1 23 28 5 10 14 9 11 Ja 4 26 10 19 9 16 10 6 Po 3 28 12 8 14 19 12 4 Rh 5 31 12 8 11 19 6 8 Ro 2 25 9 12 15 8 26 3 Sp 2 20 9 5 6 7 3 48 Table 4. Similarity Matrix - timbric and rhythmic descr. proved up to the following values, as shown in Table 5 Accuracy for Timbric classification: 37%; Accuracy for Rhythm classification: 47.5%; Accuracy for both Timbric and Rhythmic classification: 55%. In both cases, the inclusion of rhythmic information improves the efficiency of the system. Our classifier is still far from the State of the Art ones (more than 70% in the Genre Classification Contest, ISMIR 2004) because of the possible following reasons: we have just implemented a basic classifier and we have included some rhythmic and timbric conflictive cases (e.g. speech) in order to study the behavior of Rhythm Transform in more general environments. All these considerations will decrease the performance of the classifier. Despite that, the accuracy is improved for classical and speech genres (from 25% to 27% and from 43% to 48% respectively) as well as for Dance music (from 21% to 81%). But not all the results are so encouraging, as seen in Pop Music: confusions between Pop and Dance music can introduce noise to the system. 5. CONCLUSIONS & FUTURE WORK As shown in previous sections, a generic rhythmic description of audio has been presented in this paper. It has been tested in a simple genre classification system as one of the most important applications. Results are encouraging: with the inclusion of rhythmic description of audio, the efficiency is increased in all genres, independently of its rhythmicity (from dance to speech, from classical to hip-hop). This description of rhythm has proved to be robust, versatile and compact enough to deal with all kind of audio signals. ITimbre Rhythm Both Classic 5 0 3 Dance 2 5 5 Hip-hop 5 4 4 Jazz 0 0 2 Pop 0 1 0 Rhythm'n'Blues 0 4 1 Rock 0 3 2 Speech 3 2 5 Table 5. Comparison between the number of correct classified songs for the three different configurations 6. ACKNOWLEDGMENTS This research has been partially funded by the EU-FP6 -IST-507142 SIMAC project and the EU-FP6-eContent HARMOS project. 7. REFERENCES [1] Tzanetakis, G; Essl, G. Cook, P. "Automatic Musical Genre Classification of Audio Signals" ISMIR, 2001. [2] Aucouturier, J.J; Pachet, F. "Musical genre: A survey" Journal ofNew Music Research 32(1), 2003. [3] Kosina, K. "Music Genre Recognition", Diplomarbeit, Hagenberg, 2002. [4] Gouyon, F. "Towards Automatic Rhythm Description of Musical Audio Signals. Representations, Computational Models and Applications" PhD Pre-thesis work Pompeu Fabra University, 2003. [5] Dixon, S. "A beat tracking system for audio signals" Proc. Conference on Mathematical and Computational Methods in Music 1999. [6] Foote, J; Uchihashi, S. "The Beat Spectrum: A new approach to rhythm analysis" Proc. International Conf on Multimedia and Expo 2001. [7] Scheirer, E. "Tempo and Beat Analysis of Acoustucal Musical Signals" J. Acoust. Soc. Am. 103(1), Jan 1998. [8] Goto, M. Muraoka, Y. "A Beat Tracking System for Acoustuc Signals of Music" Proc. ACM Multimedia 1994. [9] Guaus, E. "New approaches for rhythmic description of audio signals" DEA Dissertation Pompeu Fabra University, 2004. [10] Pachet, F; Cazaly, D. "A Taxonomy of Musical Genres" Technical Report April, 2000.