Page  1 ï~~AN EFFICIENT OFF-LINE BEAT TRACKING METHOD FOR MUSIC WITH STEADY TEMPO Bee Suan Ong & Sebastian Streich Center for Advanced Sound Technologies Yamaha Corporation 203 Matsunokijima, Iwata, Shizuoka, 438-0192, Japan { beesuan,sstreich} @beat.yamaha.co.jp ABSTRACT This paper presents an efficient off-line approach for automatic estimation of tempo and the exact temporal locations of beats in rhythmic music signals with steady tempo. Different from previous approaches, our method first finds an appropriate location to start beat detection within the analyzed signal. Consequently, the initial beat estimations reach a high level of reliability. The method then infers the beat locations in the rest of the signal by using the global tempo estimate. The processing involves three stages: rhythm feature extraction, tempo estimation, and beat induction. The performance of the proposed method is evaluated using a database of 160 music excerpts from 13 different genres. The global tempo recognition rate reaches 92%, the average p-score for beat tracking is 0.7. The results indicate that despite its simplicity, the proposed approach is successful in extracting rhythmic description from music audio signals and suitable for targeted applications. 1. INTRODUCTION Rhythmic descriptions are one of the most basic and important elements used in characterizing music by human listeners. The ability to capture such musical representation brings benefit to manipulation and retrieval of music audio data in a variety of applications. One example is the mobile music player that automatically selects songs with a suitable tempo for jogging or work-out. Another example is the beatsynchronous playback of tracks in an easy-to-use mixing application for dance music. Over the last few years, automatic extraction of rhythmic descriptions (particularly beat and tempo information) directly from music signals has received a lot of attention. The current state of the art can be observed in the results of several recent MIREX' events. According to [7], most approaches use either the signal energy along time [6] or the detected onsets [2] to derive the beat periodicity of any type of music signal. Collins [1] points out that style-specific approaches towards beat tracking, very much like human listeners, lhttp://www.music-ir.org/mirex/2008/index.php/Main Page Tempo Tempo Estimation E_ I Music Rhythm Beat Candidate Beat Input - Feature Filtering Locations Signal Extraction Beat Grid Determination Beat Induction Figure 1. Block diagram of the proposed method. can take advantage of conventions and characteristic properties of the material. By doing so, they can outperform their generalized equivalents by considerable margins within their domain. On the other hand, a method with an extreme degree of specialization also puts strong restrictions on potential practical applications. Our method follows an intermediate way by requiring the tempo to be constant throughout the track, limited to the range of 60 to 200 beats-per-minute (bpm), and the beat to be regular within a certain tolerance range. While this approach is doomed to fail for classical music and is likely to encounter problems in categories like Jazz or Latin, it is reasonable for the great majority of modern Pop, Rock, and Dance music. We further chose to exploit the advantages of an off-line or non-causal system for the benefit of a higher accuracy in the beat estimation. Our method is conceptually very similar to the one described by Ellis in [4]. However, we rely on a simpler front-end and apply several simple selection criteria for the beat induction instead of the proposed dynamic programming approach. The rest of the paper is organized as follows: Section 2 explains our method for estimating tempo and beat locations. Section 3 shows and discusses the evaluation results. Finally, Section 4 presents our conclusions and future research plans. 2. METHOD Our proposed approach involves three main processing stages as shown in Figure 1. The first stage consists in the rhythm feature extraction, which in our case is an onset detection function based on the differences between spectral frames. Using this feature we then

Page  2 ï~~establish an estimation of the track's tempo in stage two. In stage three, we combine this tempo estimate with the onset detection function to find distinct beat locations first. From these, we then infer the remaining beat locations in the signal. The following sub-sections explain the process in more detail. 2.1. Feature Extraction The feature extraction starts with a standard spectrogram computation by applying a short-time Fourier Transform (STFT) to the audio signal. This gives us N-1 Xk(m)-= w(n)x(mh+n)e-j2k/N kO = 0,...,N-1 (1) n=0 where w(n) is the N-point Hanning window, k is the frequency bin index and h is the hop size. With an audio sampling rate of 44.1 kHz, we set N=2048 and h=200 samples. As mentioned, we base the onset detection function on the spectral differences between adjacent frames. We sum up the absolute values of the complex-valued differences between a frame, X(m), and its predecessor, X(m-1), for all relevant frequency bins: detected peaks S0.65 S 0.6 0.5 < w0.5 0.45 100 200 300 400 500 600 700 800 900 1000 Lag Figure 2 Peak selection process within an autocorrelation function plot. Circled peaks mark the local maxima that contribute to the confidence measure. cause peaks in its auto-correlation function at the lags corresponding to the period intervals. As mentioned earlier, we assume the tempo of each track to be constant, so we can simply evaluate equation 3 on the entire onset detection function without the need for partitioning. We then pick all local maxima in A as beat periodicity candidates. To select the most reliable of the different candidates, we also consider peaks appearing near multiples of the beat periodicity lag. We do this by defining the confidence measure N/2 O(m) = X k (m) - Xk (m -1) k=1 (2) 4 Sc = ZAp(i.lc) i=1 (4) where N is again the window length. Since we use magnitude and phase here, theory would suggest compensating the phase differences of the basis functions that result from the advancement by h samples between the frames. However, when comparing the difference functions using phase adjustment, pure magnitude values and equation 2, the third option obtained the best results for our tempo and beat estimations. We further enhance the onset detection function by subtracting the n-moving average from it. The n-moving average is computed for each sample of the detection function O(m) using a centred window of 17 samples. Parts of the window that fall out of the defined range of O(m) have no contribution. Different from [2], in our case the purpose of this process is not to discard any least significant peaks but to emphasize bumps that stand out from the local trend of O(m). 2.2. Tempo Estimation With the enhanced onset detection function 0, we estimate the tempo or beat periodicity of the signal through analysis of the normalized auto-correlation function where l is the lag of beat periodicity candidate c and Ap(n) is the value of the local maximum in A that is the nearest to n within a range of +10 lags. If no local maximum is within that range, Ap(n) is zero. Figure 2 illustrates the way SS is computed for a beat candidate at lag 116. As a final step, we select the lag of the candidate with the highest confidence value as the estimated beat period Tb in the music. To save computations we can restrict the values for n when evaluating equation 3 to the range that is relevant for our processing. We expect the tempo to lie within the range of 60 to 200 bpm. Hence, the basic beat periods will emerge as peaks in A between the lags 220 and 66 respectively. Considering as well the requirements for our confidence measure, it is therefore sufficient to set 56< n <890. 2.3. Beat Induction For the beat induction we reuse the onset detection function O from stage one together with the estimated beat period, Tb, from stage 2. The first step consists in picking all local maxima in the onset detection function and storing their locations in the vector M. Each one of these maxima is considered as a possible candidate for a beat location. 2.3.1. Beat Candidate Filtering The goal of this step is to identify the most solid ones among the beat candidates. For this purpose we scan the elements in vector M for sequences of beat candidates 1 L A(n) = L (i)O(i - n) L-n a=n (3) with n being the auto-correlation lag and L being the number of samples for which O is defined. Here we take advantage of the fact that periodic patterns in a signal

Page  3 ï~~that are spaced close to Tb samples apart from each other. As for the tempo estimation process, we allow again a tolerance of +10 samples while giving preference to candidates that are closest to the exact interval of Tb samples from their predecessor. Here, instead of defining the entire location grid solely on the very first beat candidate of a sequence, we keep updating it along the way. The last selected element in the process becomes the new reference for the one to follow. If no element is found within the tolerance range, the scan continues at the next interval of Tb samples. The scanning is done one-directional from the beginning to the end of the signal. During the process we keep track of the number and location of the selected followers for each candidate. In our first filtering operation, we apply a local continuity criterion to remove unreliable candidates. We therefore define F32(c) as the number of followers that were found within the next 32 beat periods after each candidate c. The threshold for the elimination of candidates is adaptive and is calculated after Thres(F, 5) = max(F) - 8 (5) After all beat candidates with F32(c)<Thres(F32,4) are removed, we apply a second filtering operation that considers the overall success of each remaining beat candidate in establishing a beat grid on the track. We therefore define Fend(C) as the entire number of followers in the sequence of a candidate c. This way of counting is purposely biased to favor beat candidates that appear early in the signal. Establishing again an adaptive threshold after equation 5, we remove all candidates with Fend(c)<Thres(Fend,2). 2.3.2. Beat Grid Determination Finally, we select the most prominent among the remaining beat candidates as a reference for inferring the beat grid on the track. We therefore initiate a second scan process for beat sequences, as described in the first paragraph of the previous section. Here, we only consider those beat candidates that survived the two filtering operations and we reduce the tolerance for the beat intervals to +5 samples. At this point let's remember that (i) each beat candidate and its followers correspond to a local maximum in the onset detection function O and (ii) the function O reaches higher values, the more abrupt and strong changes occur in the music signal (e.g., at drum hits or sudden note onsets). For the type of music we are targeting, the most prominent beat candidate sequence is expected to coincide very frequently with such events. As the final selection criterion, we therefore compute the average level of O at all beat locations for each beat candidate's sequence. Beat intervals in the sequence that contain no local maximum within the tolerance range contribute with a value of zero to the average. The candidate with the highest average score is selected as the reference for the beat grid. With the selection of the most prominent beat candidate sequence, we have already established all beat locations from the initial candidate's location to the end of the track. Now we only need to infer the beats that occur before the beginning of the sequence. This is done by another scanning process starting from the first beat of the selected sequence and moving towards the beginning of the signal. Again, we allow a tolerance for the beat intervals of +5 samples from Tb. 3. EVALUATION Data sets: We assessed the performance of the proposed approach on two different databases. In order to have a direct comparison with other approaches, we used the freely available "Songs data set" from the ISMIR 2004 tempo induction contest [5], which consists of 465 excerpts from music tracks with manually annotated tempo information. This data set contains only a minority of excerpts from our target music genres. Additionally, we created a second database comprising 160 excerpts with 30 seconds in length each. The proportion of genres in this selection is more focused on Pop, Rock, and Dance music, as can be seen from table 2. Excerpts were sampled at 44.1 kHz with 16 bits in mono. The ground truth was obtained by having two professional musicians manually annotate the beat locations inside a DAW application. For this purpose the data was split into two halves, so each track was annotated only once. One of the authors crosschecked the annotation results for obvious errors. Evaluation Measures: For comparing our results for tempo estimation with those reported in [5], we compute the two metrics exactly as suggested there. Metric one considers only tempos within a 4% error margin around the annotated tempo as correct. Metric two also includes third, half, double, and triple tempos as correct. For the evaluation of the beat detection on our inhouse database, we adopt the metric established for MIREX 2006. The p-score between 0 and 1 is obtained by calculating the cross-correlation function of the impulse train created by our proposed method and the ground truth beat vectors [3]. With regard to the error margin, we reduce the tolerance from 20% to 10%, since we expect our annotations to be more precise. Additionally, for the purpose of comparing the performance across different music genres, we use the standard F-measure [3] with an equal weight of recall and precision scores. The F-measure is commonly used in the evaluation of MIR tasks. Results: For the tempo estimation on the "Songs data set", our method reaches 42.2% and 70.5% of correct results according to the two metrics. While staying clearly behind the much more sophisticated top method (58.5% and 91.2%), our approach copes surprisingly well with this data and ranges among the top three of the eleven evaluated algorithms as reported in [5]. As for our in-house database, the obtained average tempo

Page  4 ï~~recognition rates are significantly higher with scores of 68.1% and 92.0% respectively. For the beat tracking performance, our proposal has achieved an average p-score of 0.7 with a standard error of 0.03 on our database. In order to have a direct comparison to an existing approach, we also evaluate the performance of the beat detection method proposed by Ellis in [4] on the same test database. Both compared methods were implemented using MATLAB, the source code for Ellis's method was taken from [4]. Table 1 includes the global statistics for the performance evaluation of our method (OS) and Ellis's proposal. We see that our method has obtained a slightly lower performance (about 4%). Regarding computational efficiency however, we observed that our algorithm needed only half of the computation time used by Ellis's method. This is particularly remarkable when considering that the latter is already a very efficient algorithm and the second fastest among all submissions to MIREX 2006 despite running in MATLAB. Thus, at the expense of small accuracy deterioration, our proposed method can be highly useful when dealing with large audio database or when computational power is limited (e.g., in mobile devices). The large discrepancy between the median and the mean values in table 1 could be a signal that both methods perform very well for the majority of the songs in the test database. The overall average is likely to be spoiled by a few very bad results for song excerpts with tempo changes from classical or ethnic genres. The evaluation results by genre as listed in table 2 show that both methods indeed perform drastically different inside and outside their target domain. We find high performance in music categories such as "Jazz Acid", "R&B", "Pop" and "Dance", with F-measures over 80%, whereas for the categories "Classical" and "New Age/Ethnic" performance is very poor. Ellis's method appears to generalize slightly better here as performance does not drop as extremely as with our algorithm. 4. SUMMARY We presented a very simple and efficient approach to estimating tempo and beat locations in rhythmic music. Our evaluation results show that the method works with high reliability despite its very low complexity. It therefore is attractive for practical applications although the method requires the tempo to be fairly constant in order to work well. A mixing environment for electronic dance music tracks would be one example where the restriction to constant tempo is not likely to cause any problem. For the items of the classical music category, failures of our algorithm are mainly due to unclear onsets or to the occurrence of tempo changes (e.g. rubato, accelerando, etc.). Regarding the former, we will focus in future work on improving the computation of the onset detection function for a better response to non percussion instruments. The latter would require a more substantial modification of our approach and is basically out of scope. -scores Mean Standard Standard Median Metho Deviation Error OS 0.70 0.32 0.03 0.95 Ellis 0.74 0.30 0.02 1.00 Table 1: Global beat detections performance statistics. Genre Number of F-measures (%) music excerpts Methods OS Ellis R&B 9 86.3 88.0 Pop 22 86.3 88.7 Rock 15 75.8 82.3 Jazz Acid 10 90.6 91.3 Metal Punk 10 68.1 68.6 Hip hop 10 77.2 86.1 Reggae 10 54.3 55.8 Funk & Blues 11 74.8 86.1 Latin 10 58.5 54.7 Dance 30 80.1 83.6 J-pop 3 73.2 73.8 Classical 10 27.5 44.7 New Age/Ethnic 10 44.6 50.2 Table 2: Beat detection performance on various music genres. 5. ACKNOWLEDGEMENTS The authors would like to thank Maarten de Boer and Paul Brossier for their support in the evaluation tasks. We also thank Fabien Gouyon for making the "Songs data set" available. 6. REFERENCES [1] Collins, N. "Towards a Style-Specific Basis for Computational Beat Tracking", Proc. of ICMPC, Bologna, Italy, 2006. [2] Davies, M.P., and Plumbley, M.D. "ContextDependent Beat Tracking of Musical Audio", IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 3, pages 1009-1020, 2007. [3] Dixon, S. "Evaluation of the Audio Beat Tracking System BeatRoot", In JNMR, 36, 1, pages 39-50, 2007. [4] Ellis, D. "Beat Tracking by Dynamic Programming", JNMR Special Issue on Beat and Tempo Extraction, vol. 36, no. 1, 2007. [5] Gouyon, F., et al. "An Experimental Comparison of Audio tempo Induction Algorithms", IEEE Trans. Audio, Speech, and Language Processing, 14(5), 2006. [6] Klapuri, A.P., et al. "Analysis of the Meter of Acoustic Musical Signals", IEEE Trans. Audio, Speech, and Language Processing, 14(1), 2006. [7] Peeters, G. "Time Variable Tempo Detection and Beat Marking", In ICMC, Barcelona, Spain, 2005.