Page  396 ï~~Preprocessing for the Automated Transcription of Polyphonic Music: Linking Wavelet Theory and Auditory Filtering Rolf W6hrmann*and Ludger Solbacht AB Verteilte Systeme, TU Hamburg-Harburg 21071 Hamburg, Germany Abstract: In this paper a fast method for the calculation of a linear time-frequency distribution based on the gammatone filter auditory model is introduced as a preprocessing step for the automated transcription of music and auditory source separation. Examples show that this method has a promising potential for the analysis of music pieces with a limited spectral overlap of the different signal components. 1 Wavelet Analysis Using Auditory Filters In the past few years wavelet transforms have become an important tool for signal processing. See (Rioul and Vetterli, 1991) for an overview. An important property of both the continuous wavelet transform (CWT) and the short-time Fourier transform (STFT) is their linearity, which makes them more suitable for the analysis of multicomponent signals than quadratic time-frequency distributions (TFD) suffering from cross-term artefacts. The CWT is given by W)g )dr, a>0. (1) For admissibility of a time function g(t) as a mother wavelet, it must satisfy I g(t)12dt < oo, and G(0) = 0, (2) where G(f) is the Fourier transform of g(t). Functions satisfying these conditions look like short waves, which has been the reason for naming them wavelets. The main difference between the STFT and the CWT lies in the fact, that for the STFT the analysis window remains unaltered, whereas the CWT window changes its scale due to the scaling factor a. The quasi-logarithmic organization of musical scales and of the frequency resolution in the human cochlea makes the CWT a more appropriate TFD of acoustic signals than the STFT. Since eq.1 cannot be evaluated everywhere, it has to be modified by picking certain fixed values for a and b yielding a discrete approximation of the CWT. Furthermore, the wavelets used throughout our work are close to being analytic (progressive), that is they satisfy Vf < 0: G(f) 0. Thus, given the complex-valued filter outputs the instantaneous frequencies can be estimated from the phases and the signal envelopes can be estimated from the moduli, if the signal components have a negligible overlap. Due to the high computational burden of the quasi-continuous CWT, infinite impulse response (IIR) realizations like our gammatone approach are desirable. As stated by Patterson (1992) the gammatone filter can be a good approximation of the filtering in the human cochlea if the parameters are properly adjusted. See fig.1 for an example of a gammatone impulse response. A quasi-analytic version of the gammatone filter is of the form = k. e(t). in-1e(Ai2?1fO)t (3) with a Laplace transform of Gn,A(s5) = (s + (A - j2wrf0))" ' (4) *e-mail: t e-mail: 396 I C M C PROCEEDINGS 1995

Page  397 ï~~0.02 0.015 0.1 0.01 E..0.06 0.005 0.04 Qm '0.01 0.02 -0.015 -0.02-. 0 0.005 0.01 0.015 0.02 0 500 1000 1500 2000 seconds frequency in Hz Figure 1: real part of a gammatone impulse response, Figure 2: bank of gammatone filters of 3rd order: n = 3, fo = 1000Hz, Af = 50Hz frequency response magnitudes where e(t) is the unity step function, n the filter order, A > 0 the damping factor, k some normalization constant and fÂ~ the center frequency of the filter. Closed expressions for the calculation of the coefficients of an IIR discrete-time analytic gammatone filter approximation have been given in (Solbach et al., 1995). For energy normalization we set k = ((2A)1-' r(2n - 1))-0-5, where 1(x) denotes the Gamma function. Fig.2 visualizes an example of a bank of normalized gammatone filters. A function cannot be arbitrarily concentrated in both time and frequency. The lower limit of the time-frequency window size is given by Heisenberg's uncertainty principle. The minimum RMS window area of 1/(47r) holds for the Gaussian function and its shifted variants in the time-frequency plane (Papoulis, 1987). In (Solbach et al., 1995) it has been shown that for the gammatone filter we have 1 n.(2n-1) (5) (A f t)(n)=- 2 _ 4n-6() for the window area in the time-frequency plane, leading to the conclusion that n = 3 is the appropriate choice for minimum window size. Compared to the Gaussian window the resolution of the gammatone wavelet is considerably worse. In terms of wavelet theory, however, both the Gaussian and the gammatone wavelet are not admissible, because eq.2 is not satisfied. Practically, this can be neglected if the steepness of the filters is sufficiently high. For further processing of the CWT output, a method for the detection of sound onsets has been introduced in (Solbach et al., 1995). It is based on a matched filtering procedure, making use of the fact that a characteristic phase pattern in the time-scale plane can be observed at the time of a sound onset. 2 Examples The first example in fig.3 shows an analysis of the first four measures of Johann Sebastian Bach's Goldberg Variations played by Glenn Gould. The upper display was created by plotting bars proportional to the moduli of each band, centered around the instantaneous frequencies estimated from the phases. The lower display is a graph of the above mentioned sound onset detection algorithm. We expect to solve the problem of interference, strongly evident in the range between 12s and 17s, by removing identified steady components prior to the application of the algorithm. The analysis was performed over 3 octaves descending from 987.77 Hz using 12 equidistant filters per octave with Af/fo - 0.01. In order to enhance the graphical display, only filter output values with moduli above a certain threshold were plotted. In fig.4, harmonics of the base frequencies were suppressed to a great extent by a simple phase locking detection algorithm. The second example shown in fig.5 is an excerpt of a Mbira improvisation played by Dumisani Mbaso from South-Africa. It shows the complex polyphonic and polyrhythmic structure of African ICMC PROCEEDINGS 199539 397

Page  398 ï~~frequency in Hz (log in 400.0 cent steps) 804.85 I. - - 638.81 507.02 402.43 319.41_L 253.51_ I. 20121 159.70 126.76, I 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 time in seconds singularity 100t ' Ir 100 o 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 time in seconds Figure 3: analytic gammatone filtering and onset detection, Goldberg Variations example frequency in Hz (log in 400.0 cent steps) 804.85 638.81 - - 507.02 402.43-..- - - 319.41.,.. --..... avo....w...._- -.......... 253.51 I 20121 159.70 126.76 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 time in seconds Figure 4: analytic gammatone filtering with suppressed harmonics, Goldberg Variations example 398 ICMC PROCEEDINGS 1995

Page  399 ï~~frequency in Hz (tog in 400.0 cent steps) 836.77_ - 664.15 527.13 418.39, o a--- a....-.. =>,= Caw--" a 332.07 v -L"i b -''." w b. - " AM-Â~ a.-. ~ 263.571 209.19 -..... d.. Â~.., --,-.--, 166.04. 131.78. I..I" 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 sme in seconds Figure 5: Mbira improvisation by Dumisani Mbaso, recorded by Manfred Stahnke Mbira music. The analysis was performed over 3 octaves descending from 1018.0 Hz using 9 nonequidistant filters per octave with ZAf/fo =0.01. 3 Conclusion Examples showing the potential of our approach for the analysis of polyphonic music have been given. In order to alleviate the restriction of limited spectral overlap, the development of an adaptive lateral inhibition algorithm is envisaged. References [Kronland-Martinet et al., 1987] R. Kronland-Martinet, J. Morlet, and A. Grossmann. Analysis of Sound Patterns through Wavelet Transforms. International Journal of Pattern Recognition and Artificial Intelligence, 1:273-302, 1987. [Papoulis, 1987] A. Papoulis. Signal Analysis. McGraw-Hill, 1987. [Patterson et al., 1992] R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand. Complex Sounds and Auditory Images. In Y. Cazals, L. Demany, and K. Homer, editors, Auditory Physiology and Perception, Advances in Biosciences, pages 429-443. Pergamon Press, 1992. [Rioul and Vetterli, 1991] O. Rioul and M. Vetterli. Wavelets and Signal Processing. IEEE SP Magazine, pages 14-38, October 1991. [Solbach et al., 1995] L. Solbach, R. Wohrmann, and J. Kliewer. The complex-valued continuous wavelet transform as a preprocessor for auditory scene analysis. In Working Notes of the Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence, Montreal, Canada (in print), August 1995. Preprint available via IC M C P ROC EE D I N G S 199539 399