Page  324 ï~~Sound Onset Localization and Partial Tracking in Gaussian White Noise Ludger Solbach and Rolf Wohrmann Arbeitsbereich Verteilte Systeme Technische Informatik VI TU Hamburg-Harburg 21071 Hamburg, Germany {ludger, rolf}Â~tick.ti6.tu-harburg.de Abstract In auditory scene analysis the detection of common partial onsets is an important clue for event formation. Our wavelet based approach for onset localization does not suffer from either smearing effects caused by windowing in the time domain nor from consistency problems caused by multiband level thresholding. Furthermore the method makes use of the beneficial effect of partial removal by adaptive signal cancelling. The results presented indicate that our approach is a promising basis for multicomponent acoustic signal separation. 1 System Architecture The fundamental signals considered in our approach are time limited sinusoid functions disturbed by Gaussian white noise. Our model for these signals consists of two basic components, an onset part and a steady part. These parts correspond to two types of agents used in the system called the master module and the partial tracker module (PTM) shown in fig.1 and in fig.2, respectively. There is exactly one master module in the whole system, while the number of PTMs changes dynamically. The master's task is to create or delete PTMs triggered by the detection of an onset or an offset, respectively. The task of the PTMs is to track the partials originating from the events. The tracked partial signals si(k) are fed back to the master and subtracted from the overall input signal s(k). All filters used in our system are analytic constant-Q energy-normalized gammatone filters (Solbach et al., 1996). We chose the gammatone filter because of its physiological justification and computational efficiency. The global architecture of the system resembles the one described in (Nakatani et al., 1995) in many respects. The realization of the system components, however, differs considerably. 1.1 The Master Module The master module depicted in fig.1 consists of a wavelet transformation module (WTM), an event detector and a partial tracker module controller. Figure 1: master module residual 1st order + time variant AR model filoer At Adaption Aft0 Rule i (k+1) Figure 2: i-th partial tracker module The WTM is basically a bank of logarithmically spaced analytic constant-Q gammatone filters (Solbach et al., 1996). Its purpose is to preprocess the signal for the event detector and to provide suitable initial states for the tracking filters of the PTMs. As soon as the event detector detects an onset, the PTM controller initializes the proper PTMs with one-to-one copies of the wavelet filters residing closest to the frequencies where the partials are expected. By building the residual between the overall input signal and the tracked partials, the event detector will find the following events in an almost undisturbed context. In the current realization the master depends on the previous knowledge of a partial's frequency localization at its onset and on the knowledge of an event's status as an onset or offset. This information could be gained from musical notation as done by Scheirer (1996) for expressive music performance analysis. While the use of previous knowledge leads to a simplification in the current state of the implementation, the entry point will still be kept as a higher level knowledge feed after the future removal of this dependency. Solbach & Wohrmann 324 ICMC Proceedings 1996

Page  325 ï~~Event Detection Our approach for event detection is based on the fact that homogeneous signals cause a characteristic phase pattern in the time-frequency plane (Solbach et al., 1996). The homogeneity condition is approximately fulfilled in the neighbourhood of sound onsets. To reduce the computing load for pattern detection a single line of constant phase in the phase pattern being close to the amplitude maximum of the impulse response is selected. Let p(k) denote the sample of the selected phase line in the frequency band with index k E {O..N - 1}, if a singularity occurs at n = 0. With wk(n) being the wavelet coefficient in band k at sample n, consider the expression where N-1 m= --. Z Iwk(n + p(k))I E{wkI}. k-0 Thus if (s) (7) IbI<,r 'ml' N-1 Y(n)= Z 1wk(n + p(k)) k=0 (1) Selecting the p(k) close to the maximum of the filter onset responses, causes a prominent peak in Y (n), since the phases sum up constructively along the phase line. In the presence of noise Y(n) will exhibit spurious peaks which we wish to eliminate. In the following we present an algorithm for event localization in Gaussian white noise. As our wavelet filters are energy normalized we have for each scale of the wavelet transform due to the Wiener-Lee theorem E{Jwkl2} = 92, (2) where a2 is the variance of the white noise. If the noise is not only white but also Gaussian and if the real and imaginary filter responses are uncorrelated' the modulus exhibits a Rayleigh distribution for which the expectation value is given by (Kammeyer, 1992) E{-wkI }= =.. (3) Thus for Gaussian white noise the expectation value of the modulus is the same constant for each scale. We make use of this fact for noise floor estimation by fitting a straight line given by g(k) =a + b.k (4) to the wavelet transform modulus along the selected phase line via least squares linear regression. As the ideal case b= 0 will hardly ever be observed, a tolerance must be granted for b to deviate from this ideal. For the variance of the modulus 2 we have with eqs.2 and 3 01 w = E{Uwk-- E{IwkI))2} = E{Iwk I} - E2{IwkI) = h2 or) 'this condition holds for analytic filters with some constant r > 0, we consider the signal as Gaussian white noise. In case this condition holds, m1 is used as threshold for event detection. With threshold = 0 as initial value, the complete algorithm for event detection is as follows: 1. set Z(n)"= Y(n) 2. if Y(n) < threshold set Z(n) = 0 3. estimate bin eq.4 4. if eq.7 holds set threshold = ml and Z(n) = 0 5. there is an event at sample no if the following conditions hold: " Z(no) is a maximum2, " there was no event in [no - v...no [. The constant v > 0 should be chosen such that spurious peaks in the neighbourhood of an onset are ignored. A problem with the above algorithm arises from the fact that b would be close to zero not only for white noise but for all spectra being symmetric with respect to the midmost frequency band. This problem could be solved by calculating further line fits for shifted band indices and requiring the fulfillment of eq.7 for each of them. 1.2 The Partial Tracker Module For each onset a PTM is created in which the adaptive gammatone tracking filter is instantated as a one-to-one copy of the corresponding wavelet filter. The term "one-to-one" is to be taken literally. To avoid unnecessary tracking delay, not only the coefficients are copied but also the current state variables. Both energy normalization and constant-Q property are maintained with each update of the filter coefficients. Prediction and Frequency Estimation A first order complex-valued least squares AR model is used to predict the next sample and the frequency localization of the tracked partial. With the estimates of the zeroth and first autocorrelation coefficients Â~0 and 01 the pole of the AR model is at zo = =011 Â~. The predicted next sample s (k + 1) is at (k + 1) zos(k), (8) where a, is the inverse of the transfer function modulus of the gammatone filter at fo given by ac-- k(n- 1)!" The frequency estimate is f(k) = arg(zo), (9) (10) 2those resulting from setting Z(n + 1) = 0 are ignored ICMC Proceedings 1996 325 Solbach & Wohrmann

Page  326 ï~~where fs is the sampling rate. The window size employed for the estimation of the autocorrelation coefficients is continuously adapted to the time window size At on a samplewise basis. The amplitude of a complex valued signal contaminated by white Gaussian noise is Ricedistributed (Kammeyer, 1992). By the influence of noise the estimated amplitude is biased towards higher values. In the extreme case of zero amplitude the bias is given by eq.3, while for growing amplitudes it is continuously decreasing to zero. The Rice distribution is difficult to handle, due to the involvement of the modified Bessel function. In the current inplementation there is no account for the amplitude bias introduced by the noise. The Adaption Rule As in (Wang, 1994) the center frequency fo(k) of the tracking filter is adapted by the gradient descent rule fo(k+1)=fo(k)+g.[f(k-6)-fo(k-6)], (11) where f(k) is the estimated signal frequency, g is the tracking gain and 6 is the group delay of the tracking filter plus the one of the frequency estimator in samples. It can be shown that the adaption given by eq. 11 is stable if (Wang, 1994) 7r + 2 0<g<2sin 1+26 (12) With 6 = 0 in eq.12 we arrive at 0 < g < 2. If the group delay is increased, the upper bound for g is diminished. For 6 -+ c stable tracking is impossible. Evidently minimal group delay of the tracking filter is desirable for maximum adaption gain. This is why we choose n = 3 for the order of the gammatone filter, which in (Solbach et al., 1996) was shown to yield minimum group delay. As the parameters are adapted samplewise it is preferable not to evaluate a transcendental function for the calculation of g. As we have 2 -.'x <sin(x), for 0<x<it is safe to set The last two columns show the deviation of the estimated time from the actual event instant in samples (f, = 22.05kHz). In the upper part of fig.3 the residual sonogram is displayed, in the lower part we see the corresponding Z(t) and in fig.4 the estimated partial frequencies. 3 Conclusions and Future Work An architecture for sound onset localization and partial tracking in Gaussian white noise was presented. The results given are a snapshot of our research in acoustic multicomponent signal separation. They indicate that the onset localization and partial tracking properties of our method are promising in terms of robustness against the superposition of noise and concurring partials. Obviously there is still much work left to be done. This is part of our agenda for forthcoming work: " The pre-knowledge requirement for the frequency starting point should be removed. " The system should be able to automatically discern onsets from offsets. " The flexibility of the noise model should be improved. " If the partials originating from one onset are harmonic, the proper partial trackers should be tied together by coupled adaption. This will improve robustness against the disturbance of single partials. " A partial tracker should be able to terminate itself if the partial fades gradually without offset. " There should be an account for the noiseinduced bias of the amplitude estimation. References [Kammeyer, 19921 K.D. Kammeyer. Nachrichteniibertragung. Teubner, Stuttgart, 1992. [Nakatani et al., 1995] T. Nakatani, H.G. Okuno, and T. Kawabata. Residue-Driven Architecture for Computational Auditory Scene Analysis. in C.S. Mellish, editor, 14th International Joint Conference on Artificial Intelligence, pages 165 -172. Morgan Kaufman Publishers, 1995. [Scheirer, 19961 E.D. Scheirer. Using musical knowledge to extract expressive performance information from audio recordings. In H. Okuno and D. Rosenthal, editors, Readings in Computational Auditory Scene Analysis. Erlbaum, 1996. in press. [Solbach et al., 1996] L. Solbach, R. Wdhrmann, and J. Kliewer. The complex-valued continuous wavelet transform as a preprocessor for auditory scene analysis. In H. Okuno and D. Rosenthal, editors, Readings In Computational Auditory Scene Analysis. Erlbaum, 1996. in press, preprint available via ftp://ftp.ti6.tuharburg.de/pub/paper. [Wang, 1994] A. Wang. Instantaneous and Fre quency Warped Signal Processing Techniques for Auditory Source Separation. PhD thesis, Stanford University, 1994. 2 g- 1 + 26 (13) 2 Results The parameters employed in our examples are Q = 0.1, q = 1 and 0 = 20ms. 12 wavelet filters per octave are used. In Gaussian white noise the onset and offset of a time limited sine of lkHz could be localized with an error of less than 0.4ms, down to a signal to noise ratio (SNR) as low as 1dB. The sampling rate was f9 = 44.1kHz. In the more complex example shown in fig.3 and fig.4, a sound composed of 4 partials in 0dB Gaussian white noise was analyzed. The following table shows the partial parameters: start 0.05s 0.10s 0.15s end f treq. SNR j onset 0.30s 500 Hz 13.7 dB -13 0.25s 600HzI13.7 dB -9 0.20s 400z 3.2 -6 700 Hz 3.2 dB | offset +3 -6 -2 t 1 1 Solbach & Wohrmann 326 ICMC Proceedings 1996

Page  327 ï~~trequency in Hz 009 in 600.0 cent steps) 1072.0(758.02 536.00 379.01 268.OQ 189.51 -134.00 -94.75 -67.00 -47.385.............................. czo rr..., ".'. F l1. _....................................................................... '' JO I I I- 1 - I I I --I I- I I I I I-----1 -; i I 1 - 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.t6 0.17 0.18 0.t9 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 time in seconds 3000 200 1000 Ii '!- - 1- 1 I - 1 --{ I a 1 1 I 1 I I I 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.t7 0.18 0.19 0.2 -1 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 time in seconds Figure 3: residual sonogram and Z(t) I I I I I I 700 600 500 a) IVNAII\\,Ari 400 I 300 1 i 0.05 0.1 0.15 0.2 0.25 0.3.. time [s] Figure 4: partial traces ICMC Proceedings 1996 37Slah&Whmn 327 Solbach & Wohrmann