Page  00000001 Cancellation of Unwanted Audio to Support Interactive Computer Music Jonghyun Leet, Roger B. DannenbergT, and Joohwan Chunt tScientific Computing Laboratory, Department of Electrical Engineering Korea Advanced Institute of Science and Technology, 373-1 Kusong-Dong, Yusong-Gu, Taejeon, 305-701, South Korea jhlee@sclab.kaist.ac.kr, chun@sclab.kaist.ac.kr TSchool of Computer Science, Carnegie Mellon University dannenberg@cs.cmu.edu Abstract A real-time unwanted-audio cancellation system is developed. The system enhances recorded sound by canceling unwanted loudspeaker sounds picked up during the recording. After cancellation, the resulting sound gives an improved estimation of the live performer's sound. The cancellation works by estimating the unwanted audio signal and subtracting it from the recorded signal. The canceller is composed of a delay block and two adaptive digital filters. Our work extends conventional echo-cancellation methods to address problems we encountered in music applications. We describe a realtime implementation in Aura and present experimental results in which the proposed canceller enhances the performance of a real-time pitch detector The cancellation ratio is measured and limitations of the system are discussed. 1 Introduction In interactive computer music performances, a live musician often uses a microphone to capture an acoustic performance, but the microphone also picks up sounds from the computer. To enhance the quality of the recorded sound and to reduce the capture of other sounds, we usually use a good directional microphone and locate it as close as possible to the player, but some computer sound is still captured. This unwanted sound can interfere with the signal analysis and signal processing typically used by interactive music systems. Signal processing systems can use various methods to estimate unwanted signals and subtract them from the recorded sound. One approach synthesizes an inverted signal to cancel the unwanted signal acoustically (Kuo and Morgan 1999). In this paper, we describe an application of cancellation to enhance recording quality. We have created a real-time implementation of the system using moderate computational power. An important difference between this and previous work is that our goal is to enhance interactive music performance systems. The often-independent nature of computergenerated "unwanted" sounds and the sparseness of musical spectra can make it very difficult to estimate the system characteristics. This has led us to develop new techniques, which we describe below. For this work, we assume interactive music performance systems with a single microphone and a single loudspeaker, although extension to more channels should be possible. The microphone captures the sound of one or more instruments. Simultaneously, sounds generated by a computer are played through a loudspeaker. In these situations, the microphone often picks up unwanted sound from the loudspeaker. This unwanted sound may degrade the computer analysis of the acoustic instrument. We want to enhance the recorded signal quality by canceling the unwanted sound. Note that the computer-generated sound is available in digital form, so the real problem is to estimate how this sound is transformed on the path to the loudspeaker, through the room, to the microphone, and back to the computer. If we can estimate this entire channel, then we can digitally simulate the effects of the channel on the computergenerated signal and subtract the result from the signal obtained from the microphone. At first glance, one might guess that the computer-generated signal can simply be delayed, attenuated, and subtracted from the microphone signal to accomplish our goal. Unfortunately, the frequency-dependent behavior of the loudspeaker and microphone have a large effect on the signal. Thus, the system must estimate the overall response of the entire signal path, or channel, to achieve cancellation. To make matters even worse, the channel is not fixed. As the performer moves, the channel characteristics change. For example, the performer might move into the direct path from loudspeaker to microphone, or the performer might accidentally move the microphone. Our cancellation system is implemented as a signal pro Proceedings ICMC 2004

Page  00000002 cessing component of Aura (Dannenberg and Brandt 1996). Using pre-recorded sound files to simulate both performers and computer-generated signals, we can experiment with different configurations and listen to the results. One important application of this work is to isolate the wanted audio signal for analysis. We experimented with the cancellation system as a front-end to a pitch estimation module and show that pitch estimation is enhanced by canceling unwanted sounds. 3 Cancellation system x(t) Accompaniment Sound (Unwanted) Figure 1 offers an overview of our cancellation system. x(t) is an accompaniment signal played from the loudspeaker. A musician is playing near a microphone producing the waveform y(t). The sound of the accompaniment and target instrument are recorded together, so the recorded waveform is the sum of two sounds. Because of the effect of the acoustic channel, the recorded sound is not exactly identical to x(t) and y(t), so we define the recorded sound as z(t) x'(t) + y'(t), where x'(t) and y'(t) are distorted versions of x(t) and y(t), respectively. We want to get only the target sound by canceling the sound of the loudspeaker. The cancellation system inputs are the sound sent to the loudspeaker and the sound received from the microphone. The output is the cancelled sound which is written as y"(t). The recorded sound x'(t) is not a simple time-delayed copy of x(t) because of many effects including sound reflection and diffraction in the acoustic environment. Other effects are the transfer characteristic of the loudspeaker, microphone, amplifier and recording equipment, including quantization errors in the A/D and D/A converters, the nonlinearity of the amplifier, and frequency characteristic of the loudspeaker and microphone. All these effects are referred to collectively as the acoustic channel. When the player or microphone moves, the channel varies. Therefore the recorded sound, which is now a discrete signal indexed by n, is given mathematically as y'(t) y"(t) Figure 1: The overview of sound canceller system. 2 Related Work These are not entirely new problems, and adaptive systems already exist for active noise cancellation (Kuo and Morgan 1999) and echo cancellation (Haykin 1969). However, in an active noise canceller, the noise is continuous and spectrally stable, and in an echo canceller, the echo signal is fairly deterministic. For example, in telephony, the echo comes from relatively stable electrical circuits rather than changing acoustic environments. In telephony, we believe the channel varies more slowly than in a live music performance. Also, telephony uses a training signal to analyze the circuit before talking begins. In our application, the wanted signal can interfere with our estimation of the unwanted signal, but we do not know when the wanted signal starts or stops. Also, when the unwanted signal is absent or weak, it is impossible to estimate the channel or to estimate the canceling signal. Therefore, we develop a method to evaluate when the adaptation leads to improvement. When there is no improvement, the channel estimate is not updated. In the field of telephony, this is called the "double-talk problem." (J. Benesty and Cho 2000; Kuo and Pan 1993) Usually, only one speaker talks at a time, and the double-talk situation is often detected by comparing the incoming signal with echo to a threshold. Adaptation is stopped when double talk is detected. With live music, the "double-talk" situation is the norm rather than the exception. z(n) = x'(n) + y'(n) = h(n) * x(n) + y'(n) + ng(n) + nq(n) (1) + nn(n) (2) where ng(n) is background noise, and we assume its probability density function (PDF) is white Gaussian. nq(n) is quantization noise due to A/D conversion, and n,(n) is the sum of unknown noises due to nonlinearities. n is the time index and * is the convolution operator. The PDF of n q(n) is given as IIQ P(nq) - f 0 -Q/2 < n, < Q/2 elsewhere (3) where Q is number of quantization steps determined by the number of bits w: Q = 2-(-1). If we know the channel h(n) exactly, we can estimate the sound x'(n). When the target sound y'(n) does not exist, the problem is similar to system identification problem shown in Figure 2. The delay block in Figure 2 compensates for the delay of the D/A system, the acoustic delay, and the capture delay of the A/D system. If the digital filter W(z) is long enough, the delay block is not required, but a long filter is undesirable because it requires additional computation and memory. We use the Phase Transform (PHAT) delay estimation to estimate Proceedings ICMC 2004

Page  00000003 Acoustic domain Therefore we have the LMS adaptation w(n + 1) =w(n) + pd(n)e(n) (11) - e(n) The performance of the cancellation system can be determined by a frequency-domain analysis of the residual error signal e(n). The autopower spectrum of e(n) is: (Kuo and Morgan 1999) See (w) = [1 - Cdx'(W)1Sx'x'(W) (12) Figure 2: System identification block diagram. time delay (Knapp and Carter 1976; lanniello 1982; Carter 1987). PHAT require two real fast Fourier transforms (FFT) and one complex FFT transform. where Cdx' (w) is the magnitude-squared coherence function between d(n) and x'(n), and Sx'x' (w) is the auto power spectrum of x'(n). The magnitude-squared coherence function is defined as Cdx(0W) Sdxl(W) 2 Sdd (W) Sx /x /(WU) (13) TPHAT =arg max Rx'd(7) (4) Rd (7)T) X'(w)D(w) eiWT 1-0 |X' (w)D(w)| FFT X [ | X1 (w) D~w| and if x'(n) and d(n) are perfectly correlated such as x'(n) h(n) * d(n), Cdx (W) is 1. Therefore the power of the error (5) Se (w) is 0. This equation indicates that the performance of the canceller system is dependent on the coherence, which is (6) a measure of noise and the relative linearity of the two processes d(n) and x'(n). To check the influence of error we can replace x'(n) with (') x'(n) =(n) + n(n) wheredr(n) h(n) * x(n) and n(n) 2ng(2) + n2q() + nn(n). We assume that we have perfect ter. knowledge of h(n). Cdx' (w) is given by where X'(w) and D(w) is the real FFT of x'(n) and d respectively, and d(n) is the delayed version of x(n). After estimating delay, we must estimate the digital fi The objective of the adaptive filter is to minimize a resid error signal e(n). We want e(n) -=0 after the adaptive fi W(z) converges. The digital filter W(z) can be estimated ing the LMS or RLS algorithm (Haykin 1969). In our syst we use the LMS adaptation algorithm because the comp tion is feasible in real time. The residual signal is expres as e(n) = x'(n) - w'(n)d((n) lual lter usem, utased (7) CdxI (w) | Sdx (0) 2 Sdd(W)Sx/x/ (w) |Sda(w) + Sdn(w)|2 Sdd w)Snn (w) + S:ýý(0w)I Sdd (W) S:: (0) Sdd(W)S:z(W) + Sun(w)Sdd(w) (14) (15) (16) (17) where n is the time index, w(n) = [w1r(n) w2Qr)... wL(rL)1 and d(n) -=[d(n) d(n - 1)... d(n - L + I)]T are the coefficient and signal vectors, respectively, T is the transpose operation and L is the filter order. The filter W(z) must be of sufficient order to accurately model the impulse response of the acoustic channel. Assuming a mean square cost function ((n) E [e2( the adaptive filter minimizes the instantaneous square error (n) = e2 (n). Using the steepest descent algorithm, we update the coefficient vector in the direction of the negative gradient with the step size p: w(n + 1) = w(n) - Vý(n) (8) 2 where Vý(n) is an instantaneous estimate of the mean squared error (MSE) gradient at time n and is expressed as V74(n) = 7C2 (n) = 2 [1ec(n) ]ec(n) (9) =- d (n) e() (10) 1 + SnP(w)/SP(w) and the power spectrum of the error signal is given by See-(w) =[1- Cdx(W)ISz(W) S::P(w) -Sun (w) + S~a(W) SxW () (18) (19) According this equation, if S:: (w) = 10 x Sun (w), the theoretic limitation of the canceller is. After implementing the system as described thus far, we found that sometimes the filter adaptation performs poorly. The adaptation algorithm assumes that the target sound does not exist and that the unwanted sound is white Gaussian. Since these assumptions are not ordinarily true, the adaptation does not always converge to a good estimate of the channel. For example, when the computer-generated sound is very small or silent, it is very unlikely that any changes to the filter will make improvements. Therefore, we developed an extended adaptive method shown by the block diagram in Figure 3. Proceedings ICMC 2004

Page  00000004 Recorded sound Figure 3: Block diagram of Alternative Canceller. In this block diagram the cancellation system has two digital filters: an adaptive and a fixed digital filter. The coefficients of the adaptive filter are updated rapidly using incoming samples whereas the coefficients of the fixed digital filter are updated (or not) only at decision points. Between these decision points, we form the sum of error power and decide which filter has better performance. At the end of the decision window (at the decision point), if the adaptive filter performs better, we copy its coefficients to the fixed filter. 4 Implementation Our ultimate goal is to develop a real-time cancellation system for use in interactive performance. Toward this goal, we implemented the cancellation system as an Aura component and configured a test system using Aura. Aura is a software environment for real-time audio processing; it includes various audio and video signal processing blocks (Dannenberg and Brandt 1996). Using Aura, we implemented the block diagram as shown in Figure 4. Due to limitations in computation power and also to the difficulty in estimating the high frequency behavior of the channel, we run the cancellation system at 1/4 of the 44,100 Hz sample rate used elsewhere in the system. We put two downsampling blocks and one upsampling block at the ports of the cancellation system. The number of taps (weights) in the canceller is 500, modeling 45.5 ms of the channel's impulse response. The CPU load is 11% using a 2.4GHz Pentium 4 (and Redhat Linux). The computation time for the cancellation algorithm is 0(n 2), where n is the number of filter taps. Therefore, if the sampling frequency is doubled and the time duration of the digital filter is the same, the number of filter taps is doubled, and 4 times the computation power is required. Notice that latency is independent of CPU load and filter delay. In fact, since the computer "knows" the source of unwanted sound before it even reaches the D/A converters, the cancellation system can estimate the unwanted sound long before it reaches the microphone. The unwanted sound estimate can then be subtracted from incoming samples as soon Trumpet Sound (Wanted) y....t).. Accompaniment Sound (Unwanted) Figure 4: Canceller installed in Aura system. as they arrive. The only additional delay is due to the downsampling and upsampling filters (see Figure 4). Of course, the computer audio system adds some buffering and therefore latency, but this system latency is not increased by the cancellation processing. For testing, we recorded a stereo waveform in which one channel plays the role of the "computer generated" sound and the other is the "live performer" sound. We play this over two loudspeakers, and place a microphone near the "live performer" loudspeaker to simulate a live performance. This makes the performance repeatable, allowing more controlled experiments. The unwanted sound and the live performance sound are captured and sent to the canceller, which computes the output y"(n). The recorded sound and y"(n) are stored in a sound file in real-time. To provide one objective measurement of the system performance, we connected a pitch estimation1 module to the output of the cancellation system. Our hypothesis is that by 'Even though technically incorrect, we use "pitch" rather than "fundamental frequency" here because the term is shorter and we feel the meaning is clear in this context. Proceedings ICMC 2004

Page  00000005 -0.1. -0.2...... -0.3 L 0 5 10 15 20 25 30 35 40 45 51 Time (msec) 0.1 0.... o - - --- ------- "double-talk" problem mentioned earlier, we do not consider the conventional adaptive filter approach to be suitable for our musical examples, so all of our results are from the complete system (Figure 3) combining an adaptive and a fixed filter. To evaluate the cancellation performance, we set y(n) y'(n) = 0 because we do not have any method to get the waveform y'(n) exactly. Ideally, y"(n) should also be zero, but since cancellation is imperfect, y"(n) will be non-zero. We define the cancellation ratio as E{x'2 () CR =E{y"(n E{y"2(n)I (20) -0.2 -.......... 0 1 2 4 5 Time (msec Figure 5: An example rejecting unwanted sounds from th can improve the performance of fR pitch estimation. The particular pit simple time-domain algorithm that shape of trumpet waveforms, whic peak per period. The pitch estimation algorithm g( consistent consecutive periods are c rithm rarely makes mistakes, a gooc count the total number of reported form pitch estimation directly on the on the output of the cancellation sys 5 Experimental result The first interesting result is th acteristic, but we cannot find this weights of the digital filter, and we es nel from the weights. The filter coef dent upon the property of the source of the loudspeaker, the microphone fects. Even in the same acoustic cha upon sound sources, and the weight lution of the microphone, speaker a of weights, w(n), is shown in Figu all weights, and the bottom shows ( weights. Although it is hard to deter channel appears to be about 35 ms. accurate estimate of the impulse re, is clear that a simple delay with att good model of the channel. We show two types of evaluati performance and pitch estimation p Intuitively, CR is the amount by which the unwanted signal S is suppressed (higher is better). The result is shown in the ___ Table 1. Music A is a smooth pop music and music B is hip6 7 8 9 10 ) hop music with strong percussion sounds. In these tests, CR varies from about 9 to 18dB. Two versions of Music A were of weights. tried, one at a sample rate of 11.25kHz and one at 44.1kHz. The CR is better for the 11.25kHz version because the canceller operates at 11.25kHz and therefore high frequencies in e recording process, we e recoding pocess e the 44.1kHz version are not cancelled. We show an example eature detection such as. t t of x'(n) and the canceller output y"(n) in Figure 6. ch estimator uses a very Se Next, we show test results using target sounds. In Figrelies upon the general relies upon the general ure 7, we show the recorded waveform, cancelled waveform,:h have one pronounced. h have one pronounced and target sound y'(n). Here, the "unwanted sound" is a computer-generated music accompaniment containing synenerates reports only when. neates repts ony when thesized piano, bass, and drums playing Gershwin's jazz stanletected. Since the algo-,.. S dard "Summertime." The "wanted" sound is a recording of I measure of quality is to an acoustic trumpet. As explained earlier, these signals expitch estimates. We per-. pitch estimates. We per- ist as left and right channels of a stereo recording. Headrecorded sound and also,. rere phones were used in the original recording process to obtain nearly perfect isolation of the trumpet sound. The channels are played simultaneously (see Figure 4) to test the canceller, S simulating a live performance of computer and trumpet. To obtain the "true" value of y'(n) (the trumpet), we can simply e acoustic channel char- run the test again while muting the unwanted channel. Nodirectly. We only know tice that you can still see (and hear) accompaniment sounds;timate the acoustic chan- in the cancelled sound, but they are much attenuated. For the ficients are highly depen- first 7 seconds of Figure 7 (middle), the system was estimat-, the frequency response ing the delay, and after that time the cancellation algorithm is,and other nonlinear ef- running. nnel, the weights depend Listening to the cancelled version of the sound reveals s approximate the convo- substantial artifacts as the filter coefficients change. As it nd channel. An example stands now, this system would not be suitable for reducing ire 5. The top part plots cross-talk in a recording for human listening; however, maonly the first 9 ms of the chine listening (or feature extraction) is another interesting *mine, the duration of the Assuming that this is an E{x'2(n)} E{y"2(n)} CR(dB)...... 4- 1-_ - - 1-.-.... I "_ 4 sponse ot mte cnannel, it enuation would not be a ion results: cancellation )erformance. Due to the Music A (44kHz) 2.13 x 10-3 8.46 x 10-5 14.01 Music A(llIkHz) 2.02 x 10-3 2.81 x 10-5 18.572 Music B (11kHz) 3.71 x 10-4 4.27 x 10-5 9.389 Table 1: Canceling performance. Proceedings ICMC 2004

Page  00000006 Original sound I[.L I Recorded waveform ^pi~li 0.10 5 -0.05 -0.05 o-0. Cancellation sound 0.05 -0.05 -0.1 0 2 4 6 8 10 12 14 16 18 20 second Figure 6: Waveforms without wanted sound. Original sound 0.4 Cancellation sound 0.4 Trumpet sound 0.41---------------------------------- aO 02 -0.2 -0.4 l ------------------------------ 0 5 10 15 20 25 30 second Figure 7: Waveforms with wanted sound. possibility that also allows for a more objective evaluation. One useful feature is pitch, and we use performance on pitch estimation to measure the effect of cancellation. The pitch estimation module uses a very simple algorithm that looks for well-defined pitch periods based on equallyspaced threshold crossings. When detected, pitch and time are logged to a file. The pitch detection performance with and without the canceller is shown in Figure 8. The top graph is obtained from the waveform of recorded sound (without the canceller) and represents performance with perfect cancellation. The other two graphs represent pitch estimation with unwanted sounds added, with and without cancellation. We use the top graph (no unwanted sound) as a reference to classify points in the other graphs as correct (near a reference point) or incorrect (differing in time by 50 ms and/or pitch by 1 Hz from any reference point). Table 2 shows the number of detected pitches in all three conditions. There are almost 80% more correct pitch estimates using the cancella Pitch detection result (no unwanted sound) 70 -------------- 5 - - 'c50 - i i I i i I i i Pitch detection result (with canceller) 70 C,. - -, --.-- 60 -50 --I I I I I I I I I Pitch detection result (without canceller) 55 - from almost none to about 2.5 percent. The canceller aptors might not be so sensitive to interference, so results with 50suggests that the reduction of unwanted sounds migt im10 12 14 16 18 20 22 24 26 28 30 second Figure 8: Pitch detection result. tion system, although the number of incorrect estimates rose from almost none to about 2.5 percent. The canceller apparently removes enough interferencove to allow the pitch estimation algorithm to detect periodicity at many more places. Note that more sophisticated fundamental frequency estimators might not be so sensitive to interference, so results with other algorithms might vary quite a bit. We believe this test suggests in th e reduction of unwanted sounds might improve audio feature detection. Demonstrating this with a real feature detector in a real performance is left to future work. 6 Conclusions In this paper, a real-time unwanted audio cancellation system is described. The system can be used to enhance the recorded sound's quality and to improve pitch estimation of a soloist in the presence of computer-generated sound. The proposed system estimates unwanted audio sounds I Instrument only Instrument with canceller Instrument without canceller number of number of wrong points correct points 4829 39 1536 1 859 I Table 2: Pitch detecting performance. Proceedings ICMC 2004

Page  00000007 and subtracts them from the recorded sound. Although we know the source waveform of the unwanted audio, the unwanted audio component in the recorded sound is delayed and distorted according to the acoustic channel. We combine a pure delay with a digital filter to compensate for the effect of the acoustic channel and describe the methods of estimating delay and adapting filter weights. To increase the robustness and performance, we developed an alternative adaptation method that avoids making changes that would decrease performance. We implemented the unwanted sound cancellation system in Aura, a real-time platform for interactive computer music. To demonstrate the cancellation system performance, we show the recorded waveforms and the cancellation ratio (CR). CR depends on the error, especially quantization error. To further evaluate the system, we assembled a pitch estimation application and showed that pitch estimates can be improved when cancellation is applied. (We will play sound examples at the conference.) In conclusion, unwanted sound cancellation can be applied in real-time to improve the performance of an interactive computer music system. We learned that classical echo cancellation techniques alone are not suitable for this application because of the sparse nature of musical spectra combined with the possibility that the computer-generated sound can contain silence. Our final system combines three components: a channel delay estimator, an adaptive filter, and a controller that ensures that adaptation actually improves performance. All of this runs in real time in software on a single-CPU personal computer. The time-varying nature of the adaptive filter produces artifacts that might be considered more objectionable than the original unwanted sound, so the system should not be used to "clean up" recordings for human listeners. However, the reduction of unwanted noise may be very helpful for various sound analysis tasks. We demonstrated how unwanted sound cancellation can improve the performance of a simple pitch estimation system. 7 Acknowledgements This work was mainly performed at Carnegie Mellon University by the first author supported by the Brain Korea 21 Project, School of Information Technology, KAIST in 2004 and the Ministry of Science and Technology managed by MICROS and KOSEF (R01-2003-000-10829-0). Additional support (for the second author) came from the National Science Foundation, grant IIS-0085945, and from the Computer Science Department at Carnegie Mellon. This work was inspired by discussions at the Connecticut College Symposium on Art and Technology in 2003 and with Allen Heidorn. References Carter, G. C. (1987). Coherence and time delay estimation. Proceedings of the IEEE 75, 236-255. Dannenberg and Brandt (1996, 8). A flexible real-time software synthesis system. In Proceedings of the International Computer Music Conference, pp. 270-273. International Computer Music Association. Haykin, S. (1969). Adaptive Filter Theory (3 ed.). Prentice-Hall. lanniello, J. P. (1982, 12). Time-delay estimation via crosscorrelation in the presence of large estimation errors. IEEE Transaction on Acoustics, Speech and Signal processing 30, 998-1003. J. Benesty, D. R. M. and J. H. Cho (2000). A new class of doubletalk detectors based on cross-correlation. IEEE Transactions on Speech and Audio Processing 8, 168-172. Knapp, C. H. and G. C. Carter (1976). The generalized correlation method for estimation of time delay. IEEE Transaction on Acoustics, Speech and Signal Processing 24(3), 320-327. Kuo, S. M. and D. R. Morgan (1999). Active noise control: a tutorial review. Proceedings of the IEEE 87(6), 943-975. Kuo, S. M. and Z. Pan (1993, 12). Distributed acoustic echo cancellation system with double talk detector. Journal of the Acoustical Society of America 94(9), 3057-3060. Proceedings ICMC 2004