Page  288 ï~~A TWO-STAGE AUTOMATIC ADAPTIVE PROCESS TO REMOVE NOISE FROM AN AUDIO SIGNAL Jonathan Berger Ronald R. Coifnan The Center for Studies in Music Technology Department of Mathematics Yale University Yale University New Haven, CT 06520 New Haven, CT 06520 jberger@alice.music.yale.edu coifmanjules.math.yale.edu Maxim J. Goldberg Physical Sciences Department York College of PA York, PA 17405 mgoldberg@yorkcol.edu ABSTRACT In this paper we present a two-stage algorithm for removing noise from music. The first stage is a denoising procedure described before in (Berger,J., Coifman,R., and Goldberg,M. 1994-a). The second stage, our recent work, uses two signals: the original source, and the model of the noise obtained after the first stage. The spectra of the two signals are expanded into two dyadic trees of smooth local sine bases, and the best basis for the source signal is extracted using a relative entropy function. The result is more pleasing musically than using the first stage alone; in particular, almost no numerical artifact noises remain. 1. INTRODUCTION This paper extends our previous work on denoising the same recording, which was presented at ICMC '94 (Berger,J., Coifnan,R., and Goldberg,M. 1994-a) and which is fully described in (Berger,J., Coifnan,R., and Goldberg,M. 1994-b). Our earlier algorithm, which comprises the first stage of the two-stage process described below, is based on the ideas of R. Coifman and V. Wickerhauser on entropy-based algorithms (Coifinan,R. and Wickerhauser,V.), and of R. Coifman and F. Majid on the denoising process (Coifman,R. and Majid,F.). We have adopted some of the ideas in (Saito,N.), see also (Saito,N. and Coifman,R.). The denoising experiments described have been run with software utilizing the Adapted Waveform Analysis Library, (Wickerhauser,V.). The paper is organized as follows. We first describe our two-stage denoising process. We next describe the process applied to a particular sample and discuss the quality of the resulting cleaned signal. After a discussion of future research possibilities, we conclude with a summary. 2. THE TWO-STAGE ALGORITHM The two-stage algorithm works as follows. Stage one consists of a procedure in which Shannon entropy is used to select the best basis from a dyadic tree of local sine bases constructed on a windowed signal's spectrum. The largest coefficients give rise to the coherent signal, and the remaining coefficients are deemed noise. At this point, we preserve the noise portion for use in stage two of the algorithm. 288 8IC MC P R OC EE D I N G S 1995

Page  289 ï~~In the second stage, we use the (approximate) model of noise obtained from Stage I to reprocess the original signal in parallel with the noise signal to permit the use of a more refined cost function. The resulting clean signal has less noise, and is more musically appealing, than that obtained by using Stage 1 alone. The process is detailed below. 2.1. First Stage A signal is first segmented into windows. Each window is then transformed by a local sine transform to obtain a window of frequencies; the latter is then expanded into the local sine bases of Coifinan and Meyer, see (Auscher,P., Weiss,G., and Wickerhauser,V.). The local sine basis elements are windowed sine expansions with the added benefit of being formed smoothly through a process of projection and parity folding, rather than the traditional localization by characteristic functions. The frequency window's expansion on its whole length forms the root node of a dyadic tree. The children of any node in that tree are formed by splitting the segment corresponding to their parent in half (in the smooth way described above), and then doing a sine expansion on each half, with the expansions from the two halves being associated to the two children nodes, respectively. The resulting dyadic tree contains many bases for the (frequency) window: any selection of nodes in the tree such that their "projection" covers the window without overlapping gives an orthonormal basis. It is thus necessary to select a basis, which is done with the view of making the denoising task easier. We use Shannon entropy as a cost function, which is highest if the coefficients in a tree node have the same magnitude, and is lowest if only one coefficients is non-zero. Thus, Shannon entropy gives a measure of how spread out a vector is in the basis used to represent it. The cost of each node in the local sine tree is compared to the sum of the costs of its children: the node is chosen if its cost is less than the sum, the children are chosen otherwise. The tree is traversed recursively to choose nodes that will comprise a full basis for the (frequency) window to which the tree corresponds. The resulting basis is the one that has most concisely expressed the window's frequencies using local sine bases and thus should give the most efficient representation of the musical content of the window. It is reasonable to hope that choosing the largest coefficients in the constructed basis will result in the separation of noise from music: the musical content should have prominent localized frequency coefficients, while the noise will have coefficients more or less equally distributed and of lesser magnitude. The choice of how many coefficients to keep is done by ordering the coefficients of the chosen basis by magnitude, and successively computing the entropy of the remaining ones as the leading coefficients are removed one by one. Once the entropy of the tail has exceeded an empirically determined threshold, the tail is then re-expanded back to the time side and is deemed a noisy signal, while the leading coefficients, upon reexpansion, form the clean, or coherent, signal. The process can be improved by averaging in both the time and frequency domains to remove numerical artifacts arising from window segmentation and the artificial dyadic segmentation of frequency space. 2.2. Second Stage The second stage is the essence of our current paper. After stage 1, we obtain a "clean" and a "noisy" signal. Regrettably, the clean signal has lost some of the voice's brilliance, while giving a good, life-like quality to the middle register. Moreover, there are annoying artifacts that are residuals of incomplete noise removal, which are very noticeable since they stand out in an otherwise clean signal. The second part of our algorithms attempts to improve on the first pass. We wish to model the noise more carefully, not just as something which cannot be efficiently represented, but taking into account that noise will vary in subtle ways from one audio signal to another, and will depend on the type of instruments played, the recording equipment used, and many other factors. We propose to use the rough model of the noise obtained in the first stage to obtain a more precise model of the noise present in the signal. I C M C P ROCEE D I N G S 199528 289

Page  290 ï~~We proceed as follows. We start once again with a window of the original signal and re-build the tree of local sine bases, on the frequency side. We simultaneously expand the corresponding window of our model noise signal in another tree. To each node of the noise tree we associate a weight, the average energy of that node. We then go to our tree of the original signal and choose a best basis from it, but no longer using Shannon entropy. We now use relative entropy (also called cross entropy, or Kullback-Leibler distance) to compare the sum of the costs of the children to the cost of their parent node. For a vector {xi}, with associated weight w, we let yi = xi and seek to minimize EYlog(w/yi). If w is constant, we get the usual Shannon entropy. The relative entropy assigns a high cost when w is large compared to the yi's, i.e., when the signal to noise ratio is low in our setup. This discrimination is done node by node, i.e. varying from frequency region to frequency region, thus allowing for subtle effects to be considered. The second stage penalizes those nodes where the first stage noise model shows there is more noise than signal, and that thus should be considered unreliable. Even if the first stage noise model has defects, we are only using the node by node average energy of the noise, and an approximation should be sufficient to drive the second stage. Finally, in the optimal basis chosen, those coefficients are kept for the clean signal which exceed by a user-specified multiple the square root of the noise energy in the corresponding node. 3. RESULTS BASED UPON A SAMPLE RUN We applied the two-stage denoising algorithm to a piano vocal arrangement of Cavaradossi's aria E lucevan le stelle from the third act of Puccini's Tosca, recorded by Enrico Caruso with an unknown pianist in 1904 for The Victor Talking Machine Company. Our experiments have shown that the second pass of noise removal does indeed give rise to a cleaner signal than we have been able to obtain with our previous methods. Most noticeably, the annoying whistles present after the first stage, no matter how many averaged iterations are used, have almost completely disappeared after the second stage. The relative entropy's preference for those nodes in which the noise has not overpowered the music does indeed result in less noise being audible. The method seems to produce a more "natural" sounding result than the previous one: there is no noise left which stands out due to its artificiality-- the few crackles that remain do not draw attention to themselves as being the product of signal processing. Moreover, the reconstruction of the voice, which was quite good even after the first stage, retains its quality and perhaps even gains a certain richness after the second stage. Some drawbacks remain, however. In certain places, the vocal color is a bit more "pinched" than in the original, suggesting a loss of formant emphasis. Additionally, both the piano and the voice appear to be too clean in some respects. The noise components of the instruments (the attack portion of the piano as well as consonants and fricatives of the human voice) are discarded along with noise. These drawbacks leave an occasionally antiseptic quality in the cleaned version. 4. FUTURE WORK There are several promising avenues for future research. One is to use the (local) complex Fourier transform, and keep the real part at the end, rather than using local sine or local cosine bases. This approach should avoid the problem of using many sine terms to represent a cosine, and vice versa, and should allow both sines and cosines to be equally well represented, which is desirable for musical signals. Another improvement would be to try to extract the part of the music that still remains in the noise. A possible approach is to use the basis that was chosen as optimal for expanding the original signal (in the second stage) and to expand the noise obtained after the second stage in that basis. Since this basis was built to represent music concisely, the musical components mn the noise should be prominent in the basis, and may be possible to extract. 290 I C M C P R OCEE D I N GS 1995

Page  291 ï~~5. SUMMARY The audibly apparent fact that the second stage of the denoising algorithm described above has removed the annoying numerical artifact noises which remained after the first stage, is a very encouraging sign that the process we are using is well-suited for discarding noise in a signal. It seems that the first stage alone, while doing a good job of removing noise, suffered from being too generic and treating all noise as being the same in all signals by using a general definition of noise (as something having high entropy). The second stage allows for "customizing" the definition of noise: the rough model of noise from the first stage is used to build a better model of noise which depends on the first model, and thus varies from signal to signal. The advantage of the two stage process is further demonstrated by the improvement in the musical quality of the signal. 6. AVAILABILITY OF EXAMPLES A segment of the original recording used in our work, examples of the results at each juncture of the denoising process, and examples of other attempts at denoising the same segment can be downloaded from the Center for Studies in Music Technology's World Wide Web site: http://www.yale.music/research/denoise/caruso/ 7. REFERENCES (Auscher,P., Weiss,G., and Wickerhauser,V.)P.Auscher, G. Weiss, and M.V. Wickerhauser, Local Sine and Cosine Bases of Cowfan and Meyer and the Construction of Smooth Wavelets, in Wavelets: A Tutorial in Theory and Applications (C. Chui ed.), Academic Press, Inc., 1992. (Berger,J., Coifman,R., and Goldberg,M. 1994-a) J. Berger, R. Coifinan, and M. Goldberg, A Method of Denoising and Reconstructing Audio Signals, in Proceedings of ICMC 94, pp. 344-347. (Berger,J., Coifman,R., and Goldberg,M. 1994-b) J. Berger, R. Coifman, and M. Goldberg, Removing Noise from Music Using Local Trigonometric Bases and Wavelet Packets, J. Audio Eng. Soc., Vol. 42, No. 10, Oct. 1994, pp. 808-818. (Coifman,R. and Wickerhauser,V.) R. Coifinan and M.V. Wickerhauser, Entropy-Based Algorithms for Best Basis Selection, IEEE Trans. Inf. Theory, vol. 38, March 1992, pp. 713-718. (Coifman,R., and Majid,F.) R. Coifman and F. Majid, Adapted Waveform Analysis and Denoising, in Progress in Wavelet Analysis and Applications, Proc. Int. Conf. "Wavelets and Applications", Y. Meyer and S. Roques, Eds., Editions Frontieres, Toulouse, France, 1993, pp. 63-76. (Rao,K. and Yip,P.) K. Rao, P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications, Academic Press, Inc.., Boston, 1990. (Saito,N.) N. Saito, Local Feature Extraction and its Applications using a Library of Bases, Ph.D. Dissertation, Yale University, December 1994. (Saito,N. and Coifman,R.) N. Saito and R. Coifman, Local Discriminant Bases, in Mathematical Imaging: Wavelet Applications in Signal and Image Processing, A.F. Laine and M.A. Unser, eds., July 1994, Proc. SPIE 2303. (Wickerhauser,V.) M. V. Wickerhauser, Adapted Waveform Analysis Library (software package for wavelet packet and local trigonometric decompositions), Wickerhauser Consulting, 1991-1992. ICMC PROCEEDI N GS 199529 291