Page  00000001 Optimal Filtering of an Instrument Sound in a Mixed Recording Given Approximate Pitch Prior Adiel Ben-Shalom and Shlomo Dubnov Center for Research in Computing and the Arts (CRCA), UCSD {abenshalom,sdubnov} @ucsd.edu Abstract In this paper we show a method for optimal filtering of an instrument in a mixed recording. The filter is based on prior knowledge of the approximate pitch played by the instrument and also on the assumption that the instrument is well described within the harmonic model framework. Pitch priors can be obtained using score alignment methods. Using the pitch priors, we present an algorithm that can be used to filter a single instrument or a voiced singer, subtract an instrument or balance the volume of several instruments. 1 Introduction This paper describes a method for filtering an instrument out of a mixed recording. This problem is very interesting and it has many applications. For example, in a multi-track recording one could re-mix the sound from the stereo recording without having the original tracks. Another future applications could be automatic karaoke and automatic music minus one. Without any prior knowledge, this problem is considered very hard. An algorithm that would filter a single instrument out of a mixed recording, should be able to analyze precisely a complex musical scene. A lot of difficult problems should be solved such as polyphonic pitch detection, instrument recognition and more (Cook and J 1994; Virtanen and Klapuri 2002). We solve this difficulty by using score alignment algorithm (Shalev-Shwartz, Dubnov, Friedman, and Singer 2002). We assume that the score is given along with the real recording. This is not a strong assumption since almost all popular and classical music can be found in MIDI format.The score alignment algorithm takes as input the real recording and the score information (MIDI) and output the precise alignment over time between them. This method simplifies the analysis of the complex sound. Once an alignment is achieved, we use the exact timepitch information to design an optimal filter for the different notes played by the different instruments. The filter that we describe is based on the harmonic model, and it is assumed that the instrument that we wish to filter can be modeled with the harmonic model. We explain in the next section the details of the filtering algorithm. We also show how this algorithm can be used to filter vocal voice (2.3), subtract an instrument out of a mixed recording (2.4) and also how it can be used to balance the volume between several instruments (2.5). There are several drawbacks for this algorithm. First, it is limited for instruments that are well modeled within the harmonic model framework. It can not process percussion like instruments. We also need to have special processing for the attacca part of the sound as well as for unvoiced parts of a singer. Another drawback is that it cannot do source separation of unison(i.e if two instruments play the same pitch then algorithm cannot do any kind of filtering/separation between them). 2 Harmonic Filter Theory 2.1 Harmonic Model and LCMV Filtering The optimal filtering process is based on two concepts. The first is the harmonic model (Serra 1989; Rodet 1997; Roads, Pope, and Piccially 1997) assumption for the instrument pitch. The second is representing the filtering problem as a Linear Constraint Minimum Variance (LCMV) filtering problem. The LCMV approach is very natural in this setup, since the mixing instructions can be viewed as specification of a linear constraint on the resulting signal. In the LMCV approach, a filter design is based on assumption of a particular shape of the signal of interest (usually a single sinusoid). In our case, we assume that the signal of interest is described by a harmonic model. Harmonic model means that the signal contains a periodic pitch component comprising of a fundamental along with its partials, plus some noise that we model as white gaussian noise (WGN). Formally, K x(n) = Ak cos(27k fon + cbk(n)) + w(n) k=l (1) where Ak and 6k are the instantaneous amplitude and phase Proceedings ICMC 2004

Page  00000002 of the k'th sinusoidal component K is the number of partials and rn o..., N - 1 where N is the number of samples. The second concept that we adopt is Linear Constraint Minimum Variance estimation (LCMV). LCMV theory was developed as an optimal beamforming method (Veen and Buckley 1997), which is basically a method for spatial filtering of a signal in the presence of other signals originated from different locations. Linear constraints guarantee that the desired signal will be filtered correctly and the output signal variance is minimized to ensure minimal contributions from interfering sources. The same principles can be used for temporal filtering a signal in the presence of interfering signals (FIR). LCMV filtering has two important properties. The first, is that it preserves the desired signal without any distortion, a property that is not achieved with other filtering methods. The second property is that it minimizes the interfering signals contributions to the output signal. The second property is also referred to as sidelobe cancelation 2.2 Filtering a Single Instrument We begin with the first scenario, where we wish to extract a single instrument from a musical recording whose score information is given. From the alignment process we get the segmentation information of this instrument in the recording which includes the pitch that the instrument in each time. For polyphonic instruments it may result that the instrument plays several pitches at each moment. Since the harmonic filter relies on exact pitch estimation we incorporate a pitch refinement process for each frame (see below). Due to dynamics, the filter design is dynamic,thus, the filter parameters must change according the content of the signal. This constraint leads us to adapt the Overlap Add (OLA) in the filtering process. We analyze the signal in overlapped frames where we modify the filter parameters in each frame according to the exact pitch value in the frame. For sake of simplicity, we describe the filtering algorithm for a single window x, where a single actual pitch within this window is known and equals fo. With the knowledge of fo we model the samples in the window x by a harmonic model with additive WGN as in Equation (1). Using simple trigonometry identity we rewrite Equation (1) as: x(ni) =3 a/ i cos (2wfokii~) + bk sin (2wfokin) + w~(a) (2) The unknown parameters in this model are the amplitudes of the sinusoids. We shall see that in order to filter the signal, we do not need to estimate explicitly the amplitude parameters. Let us denote K K~ 1 cos (2wi2fo) cos (2wirfo) (2wfo(N -1)) cos (2wfro(N -1)). cos (3) (4) 0 sin (2wi2fo) sin (2wfro(N - 1)) sin (2wirfo)..sin (2wfroK(N -1I))) The matrices A, and A8 contain the sinusoidal components that form up the harmonic signal. We denote by H a matrix which is a concatenation of A, and A8. H= (A, A8) (5) we can now represent the samples in each window x in terms of a linear model: x =H + w (6) where 0 is the unknown amplitude vector 0 [l...... aKe,.l.... biK]T and w is WGN. LCMV filtering will result in: y H(HTH)l1HTX (7) The rows of the matrix H(HTH)l1HT can be interpreted as FIR filter coefficients. The frequency response of the filter (See Fig. 1) shows that the filter passes fo the fundamental frequency along with its partial in 0 dB, which means that these frequencies appear in the output signal in the same magnitude as in the input signal. All the other frequencies are suppressed by a factor that corresponds to the relative power between the main lobe and the sidelobes of the filter window. oD ~ - 2(7 2_ i I I i I I I: I I I I I I I I:I::I I I I I ): I: I I I I I: I I I I]:~:BX il-:~I 'hlh $~S s:b'lllY:: 861,1-:~88 TiP::1(W:(l:lbbS-:lli'l YI: ~II IY I1FI: 8:W!~ 811 I r r[l hi Is II I I Ig i I b:?Illa i:I I 1 11:1 1 ( i I il:I ii 1 ].-1:11 ':II I r i- -_- - I j _ d- i- ~:,iJI I i- -:I -b 9:- -11 -_L r:L I - - - -L,-l -:d I-; j; _ I ___ IL - L-lIi I r I 1 rr:I II:I I:I I: I:I I I 1 1::I I: 1:I ~nu U.] 01 i.2: 6.~3- 1eL~4, Lr 6. 6 LL 0.6-: Norm~arized~ F~requen cy~ xIt~:rad/sasmple) 6.9 Figure 1: Frequency response of the harmonic filter Using WGN might not be the correct model to the remainder of the spectrum, since it contains other instruments which 11n our application, the suppression of non-harmonic frequencies was more than 10 dB. The reason for this number is the fact that the short-time analysis process can be viewed as multiplying the signal with a rectangular window. One of the characteristics of a rectangular window is that its frequency response resembles a sinc function, which has a difference between the main lobe to the first side-lobe of 13 dB. Better suppression of the nonharmonic components can be achieved using other types of windows. Proceedings ICMC 2004

Page  00000003 might be harmonic or percussive and they exhibit high correlations between successive data samples2. However, since in this scenario the algorithm has information only about the one instrument that we wish to filter, the WGN assumption is the least committing one. Although incomplete, this model has several advantages such as canceling of the sidelobes by the pseudoinverse or little phase distortion due to its FIR structure, which gives sufficiently good results. Since the filtering algorithm requires knowledge of the exact pitch within each frame, we use an exhaustive search over neighboring values of fo and choose a corrected pitch that has minimal energy in the residual (see Eq. 8). Since pitch estimation is a non-linear process, we can not do better than exhaustive search. Other pitch refinement methods might be used instead of this method to save on computation. 2.3 Vocal Filtering Filtering vocal singing is quite similar to filtering harmonic instruments. However, due to several unique characteristics of human voice we need to slightly modify the algorithm. The harmonic filter can be applied only to the voiced parts of the singing. The unvoiced parts are noise alike, and the harmonic model is inadequate in that case. In addition, for the unvoiced part the algorithm has to separate the singer from the accompaniment, this is not a trivial task. For voiced/unvoiced separation we use a heuristic which states that the unvoiced part contains more energy in the higher frequency than in the voiced part. This is closely related to the work by Masri and Bateman on modeling transients (Masri and Bateman 1996). We process the voiced components using the algorithms that were described in sections (2.2) and (2.5). For the unvoiced part we use a simple model in which we assume that the high frequencies in the spectrum of the unvoiced component belong to the singer and the low frequencies components belong to the accompaniment. Thus, we do high-pass filtering of the unvoiced part and the resulting signal is associated with the singer. 2.4 Subtracting an Instrument or Voice from a Recording In many situations, especially when the score information is incomplete, it is useful to keep the accompaniment, whose score is unavailable to us, while suppressing the instrument or singer whose score we do have. For example, in karaoke, the recording contains the soloist along with the accompaniment, and the process removes the soloist part while keeping the accompaniment. In order to solve this task we design the filter in the same way that was described in section (2.2), but we modify equa2WGN assumes that the remaining part of the signal contains samples that are uncorrelated and Gauss distributed tion (7) to: y = x - H(HTH)-lHTx (8) In other words, we subtract the estimation of the soloist from the original signal. This gives an estimation to the accompaniment in the recording by projection of the recording on a subspace that is orthogonal or complementary to the signal space that described the soloist. All other details remain the same as in filtering the soloist scenario. 2.5 Adjusting the Balance between Several Instruments The second scenario that we treat is balancing between several instruments in a musical piece. In this scenario, we have at our disposal score information about several instruments that are playing together. The input now contains the amount (measured in dB units) of boosting or suppression of each instrument. The filter design in this case differs from the filter design that was described above in two major points: First we have to extend the harmonic model from a single instrument to a group of instruments. Second, we must constrain the filter to have different magnitude responses which match the boosting/suprression request for the different instruments. The extension of the harmonic model to group of different instruments is straight-forward. Assume that we have P ^(1) ^ (2) ^(P) instruments and let us denote in fol) fIo)... fo1 the fundamental frequencies of all of these instruments. The extension to Equation (1) is then: P K (o) (- F3 cos (skn)+bksin (2-F ( kn)+w(m) p=1 k=1 (9) In order to represent the signal in a linear model we use again the sinusoids matrix A, and A,. We denote now the matrix H as concatenation of all pairs of A, and A, H =(Ab)AB)... A(P)AAP)) (10) The constraint that we have on the magnitude response of the filter is formulated as a gain matrix G. G is diagonal matrix, which contains the magnitude response values for each pitch (fundamental and its partials). The LCMV filtering process is then defined by y = HG(HTH)-lHTx (11) 3 Results and Future Work All examples are available at http://www.cs.huji.ac.il/ ~ chopinNMix/filter.html. The first example is an instrumental music recording - Mozart's violin sonata. It contains two instruments - violin and piano. Using the virtual mixer we Proceedings ICMC 2004

Page  00000004 tried to filter out the violin part. We assumed that we know the score information only for the violin part. We then used the algorithm that was described in section (2.2). As can be heard, the piano part in the modified signal is almost completely inaudible. The filter supressed the piano part by 13 dB. For this example this is almost true isolation of the violin part. Figure 2 depicts this filtering process. One might notice some pops and clicks in the beginning of the notes, which are the result of applying the filter for the incorrect pitch. This happens due to inaccurate alignment results. We intend to keep working on the alignment process to see whether it can give accurate output even in note boundaries. The second recording was a recording with a vocal part. We took a pop song - 'Summertime" sang by Ella Fitzgerald. As in the instrumental piece, we assumed that we know the score information of the singing. We then used the virtual mixer to modify the singer. As can be heard, the filtered voice contains both the voiced and unvoiced parts of the singing, which correspond to different processing algorithms used by the system. We then filtered out the accompaniment part using the algorithm described in section (2.4). With both the voice and the accompaniment, we virtually mixed the two parts with different balance between the soloist and its accompaniment. The last example that we tested was a vocal recording with a strong percussion accompaniment. Since our filter cannot handle percussion instruments due to its assumption of the harmonic model, we wanted to test how good it handles the percussion instruments when they exists as accompaniment. The recording that we chose was the beetles song 'help" from which we chose a solo part with strong percussion accompaniment. We had the score information of the singer and using this information we filtered out the singer voice and the accompaniment. Then we re-mixed them again with different balance between the singer and the accompaniment. The result was quite good, the re-mixed sound preserve the strong percussion in the accompaniment. 4 Acknowledgments We would like to thank Shai Shalev-Schwartz and Michael Werman from the Hebrew University for helping discussions. This work was partially funded by the Samentic HIFI IST project (FP6 project no. 507913). References Cook, M. P. and B. G. J (1994). Separating simultaneous sound sources: issues, challenges, and models, Speech Recognition and Speech Synthesis. John Wiley and Sons. Kay, S. M. (1993). Fundamentals of Statistical Signal Processing. Prantice Hall. Figure 2: Filtering single instrument: The top figure is recording of piano and violin. Using the harmonic filter we extract the violin part (bottom figure) Masri, P. and A. Bateman (1996). Improved modelling of attack transients in music analysis-resynthesis. In Proc. of International Computer Music Conference (ICMC96), pp. 100-103. The international Computer Music Association (ICMA). Roads, C., S. Pope, and A. Piccially (Eds.) (1997, June). Musical Signal Processing. Swets & Zeitlinger. Rodet, R. (1997). Musical sound signal analysis/synthesis: Sinusoidal+residual and elementary waveform models. In IEEE Time-Frequency and Time-Scale Workshop. Serra, X. (1989, Oct.). A System for sound analysis/transformation/synthesis based on deterministic plus stochastic decomposition. Ph. D. thesis, Stanford University. Shalev-Shwartz, S., S. Dubnov, N. Friedman, and Y. Singer (2002). Robust temporal and spectral modeling for query by melody. In SIGIR. Veen, V. V. and M. Buckley (1997). Beamforming Techniques for Spatial Filtering. In Digital Signal Processing Handbook, Chapter 61. CRC Press. Virtanen, T. and A. Klapuri (2002). Separation of harmonic sounds using linear models for the overtone series. In Proc. ICASSP 2002. Proceedings ICMC 2004