Page  00000001 A model for selective segregation of a target instrument sound from the mixed sound of various instruments Masashi Unoki, Masaaki Kubo, and Masato Akagi School of Information Science, Japan Advanced Institute of Science and Technology 1-1 Asahidai, Tatsunokuchi, Nomi, Ishikawa, 923-1292 JAPAN Email: {unoki,kubomasa,akagi}@jaist.ac.jp Abstract This paper proposes a selective sound segregation model for separating target musical instrument sound from the mixed sound of various musical instruments. The model consists of two blocks: a model of segregating two acoustic sources based on auditory scene analysis as bottom-up processing, and a selective processing based on knowledge sources as top-down processing. Two simulations were carried out to evaluate the proposed model. Results showed that the model could selectively segregate not only the target instrument sound, but also the target performance sound, from the mixed sound of various instruments. This model, therefore, can also be adapted to computationally model the mechanisms of a human's selective hearing system. 1. Introduction Let us consider the problem of selective sound segregation (Fig. 1). Here, the sound of three musical performances, independently played by flute, piano, and violin, are mixed together. When we try to listen separately to the target sound (e.g., the piano sound) from among the mixed sound, we can easily selectively segregate the target sound if we know what the target is and have previously listened to it. In general, this type of situation arises from what is called the "cocktail party effect" [3] and is an important issue not only with regard to automatic music description systems, but also regarding various types of signal processing such as that of hearing aid systems and robust speech-recognition systems. In practice, though, it is difficult to construct a computational model that can process signals in this way, because the signals exist in a concurrent time-frequency region and this problem is an ill-inverse problem. Therefore, we need to use reasonable constraints to solve the problem. Recently, sound segregation models based on "computational auditory scene analysis (CASA)" have been proposed to solve the above problem by using Bregman's regularities [2]. In particular, in the case of musical sound, CASA are called "music scene analysis" [6], and models have been proposed for extracting significant information (musical sequences, rhythm, etc.) regarding a target sound from a mixed sound and to understand the target [4, 5, 6]. The underlying concept of these models is to computationally model the ability of the auditory system as a function of active scene analysis [2]. There are two main types of segregation models, based on either bottom-up (e.g., [7]) or top-down (e.g., [4, 6]) processes. To realize a selective sound segregation model as shown in Fig. 1, we have to resolve two issues: (1) how to precisely select __._._ Figure 1: The selective sound segregation problem. the target sound within a real environment, and (2) how to completely separate the target sound from the mixed sound in which overlapped components exist in a concurrent time-frequency region. However, since bottom-up and top-down processes focus only on either issue (1) or (2), respectively, each alone cannot be used to realize a selective sound segregation model. This paper proposes a model concept for selectively segregating a target instrument sound from a mixture of various sounds by combining top-down and bottom-up processing. 2. Selective sound segregation model The proposed selective segregation model is shown in Fig. 2. This model is based on the two types of processing: top-down processing to select the position of the target sound in the mixed sound (to resolve (1), as shown by the dashed-line in Fig.2), and bottom-up processing to separate the target from the other sounds in the concurrent time-frequency region (to resolve (2), the dotted line in Fig. 2). The bottom-up processing is the same method proposed [7], but it has been modified so that it can be combined with top-down processing. 2.1. Model concept and definition In this model, the original signals (fi(t), f2(t), f3(t), and so on) are not known, nor it is known how many different sounds there are. The only model inputs are the observed mixed signal f(t) and a knowledge key such as the symbol for the target instrument name (here this is fi (t)). To deal with top-down information, we assume that the exact target sound can exist

Page  00000002 Table 1: Constraints corresponding to Bregman's regularities. Regularity (Bregman, 1993) Constraint (Unoki, 1999) (i) common onset/offset |Ts - Tk,on < ATS, \TE - Tk,off I < ATE (ii) gradualness of change dAk(t)/dt = Ck,R(t) (slowness) dOlk(t)dt Dk,R(t) dFo(t)/dt= Eo,R(t) (smoothness) ftb [Ak1) (t) dt > min ftb[(R+1) (t)2dt min (iii) harmonicity n x Fo(t), n = 1, 2,... NFo (iv) common AM Ak(t) A(t) k 11 Ak (t) 11 IIAE(t)H I fi(t) Figure 2: Selective sound segregation model. Figure 3: Typical template for the target instrument. in anywhere in the mixed sound, and knowledge about the target sound can be represented through the acoustical features. Thus, the key enables the model to obtain information regarding the acoustical features of the target sound from the knowledge sources. This model concept is based on the problems associated with segregating two acoustic sources. This fundamental problem is defined as follows [7]. First, only the mixed signal f(t), where f(t) = fi (t) + f2(t), can be observed and f(t) is, then, decomposed into its frequency components by a K-channel filterbank. The output of the k-th channel Xk (t) is represented by Xk (t) = Sk(t) exp(jUwkt+j k (t)), (1) where Sk(t) and k (t) are the instantaneous amplitude and phase, respectively. If the outputs of the k-th channel, which correspond to fi(t) and f2(t), are assumed to be Ak(t) exp(jwkt + jOlk(t)), and Bk(t) exp(jkt + j02k (t)), then the instantaneous amplitudes Ak (t) and Bk (t) can be determined as Ak (t) = Sk (t) Sin(02k(t) - k(t))/sin k (t), (2) Bk (t) = Sk (t)sin(k (t) - lk(t))/Sink (t), (3) where 0k(t) = 02k(t) - Olk(t), 0k(t) r nur, n E Z, and wk is the center frequency of the k-th channel. However, Ak(t), Bk(t), 0lk(t), and 02k(t) cannot be uniquely determined without some constraints. This is easily understood by considering the above equations. The problem, therefore, is the ill-inverse problem. To solve this problem, we previously proposed a basic model that uses constrains related to the four Bregman's regularities [7], as shown in Table 1. 2.2. Model implementation The basic problem given above is for two-sound segregation. Thus, in this paper, the problem is set so that fi (t) is the target sound selected by top-down processing and f2 (t) is the other mixed sound (i.e., f2(t) + f3(t) +. + fN(t)). The problem is then solved using the solution based on the Auditory Scene Analysis [7]. The proposed model is implemented in six blocks: the filterbank, F-note estimation, template generation, event detection, separation block, and grouping block (Fig. 2). This filterbank decomposes the observed signal f(t) into complex spectra Xk(t). It is designed as a constant narrowbandwidth filterbank with K = 500, a 20-Hz bandwidth, a FIR-type bandpass filter, and a 20-kHz sampling frequency. Sk (t) and Ok (t) are determined by using the Hilbert transform of Xk (t) [7]. The F-note estimation block determines the candidates for the fundamental frequency of the musical instrument sound obtaining some peaks in the auto-correlation function in terms of the frequency region at each time of Sk (t)s. The histograms for each candidate are then calculated according to the time axes in the time-frequency region. Some of the candidates with higher histogram values are passed to the event-detection block and the final estimated F-note, Fo (t), is determined in this block. In this paper, Fo (t) fluctuates in steps, and the temporal differentiation of Fo (t) is zero in all segments. As a result, this paper assumes that Eo,R(t) 0 in Table 1 (ii) for each segment. Most of the segments correspond to each F-note duration in the target instrument sound. The template generator produces an acoustical template from the knowledge sources, depending on the target sound symbol. The generated template is composed of the shape of the instantaneous amplitude in the time-frequency region, based on the fundamental frequency, duration, and general acoustical feature of the musical instrument sound. The shapes of the standard template for flute, piano, and violin are shown in Fig. 3. In this paper, templates were obtained from the averaged instantaneous amplitude of the target under various conditions (normalized duration and normalized harmonicity etc.). This can be extended by analyzing all of the sounds as was done in [6]. The event detection block uses a template of the target to determine the concurrent time-frequency region of the target

Page  00000003 sound. In this block, the F-note is selected from the candidates of F-note Fo (t) while this block searches whether the extracted amplitude based on the harmonicity of each candidate of F-note matches the template based on the correlations. This corresponds to constraint (iii). The estimated event of the target can then be obtained from a candidate with the highest correlation. The onset and offset of the target instrument sound, Tk,on and Tk,off, are determined from the estimated instantaneous amplitude based on the harmonicity of the selected fundamental frequency. This corresponds to constraint (i). The separation block determines Ak (t), Bk (t), 81k (t), and 02k (t) from Sk (t) and k (t) using constraints (ii) and (iv) in the determined concurrent time-frequency region. Constraint (ii) is implemented such that Ck,R(t) and Dk,R(t) are linear (R = 1) polynomials, in order to reduce the computational cost of estimating Ck,R(t) and Dk,R(t). In this assumption, Ak(t) and 01k(t), which can be allowed to undergo a temporal change in region, constrain the second-order polynomials (Ak(t) = fCk, (t)dt Ck,,o andO 1k(t) = f Dk,1(t) +Dk,). Then, by substituting dAk(t)/dt = Ck,R(t) into Eq. (2), we end up with the linear differential equation of the input phase difference Ok (t) = 02k (t) - Olk (t). By solving this equation, a general solution is determined by k (t) =rSk (t) sin(t k (t) - O1k (t)) () arctan Sk(t) cos(Ok(t) - O1k(t)) + Ck(t) ) (4) where Ck(t) f Ck,R(t)dt k, = - Ak(t) [7]. In the segment Th - Th-1 of each instrument duration which can be determined by Eo,R(t) = 0, Ak(t), Bk(t), 1k (t), and 02k(t) are determined through the following steps. First, the estimated regions, Ck0 (t) - Pk(t) < Ck, (t) < Ck,0(t) +Pk(t) and Dk,o(t) - Qk(t) < Dk,1(t) < Dk,o(t) + Qk(t), are determined by using the Kalman filter, where Ck,o(t) and Dk,o(t) are the estimated values and Pk(t) and Qk (t) are the estimated errors. Next, the candidates of Ck,1 (t) at any Dk,1 (t) are selected by using spline interpolation in the estimated error region. Then, Ck,1 (t) is determined by using 15 1 Proposed model -10 0 10 20 SNR [dB] (b) 20 915 o ~ 10 -10 0 SNR[dB] 10 20 Figure 4: Segregation accuracy when segregating a piano sound from a mixed sound: (a) SNR and (b) precision. 3. Simulations First, to show that the proposed model can selectively and precisely segregate the target instrument sound fl (t) from the observed sound f (t), we carried out simulations of the segregation of each sound from four mixed sounds (piano, flute, horn, and violin). Five types of mixed signal f(t) were used as simulation stimuli in each simulation, where the SNRs of f(t) ranged from -10 to 20 dB in 10-dB steps. These original signals were generated using a Tone-generator (YAMAHA, MU-2000). To evaluate the segregation performance of our proposed method, we used the following two measures. Both measures show improvement if they become positive higher values. OTf fl (t)2dt SNR = 10logo ) )2 fo(, (/i(t)-(t)) dt (7) Ck,1 < Ak, ATMP,k > arg max, (5) k,o-Pk<Ck,1<Ck, +Pk ||Ak1 | IIATMP,k I where Ak (t) is obtained through spline interpolation and ATMP,k(t) is template such as one shown in Fig. 3. Finally, Dk,1 (t) is determined by using Dk,1 < Ak, ATMP,k > arg max.(6) Dk, o-Qk<Dk,1<DkoQk IAkll IIATMP,kll j Kfi Ak 1(t)2dt Precision T 10 log(10 o Ek1 d(t). d T lo K (Ak(t)- Ak(t))2 (8) Moreover, to show the advantages of the proposed model, we compared the performance of the model when (a) using only top-down processing (only extracting the harmonic component of the target sound, not segregating it in each channel) and (b) using bottom-up processing (i.e., using a previous model [7]). The results of the first simulations for piano (G3) are shown in Fig. 4, where f(t) was the target piano (G3) sound mixed with flute (A4), violin (C4), and horn (Eb2). For example, when the SNR of the mixed signal was 0 dB, it was possible to improve the SNR by about 12 dB from f(t), and to improve the SNR by about 2 dB and the precision by about 5 dB as segregation accuracy, compared with the top-down processing. This comparison shows the importance of separating each component from the overlapped components in each channel. Our results show that the proposed model can selectively segregate the target, using the key of the target sound, with high accuracy. For the other target sounds (flute, horn, violin), the results were similar to those shown in Fig. 4. When the SNR of the mixed signal was 0 dB, we could improve the SNR for the flute, The difference between our proposed model and the previous model is that we use a template of ATMP,k(t) instead of the averaged Ak (t) [7]. These equations mean we can determine a unique solution from among the candidates. Since 01k(t) and Ok(t) are determined from Dk,1(t) and Ck,1(t), we can determine Ak(t), Bk (t), and 02k (t) from Eq. (2), Eq. (3), and 02k(t) = Ok(t) + Olk(t), respectively. The grouping block merges the instantaneous amplitudes Ak (t)s and phases 01k (t) in the concurrent time-frequency region of the target using constraints (i) and (iii) from Table 1, and then reconstructs them into the segregated signal fi (t).

Page  00000004 Piano (CDECDE) A Violin (GEEFEE);it L i L ik iA ~ - - fl(t) )IPl f2(t)|W~ Flute (C estoot tion SI---. t--- Symbol ----- - " "[ - f4(t)....IN m... IIIII Name: Piano __ [ _ Whte 1 ns Note: CDECDE CGGAAG) White noise fit) Know.ledge Filterbank eneat Event Correlation detection C k(t) D ~(t) S sion, compared with top-down processing. In this simulation, it was difficult to selectively segregate the target sound from the mixed sound using bottom-up processing without having some prior information. We have thus shown that our proposed model can be used to selectively segregate the sound of a target musical instrument performance sound from a mix of various sounds such as one resulting from the cocktail party effect. 4. Conclusions J -::-AT MP k,) histograms - -- - onset offset- Template Separation, Correlation Ak(t l k(t) Grouping G Figure 5: Overview of signal processing for the proposed model. horn, and violin sounds, by about 16.7 dB, 7.3 dB, and 13.6 dB, respectively, from f(t), and improve the SNR by about 2.0 dB, 3.6 dB, and 0.9 dB and the precision by about 9.9 dB, 9.3 dB, and 0.3 dB as segregation accuracy compared with the top-down processing. Next, to demonstrate that the proposed model can be applied to a realistic problem where the target performance sound must be segregated from mixed sound as shown in Fig. 1 (which is a typical situation resulting from the cocktail party effect), we carried out the following simulation. The original signals were as follows. Target fi (t) was a piano sound played "chu-rippu" (six notes: CDECDE), f2(t) was a flute sound played "kirakiraboshi" (seven notes: CCGGAAG), f3 (t) was a violin sound played "choucho" (six notes: GEEFEE), and f4 (t) was white noise. These were musical sounds taken from Japanese songs (except for f4(t)). Inputs were the mixed signal f(t) = fi(t) + f2(t) + f3(t) + f4(t) and the keys of the symbol (piano) and notes (CDECDE, not including any time information) of the target. The task was to selectively segregate the target sound ("chu-rippu" of the piano sound) from f(t). Figure 5 shows an example of the signal processing of the proposed model for this task. In this figure, panels A and B show each original signal and the mixed signal f(t) at an SNR of 0 dB, respectively. The instantaneous amplitudes Sk (t)s and phase qk (t)s (panel C) are decomposed from f(t) using the filterbank and then the candidates of the F-note (panel D) are extracted from Sk (t)s. The template of the target sound (panel E) is generated from the knowledge sources using keys. The segregated amplitude Ak(t)s (panel F) and phase 01k(t)s are obtained from Sk (t) and pk (t) using the constraints and template, and then the selective-segregated signal fj (t) is reconstructed by the grouping block. In this simulation, the proposed model improved the SNR about 10.6 dB from f(t). Moreover, the accuracy of the segregated target sound was improved by about 1.5 dB because of the better SNR and by about 2 dB because of the greater preci In this paper, we have proposed a selective sound segregation model that combines top-down and bottom-up processing. We carried out two segregation simulations to evaluate the proposed model - one in which a target sound was segregated from a mix of four instrument sounds, and one in which a musical performance sound was segregated from a mixed musical performance. Our results in the first case showed that our model can selectively and highly accurately segregate a target instrument sound from a mix of various sounds. Our results also showed that combining top-down and bottom-up processing is useful for selective sound segregation. The results of our second simulation showed that the proposed model can be applied to a more realistic sound segregation problem, such as the sort of situation that results from the cocktail party effect. The advantages of our proposed model make it applicable to preprocessing for a musical scene analysis system and for replaying of a target sound. This model, therefore, can also be adapted to computationally model the mechanisms of a human's selective hearing system. In our future work, we hope to establish a means of constructing a standard template for any instrument sound (e.g., optimization between the template and a real sound), and then we will adapt the model for various musical performance sounds. 5. Acknowledgment This work was supported by a Grant-in-Aid for Science Research from the Ministry of Education (No. 14780267). 6. References [1] Bregman, A.S., "Auditory Scene Analysis: hearing in complex environments," in Thinking in Sounds, pp. 10-36, Oxford University Press, New York, 1993. [2] Cooke, M. and Ellis, D.P.W., "The auditory organization of speech and other sources in listeners and computational models," Speech Communication, vol. 35, no. 3, pp. 141 -177, Oct. 2001. [3] Cherry, E.G., "Some experiments on the recognition of speech with one and with two ears," J. Acoust. Soc. Am., 25, pp. 975-979, 1953. [4] Ellis, D.P.W., "Prediction-driven computational auditory scene analysis," Ph.D. thesis, MIT Media Lab., 1996. [5] Goto, M., "FO Estimation of Melody and Bass Lines in Musical Audio Signals," IEICE Trans. D-II vol. J84-D-II, no. l, pp. 12-22, Jan. 2001. [6] Kinoshita, T., Sakai, S., and Tanaka, S., "Musical source identification based on frequency component features," IEICE Trans. vol. J83-D-II, no. 4, pp. 1073-1081, April 2000. [7] Unoki, M. and Akagi, M., "Signal Extraction from Noisy Signal based on Auditory Scene Analysis," Speech Communication, vol. 27, no. 3, pp. 261-279, April 1999.