Page  311 ï~~Blind Decomposition of Concurrent Sounds Mamoru UEDA Shuji HASHIMOTO Department of Applied Physics of Science and Engineering WASEDA UNIVERSITY 3-4-1Okubo, Shinjuku-ku, Tokyo 169, JAPAN E-mail: shujivax@cfi.waseda.ac.jp Abstract This paper describes a signal-processing approach designed to segregate concurrent sounds without using information that is peculiar to each sound source. The information that is peculiar to each sound source (such as sound models or localization of sound sources, etc.) may be useful to solve the sound source separation problem. But such information is not always necessary for the human performance in the task related to the segregation sounds. In this regard, it is interesting to consider the sound source separation problem on a blind condition. In this paper, we regard the sound source separation problem as a blind decomposition problem. The blind decomposition problem involves the estimation of the constituents by observing only their sum. Although the blind decomposition problem may be impossible to solve completely, we can solve it under some adequate and practical conditions which are common to all sounds. In this paper, we describe the necessary conditions and a simple and effective algorithm for the blind decomposition of monaural concurrent sounds in the case when two sound sources are presented simultaneously. The experimental system based on this algorithm was constructed and tested. The results of the experiment using the test data and actual data of musical instruments proved the propriety of the algorithm. 1. Introduction By hearing the sounds that have been generated from many sound sources at the same time, human listeners can separate them or can extract the sound signals individually. For example, even when we listen to someone speaking in a noisy condition such as a party situation where many people are chattering here and there, we can extract the meaning of what the speaker says and can keep on conversing. This is one of the popular examples related to human performance in the task associated with segregation sounds, which is known as "The Cocktail Party Problem." [Mitchell, Ross, and Yates, 1971] Moreover, the human auditory can segregate sounds in more special cases as follows, Case (1); when we hear sounds monaurally. Case (2); when we hear sounds that have never been heard before. Case (3); when the sounds are not only harmonic but also non-harmonic. As an example of Case (1), when we listen to a monaural radio that has only one loud speaker, we separate the sounds and enjoy the ensemble music or extract the talk of announcer, while hearing the back-ground music. This observation indicates that we can separate and identify the concurrent monaural sounds. As an example of Case (2), even when someone whom we have never met before is speaking with others, we can extract what the speaker says. This shows that we can segregate the sounds that had never been heard before. This suggests that we may separate sounds without models or some kind of templates for every sounds. Regarding Case (3), the signals that consist of a sum of sine waves whose frequencies are integral multiples of the lowest frequency (so-called fundamental) are said to be harmonic. [Nehorai and Porat, 1986] Vowels and most of the musical sounds that have scales, such as piano, trumpet, clarinet, are harmonic. The sounds of voiceless consonants and the musical sounds that have no scales, such as cymbals, some kinds of percussion, are nonharmonic. Even if the mixed sound contains nonharmonic sounds, we can identify them. Not only. harmonic sounds but also non-harmonic sounds can be identified from the mixed sound, which proves Case (3). Information such as the location or the models of each sound source is peculiar to each sound source. ICMC Proceedings 1994 311 Audio Signal Processing

Page  312 ï~~The above-mentioned Cases (1), (2), and (3) show that these characteristics are not always necessary for a human to segregate sounds. The condition in which we separate the sounds is said to be the blind condition. We may or may not use the cues that are particular to each sound source such as the localization or the models of each sound sources or harmonic cues in order to separate the sounds. The sound source separation problem has been studied over the past years in various areas. Some of studies and results indicate that these cues are helpful for the sound source separation system. The approaches using estimation of localization of sound sources with two or more microphones [Mitchell, Ross, and Yates, 1971] [Fujinaga, Alphonce, Pnnycook, Diener, 1992] have shown effective results. The approaches using comb filters to segregate the harmonic sounds or vowels [Nehorai and Porat, 1986] and the approaches using sound models[Kashino and Tanaka, 1993] and other approaches have been presented. These approaches owing to the conditions that are peculiar to each sound source are not on the blind condition. They possibly may fail to separate the sounds that don't meet their conditions, while human listeners can easily segregate them. To consider the sound source separation problem on the blind condition, we propose the idea of the blind decomposition. We believe that this study is the basic research for constructing a sound source separation system that is capable of achieving human performance in tasks related to sound segregation. It is interesting to consider the algorithms on the idea of the blind decomposition for not only the technology for processing musical sounds but also for the man-machine interface, the speech recognition system, the robotics, and so on. We propose a simple and practical algorithm to segregate monaural input that consists of two sounds. 2. Blind Decomposition Problem The blind decomposition problem is one kind of inverse problem. The blind decomposition problem means the estimation of the constituents by observing only the sum of them. We define this problem as follows: When S is given by the sum of real numbers Ci, s=C,1+C2+...+CM =x~i (Eq.2-1) Where S, and Ci are real numbers. To estimate Ci by decomposing S without any information on Ci is the blind decomposition problem. If the constituents Ci were completely unknown, it is impossible to solve the problem. This problem may have many solutions, so that we cannot obtain an unique-solution. As an example, 10 may be the sum of I and 9, or the sum of 2.5 and 7.5,....and so on. Consequently, we cannot solve this problem. This kind of problem is an ill-condition inverse problem. Similarly, sound source separation can be seen as the inverse problem. By only hearing the sound, we can recognize what has happened in our environment, for example, that an ambulance car has passed by a motorcycle, or something stiff like a glass has fallen to the floor and broken. This observation shows that the human auditory solves the sound source separation problem that is a kind of inverse problem. Thinking about we segregate sounds on the blind condition, it is interesting to consider the segregation sounds as signal-processing problem. Regarding with this point, we propose the modeling of the sound signals as follows. The amplitude of the mixed sound at time t can be written by the sum of the amplitude of the sounds of the constituents at time t, that is: S(t) = C, (t) + C2 (t) +... +C. (t) =XC,(t) i=1 (Eq.2-2) where S(t) represents the amplitude of the mixed sound at time t. Ci(t) represents the amplitude of the sound source numbered i at time t. The sound source separation problem is to decompose S(t) to each Ci(t) where Ci(t) is unknown. If we use the information that is peculiar to each sound source, it is not on the blind condition. But it may be possible to solve the blind decomposition problem if we use the information that is common to all original sources. Especially when the original sources are the sound sources, there may be some useful propertes that are common to all sounds. Here, we define the blind condition as not using the information peculiar to each sound source. Using the information that is common to ail sounds does not set a limit on the sounds that can be segregated. What kind of information on sounds can be used to Audio Signal Processing 312 ICMC Proceedings 1994

Page  314 ï~~the sum of the power spectra of each sound source and the cross-spectra of f(t) and g(t). Although the cross-correlation of two signals may not be small in general, it can be assumed quite small in case of the sound signal. Because they have complicated and different wave forms each other, the cross-spectra is enough negligible to make the power spectrum of the mixed sound be the sum of the power spectrum of the sound sources. We made a simple experiment to confirm Condition (1). Two sound signal data f(t) and g(t) were taken exclusively. The sum of the power spectra Si(co) and Sg(co) was compared to the power spectrum H(co)H*(co) calculated from h(t) (= f(t) + g(t)). The latter contains the cross-spectra, but the former does not. If they are equal, the cross-spectra is negligible. We did this experiment using actual musical instruments. As the result, the two power spectra were almost always the same. Regarding Condition (2), this condition may not hold true for a long time interval. But, we consider this condition hold true for a short time. Concerning Condition (3), there may be some cases where this condition does not apply. We can however expect that the case which does not meet this condition is rare among the sounds of speech or musical instruments. 4. Algorithm In this section, we describe the algorithm for the separation of the spectrum of the mixed sound which satisfies the above-mentioned necessary conditions. The aim of the algorithm is to obtain the power spectrum of the sound sources Sr(co) and Sg(co), and their rate of variation of a(t) and b(t) from Sh(t,co) in Equation 3-2. We assume that the spectrum of Sf(co) and Sg(co) stand for the spectrum of each sound source F and G at time to and a and b stand for the variation of the spectrum from time to to ti. The short-time spectra of the mixed sound at time to and ti are represented as follows: Sh (to 0'(n )Sf (()n)+Sg(03n) (Eq.4-1) Sh (tl, C~n)- a. Sf( n )+-b. Sg (03n ) If we had the values of a and b, we can solve the problem as the simultaneous equations. To find the value of a and b, we propose the following function K(p,con), where real number p (p>O) is the parameter of this function. K(p, (On )Sh(tl, On )-p-Sh (t0, On ) (Eq.4-3) Equation 4-3 is rewritten by Equations 4-1 and 4-2 as follows: K(p, cn ) =(a- p)- S f ((On )+(b -p) Â~ Sg((On) (Eq.4-4) We can find the values of a and b by changing the value of p and simultaneously observing the value of K(p,wn) in each frequency con. In the following discussion we assume that a < b. The power spectrum Si(con) and Sg(con) are not below zero in all frequency con. Focusing on this point, we increase the value of p from 0 gradually. When p is below a, the values of (a-p) and (b-p) are positive, then the value of K(p,con) is positive for every frequency con. The value of (a-p) becomes negative just when p exceeds a, while (b-p) is positive. On the other hand, as there exists frequency coo where Sf(coo)O flSg (o)=O, the value of K(p,con) becomes negative at coo. In other words, the value of a is the minimum of p that makes K(p,con) negative somewhere in con. Also, while p is under b and above a, K(p,con) is positive in some con, and at the same time, is negative in another con. When p has passed over the value of b, the value of K(p,Con) is negative for all con. In this way, we can obtain the value of a and b. We can solve the simultaneous functions Equation 4-1. and Equation 4-2. to get the power spectrum of each sound source as follows: a-b (E.4-5) Sh(tl, on)-a-Sh(to,(On) Sg ((On) - b-a (Eq.4-2) a b, a >0, b >0. where con represent the n-th discrete frequency. While Sf(cOn), Sg((Dn), a and b are unknown, we cannot obtain the solution from Equation 4-1 and 4-2. (Eq.4-6) Audio Signal Processing 314 ICMC Proceedings 1994

Page  315 ï~~5. Experimental results We constructed the experimental system aimed at segregating two sounds. At the first, in order to investigate the algorithm, we tested the system by using the model data that meet the necessary conditions. Secondly, we have tried to segregate the actual sounds that were generated by the musical instruments. The outline of the experimental system is illustrated in Figure 1. The monaural audio signal is digitized at a 12bit. The sampling rate is 20kHz. Frequency analysis is performed using the FFT algorithm for every 2048 points. The system was constructed on a personal computer (NEC pc-98) Sh(tl)j LAAÂ~0) 0 (e) result:.. t: lt Fig.2 Experiment and results using the model data -Using test data. The spectra have the different fundamental frequencies. - --- --- --- Sf(a1Sa rig. 1. oniguraton o[ me system LtrOw 5.2 Results of using the test data We made the experiments by using the test data which met the necessary conditions in order to evaluate the algorithm. A couple of the model spectrum of the virtual sound sources Sf(coa) and Sg(o n) were generated in the computer. Setting the time change value a or b at random, we calculated the test data Sh(lo,COn) and Sh(t,cDn) in according with Equation 4-1 and 4-2. Then we inputted them to the system and compared the results with their original test data. In Figure 2, Sf(coDn) and Sg(COn) are shown in (a) and (b), which are the model spectrum of the virtual 0 1.(e) result.) result Fig. 3 Experiment and results using the model data Using test data. The spectra have the same fundamental frequencies. ICMC Proceedings 1994 315 Audio Signal Processing

Page  316 ï~~__A A ll1kI SL o.U) SU(tUL o 00 I 1 L0) IIl AAA..... if (p) rps I t ~() iresu I t.... (e) resu it (f) resu l t.r, \c/ 1 L" lu r.-.,r,.,,rrrl.=r., - t 4 7u' 6 _" l:: Fig.4 Experiment and results using the test data Using test data which were calculated on a computer on the basis of the actual sound spectra sounds. (c) is Sh(to,con), which is the sum of St(con) and Sg(oa). (d) is Sh(ti,o n), which the sum of the data multiplied the time change value a and b. We inputted them to the system and obtained the outputs Sf*(COn) and Sg*(on) are shown in (e) and (f) in Figure 2. We define the estimation error of the separated spectrum as the Equation. 5-1, concerning the sound source F. N Ef =- a l N n=1 (Eq.5-1) The error Ef and Es were almost always zero (under 0.001). This shows that the algorithm separated sounds under the blind condition, if the sound met the necessary conditions. Fig.5 Experiment and results using the test data Using test data which were calculated on a computer on the basis of the actual sound spectra. They have the same fo. -;Â~ Next, we used the virtual models whose spectra had continuous-frequency. In Figure 3, (a) and (b) are the models of the non-harmonic spectrum. (c) and (d) represents Sh(to,con) and Sh(t,co n). We fed them into the system and obtained the outputs shown in (e) and (t) in Figure 3. The estimation error Ef and Es were under 0.001 These experimenthi results verified that the algorithm separated either harmonic sounds or nonharmonic sounds, while the necessary conditions held good. In succession, we tried to separate the test data whose model spectra were the actual sounds instead of the virtual sounds. In Figure 4, (a) is a spectrum of the note of a French horn (to = 380Hz) and (b) is a spectrum of the note of a violin ((o=420Hz). (c) is Sh(to,co)n) which is generated in the computer by summing up (a) and (b). (d) is Sh(tl,o n) which is the sum of the data multiplied the time change value a and b. We inputted them into the separation system and obtained the results are shown in (e) and (f) in Audio Signal Processing 316 ICMC Proceedings 1994

Page  317 ï~~QHorn(fO.38011z) ) Viol in(fO-42011z) j p? H orn (f 0.380112) )~Violin (f0O '18OHz )) LLYJLLLkLk _MJULL-L SThese specta are for confirmation _ Fig.6 Experiment and results using actual sounds: The sound sources are Horn (fO=38OHz) and Violin (fO=420Hz)...:........:..............:..-.............,................................ *1....,........... ' (a) result (b) result Fig.7 Results using actual data; An example of poorly separated case. Figure 4. The estimation error Er of Sf*((on) and Es of Sg*(co n) were under 0.01. We also tried to separate the data made by the model spectrum of actual sounds whose fundamental frequency were the same. In Figure 5, (a) is a spectrum of the note of a French horn ([o = 380Hz) and (b) is a spectrum of the note of a violin (o = 380Hz). (c) and (d) are Sh(to,con) and Sh(tt,con) (a)spectrum at time tO (b)spectrumn at time t1 Separation system (c) result (d) resinlt _ _ _ _ _ _ S g(n) (e) Spectrunm of sound source (f) Spectruamof sound source * These specta are for conf irmat ion Fig.8 Experiment and results using actual sounds The sound sources are Horn (ft = 380Hz) and Violin (fO = 380Hz) which are generated in according with Equation 4-1 and 4-2. We inputted them into the system and obtained the outputs St*(aon) and Sg*(con). They are shown in (e) and (f) in Figure 5. The error Er and E5 were under 0.05. Although it is difficult to separate the sounds which have the same fundamental frequency by using comb filters, the algorithm we proposed in this paper can segregate them. And these two experimental results suggest that in many cases the actual musical sounds meet Condition (3) (There exists a frequency where one of the two spectra is zero.) These observations show that the algorithm separate sounds when they meet the necessary conditions. 5.3 Results using the actual data We confirmed that the algorithm was effective in separating sounds that met the necessary conditions. ICMC Proceedings 1994 317 Audio Signal Processing

Page  318 ï~~In this subsection, the experimental results using the actual sounds are described. We took the sound data of two musical instruments on one microphone. And successively we transformed them to the short time power spectra and inputted them to the system. In Figure 6, (a) and (b) are the spectra of the concurrent sounds Sh(to,con), and Sh(tl,,(n) at time to and t. The sound sources were horn (/b = 380Hz) and violin (fo = 420Hz). (c) and (d) are the results of the system Sf*( n) and Sg*(con). (e) and (t) are the original spectrum of each sound source SI(Dn) and Sg(n), which are not concerned with the separation system. Although, we show here to confirm the results. Figure 6 shows that the system separated the actual sound consists of two musical instruments. The results show the features of each sound sources. The error Ef of (c) and Es of (d) were under 0.2. Among the results using actual data, there were poorly separated cases that the estimation error became over 0.4. Figure 7 is an example that is poorly separated. Although we sometimes had the poorly separated results, we could separate the actual sounds which have the same fundamental frequencies in most cases. We made the experiment using the actual sounds whose fundamental frequencies are the same. In Figure 8, (c) and (d) are the results of the sound separation system using the actual sounds which consist of the sound of horn (f0 =380Hz) and violin (10 = 380Hz). (c) and (d) are the spectra of each sound sources. The estimation error Ef and Eg were under 0.3 in Figure 8. Additionally, we tried to separate the actual sound that consisted of two human voice. We inputted the sounds that contained Japanese vowels "a" and "i". The system separated them adequately to some degree. Although the system did not separate them perfectly,we recognized that the result enhanced one of the pair. 6. Conclusions We proposed an algorithm of segregation for two concurrent sounds on the blind condition, under three necessary conditions. The system which is based on the proposed algorithm separate the test data perfectly, when the sounds meet the necessary conditions. The results using the actual sounds show that the system can apply for the segregation of actual sounds. But, in this case, the system has a problem with its reliability. We consider the reasons of the problem of poor reliability are as follows: A) In actual sounds, Condition (2) (that the power spectrum of all frequency changes in same rate) may not always hold true. B) The actual data contains noise. It may be possible to make the data meet the necessary conditions by adjusting the range of the time window and by averaging the signal more sufficiently. In future work, we intend to improve the proposed algorithm for a real-time decomposition of concurrent sounds. We are also going to apply this algorithm for the speech recognition system. The algorithm can be considered to be effective in sound measurement and instrumentation in various fields. References [Lea and Summerfield, 1992] Andrew P. Lea, Quentin Summerfield, "Monaural Segregation of Competing Voices", 53 ATR HIP Res.Labs.,1992 [Stubbs and Quentin, 1990] Richard J. Stubbs and Quentin Summerfield, "Algorithms for separating the speech of interfering talkers: Evaluations with voiced sentences, and normal-hearing and hearing-impaired listeners", J.Acoust. Soc. Am. 87 (1), January 1990 [Kashino and Tanaka, 1993] Kunio Kashino, Hidehiko Tanaka, "A Sound Source Separation System.with the Ability of Automatic Tone Modeling", Proceedings ICMC 1993 pp.248-pp.255 [Nehorai and Porat, 1986] Arye Nehorai, Boaz Porat, "Adaptive Comb Filtering for Harmonic Signal Enhancement", IEEE Transactions on Speech and Audio Processing, No. 34 (5), 1986 [Ghitza, 1994] Oded Ghitza, "Auditcy Models and Human Performance in Tasks Related to Speech Coding and Speech Recognition", IEEE Transactions on Speech and Audio Processing, VoL2, No.1, Part 2,Janury 1994. [McAdams, 1989] Stephen McAdams, "Segregation of concurrent sounds. 1: Effects of frequency modulation coherence", J.Acoust. Soc. Am. 86 (6), December 1989 [Assmann and Simmerfield, 1990] Peter F. Assmann and Quentin Summerfield, "Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies", J. Acoust. Soc. Am. 88 (2),August 1990 [Moore, 1983] Moore,B.C.J. "Suggested formulae for calculating aouditory-fillter bandwidths and excitation patterns", J.Acoust.Soc.Am.,1991. [Mitchell, Ross, and Yates, 1971] O.M.E. Mitchell, C. A. Ross, and G. H. Yates. "Signal processing for a cocktail party effect", J. Acoust. Soc. Am., Vol. 50, No.2, 1971 [Fujinaga, Aiphonce, Pnnycook, Diener, 19921 Ichiro Fujinaga, Bao Aiphonce, Bruce Pennycook, and Kharim Hogan, " Optical music recognition", In proceedings of 1992 ICMC, pp.117-pp12O,1992 fUeda, Hashimoto, and Ohtcru, 19931 Mamoru Ucda, Shuji Hashimoto, Sadamu Ohteru, "On Sound Signal Separation using short term Spectrum", Proceedings Infonnation Processing Society of Japan,46th national conference, 1993 Audio Signal Processing 318 ICMC Proceedings 1994