Page  70 ï~~ADAPTIVE KARAOKE SYSTEM - Human Singing Accompaniment Based on Speech Recognition - Wataru INOUE Shuji HASHIMOTO Sadamu OHTERU Department of Applied Physics, School of Science and Engineering, WASEDA UNIVERSITY 3 - 4 - 1, Okubo, Shinjuku-ku, Tokyo 169, JAPAN E-mail: shujivax@cfi.waseda.ac.jp ABSTRACT This paper is concerned with an adaptive automated accompaniment and modulation system for human singing based on speech recognition. One purpose of our system is to produce an adaptive accompaniment to follow the human singing in real time, so that singers can change the singing tempo according to their own emotions. Another purpose is to control the pitch of the singing voice in real time. When one sings out of tune, the system monitors the pitch of the singing and adjusts it to the correct one. 1 Introduction In the music performance system, the real time processing is a very important theme as music is art on a time axis. Therefore, we have been engaged in real time man-machine interface for musical performance [Morita et al., 1991] [Sato et al., 1991] [Harada et al., 1992]. The research of automated accompaniment usually consists of interesting investigations where real time processing is required, and many have been reported on this matter [Dannenberg, 1984] [Dannenberg and MontReynaud, 1987] [Naoi et al., 1989] [Wake et al., 1992] [Horiuchi et al., 1992]. In almost all the systems, the automated accompaniment itself was concentrated on (that is, the generation of the accompaniment in the adequate timing with the suitable arrangement, etc.). But the human performance is played by a keyboard to make the system easy for physical sound signal processing such as pitch extraction. On the other hand, there are few reports on the automated accompaniment systems which allow acoustic sound inputs [Vercoe, 1984] [Vercoe and Puckette, 1985] [Katayose et al., 1993], although the sound processing in the field of acoustics has been widely reported [Yamaguchi and Ando 1977] [Niihara et al., 1986] [Kuhn, 1990]. Some Pitch-to-MIDI equipment which has become commercially available is insufficient, especially for the vocal sound. As the human voice is more complicated than the instrument sounds, we have to overcome abundant problems. Recently, an accompaniment system called "KARAOKE", a sort of MMO (Music Minus One), that was developed in Japan, has become popular throughout the world. However, in KARAOKE, singers must sing in the tempo of the recorded accompaniment, that is, they cannot sing in their favorite tempo. Machine Recognition of Music 70 ICMC Proceedings 1994

Page  71 ï~~A / D CONVERTER =A/ D Conversion = S I I I I I I I I I I I I I PERSONAL I /, Extracted Established ' Difference Tempo I Figure 1: Flow diagram of the system. We have already introduced "A Computer Music System for Human Singing" in ICMC 1993 [Inoue et al., 1993]. One purpose of our system is to produce an adaptive accompaniment to follow the human singing in real time, so that singers can change the singing tempo according to their own emotions. Another purpose is to control the pitch of the singing voice in real time. When one sings out of tune, the system monitors the singing pitch and adjusts it to the correct one. But we had some problems in this previous system: when one sings out of tune, the system cannot obtain the information about the singing measure of the melody score (and therefore cannot predict the singing tempo). This is because the system obtains the tempo in ICMC Proceedings 1994 71 Machine Recognition of Music

Page  72 ï~~formation by comparing the singing pitch and the stored melody data. In the system proposed here, we introduce "Speech Recognition". Then the system can know the singing measure (that is, the singing tempo) by comparing the results of speech recognition with the lyrics of the song, and also know the difference between the singing pitch and the pitch of the melody score according to the results of the pitch extraction. Consequently, even if one sings out of tune, the system can adaptively generate the accompaniments and adjust the singing pitch to the correct one at the same time. 2 System Configuration Figure 1 shows a flow diagram of the system. The system hardware consists mainly of a personal computer (NEC PC9801RA) to control the entire system, a digital signal processor (DSP) unit (Mitec Corp. MSP77230 with NEC DSP chip ttPD77230) for speech recognition, a Pitchto-MIDI Converter (Roland CP-40), a digital sound processor (YAMAHA Corp. SPX90 II) to compensate for the singing voice, and a MIDI instrument (E-mu Systems, Inc. Proteus) to generate the accompaniments. The scores of the melody and accompaniments and the lyrics of the song are effectively maintained as a knowledge base in the system software. The system first analyzes the singing voice, and next performs the recognition of vowels (five Japanese vowels, that is, /a/, /i/, /u/, /e/ and /o/) with the help of the DSP unit in real time. The personal computer monitors the vowels of the singing voice by comparing the lyrics of the song, and gets the information about the singing measure of the melody score. From this information, the system is able to estimate the singing tempo. The tempo of the accompaniments is adjusted to the detected singing tempo by the linear prediction method using the man-machine interaction model. Thus, the system adaptively generates the accompaniments with the MIDI instrument according to the established tempo. In order to automatically compensate the singing voice in real time, the singing pitch is detected with the help of the Pitchto-MIDI Converter as the MIDI note number. As the system is tracking the lyrics of the song being sung, it is possible to detect the difference between the singing pitch and the melody score. The singing voice is corrected to the pitch of the melody score by the digital sound processor according to the detected difference. 3 Singing Voice Analysis 3.1 Speech Recognition The system recognizes the vowels excluding consonants, because it is necessary to analyze the singing voice in real time for the automated accompaniment. The recognition in this system is performed under the conditions of continuous speech, speaker dependency, and phoneme recognition. The system prepares reference patterns for each singer to perform speaker dependent speech recognition. The feature of the patterns is extracted as a spectrum envelope by the 'CEPSTRUM' Method with the help of the DSP unit in real time, as shown in Figure 2. The singing voice is quantized to 12 bits of digital data with the sampling rate of 8 kHz and the Hamming Window of 64 milliseconds (512 = 2N points). Let g(n) (1 < n < 2N) stand for the sampling data of the singing voice which is cut out with the Hamming Window. The Machine Recognition of Music 72 ICMC Proceedings 1994

Page  73 ï~~P(J) - (j=1,2,...,N) (2) Sampling Freq. 8 kHz Time Window: 64 mec. FIT FFT 512 Points LOG Spectrum GO) 1FF F Hamming LifterRernc Pattern A) Spectrum Envelope (Unknown Pattern B) - Pattern Matching Figure 2: Cepstrum Method. system obtains the magnitude E of g(n) as follows: 2N E = (g(9n))2 (1) n=1 The singing voice is sorted into 'vowel', 'non vowel' and 'no sound' by the threshold values of the magnitude E. In the case of being a 'vowel', the system gets an N points logarithmic power spectrum G(,) (1 < j _ N) from g(n) by the Fast Fourier Transform (FFT), and gets the cepstrum C(q) from G(j) by the Inverse Fast Fourier Transform (IFFT). Then the spectrum envelope P(j) is calculated by the FFT algorithm on the low quefrency parts of C(q) which is cut out with the Hamming Lifter. P(3) is normalized as follows: k=1 The extracted feature P(j) is expressed as the vector (P = (P(l),P(2),...,P(N)))7 and { ' } indicates the patterns which are normalized by Equation (2). The distances Di between the reference patterns of each vowel At and the unknown patterns B are calculated as follows (i = 0: /a/, i= 1: /i/, i = 2: /1u/, i= 3: /e/ and i = 4: /o/, that is, Ao is expressed as the reference pattern of /a/, Do is expressed as the distance between A0 and B, and so on): N Di = Z(A(j)-- B) j=1 (3) The vowel which is i such that min{D2}, is decided as the result of the recognition R(= i). Therefore, the final result of the recognition is as follows: Â~ vowel: R = 0,1,..., 4, "nonvowel: R=5, "*nosound: R=6. The system currently performs this recognition process approximately every 100 milliseconds, and gets the results one after another. The results of the recognition are sent to the personal computer which monitors the progress of the singing. 3.2 Pitch Extraction The singing voice is transformed into the MIDI signal ('MIDI note number', 'velocity' and 'note on or off', etc.) with the help of the Pitch-to-MIDI Converter (Roland: ICMC Proceedings 1994 73 Machine Recognition of Music

Page  74 ï~~CP-40) in real time. The singing pitch is set to the nearest halftone interval and is sent to the personal computer as the MIDI signal. To improve the precision of the pitch extraction, the singing voice is filtered through a lowpass filter which eliminates the small peaks due to harmonies. The filtered singing voice is fed to the Pitch-toMIDI Converter. The cutoff frequency of the lowpass filter is fixed as 500 Hz in the case of the male voice. 4 System Controller 4.1 Score Data The score data of the melody and accompaniments consist of the MIDI note number, length of the note, MIDI velocity, flag for the chord, and MIDI channel. That is, the data of an event is five bytes in total, quite short to provide enough time for processing. Table 1 shows the example of the score data. In case of the melody score, the score data consist of the lyrics in addition to the above-mentioned data. Table 1: The example of the score data. Note Len. Vel. Flag Ch. 60 8 64 0 1 60 4 64 0 1 64 8 64 1 1 67 8 64 3 2 harmony 64 4 64 1 1I 67 4 64 3 2 J* harmony 69 8 64 0 1 69 4 64 0 1 60 12 64 1 1 ") 64 12 64 2 2 ) harmony 67 12 64 3 3 4.2 Matching The system gets the information about the singing measure of the melody score by matching the new results of speech recognition with the lyrics of the song. The errors in the results of recognition are eliminated as follows: the appearance times of each result (R = 0, 1,..., 6) Fi in the past (k + 1) results are calculated as a following equation: 0 Fi- E W E(i,,) (i-=0,1,...,6) j=-k (4) where E - 0 (iRa) E,={1 (i-Rj) Rj Â~the past k(= -j)th result (0 < R < 6, Ro is expressed as the newest result). The system chooses the result which most frequently appears in the past (k + 1) results as the final result at that time (that is, i such that max{F}). Fi is calculated using the weights Wj, so Fi is more influenced by the newer results in the past (k + 1) than by the older results. When this final result is consistent with the lyrics of the song (that is, the vowel of the lyrics which will be sung next), the system judges that the vowel will be sung, and calculates the tempo of the singing. If the extracted tempo of the singing is quicker than the quickest tempo which is fixed by the system because of wrong judgment, the system once more matches the new results of speech recognition with the same lyrics of the song. If the final result is inconsistent with the lyrics over the times which are set by the system, the system generates the accompaniments in the slowest tempo which the system fixes in order not to stop the performance. 4.3 Automated Accompaniment According to the extracted tempo of the singing, the system predicts the next Machine Recognition of Music 74 ICMC Proceedings 1994

Page  75 ï~~singing tempo and schedules the timing of the accompaniments. The tempo of the accompaniments is adjusted to the predicted tempo of the singing by the linear prediction method using the man-machine interaction model [Sawada et al., 1992]. This man-machine interaction model is described as follows: in this example, music is assumed to consist of unit notes. In the performance, let X(i) and Y(1) be the timing of the ith part of the melody and accompaniment score, x(i) and Y(i) express the tempo of the singing and the accompaniment, A(2) express the phase difference between the melody and the accompaniment at the synchronous point i, and -ix and -y, express the influences of the phase difference. Then X(i+l) and Y(i+l) is expressed as follows: m - ZC. X(..), j=l (10) X(+iI) = X(i) + X(j) - 7x "i(i) (5) Y(i+1) = Y() + Y() + 7y "A(i) (6) where A(i) = X(i) - Y(i). (7) Let Fx(i), x'i),!4i), Xo(i), a and 3 be the singer's favorite tempo, the singing tempo which is predicted by the system, the accompaniments tempo which is predicted by the singer, the default of the accompaniments tempo, and the coefficients of the mutual interaction respectively. Then x(i) and y(i) are expressed as follows: x(i) -(1-a).Fx(i)+a.y), (8) Y(i) =( )"Xo(i) +!Q " x~i)." (9) The tempo of the accompaniments is established by changing Y(i). To obtain Y(i), the system first calculates x(i) by Equation (8) and predicts x i+1) by Equation (10) which is the linear prediction formula. where c()Â~ the coefficient of the prediction The system currently uses the past two data (that is, m = 2 of Equation (10)) in the linear prediction. The system next obtains Y(i+i) by Equation (9), and obtains Y(i+i) by Equation (6). According to the timing Y(i+1), the system generates the accompaniments. If the difference between xo(i-i) and x(i_i) is small, /6 in Equation (9) is made smaller, that is, the accompaniments are generated according to the default tempo xO(i). If the difference is large, /3 is made larger, that is, the tempo of the accompaniments is adaptively adjusted to the singing tempo. 4.4 Automated Pitch Modulation The system monitors the difference between the melody score and the MIDI note number which is extracted by the Pitch-toMIDI Converter. According to the pitch difference, the system adjusts the singing pitch to the correct one (that is, the pitch of the melody score) by using the pitch change program of the digital sound processor (YAMAHA SPX90 II) in real time. The quantity of the pitch shift modulation is the difference between the extracted MIDI note number and the parameter "BASE KEY" which is set in the digital sound processor. The digital sound processor changes the pitch at a halftone interval with the range of Â~ one octave. The tone color of the singing voice is obviously distorted in the pitch shift modulation of the digital sound processor. We analyzed the spectrum of the singing voice which was changed by the digital sound processor. The spectrum has spurious peaks around the fundamental and harmonic frequencies, which is considered to ICMC Proceedings 1994 75 Machine Recognition of Music

Page  76 ï~~cause the tone color distortion. So, to restore the original tone color, we used a equalizing filter on the output of the digital sound processor. 5 Experiments and Results 5.1 The Operation of the System The system first makes the reference patterns of each vowel for each singer. If these reference patterns are saved as files in a floppy disk or a hard disk, the system can obtain the patterns by reading the pre registered files instead of making the reference patterns. After making (or reading) the reference patterns, the system is on standby for the starting accompaniment. Then, if some sound is inputed, the system starts the introduction of the music, adaptively generates the accompaniments and automatically compensates the singing voice. the horizontal axis means the progress of the singing, and the vertical axis means the tempo of the singing and accompaniments. In the experiments, the phase difference between the singing tempo and the tempo of the accompaniments was small, and the system could satisfactorily adjust to the tempo of the human singing. In the pitch modulation, the system could adjust the singing pitch to the actual singing pitch. 5.2 Results The results of the speech recognition are shown in Figure 3. The errors in the results of the recognition are eliminated by the method which is described in Section 4.2. The recognition -- * * i i iu i e aaaa * *ui i result The smoothed -- * *i i ii i i aaaa * *ui i recognition result Lyrics (k) i (r) a (k) i - -time ii * aaa * *i j * **aaaao u uou ii * aaa * *i i ii * **aaaaa a uuo Figure 4: The results of the automated accompaniment. 6 Conclusions We introduced speech recognition into the human singing accompaniment system to make the system more adaptive. Even when one sings out of tune, the system can monitor the progress of the singing by using the lyrics of the song, that is, it can adaptively generate the accompaniments to follow the human singing, and also can compensate the singing voice. Although the experimental results were promising, we found that the response of the system often was insufficient because of the time delay of the speech recognition. We think that the problem can be solved by decreasing the FFT points. To obtain the information about the singing, the previous system uses the results of the pitch extraction, and the present system uses the results of the speech recognition. The future system will become a robust and responsive system by (r) a (h) i (k) a (-: no input, *: consonant) (r) u - time Figure 3: The results of speech recognition. Figure 4 shows a typical result of the automated accompaniment. In this figure, Machine Recognition of Music 76 ICMC Proceedings 1994

Page  77 ï~~using both the pitch extraction and speech recognition effectively to make the system free from mistakes in the lyrics of the song and from the environmental noise such as a call thrown in to enliven a song. References [Morita et al., 1991] Morita,H., Ohteru,S. and Hashimoto,S., "A Computer Music Performer that Follows a Human Conductor," Computer, Vol.24, No.7, IEEE, 1991. [Sato et al., 1991] Sato,A., Hashimoto,S. and Ohteru,S., "Singing and Playing in Musical Virtual Space," ICMC Proceedings, 1991. [Harada et al., 1992] Harada,T., Sato,A., Hashimoto,S. and Ohteru,S., "Real Time Control of 3D Sound Space by Gesture," ICMC Proc., 1992. [Dannenberg, 1984] Dannenberg,R.B., "An On-Line Algorithm for Real- Time Accompaniment," ICMC Proc., 1984. [Dannenberg and Mont-Reynaud, 1987] Dannenberg,R.B. and MontReynaud,B., "Following an Improvisation in Real-Time," ICMC Proc., 1987. [Naoi et al., 1989] Naoi,K., Ohteru,S. and Hashimoto,S., "Automatic Accompaniment Using Real Time Assigning Note Value," Convention Record of Acoustical Society of Japan, Spring 1989 (in Japanese). [Wake et al., 1992] Wake,S., Kato,H., Saiwaki,N. and Inokuchi,S., "The Session System Reacting to the Sentiment of Player," Japan Music and Computer Science Society, Proc. of Summer Symposium, 1992 (in Japanese). [Horiuclhi et al., 1992] Horiuchi,Y., Fujii,A. and Tanaka,H., "A4 Computer Accompaniment System Considering Independence of Accompanist," Japan Music and Computer Science Society, Proc. of Summer Symposium, 1992 (in Japanese). [Vercoe, 1984] Vercoe,B., "The Synthetic Performer in the Context of Live Performance," ICMC Proc., 1984. [Vercoe and Puckette, 1985] Vercoe,B., Puckette,M., "Synthetic Rehearsal: Training the Synthetic Performer," ICMC Proc., 1985. [Katayose et al., 1993] Katayose,H., Kanamori,T., Kamei,K., Nagashima,Y., Sato,K., Inokuchi,S. and Simura,S., "Virtual Performer," ICMC Proc., 1993. [Yamaguchi and Ando, 1977] Yamaguchi,K., and Ando,S., "Application of Short- Time Spectral Analysis to Natural Musical Instrument Tones," Journal of Acoustical Society of Japan, Vol.33, No.6, 1977 (in Japanese). [Niihara et al., 1986] Niihara,T., Imai,M., and Inokuchi,S., "Transcription of Sung Song." Proceedings of the International Conference on Acoustics, Speech and Signal Processing, New York: IEEE, 1986. [Kuhn, 1990] Kuhn,W.B., "A Real-Time Pitch Recognition Algorithm for Music Applications," Computer Music Journal, Vol.14, No.3, 1990. [Inoue et al., 1993] Inoue,W., Hashimoto,S. and Ohteru,S., "A Computer Music System for Human Singing," ICMC Proc., 1993. [Sawada et al., 1992] Sawada,H., Isogai,M., Hashimoto,S. and Ohteru,S., "Man-Machine Interaction in Musical Accompaniment System," Convention Record of Information Processing Society of Japan, Spring 1992 (in Japanese). ICMC Proceedings 1994 77 Machine Recognition of Music