Page  00000329 Applying STRAIGHT toward Music Systems - Accurate FO estimation and application for data-driven synthesis - Haruhiro Katayose'13 and Hideki Kawahara1'2 'Faculty of Systems Engineering, Wakayama University, Sakaedani, Wakayama, 640-8510 JAPAN 2C.R.E.S.T. 3L.I.S.T. { katayose,kawahara} Abstract This paper presents our on-going project toward the development of a shakuhachi synthesizer using STRAIGHT -based on a VOCODER architecture. An overview of STRAIGHT and its function regarding FO estimation are introduced. We briefly describe the design of Cyber-Shakuhachi II composed of gesture sensors and a sound synthesis module, which is supposed to be used as a VOCODER-like instrument or a data-driven synthesizer. 1. Introduction Computer music which reflects ethnical musical sound in computer generated sound is attractive. We (the 1st author) have been interested in interactive computer music featuring shakuhachi, above all. Gesture sensors for the shakuhachi (called Cyber-Shakuhachi) have been developed and some pieces featuring Cyber-Shakuhachi have been composed as a L.LS.T. project[Katayose 1994][Katayose 1997] (figure 1). Compared with western musical instruments represented by flute, shakuhachi is not standardized. It result in sound nuances which differ according to instruments and players. This fact and the variety of gestures regarding performing technique have been difficulties to develope shakuhachi synthesizers. We started with the utilization of gestures of shakuhachi performance to control commercial effecters, samplers, and mixers in interactive computer music pieces. We succeeded in the first objective, but Cyber-Shakuhachi is not a synthesizer in the strict sense, because it itself does not posses sound synthesis module. Our next goal is to develope Cyber-Shakuhachi II as a shakuhachi synthesizer. The considered points are that the system can resynthesize the shakuhachi sound controlled by obtained gestures, that the system works as a kind of effector, and that the system offers the function to assign the parameters of gesture sensors to sound parameters. Especially a high quality pitch change function has been long desired to be a substitute of Figure 1: Playing the Cyber-Shakuhachi changing the pipes of different length. In order to realize such objectives, STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrogram)[Kawahara 1997], a high quality VOCODER originally designed for a speech analysis, modification and synthesis is employed. First, this paper introduces the overview of STRAIGHT with its characteristic, precise FO estimation. Next, this paper outlines the design of Cyber-Shakuhachi II applying STRAIGHT for the sound synthesis module. 2. STRAIGHT STRAIGHT consists of three key procedures to achieve the high quality sound manipulation (shown in figure 2.) The first procedure extracts a smoothed time-frequency representation, which is free from interference due to the source periodicity. It uses pitch-adaptive time-frequency analysis combined with a surface reconstruction method in the time-frequency region. The smoothing operation using cardinal B-spline is employed to eliminate the periodicity interference in the time-frequency domain, which are unavoidable when using short-term Fourier transform. The second component extracts FO and other source related information with high reliability and precision. It extracts the speech FO as the instantaneous frequency of the fundamental component of complex sounds like ICMC Proceedings 1999 -329 -

Page  00000330 Figure 2: Outline of STRAIGHT voiced speech, by using a new concept called "fundamentalness". "Fundamentalness" is defined as the negative logarithm of the total amount of AM (amplitude modulation) and FM (frequency modulation) magnitudes of wavelet transform using an auditory-like analyzing wavelet. The third procedure designs the excitation source for resynthesis using group delay manipulations, and this enables artificial "naturalness" to be added to the synthetic speech. It takes advantage of the fact that humans are very sensitive to specific group delay attributes. 3. FO estimation[Kawahara 1998] A new algorithm based on the instantaneous frequency has been developed to provide source information, in order to guide the STRAIGHT procedure. The proposed method extracts the fundamental frequency as the instantaneous frequency of the fundamental component of a complex sound. This may sound strange, because selecting the fundamental component seems to require a prior knowledge of the fundamental frequency to be extracted. A new measure called 'fundamentalness' provides a built-in mechanism for selecting the fundamental component without referring to FO information. The 'fundamentalness' is designed to have the maximum value when a filter output only consists of the fundamental component. It is made possible to use a bank of asymmetric constant-Q band-pass filters which has gradual slope in the Tower cut-off and steeper slope in the higher cut-off. By defining 'fundamentalness' to be proportional to the negative logarithm of the total amount of FM and AM of a filter output, using the filter bank mentioned above, it shows the desired behavior. An analyzing wavelet wAG(t;l) made from a complex Gabor filter w (t) having a slightly finer resolution in frequency (i.e., 1r > 1) can form such a filter bank. The input signal s(t) can be divided into a set of filtered complex signal B(t;r). B( t;r) I (t)wAGL- du2 wAG(t;77) = w,(t 1 /4;77) - w,(t + 1/4;1) w(rt;) = -e The characteristic period of the analyzing wavelet is used to represent the corresponding filter channel. The 'fundamentalness' index Mc(t;r) is calculated for each channel (r ) based on the output. The definition of the index has been slightly modified from the previous report, because FO trajectories of speech signals normally consist of moving components that carry prosodic information. Removing the contribution of the monotonic FO movement reduces artifacts on the 'fundamentalness' evaluation caused by prosodic components. M(t;r)=-log[f w (d B(u)l (u))2du -log[ w (drg(u - (u))du] du a du +log[f w IB(u)12 du]+2log-r. f=fw(u t.T dlB(u)l d2 arg(BW 1 -P)f2 du du2 w(t;r,) - e 2" where the integration interval 2 =(t-T, t+T) is selected to cover the range where the weighting factor w(u-t;, r - 330 - ICMC Proceedings 1999

Page  00000331 (in the first three terms regarding M(t;z), w(u-t), are abbreviated as w) is effectively non-zero. Thus, index Mc(t;zr is normalized in terms of scale. Extracting FO simply means finding the maximum index of M (t;zc) in terms of T and calculating the average (or more specifically, interpolated) instantaneous frequency using the outputs of the channels neighboring r. For the instantaneous frequency calculation, a modification fit for the discrete time system is introduced. 4. Design of Cyber-Shakuhachi II The important points of the design of Cyber-Shakuhachi II are the configuration of gesture sensors and sound synthesis. Figure 3 shows the design overview of Cyber-Shakuhachi II. The sensor data which are used for sound manipulation should be rich but not be redundant. In the CyberShakuhachi II, basically, data of breath, shakuhachi posture, and fingering are used. In addition, head movement or sound data can be utilized. Unnecessary data are gated by the control signal and the selected data are transferred to the sound generating computer. Performer Cyber Shakuhachi 2 Geture/Sound Head Movement data mappin breath data Gate SA Synthesized / I Sound Fingering data L I o Shakuhachi Posture " Control Signal lr, I Pitch & Power NoizeMRatioI MOl(control signal) Sampler I Effector Mixer Pattern Recognition Scene Management Soure Sound Figure 3: Cyber-Shakuhachi II with a system diagram for interactive art The most standard sound synthesis method realized by STRAIGHT is using it as a VOCODER. Convolution of a certain sound profile analyzed beforehand and the excitation source of which pitch and power are from shakuhachi sound produces a unison synthesized sound. The other gesture data can be used to control other parameters; FO shift, stretches along frequency and temporal axis or four parameters regarding group delay. Another sound synthesis method is to construct a data driven modeling and synthesis, as illustrated by Schoner et al [Schoner 1998]. Data driven modeling is suited to realize a synthesizer whose input and output data are observable and difficult to identify the physical model. An interesting usage of data driven modeling is to map gesture data with different sound sources. It can be a new tool for interactive art. 5. Ongoing activities STRAIGHT is a queer VOCODER in a sense. Its high quality sound synthesis is realized at the cost of calculation time and larger computational resources which expand more than original sound data. At present, a real-time working version has been developed, as one of the main streams of STRAIGHT project. One of the aims of Cyber-shakuhachi II is to exhibit the possibility of usage of STRAIGHT whose realtime version is scheduled in three years. It is already verified that the FO estimation procedure shown in chapter 3 works on available commercial PCs. We are also modifying the FO estimation method as it is more suited for the use of musical sounds whose harmonic structure are relatively stable. We are prospecting for the proper STRAIGHT parameters with its range which are supposed to be correlated with gestures, or which may produce interesting computer music sound. It is not intended to install all the possible parameter mappings into the mapping function of data driven synthesis. Giving too many data to the system may result in deteriorated realtime performance. We are preparing to apply kalman filter, expecting its prediction feature to compensate for the sensing delay, to map the gesture data and the sound parameters ICMC Proceedings 1999 - 331 -

Page  00000332 Figure 4: STRAIGHT user-intreface Figure 4 shows the user interface and some processing of STRAIGHT. This figure shows STRAIGHT provides parameters of FO shift, stretches along frequency and temporal axis and four parameters regarding group delay. As the examples of sound manipulation of STRAIGHT, original shakuhachi sound, resynthesized sound, and some sound variation manipulated with group delay parameters are available from the following URL. 6. Summery This paper has been presented an attempt to develop a shakuhachi synthesizer using STRAIGHT, a high quality VOCODER. The overview of STRAIGHT and its precise FO estimation method which is regarded as a key procedure in sound technology were presented. This paper illustrated the design of Cyber-Shakuhachi II composed of gesture sensors and a sound synthesis module, which is supposed to be used as a VOCODER-like instrument or a data-driven synthesizer. Acknowledgment Authors would like to thank Satosi Shimura for his shakuhachi performance, and Tsutomu Kanamori for his research support. References [Katayose 1994] H. Katayose, T. Kanamori, S. Simura and Seiji Inokuchi: Demonstration of Gesture Sensors for the Shakuhachi, Proc. ICMC, pp.196-199 (1994) [Katayose 1997] H. Katayose, H. Shirakabe, T. Kanamori and S. Inokuchi: A Toolkit for Interactive Digital Art, Proc. ICMC, pp.476-478 (1997) (Kawahara 1997 ] Hideki Kawahara: Speech Representation and Transformation using Adaptive Interpolation of Weighted Spectrum: VOCODER Revisited, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol.2, pp.1303-1306. (1997) [Kawahara 1998] H. Kawahara, I. Masuda-Katsuse, and Alain de Cheveigne: Restructuring speech representations using a pitch adaptive time-frequency-based FO extraction: Possible role of a repetitive structure in sounds, Speech Communication 27, pp.187-207 (1999) [Schoner 1998] B. Schoner, C, Cooper, C. Douglas and N. Gershenfeld: Data-driven Modeling and Synthesis of Acoustical Instrument, Proc. ICMC, pp.66-73 (1998) -332 - ICMC Proceedings 1999