Page  408 ï~~TIMBRE INTERPOLATION OF SOUNDS USING A SINUSOIDAL MODEL Naotoshi Osaka NTT Basic Research Laboratories, 3-1 Morinosato-Wakamiya, Atsugi-shi, Kanagawa, 243-01 Japan osaka@siva.brl.ntt.jp Tel. 0462-40-3577 Fax:0462-40-4721 ABSTRACT: This paper describes a model for sound timbre interpolation using a sinusoidal model. Two stationary sounds are concerned as interpolated sounds. Partials are classified into four categories: harmonics, auxiliary harmonics, inharmonics and residuals. The correspondent partials among the same class are interpolated. The unified algorithm for finding the correspondence is precisely described. 1 Introduction Timbre interpolation is a problem in perceptual space. This technique is how we synthesize the sound that interpolates timbre, given two natural sounds with different timbres. It is known as "sound morphing" and is expected to be one of the attractive techniques in sound design for music composition as well as visual art. Cross fade is a close technique to sound morphing, and it seems the timbre control technique is becoming widely spread. Recently, the framework of timbre control using signal model is under study. Tellman et al. [Tellman 1994] introduces an interpolation model among various musically expressed sounds such as vibrato using a sinusoidal model named "Lemur". Moreover, a realtime demonstration system is running in CNMAT at UC Berkeley. To technically define the problem, we set up some restrictions for timbre interpolation, as follows: 1. Timbre should be perceptually continuously controlled. 2. If two input sounds have attributes in common, the interpolated sounds should also has them. 3. Interpolated sound should not have distortion or artifacts if the original sounds do not. Here are two examples of No. 2. If two sounds are from acoustically naturally generated sounds, the interpolated sound should also be natural. If human voice to human voice interpolation is dealt with, the generated voice should sound as a human's voice. Another item in common to Nos 2. and 3. is that if the original sounds are only one stream, the interpolated sound should also be one stream. Cross fade does not satisfy this restriction. 2 Timber interpolation model A Block diagram of the model is dipicted in Figure 1. As one of inputs, an interpolation constant a(0 < a < 1) is given, which defines how much synthesized sound timbre is apart from or close to the original sounds. Two sounds are analysed using a sinusoidal model. An M & Q algorithm[McAulay 1986] was used to estimate the frequency domain trajectory for each harmonics. FFT' [Rodet 1992] is also known as another parameter estimation algorithm. Analysis conditions were as follows: 24kHz sampling, 10msec fixed frame length with hamming window, and fixed 4096 points for STET. At this stage, sinusoidal parameters are acquired. Then temporal correspondence between two sounds are found. This is done by ordinary Dynamic Time Worping (DTW). At present, signal magnitude is used for distance measurement. 408 IC M C PROCEED I N G S 1995

Page  409 ï~~2.1 Classification of partials AB Partials are classified into four categories. 1. harmonics (quasi harmonics) Sinusoidal model: analysis 2. auxiliary harmonics (M & Q Algorithm) * Sn Sinusoidal 3. inharmonics representation 4. residual Correspondence and classification Harmonics (hereafter including quasi har- Time correspondence *2 monics) generally have large magnitude and Harmonics/inharmonics *3 duration. Their instantaneous frequency Auxiliary harmonics *4 are almost an integer multiple of funda- Signal classifier mental frequency. In case a large enough signal in both magnitude and duration is not found in an appropreate frequency, it a' is called null harmonics. Auxiliary har- Frequencyzone monics are defined as partials with middle Haiino- inhar- Aux. definition level and duration time, the frequency of ncs monkcs harm. which is either as an integer or a half mul- Sinusoid parameter tiple of fundamental frequency. These are interpolator Stochastic parameter observed in the beggining and the end of interpolator the sound. Another case is for harmonics which are not loud enough. Since magnitude is time variant, in some frames har- Sinusoidal model: synthesis monics are observed and in the others are E not. Therefore a discontinuos harmonicJ trajectory as a dashed line is seen. In- Interpolated sound v VVharmonics contribute a lot to timbre. The Fig. 1 Block diagram of timbre rest of the partials are defined as residuals. The number of residual partials are much interpolation model greater than the other three classes. The residuals are noise, and level and duration of each partial are relatively smaller than the other three classes; thus is not meaningful to study each sinusoid perceptually. Correspondence of sufficiently loud partials between two sounds is found. Correspondence of partials and their classification are not processed in series, but rather partials are classified in the process of searching for correspondent partials. As a result, loud correspondent harmonic and inharmonic partials are found. The rest are judged as residuals which consist only of quiet partials. 2.2 Interpolation For all classes but residuals, correpondent partials are searched for the frequency domain among the same classes. Once found, they are interpolated under the temporal correspondence. If there is no correspondent partial, zero level harmonics are hypothesized, and their instantaneous frequencies are estimated from the nearest observed harmonics. This is necessary to define the harmonics zone. Harmonic to harmonic correspondence also defines zone to zone correspondence. Interpolation is mostly weighted average of two sinusoidal parameters except phase. A generated sinusoid is hypothesized as a chirp signal, and it is calculated from the interpolated instantaneous frequency. For interpolation of residuals, a stochastic parmeter interpolation is used. Since there are so many partials, partial to partial interpolation is not done. Stochastic parameters of residuals for IC M C PROC EE D I N G S 199540 409

Page  410 ï~~the same frequency zone count are interpolated. For each frequency zone, a histogram of instantaneous frequency and magnitude, starting time, and duration is calculated. Normal distribution for mean instantaneous magnitude, exponential for duration, and uniform distribution for starting position are hypothesized. 3 Correspondence search algorithm using dynamic programming In morphing, there are several cases in the model to find the correspondece of features between two sounds. These are solved under a unified algorithm, although distance definitions are different according to the specific problems. Here are the cases necessary to find correspondence. 1. Frame to frame peak match in sinusoidal model 2. Time correspondence 3. Harmonic to harmonic correspondence 4. Single partial to single partial interpolaton 5. Resonance to resonance correspondence These are denoted in Figure 1. No. 5 is not necessary for stationary sounds. It is a problem to find the optimal resemblant combination of vector pairs from two groups including non-corresponding vectors. The number of members in two groups are usually different. Let xi, (i = 0,..., I - 1) and yj, (j = 0,..., J - 1) represent vectors of two groups, respectively. I and 3 are the number of vectors, and generally I - 3. More accurately, the problem is for each of I vectors to find the other party, if possible, from y group sufficiently close to each other, such that it gives an optimal combination as a whole. The problem can be solved by the criteria that minimizes the total distance among considarable pairs, using dynamic programming. The overall evaluation is expressed as follows: I-1 T(n,_)(I) = minZ C(xi,yw,(k)) wi(k) i=0 (1) where wi(k) is a window function that reduces the search range of y group for xi, and represents a specific member in the y group. wi(O) = 0, i=0,...,I-1 wi(k) E j, 1 <k< ni, j=O,...,J-1 (2) (3) x2( null,b V.b x0cy "4Y1 "b )yo 1 - null < 3b-2a 2 --d null > 3b-2a Fig. 2 Two cases of correspondence (4) (5) Tk(i) is a cumulative cost up to the search of xi to Yw,(k). The cost function C is expressed in terms of distance D(xi,Ywi(k)) for correspondent vectors and d, which is the case when the other party is not found. To solve this, doubly recursive equations are given. Initial condition: T0(0) = d,-,t for k = 1,...,n0 Tk(0) = min(D(x,ywo(k)),Tk-l(0)) Recursive equations: for i = 1,..., I - 1 To(i) =Tn,_, (i- 1)+ dn,.t T. i) = mi Tk-(i) k() -mn Dxz, yw(k)) + Tk.(i-1) (6) (7) for k=1,...,ni 410 0ICMC PROCEEDINGS 1995

Page  411 ï~~na1w~)-1 > -ini1 k* = w-1(wi(k) - 1) wi_(1) < wi(k) - 1 < Wi-l(ni-1) (8) 0 otherwize Where, Tk. (i - 1) denotes the cumulative cost for previously examined members excluding combination in which members in y are examined presently and afterwards, and w" 1(j) is a inverse function of wi(k). The result differs for different value of d utl. Figure 2 depicts a simple example of different results for different values of dnutt. All correspondence search but the time correspondence use the algorithm introduced here. The only difference in each correpondence is the specific definition of the distance equation. 4 Application to actual sounds Morphed sounds treated here are only limited to stationary instrumental sounds and vowels. Sounds with musical expression, such as vibrato, are not considered. In these objects, naturalness of the sound is of much importance since there are no other musical features. Two cases for flute to clarinet and female vowel to female vowel are tested. Good natural quality of interpolated sounds are acquired. 5 Conclusion The general problem of defining timbre interpolation is discussed first. Then the flow of a timbre interpolation model is described. Next, a correspondence search algorithm among members of two groups, in which the number of members are generally different, is shown using dynamic programming. This brings a unified perspective for the correspondence search of the timbre interpolation problem. Simple stationary sounds are tried and sufficient results are acquired. At this stage, the model does not consider any data reduction. It is possible to eliminate some residuals by taking auditory phenomena into consideration, such as the masking effect. These are one topic for further study. The other is to incorporate more complicated sounds. 6 Acknowledgement The author wishes to thank Dr. Ken'ichiro Ishii, Director of the Information Science Research Laboratory, for his encouragement during the research. References [Tellman 1994] Ediwin Tellman, Lippold Haken, Bryan Holloway, "Timbre morphing using the lemur representation,"ICMC94 Proceeding, pp. 329-330, 1994 [Rodet 1992] X.Rodet & Ph. Depalle, "A new additive synthesis method using inverse Fourier transform and spectral envelopes," ICMC92 proceedings, pp.410-411,Oct. 1992. [McAulay 1986] McAulay, Robert J., and Quatieri, Thomas F. 1986."Speech Analysis/Synthesis Based on a Sinusoidal Representation," IEEE Trans. on Acoust, Speech, and Signal Processing, vol. ASSP-34, No. 4, Aug. 1986. I C M C P ROC E E D I N G S 199541 411