Page  86 ï~~TOWARDS NON-LINEAR RESYNTHESIS OF INSTRUMENTAL SOUNDS R. Kronland-Martinet*, Ph. Guillemain *+ * CNRS-Laboratoire de Mecanique et d'Acoustique, 31 ch J. Aiguier, 13402 Marseille Cedex 09, France + DIGILOG, 21 rue Fr16dric-Joliot, P6le d'activites des Mules, 13852 Aix-En-Provence Cedex 3, France I- Introduction. Additive synthesis provides a good description of sounds, giving informations on both the time and the frequency behavior for each elementary component. Thanks to time-frequency representations of signals, we have developped accurate algorithms allowing an automatic extraction of such parameters. These algorithms will be summarized below. Analysis and resynthesis is then possible using additive synthesis, opening wide possibilities in the field of sound transformation. In contrast with the information provided by the short-time Fourier transform, our method minimizes the number of components. Nevertheless, additive synthesis is still expensive because of the amount of data involved, in contrast with global synthesis such as frequency modulation or waveshaping. Our aim is to propose an approach connecting these two classes of models and making a link between additive synthesis parameters and non-linear ones. This is a challenge from a mathematical point of yiew, and one can show that there are either no solution or non unique ones. In order to get round this difficulty, we suggest to group components into sub-classes such that each element in a group "looks like" other elements of the group. The similarity is based on psychoacoustic criteria, taking into account for example the shape of the amplitude modulation law instead of the global energy of the components. One can then simplify the additive synthesis model by limiting the total number of envelopes to the number of classes. It is then easier to connect each group to a non-linear model. II- Estimation of additive synthesis parameters through time-frequency representations. In order to estimate parameters corresponding to an additive synthesis model from the analysis of a real sound, we assume that the signal can be written as follows: N s(t) = XAk(t)cos(Oft+Ek(t)) k=1 Each element of the sum: Ak(t)cos(ot+Ek(t)) is called a spectral line, and is determined by the knowledge of both the phase and the amplitude modulation laws. Actually, we shall suppose that the phase is given by a linear component perturbated by a slowly varying and bounded term Ek(t). These assumptions are not too restrictive and such a model is well adapted to the synthesis of musical sounds, taking into account the perturbations encounting during the attack of the sound as well as vibrato effects. The problem then concerns the estimation of the frequencies of the components involved 01k, and the functions of time At) and Sk(t). For that purpose, we used linear time-frequency representations of signals such as Gabor and wavelets decompositions. These representations can be seen as the result of the filtering of the signal by a continuous set of linear band-pass filters. An important feature is that the impulse response of each filter is complex-valued, and the real and imaginary parts are generally in phase quadrature. One can then consider the amplitude and the phase of each filtered version of the signal. In the case of signals with components spaced in frequency, around the frequency of a given component, the amplitude of the representation is related to its amplitude modulation law while the phase derivative is related to its frequency modulation law. Figure 1 displays the Gabor transform of a sum of monochromatic signals. The horizontal axis represents time, and the vertical axis represents frequency. The modulus is 2A.2 86 ICMC Proceedings 1993

Page  87 ï~~displayed above the phase. For this specific choice of filters, the modulus representation shows horizontal black stripes corresponding to each component. Near maxima, the phase gives an estimation of the phase of each component. In each stripe, the mean value of the phase derivative of the transform (called local frequency) equals the value of the frequency of the component [Guillemain]. Figure 1 One has to notice that it is not always possible to separate the components on the representation. We have developped algorithms allowing an automatic separation and an estimation of each component. These algorithms consist of disentangling the components in order to estimate their real contribution. They are described for example in [Guillemain]; the idea is the following: Since one uses linear representations, the knowledge of the frequencies of all the components contained in the signal (obtained by looking at the phase behavior) allows to design a bank of special filters separating each contribution. Actually, the impulse response of these filters is given by a sum of wavelets. Figure 2 shows the Fourier transform of a set of such filters with zeros corresponding to all the frequencies of the signal except the one to be extracted. Figure 3 represents the spectrum of a synthetic sound of a trumpet and figure 4 shows the envelope corresponding to the 14th harmonic used for the synthesis. Figures 5 and 6 respectively show the restriction of a time frequency representation around the frequency corresponding to the 14th component with a small and a large analysis time window (phase vocoder). Both estimations are biased and give rise to an oscillating or a smoothed envelope; the resulting sound is then definitely different from the original one. Figure 7 shows the output of the filter outlined in picture 2; the estimation is then better (to be compared with picture 4). ICMC Proceedings 1993 87 2A.2

Page  88 ï~~111. Figure 5 III- Grouping the envelopes - additive packet synthesis. Additive resynthesis enables us to reach a wide palette of timbres. Nevertheless, the total number of parameters to handle is generally large. For example, in most musical sounds, the number of elementary components increases when the pitch decreases; a hundred of components is a typical value. Taking into account physical properties of musical instruments such as formant structure, one is tempted to group the components into sub-classes such that all the components whithin a group look similar. We will say that two components look alike if they produce the same psychoacoustic effect, which means that one has to compare the time dependency of the energy of the components. A direct consequence is that for each group, one can derive a "master envelope" describing the energy dependency of each component inside the group, up to a normalization coefficient. For simplicity, let us assume that the signal s(t) corresponds to the following model: N s(t) = XAk(t)cos(okt) k=1 Our aim is to decompose s(t) such that: L s(t)X Bl(t) lkcos(okt) 1=1 Where the unknowns are: - L: the number of groups (L<N). - B1(t): the master envelope of the group 1. - alk: the normalization coefficient of the kth component inside the group I. In order to estimate these parameters, one has to compare comparable things. Actually, the envelopes derived from a spectral lines estimation are not directly comparable for several reasons. The first one is related to the huge difference between the global energy of the low and high frequency spectral lines. The second one is related to the noise contribution that appears in the spectral lines in the high frequencies. The third one is related to our logarithmic perception of the sound. Consequently, before comparing the envelopes, one has to smooth them in order to minimize the noise contribution. One must then rescale these smoothed versions. At last, one has to consider the logarithm of these smoothed and renormalized versions of the original envelopes. 2A.2 88 ICMC Proceedings 1993

Page  89 ï~~At this step, one has to choose a method that performs the packet decomposition. From a mathematical point of view, this is a eigen value problem, where the number of master envelopes is given by the size of the largest invertible matrix that elements are given by: Mi'=Ai(tj). From a statistical point of view, principal components analysis is the good tool for such a problem. Instead of using these methods, one has prefered to use a home-made one, in order to introduce both physical considerations and a-priori knowledge. Let us call { A'k(t)} the set of comparable envelopes. Let us denote by M the matrix that elements Mi are given by: T M =T JI A'i(t)-A'j(t)I dt where T is the duration of the sound From a physical point of view, similar spectral lines should be located in the same spectral area. Consequently, Mij should be close to 0 when i is close to j, and should be large when i is far from j. Figure 8 represents the 32 renormalized envelopes Ak(t) of a trumpet sound. Figure 9 represents the matrix M associated to this envelopes set. The dark areas correspond to envelopes that look similar. These regions are essentially located along the diagonal of the matrix, that is in accordance with the physical assumption one made. I kLbLi Figure 8 Figure 9 A two dimension filtering of this picture along its diagonal reveals the bumps in a onedimension curve. The size of the filter lets us take into account matrix terms that are more or less far from the diagonal. By extracting the local maximas of this one-dimension curve, one gets the number of master envelopes. Moreover, one assumes that each master envelope is one of the envelopes Ak(t), consequently the positions k of the local maximas of the one dimension curve give the master envelopes B1(t) such that Bi(t) = Ako(t). The decomposition of each envelope Ak(t) on the family {Bi(t)} is then obtained by a classical least square method: L For each k, lk N, Ak(t) = XakiBi(t) 1=1 L The system: Vp,1l p L, Xaid<Bl(t),Bp(t)> = <Ak(t),Bp(t)> gives the coefficients a~ki. 1=1 Figure 10 displays respectively two of the six master envelopes B1 corresponding to the trumpet as a function of time, and their associated coefficients 0xqk as a function of the harmonic rank. ICMC Proceedings 1993 89 2A.2

Page  90 ï~~1.66423 1.2 t.499"3. 1.33.3 1.0 1.16.3 9.94*-2 0.8 0.202 0.6 4.97."2 0.4 3.314e2' 016.2 0.2 4.69.-2_ 0.0o.0 1. 6.3 3.243 4.3 0. 6. 0.3 09.3 1.4 4..4 I. 4 0.0 4.52e43 1,2 4.07453 36243 1.0 3.17o.3 0.8 2.71.J3 2.26."3 0.8 130.3 0,4 9.05."2_ 4.52.2_ 0,2 0.00.0 l ol1 '32 4.603 6.443 6.0 3 96."3 1.144 1.3"41.444 1.6 4 0.0 Figure 10 IV- From additive packet synthesis to non-linear synthesis IV.1- The idea The spectral lines packets can generally be considered as band limited bumps of harmonic components. The idea consists of approximating each packet by a non linear synthesis method such as waveshaping or frequency modulation. Obviously, the decomposition of Ak(t) on the family {B1(t) } is not exact and one has: L A0k(t) = XczklBl(t). Let us define the quantity: Ek(t) = Ak(t) 1=1 A%(t) L L NN Then: Ak(t) = XaklEk(t)BI(t) and: XAk(t)cos(0'kt) = B(t){ lkEk(t)cos(cokt) 1=1 k=l 1=1 Consequently, one has evolving spectral lines packets alkEk(t) and the idea is to take into account these evolutions by time-varying the waveshaping or the modulation indexes. IV.2- resynthesis of packets through waveshaping We will now adress the problem of re-synthesizing a spectral line packet with the help of a non-linear method such as the waveshaping technique. The original signal will then be reconstructed by adding up the spectral lines packets. For that purpose, we will first assume that each packet is composed of a set of harmonic components containing the fondamental component and a few harmonics. This is an important assumption regarding to digital implementations; high order components are very sensitive since their amplitude generally varies quickly with respect to waveshaping synthesis parameters. Actually, this assumption leads to consider shifted versions in frequency of the original spectral line packets; at the end the spectral lines packet will be returned to its original position by single side band shift (fig. 11) Non-linear syntheeias of a shifted packet 0 0.. Non-linear synthesis of a shifted packet tfr.equ.ency shift f req ue.nc_ s.hft sound Figure 11 2A.2 90 ICMC Proceedings 1993

Page  91 ï~~The waveshaping technique had been largely described for example in [Arfib] and we will briefly recall the principle. The basic formula is given by the composition of two functions: - one non-linear function F(x), namely the waveshaping function, - one time varying function e(t) = I(t) cos(coot), where I(t) is called (by analogy to the FM technique) the index function. The generated signal is then s(t) = (F o e)(t) = F[e(t)] = F[ I(t)cos(coot)] 00 Using the decomposition of F(x) in terms of powers of x: F(x) = X ak xk, one can derive k=0 00 the formula: s(t) = _ aI k Ik(t) cosk(0ot) which shows two important properties related to k=0 the method: -The fondamental frequency is given by CO. -If the decomposition of the waveshaping function F(x) only contains K terms, then the resulting signal will not contain more than K harmonic components. For a fixed index I, one can explicitly describe the spectrum of the generated sound by using the decomposition of cosk(ojt) in terms of cos(ncoot) [Suen]. One obtains: 00 00k+j k2 s(t) = hk(I) cos(ktoot) with hk(I) = 2 C. (I)ak+2j j=0 _i (i)! where_ = (i-j)! j! are the binomial coefficients. By letting I vary with respect to the time, one can see that the generated local spectrum is changing, allowing the technique to synthesize evoluting signal. Nevertheless, one has to notice that the sign of hk(I) is not always positive; this is a severe problem since the ultimate aim is to add up numerous spectral lines packets in order to reconstruct the original signal. This is why one will have to look for a range of the index so that the phase of each component does not change. The waveshaping technique does not give rise to a complete representation of finite energy signals and consequently, not all sounds can be described by such a technique. We are looking for a method allowing the reconstruction of a given signal under acoustically relevant criterias rather than physical ones. Beauchamp and Arfib [Beauchamp] proposed to base the estimation of non-linear synthesis parameters upon the centroid of the spectrum (a parameter related to the brigthness of the sound). The use of such a criteria leads to two problems: - how to choose the function F(x)? - is there, for a given function F(x), a unique correspondance between the value of the index and the centroid position? IV.2-1 How to choose the waveshaping function? Let us suppose that the index I equals 1 and decompose the function F(x) in the Chebychev polynomials basis: 00 F(x) = k Tk(x) where Tk(X) is the kth zero order Chebychev polynomial. Using one the properties of the Chebychev polynomials: Tk(cos(8)) = cos(kO) one can write: s(t) = X [ k Tk(coS(030t)) = k cos(ko)0t) showing that: the decomposition coefficients k=0 k=0 ICMC Proceedings 1993 91 2A.2

Page  92 ï~~of the waveshaping function F(x) in the Chebychev polynomial basis leading to a generated signal spectrum 1i3k1 with an index equal to 1, are given by the (13k1. The behavior of the curve representing the spectrum centroid position versus the index will be discussed below. Nevertheless, one is tempted to define the waveshaping function so that it allows the synthesis of the modulus of the long-time Fourier spectrum of the signal for a mean value of the index (typically 0.5). In this case, one could think that by changing the value of the index one could reach "more or less bright" spectrums around the mean ones. The problem is then defining the function F(x) so that it allows to generate a given signal spectrum {(k } for an arbitrary value of the index I. In other words, one has to calculate the signal spectrum {hk} obtained with an arbitrary value of the index I (I=1 will give the coefficients in the Chebychev basis). The calculus gives: 00 k+2j k+2j+i (11+2j )i k+2j+2i hk(I) = L C j C I0) (-1) k+2j+i "k+2j+2i i=O,j=O This formula allows: - to calculate the waveshaping function that generates a signal spectrum { k} for a value I0 of the index (with I = 1), - to calculate the generated signal spectrum for any arbitrary value of the index, - consequently, to construct the variations of the spectrum centro'd versus the index. IV.2-1 index versus spectrum centroid position For a given function F(x), one has to estimate the time varying function I(t) such that the centroid of the local spectrum of the generated packet corresponds to the one associated to a real packet. Moreover, in order to be able to sum up all the packets and reconstruct the sound, one has to insure that the sign of the spectral components does not change with respect to the index. For that purpose, one has to consider the function C(I) giving the position of the centroid of the synthetic packet spectrum {hkl versus the index I: Xk Ihk(I)I C(I) = k=O and the set of local spectrums {hk(I)} o lhk(I)I k=0 The estimation of I(t) is obvious if the curve C(I) is monotonous and if the amplitudes of the components are positive; it then consists in associating to each value of C(t) the corresponding value of I. Unfortunately, the curve C(I) is generally non monotonous and local maximas arise, leading to an ambiguity of the correspondance C-->I. Arfib proposed to alternate the signs of the components to reduce this effect but this trick cannot be used in our case because of the great importance of the sign of the components. The only way is to restrict the range of variation of the index Ito a domain such that C(I) is monotonous and that the sign of the spectral components does not change. We are still on the process of explicitly calculating this range, but we have already verified that the domain {I such that I>Iol satisfies our requirements. Figure 12 shows the corresponding curve C(I) with 0<1<1 corresponding to a waveshaping function allowing to generate the spectrum { hkl}k=.0,9 = {O0,9,8,7,6,5,4,3,2,11} with an index equals to 0.5. C(I) I Figure 12 2A.2 92 ICMC Proceedings 1993

Page  93 ï~~IV.2-1 Correcting the amplitude and shifting the packets When the waveshaping index changes, the energy of the generated signal changes too, and one has to take into account a corrective term of amplitude to preserve the original energy. In order to return the packet to its original position, one has to shift it in frequency. This can be achieved by single side band modulation. This technique requires the knowledge of the Hilbert transform of the signal. This transform can be calculated directly but for efficiency, one would rather calculate it through an other waveshaping function. For example, one can show that for 1=1, the waveshaping function is given by a sum of Chebychev functions of the 2nd kind. The mathematics for an index different from 1 are still being worked out. IV.3- resynthesis through FM For numerical efficiency, one would like to describe a spectral line packet in terms of a N single FM oscillator; that is to say: B1(t) XalkEk(t)cos(wokt) = Ci(t)Bi(t)cos(coit+Ii(t)sin(o~t)) k=1 where the quantities to estimate are: C1(t): a corrective term of amplitude,t: the carrier frequency Ii(t): the modulation index om: the modulating frequency Let us assume that the signal is harmonic: o, = koo. The modulating frequency can then be chosen as the fondamental wOo. From a psychoacoustic point of view, each spectral line packet is essentially caracterized by its energy, its brightness related to its centroid, and its richness related to its spectral width. Consequently, for each packet, the carrier frequency will be chosen as the harmonic closest to the centroid of the packet; the modulation index will be chosen in order to preserve the richness of the packet and the corrective term will be chosen in order to preserve the energy of the packet. Let us now show how to estimate the modulation index with the help of mathematics: The idea is to consider scalar products between a packet and FM spectra. As an exemple, let us assume that a packet is also an FM spectrum of index Io. Then the scalar product is given by: Jei(ntsin(t))ei(mt+xsin(t))dt = 1Jm-n(Io-x) where Jn(x) is the nthbessel function of the -nt first kind. Let us remember that the maximum of all these functions Jn(x) is Jo(0)=l. Consequently, the maximum of the scalar product between two FM spectra with the same modulating frequency is obtained when the values of the indexes and the carrier frequencies are equal. Then, the maximum of the scalar product between a packet and FM spectra gives the modulation index leading to the best FM approximation of the packet. References -Ph. GUILLEMAIN, R. KRONLAND-MARTLNET (1992). Additive resynthesis of sounds using continuous time-frequency analysis techniques. International Computer Music Conference Proceedings, San Jose, Calif., 14 -18 octobre 1992, pp 10-13. -D. ARFIB "Digital Synthesis of Complex Spectra by Means of Multiplication of Non-Linear Distorted Sine Waves", Journal of Audio Engineering Society, Vol.27, 1979, pp 757-768. -C.Y. SUEN "Derivation of Harmonic Equations in Non-Linear Circuits", Journal of Audio Engineering Society, Vol. 19(6), 1970,.pp 675-676. -J.W. BEAUcHAMP "Analysis and Synthesis of Comet Tones Using Nonlinear Interharmonic Relationships", Journal of Audio Engineering Society, Vol. 23, Dec 1975, pp 778-795. ICMC Proceedings 1993 93 2A.2