Page  00000153 SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum Geoffroy Peeters, Xavier Rodet Ircam - Centre Georges-Pompidou Analysis/Synthesis Team, 1, pl. Igor Stravinsky, 75004 Paris, France Geoffroy.Peeters@ircam.fr, Xavier.Rodet@ircam.fr Abstract In this paper we present a new Analysis/Synthesis method named SINOLA, which benefits from both sinusoidal additive model and OLA/PSOLA method, and which allows adequate processing according to the inherent local characteristics of the signal. All the parameters of the models are derived at the same time from spectrum analysis. We propose an analytical formulation of a Complex Short-Time Spectrum Distortion measure, which allows the retrieval of precise sinusoidal parameters as well as their slopes. A new partial tracking method is proposed which benefits from these informations. Reassigned Spectrum is used in both time and frequency in order to characterize the signal and to position the PSOLA markers. Introduction Sinusoidal additive Analysis/Synthesis (A/S) is extremely accurate for signals which can be considered as a sum of sinusoids with stationary parameters in a window of 3 to 4 fundamental periods. On the other side, Time-Domain Overlap-Add (TD-OLA) and TDPitch-Synchronous OLA (TD-PSOLA which is important for periodic, i.e. harmonic sounds), are well adapted for non-stationary or non-sinusoidal components and require shorter windows. We present a new A/S method, named SINOLA, which benefits from both the sinusoidal additive A/S and OLA/PSOLA method. 1 The SINOLA model In SINOLA, the sinusoidal additive model is used for the stationary sinusoidal components while OLA method is used to process attacks, transients, non or nearly periodic pulses and random components. * SIN: Sinusoidal additive A/S model [6] s(t) = Aii() - sin ft(0) + i w(t)dt) where At(t), wi(t) and 01(0) are the amplitude, frequency and initial phase of the I1h frequency component of the signal. Usually At (t) and w, (t) are supposed to be low-pass signals and are therefore considered constant during a short analysis frame. At the synthesis stage, these parameters are interpolated between adjacent frames in order to avoid signal discontinuities. In section 2.3 we show how parameter variations can be included and evaluated in the analysis stage. * OLA: TD-OLA/TD-PSOLA method [3] As opposed to sinusoidal additive A/S, OLA and PSOLA do not use a model. This can be viewed as a drawback since possibilities for sound modification are limited. But it can also be viewed as an advantage since the whole signal frame is taken into account, not only the stationary sinusoidal part. The OLA method consists in decomposing the signal into overlapping frames while PSOLA constrains these frames to be positioned in a pitch-synchronous way at the analysis and at the synthesis stage. A general formulation is: si (t) = s(t). h2L,(t - ti) si(t) -+i(t) i(t) E= 9i (I - (tj - ti)) where - si(t) is the ith frame obtained by windowing the signal with a function h2Li(t) defined on a duration 2Li and centered around time ti, - i (t) is the modified i1h frame, - s(t) is the synthesis signal constructed by overlap-adding the successive frames positioned at the tj. In the case of PSOLA the ti are positioned in a pitch-synchronous way, Li is equal to the local fundamental period and the positions of the tj determine the fundamental periods of the synthesis signal. The OLA/PSOLA method is depicted in Table 1 for each type of signal. 2 Parameter Estimation Three types of information are needed for SINOLA and retrieved simultaneously using the Short Time Fourier Transform (STFT) of the signal: 1. a time-frequency characterization of the signal for its decomposition into transients, sinusoidal and non-sinusoidal components (see 2.1, 2.2, 2.3.2), ICMC Proceedings 1999 - 153 -

Page  00000154 Table 1: OLA - PSOLA method for different types of signals Type transient random random (superimposed periodic [Te ___[__ ___ __Ito a periodic part) j Method OLA OLA OLA-PSOLA PSOLA t, = transient positions.ti+1 - ti = To(t) '+ = ti of periodic part + ti+ - ti = original sigrandom component random component nal pitch (see 2.4) tj = transient positions tj+1 - tj = To(t) = tj of periodic part tj+i1-tj = synthesis sigS__nal pitch 1,(t) s,(t) alternate time reversing + alternate time reversing + morphing between si(t) morphing between si(t) morphing between si(t) and s i.(+ (t) and s,+i (t). and si+4i(t) 2. the time-varying frequency, amplitude and phase of the sinusoidal components (see 2.3), 3. the pitch-synchronous markers in the case of PSOLA (see 2.4). 2.1 Transient detection Transients are detected using cross-entropy measurement derived from the Kullback-Leibler distance [2]: Therefore PDSM compares a temporal model of the evolution of measured frequencies and a temporal model of the corresponding phase derivative. But the measurements used in [7] to create the models were biased (see 2.3.1), because taken from a stationary model. In section 2.2.1, we show how the bias in frequency can be avoided by bypassing the use of a model. In section 2.3, we propose a new model which takes into account modulation of amplitude and frequency. 2.2.1 PDSM using frequency "reassignment" "Reassignment" [1] has been proposed to improve timefrequency representations. In usual time/frequency representations, the values obtained when decomposing the signal on the time/frequency atoms are assigned to the geometrical center of the cells (center of the analysis window and bins of the FFT). In [1] it is proposed to assign each value to the center of gravity of the cell's energy. Frequency reassignment can be written [1] (using band-pass convention): KL() A(i+, k) (1) where F(xz) = z - log(x) - 1, and A(t,wuk) is the amplitude of the STFT at time t and frequency wk. 2.2 Sinusoidal versus Non-sinusoidal (S/NS) signal characterization The S/NS signal characterization consists in measuring how well a part of the time/frequency plane can be represented by a sinusoidal model. It is therefore strongly dependent on the assumptions defining the sinusoidal model: local stationarity or non-stationarity of the sinusoidal parameters. Numerous methods have been proposed for S/NS characterization (see [8] for a review) but most of them use this stationarity assumption. In [7] we have proposed a method, called the "Phase Derived Sinusoidality Measure" (PDSM), which allows to measure the sinusoidality coefficient without a stationary frequency assumption. PDSM was based on the following considerations: * for the main-lobe of a sinusoidal component, the frequency derived from the complex spectrum and the frequency derived from the evolution of the corresponding phase spectrum are the same * when parameter stationarity is not assumed, we cannot derive a sinusoidality measure from an instantaneous measurement only, but through the continuity of the parameters along time. To(t) means "'average fundamental period of neighboring periodic regions" w,(t,w ) = (Ex(H'F((-w)ek' T ( Wr (t, Wk) - Wk I ( ST FtB P(z), -- sTT2} z (2) The second formulation of w, (t, ) is the instantaneous frequency definition which is often used in order to obtain precise frequencies from a Discrete Fourier Transform. The third formulation expresses the correction to apply to the discrete frequency wk in order to obtain the exact frequency. The distance given by PDSM can be shown to be similar to this correction, but using (2) we do not face the frequency bias cited above. The third formulation also provides a low cost method to compute the instantaneous frequency and to measure the sinusoidality. 2.3 Complex Short-Time Spectrum Distortion measure In classical A/S methods, parameters are often estimated from short-time spectra. The signal is usually assumed to be stationary on the analysis window and, thus, the - 154 - ICMC Proceedings 1999

Page  00000155 spectrum is assumed to have peaks at the frequencies of the sinusoidal components. Unfortunately, the signal is rarely stationary on the analysis window: amplitude and frequency modulation of signal components distort the shape of the assumed spectral peaks, therefore inducing incorrect parameter estimation. Previous studies have shown the importance of spectrum distortion induced by these variations and have proposed partial solutions (neural network [5], signal normalization [7]), or analytical formulation [4]. We propose here a complete parameter estimation method taking into account amplitude and frequency modulation. The signal model is a sum of sinusoids with linear variation of amplitude (at + pi t) and of frequency (wo + 2Ait). 0o,1 is the initial phase and I is the peak index. For t in the ith frame centered on ti, (we note ri = t- ti): s(t) = Z(a,i + ainr) cos(0o,.,i + W,iri + Az,inr2) (3) The Short Time Complex Spectrum is estimated using a truncated gaussian window g9,o,L (t) where p and o are the mean and standard deviation of the gaussian function and L is the size of the truncation (L must be greater than 90 in order to reduce the truncation effect). The Distortion is measured by fitting a second order polynomial around each log-amplitude spectrum peak (alogw2 + blogw + Clo.) and around each corresponding unwrapped phase spectrum region (aow2 + bw + ce). For a specific peak index, parameters are given by: Tnim (a),t tt+l Figure 1: Curvature computing for couples of peaks (m,n) and (m,n+l) 2.3.2 Sinusoidality measure and Partial Tracking with time-varying parameter Extending the sinusoidal model with linear variations renders S/NS estimation more difficult. As suggested in 2.2, information about sinusoidality can only be given by the continuity of the model parameters along time. This can be evaluated by a partial tracking method. Usual partial tracking methods consider three successive frames in order to construct a track. Since the time derivatives of parameters are part of our model, it suffices to consider only two frames together. For each couple of peaks (m, n) (see Figure 1), a track-score 9 is computed. In a frequency band, the couple that leads to the maximum score (if this score is above a certain threshold) is chosen. If the maximum score is below the threshold, there is a birth, a death or no track in this band. I S 1 biog A_ 2Atr 2 alog Olt -1 ~ 1 - 16a2/94 Af~~(6 "--0-- At 0 3 72 8a t 2wiI0' D b Cc I 0"" (4) (m n) = exp (m, n) c(m, n) 9(m, n) = exp 2 - (5) where Di = 1 + 4AT"r4 2.3.1 Bias of usual sinusoidal estimators From 3 and 4 it is easy to show that * the frequency of the maximum of the logamplitude spectrum (noted Wmax and usually considered as the frequency position of the sinusoidal component) is in fact at wt + L2Aio"2. Therefore usual frequency estimators have a bias proportional to the amplitude modulation, to the frequency modulation and to the length of the analysis window. * a similar bias is found for the log-amplitude of the spectrum at Wrax which is equal to log(at) - Slog(D) + 4 " instead of log(al) 4 of + D * a similar bias is found for the phase of the spectrum at Wmax which is equal to 0o, + latan(2Ato'72) - S1) instead of =, z,(2A20-4 + 1) instead of 00,1 where c (m, n) and c2(m, n) are the maximum curvature 2 of the 3rd order polynomials with the following boundary conditions (see Figure 1): for frequency {Wm,i; Am,i; n,i+1; Awn i+1}, for amplitude {am;a,i; a n~i+,;.ian,i+l}. 07 and ao2 are model parameters. Results obtained with (5) are shown in Figure 2. 2.4 PSOLA markers positioning PSOLA markers (noted ti) have to be placed in a pitch synchronous way, i.e. the distance between two markers must be equal to the local fundamental period. Moreover, because of the windowing applied in the PSOLA method, the markers must be close to the local maxima of signal energy. In speech processing, Glottal Closure Instants (GCI) detection methods are used in order to place PSOLA markers [9]. These GCI occur pitchsynchronously and are close to the local maxima of energy. For musical signals, GCI methods are not relevant. 2second order derivative ICMC Proceedings 1999 - 155 -

Page  00000156 6000 ~0000 3000 1000 Paeal Track"i:..4m.... - el= - -r -r a --,OI dq* s-?IMM dW dF bl__ -1 --~ ~ 1 t I 0.66 0.66 0.0 Tre(S1) 0.92 0.04 0.90 Tim (sAmp6el x 10' Figure 2: Partial Tracking method: frequency and frequency slope estimations (thin dashed lines), partial births (thick dashed lines), partials (thick lines), signal: female singing voice, window size: 14 ms, analysis step: 7 ms This is why other methods, which use phase spectrum information, have been proposed. But then, we cannot guarantee that markers will be close to local maxima of energy. In order to fulfill both periodicity and energy conditions we propose here a new method based on group delay. The method uses a weighted sum of frequency component group delays. The weighting is made according to component amplitudes. Let us define: f(t) = t + 2WGd(t,wk)A(t,wk) (6) f(t) = t tk (6) where Gd(t, wk) is the group delay of frequency wk for a window centered at time t. Gd(t, wk) can be computed in an efficient way using time "reassignment" [I] which can be written (using band-pass convention): tr(tW) =t + R {f St) }FTh t= - BP(t,w) t + STFTjP t) h(-T) tr (twk) =+ STFT () (7) where we recognize, in the second formulation, the group delay definition. As explained in the following, this relates the new PSOLA marker positioning method to time reassignment. The third formulation gives a method for computing the group delay at low cost. Marker positions are then given by the local maxima of the inverse of the derivative of f(t) (special care has to be taken considering that f(t) is not injective). ti= ( max (1/ ft (8) Because of the windowing applied before computing Gd, a confidence measure of f(t) must be computed for each t. It is given by an amplitude weighted standard deviation (in wk) of the Gd(t,wk). Large std values mean small confidence while small std values mean large confidence. Results obtained with this new method are shown in Figure 3. Figure 3: PSOLA markers positioning: signal (top), confidence measure (middle), inverse of the derivative of f(t) (bottom), signal: male speech voice, window size: 20 ms, analysis step: 1 ms Conclusion SINOLA derives from spectrum analysis all the information necessary for high quality sound processing such as time warping, pitch shifting, spectrum dilatation and so on. Because of its dual processing (SIN + OLA), it preserves the inherent local characteristics of the signal (sinusoidal, random-noise, attacks-transients) and allows easy and natural modifications of the signal. Examples of the sound quality obtained with this method will be given during the presentation of this paper. References [1] F. Auger and P, Flandrin. Improving the Readibility of Time-Frequency and Time-Scale Representations by the Reassignment Method. IEEE Trans. Signal Processing, 1995. [2] M. Basseville. Distance Measures for Signal Processing and Pattern Recognition. Signal Processing, 1989. [3] F. Charpentier and M. Stella. Diphone Synthesis Using an Overlap-Add Technique for Speech Waveforms Concatenation. In ICASSP, 1986. [4] J. Marques and L. Almeida. A Background for Sinusoid Based Representation of Voiced Speech. In ICASSP, 1986. [5] P. Masri. Computer Modeling of Sound for Transformation and Synthesis of Musical Signals. PhD thesis, University of Bristol, 1996. 16] R. McAulay and T. Quatieri. Speech Analysis/Synthesis based on a Sinusoidal Representation. IEEE Trans. Acoust. Speech Signal Process, 1986. [7] G. Peeters and X. Rodet. Sinusoidal versus NonSinusoidal Signal Characterisation. In COST-G6 DAFX (Workshop on Digital Audio Effects), Barcelonne, november 1998. [8] G. Richard and C. d'Alessandro. Analysis/Synthesis and Modification of the Speech Aperiodic Component. Speech Communication, 1996. [9] H. Strube. Determination of the Instant of Glottal Closures from the Speech Wave. J. Acoust. Soc. Am., 1974. - 156 - ICMC Proceedings 1999