Page  00000001 Sinusoidal and Residual Decomposition and Residual Modeling of Musical Tones Using the QUASAR Signal Model Yinong Ding DSP Research & Development Center Texas Instruments, Inc., MS 8374 Dallas, TX 75265-5474, USA ding@ti.com Abstract A novel approach to the analysis and synthesis of musical tones based on a Quadratic polynomial-phase Sinusoids And Residual (QUASAR) signal model is presented. Taking advantage of the fact that musical signals are usually analyzed off-line, the proposed approach estimates the sinusoidal parameters using a global waveform fitting criterion and uses LPC-based waveform coding techniques for synthesizing residuals. An example of piano note is included to demonstrate the advantages of the proposed approach in terms of memory use (data reduction), computational complexity, and synthesis quality. 1 Introduction In recent years, music and speech synthesis using a sinusoidal model has been an active and interesting research topic. In their Spectral Transformation System (STS or M&Q) [1], McAulay and Quatieri of MIT Lincoln Laboratory proposed to use a sinusoidal model for speech analysis and synthesis. In 1990, Serra and Smith developed a Spectral Modeling System (SMS) [2]. SMS achieved several advancements over M&Q for music applications. One significant contribution of SMS is the proposal for a sinusoidal and stochastic decomposition of musical tones and the use of a noise-driven source filter model for synthesis of the stochastic component. In both M&Q and SMS, sinusoids are modeled (parameterized) by their amplitude, frequency and phase. In order to account for the time-varying nature of the signal being analyzed, analysis is performed on a frame-byframe basis. At the synthesis time, sinusoidal parameter estimates obtained at frame boundaries are interpolated to obtain their values at each sample point. In [1], McAulay and Quatieri also proposed to use a cubic Xiaoshu Qian Department of Electrical Engineering University of Rhode Island Kingston, RI 02881, USA qian@ele.uri.edu phase interpolation algorithm to synthesize sinusoids. In our study of M&Q and SMS, several problems related to the signal model and the analysis and synthesis algorithms are identified. First, the parameter estimates obtained by a frame-based approach are "time-localized", not being able to take full advantage of the fact that most musical tones are processed off-line. Second, an analysis by synthesis approach is not employed. Therefore the synthesis interpolation errors can not be taken into account while model parameters are estimated. In addition, the cubic phase interpolation algorithm has an inherent artifact of frequency track "oscillation" in some scenarios and is computationally expensive [3, 4]. Third, a noise-driven source filter model for synthesis of stochastic components generally does not work well due to the loss of phase coherence or alignment between the sinusoidal and the stochastic part [3]. In this paper, we introduce a Quadratic polynomialphase Sinusoids And Residual (QUASAR) signal model for music analysis and synthesis. The sinusoidal parameters can be estimated using an analysis-by-synthesis approach and a global waveform fitting criterion. We propose to use LPC-based waveform coding techniques for synthesizing the residual obtained by subtracting the sinusoidal part from the original. 2 Signal Models Various natural signals, such as music and speech, as well as many signals used in communications, radar, sonar, and other man-made signals, can be represented by a sum of sinusoids. A commonly used sinusoidal model is given in (1), where a signal is modeled as a sum of sinusoids with time varying amplitudes, Am(t), frequencies, wm (t) and phases, Om (t). s(t) = E = Am(t)sin(Om(t)), (1) Om(t)= Jt aWm()do + rlm + (wm(t)).

Page  00000002 In the above signal model, M is the number of sinusoids, which could also be a function of time in some applications. The time-varying phase,Om(t) contains the timevarying frequency contribution, a fixed phase offset, rIm, which accounts for the fact that the sine waves will generally not be in phase, and a time-varying frequency dependent term, 4(wm(t)). In the speech production model, D(wm(t)) is determined by the phase function of the vocal tract. The signal model in (1) is also the one used in M&Q and SMS. To extract the time-varying amplitudes, frequencies and phases of the component sine waves, a conventional approach is to analyze the signal on a frameby-frame basis. At the synthesis stage, the parameters estimated at frame boundaries are interpolated to obtain their values at each sample point. A frame-based approach suffers from two major limitations. First, it estimates signal parameters only using the "local" data, unable to "look-ahead" and "look-back". Therefore the parameter estimates are "time-localized". Secondly, a frame-based approach has to use some kind of window to truncate the data. The window effect thus limits the estimation resolution. Further more, the interpolation for synthesis usually produces errors, especially if the signal parameters do not vary slowly with respect to the analysis frame rate, the frame rate parameter estimates can not be reliably interpolated using lower-order evolution models such as linear interpolation for amplitudes. For music synthesis applications, musical notes are usually pre-recorded. This allows the estimation of model parameters to be performed in a globally optimized sense, and this also allows an analysis-bysynthesis approach to be employed so that the interpolation errors can be taken into account while the model parameters are estimated. We show how these can be implemented in Section 3. 2.1 Cubic phase interpolation In signal model (1), sinusoids are parameterized by time-varying amplitudes, frequencies, and phases. The parameter values at frame boundaries are estimated at the analysis time and are interpolated at the synthesis stage. For many years, researchers have used a cubic phase interpolation algorithm to synthesize sinusoids. The idea of applying a cubic polynomial to interpolate the phase between frame boundaries was independently proposed by McAulay and Quatieri in their Spectral Transformation System [1] and by Almeida and Silva in their harmonic sine wave synthesizer [5]. The cubic phase interpolation algorithm is now reviewed below. Assume the frequency and phase estimates of the mth frequency track at two ends of the nth frame are {n on and n+1 n+li1 ^ m, w } and, }, ^M) m1 (2) then a cubic phase interpolation function can be postulated as follows, S= + + c 2 n 3 0'(T)-= a' +b + +d +CT, (3) where = t - ti, and t, is the starting time of the nth frame. In order to guarantee the phase function values match the measured (estimated) frequencies and phases at the frame boundaries, the cubic polynomial coefficients a", b", c" and d" are obtained by solving a set of linear equations, and the results are shown below [1], (the track subscript, m, has been omitted for clarity.) an = On b" = cf, (4) and c" and d" satisfy the following matrix equation d (P) T2h(P 1 in1 - w- "T + 2wPl 2 1 ^ 1 + n [3 - -2 wT - The integer P in the above equation is given by P = (8" + " +~T- 81 +e, 27( 2 (5) (6) where T (and throughout of the paper) is the frame length, and e is the smallest number that makes P an integer. It is clear that Ic| < Tr. From (5) and (6), we can obtain n _ a;7n+l -, 3E c 2T + T2 d" - -.T (7) This cubic phase interpolation algorithm seems to have gained a wide acceptance in the speech and music signal processing community. However, in our experiments with a variety of musical instrument sounds, it is noticed that the interpolated frequency tracks resulted from the cubic interpolation scheme tend to exhibit small "oscillations" which is especially conspicuous when the difference of the frequency measurements at frame boundaries is small. This phenomenon of frequency track oscillation is illustrated using synthetic data in Figure l(a), where the frequency measurements at the frame boundaries are constant, fo = 2000 Hz, and the phase measurement at the nth frame is (1 + 0.012en)27rfonT + Tro (module 27r). The 0.012en

Page  00000003 is the perturbation term used to model phase measurement errors. en is independently and identically distributed Gaussian random variable with zero mean and unit variance. rlo is a random initial phase, uniformly distributed over [-r, 7r]. Also shown in Figure l(b) is the corresponding phase track (solid line). Both frequency and phase measurements are denoted by the circle symbols. Frequecy tracks B 2 B Bo B B2 t_2 t I to tl t2 t3 frame 0 frame 1 frame 2 frame 3 Figure 2: Illustration of the quadratic B-spline functions AO I1 - - - - - - - - - - - A, A2 A3 A4 I I I I t2 frame O0 t3 t4 frame 3 frame 1 frame 2 Figure 3: Illustration of the tions linear B-spline func taneous phase) at the frame boundaries are given by (a) w'(0) w'(T) Sn+l_n 6e T 2 *,n+l _,n 6c T T2" (8) Phase tracks Data frame number We can see that when the difference of the frequency measurements at frame boundaries is small, the first term in (8) is smaller in magnitude than the second one. The w'(0) and w'(T) have opposite signs, forcing the frequency track changes its slope at frame boundaries, and thus displaying the phenomenon of "oscillation". This phenomenon, though hardly noticeable in many applications, is definitely not desired, and nor is there any physical basis to justify its existence. 2.2 Least squares quadratic phase interpolation Due to inevitable estimation errors, the frequency and phase measurements shown in (2) will not be consistent in general. Imposing a cubic phase model and forcing the interpolated parameters to match the errorcorrupted estimates will end up with creating artifacts. In addition, using cubic phase model increases computational complexity of a synthesis system. Motivated by producing smoother frequency tracks and reducing computational complexity, we model the instantaneous phase of a sinusoidal component as a quadratic polynomial. At the nth frame, the phase function is repre (b) Figure 1: Frequency and phase tracks obtained using the cubic (solid line) and the least squares quadratic (dashed line) phase interpolation algorithms (a) Frequency tracks; (b) Phase tracks. The phenomenon of the frequency track "oscillation" shown in Figure l(a) is actually predictable. From the interpolation formulas (3), (4), (5), (7). the frequency derivatives (the second order derivatives of the instan

Page  00000004 sented by 0n(r) = an + bnT + Cn2. Assume we have N frames. There will be 3N unkn( (coefficients) to be determined from the frequency phase measurements. However, due to the frequl and phase continuity constraints at frame boundar 0T(T) = 8" (0) wh(T) = wnl(0), n= 0, 1, ", N - 2, the total number of unknowns is reduced to 3N - 2( 1) = N + 2. Note that the phase function 0(t) piecewise quadratic polynomial, it can also be expre by a linear combination of N + 2 quadratic basis si functions (B-splines) [6]. That is N-l 0(t)= "B,(t), n=-2 where B,(t) is the nth quadratic B-spline functio illustrated in Figure 2. We determine the N + 2 cc cients, a, n = -2, -1, 0,..., N - 1 by minimizing following weighted squared error (9) owns and ency ies: (10) 2.3 The QUASAR signal model Based on above discussions, we propose to model a musical signal as a sum of quadratic polynomial-phase sinusoids and residual, i.e., s(t) = Em1 Am(t)sin(0m(t)) + r(t), 0m (t) = a,(t) +bm(t)t+ c,(t)t2. (13) In this model, r(t) is the residual of s(t). Since r(t) N _ usually sounds very noisy, it is also called the stochastic is a part or noise of s(t). r(t) contains important musical ssed instrument information which can not be well reprepline sented by sinusoids such as the breath noise in a flute or saxophone, and the bowing friction noise of a violin. Without this information, the synthesized musical tones tend to sound artificial. Introducing a residual part into (11) the signal model avoids using a very high number of sinusoids to model the noise-like part of the signal, and n as thus makes the model more parsimonious. )effi- It should be pointed out that the model for the sinuSthe soidal part in (13) is different from that in (1). Though many functions, including the phase function in (1) can be generally represented with a polynomial based on the Taylor expansion theory, the model in (13) specifically assumes that the sinusoidal phase is a second order poly(12) nomial, and the coefficients of the polynomial are the n be model parameters to be (directly) estimated from the mis- data. In the next section, we propose a novel approach *ame to estimating quadratic phase sinusoidal parameters for musical application. We obtain these parameter estio [3] mates so that the given signal (data) fits the model in ents, a least square sense. It is well known that the least min- squares estimator is equivalent to an maximum likelihood estimator if the noise is white Gaussian. N A [0(t)-0"]2+ (1 n=O N A) 3 [w(ts) n=o fnl]2 The weighting factor A in the above equation cai interpreted as the importance index for the "phase match" relative to the "frequency mis-match" at fr boundaries or as the confidential level to the accu of the phase measurements. Readers are referred t for details on how to find the combination coefficiP a", n = -2, -1, 0, --, N- 1 when the above error: imization criterion is employed. Applying this least squares quadratic phase ii polation algorithm with A = to the same synti data as for the example in Figure 1, we can get a n smoother frequency track with less computations. resulting frequency and phase tracks, denoted by da lines, are plotted in Figure l(a) and (b), respective The difference between the least squares quad: phase interpolation and the "magnitude only" re struction [1] is obvious, though the instantaneous p function of the latter is also a second order pol mial. In the proposed quadratic phase interpolatio: gorithm, both frequency and phase measurements taken into account and are made "consistent" in a square sense, while in the magnitude-only reconst tion, the phase measurements are totally ignored the phase is simply the integral of the instantan frequency. nterhetic nuch The shed ly. ratic;conhase ynon alare least Lrucand eous 3 Estimation of Model Parameters In the previous section, we discussed the rationales to use a quadratic polynomial phase sinusoidal model for the sinusoidal part of musical signals. In this section, we discuss how to estimate the model parameters from music data. Using polynomial phase sinusoidal model for signal processing is not new. It has applications in radar, sonar, and communications as well. A generic approach to finding the estimates of the phase coefficients and amplitudes of polynomial phase components is to derive a maximum likelihood estimation algorithm. Due to its computational burden, the maximum likelihood estimator is not practical. The computationally efficient alternatives have to be exploited.

Page  00000005 In [7], a suboptimal algorithm based on a new transformation - the discrete polynomial-phase transformation (DPT) was proposed. For our application, considering the fact that musical notes are usually pre-recorded and analyzed off-line, we can seek a more suitable solution to the estimation problem. 3.1 Estimation of Sinusoidal Parameters As pointed out earlier, most existing analysis algorithms do not take an analysis-by-synthesis approach. Therefore any errors resulted from the interpolation process for synthesis can not be recovered. This problem needs to be addressed when deriving our estimation algorithm. In order to do this, and by noting that each sinusoidal component has a nominal frequency, denoted by Wi, whose contribution to the instantaneous phase is Wmt, we first rewrite the signal model (13) as M s(t) = Am(t) sin(w~mt + /m(t)) + r(t), m=l (14) where qm,(t) is the phase deviation track of the mth sinusoidal component. It should be emphasized that cwm, m = 1,..., M are by no means constrained to be harmonically or quasi-harmonically related. We then express the time-varying amplitude and phase deviation tracks as linear combinations of linear and quadratic spline functions, respectively, i.e., N Am(t)= - A",n(t), (15) n=O N--1 m (t)= C Bn, (t). (16) n--2 In the above equations, A (t) and B,(t) are the nth linear and quadratic B-spline function, respectively, as illustrated in Figure 3 and 2, where n is the data frame index. From Equations (14) - (16), we see that the number of sinusoidal parameters has reduced from three to two per sinusoid per frame compared with any existing sinusoidal model-based music synthesis algorithms or systems. This represents more than 33% of savings in memory use and is a significant advantage for using the proposed signal model. The objective of the estimation algorithm is to obtain estimates of A" and a" based on the following error minimization criterion: NL--1 (A,ac) =arg minm (s(k)- (k))2, (17) A, k=o where s(k) -= A ( A sa(k) sin (Wmk + Nj BC~ (k), m=l n=0 n=-2 (18) k is the sample index, L is the number of samples in a frame of length T, and the sampling rate is assumed to be 1 for convenience. From Equation 17, we see that the sinusoidal parameters are estimated based on a waveform (data) fitting criterion as opposed to the energy fitting one used by M&Q and SMS. We also see that the error to be minimized is the energy of the residual for a given number of sinusoids. It is worth noting that the synthesis of sinusoids is implicitly performed while sinusoidal parameters are estimated. Because the synthesis process has been built into the analysis, any interpolation errors due to the use of the linear and quadratic evolution models (15) and (16) for amplitude and phase, respectively, will be included in the residual. The completed estimation procedure for sinusoidal parameter is provided in [3, 4], and is diagrammed in Figure 4. 3.2 Modeling of the Residual As stated in Section 2, the residual plays an important role in synthesis of musical tones. Without the residual, a synthesized musical tone tends to sound artificial. In addition, it is desirable to change and/or modify the residual for music synthesis. An effective modeling and analysis/synthesis of the residual is the key to achieve this objective. A. Obtaining the Residual From (13), we see that the residual can be obtained by subtracting the sinusoidal part from the original signal. It is well known that the synthesis of sinusoids dominates the computation of a sinusoidal model based music synthesizer, and the calculation of the instantaneous phase is the major portion for generating sinusoids. It is observed in [3] that the use of a quadratic phase model allows the instantaneous phase to be calculated recursively using two additions instead of two additions and two multiplications, and the recursive algorithm can not be applied when a cubic phase model is used due to distortions caused by accumulated numerical errors. This is not only because that a higher order polynomial has more numerical problems but also because that the coefficient of the cubic term is usually very small for musical tones, and thus is extremely

Page  00000006 sensitive to the numerical errors. Due to these factors, the computational complexity for obtaining sinusoidal phase can be reduced by more than 50% [3]. Once the sinusoidal part is reconstructed, it is straightforward to get the residual by subtracting the sinusoidal part from the original as shown in Figure 4. B. Synthesis of the Residual When proposing to decompose a musical tone into a deterministic and stochastic part, Serra and Smith also proposed to use a noise-driven source filter model for the analysis and synthesis of stochastic components [2]. The filtering can be implemented in time domain or in frequency domain. Correspondingly, the magnitude of the filter frequency response, or the magnitude spectrum is made an approximation to the spectrum envelope of the stochastic signal. Subsequent research on the problem, though varying in some aspects such as the proposed use of the noise perception model [8], followed the basic model with a pure random noise as the filter excitation. An advantage of using random noise as excitation is that it eliminates the need of memory for storing the residual. However, it is our experience (and of many others [8]) that this noise-driven source filter model generally does not work well due to the loss of phase coherence between the sinusoidal and the stochastic part. As a result, the sum of the sinusoidal and stochastic part often sounds like two separate signals. In order to maintain the phase coherence (waveform coherence) and still achieve data reduction, waveform coding-based coding techniques can be used for the residual analysis and synthesis. Among various waveform coding algorithms, the Multi-pulse Excitation Linear Prediction (MPLP) for low-bit rate speech coding [9] is of particular interest. MPLP makes no prior assumption about the nature of its input, known as the excitation signal. In MPLP, the excitation signal consists of a sequence of pulses for all speech classes - including voiced and unvoiced. The coding algorithm makes no attempt to make its excitation periodic or non-periodic. On the other hand, we found that, the residual, when properly obtained, does not contain such musical information as pitch, loudness, and other "intelligently important" information. What it carries is the information about the characteristics of the instrument, providing the "gesture" of musical sounds. The nature of the residual along with the features of MPLP make this coding algorithm a good candidate for the modeling of musical signal residuals. Another similar, but much simpler coding algorithm is the Regular Pulse Excitation (RPE) algorithm [10]. A description of the MPLP algorithm in matrix arithmetic is given in [3], where a procedure for finding the locations and amplitudes of a given number of pulses in a given time interval different from that used in [9] is provided. The algorithm block diagram of MPLP is depicted in Figure 5, where A(z) is a qth order time-varying filter which can be determined by using linear prediction (LP) techniques. A(z/y) is often called the error-shaping filter because it changes the spectrum shape of the error signal to be minimized. The vector b and p contain the amplitudes and locations of Q samples in an L sample long data frame, respectively. Normally, Q < L. Figure 4: Analysis block diagram of the proposed analysis/synthesis algorithm 4 Sound Example In this section, we present an example of analysis and synthesis of musical tones. In this example, the signal is a piano note (G3, mezzo forte), and the optional parameter optimization step shown in Figure 4 was not performed. For sinusoidal parameter estimation, the analysis frame length T is set to about 3 milliseconds. Figure 6 shows the periodograms of the original note, the reconstructed sinusoidal part and the residual, respectively. From this figure, we see that the number of sinusoids is 8. The nominal frequencies of these 8 sinusoids are marked with + signs on the periodogram of the original note. Shown in Figure 7 are the signal waveforms of the original note, the reconstructed sinusoidal part and the residual, respectively. In order to show the details of the waveforms, only 0.1 second long of sam

Page  00000007 5 Conclusions r(k) u(k) 0- A(z) e(k) Excitation v + Error lb I pulse- 1Ero. bv generation A(z /y) IMinimizatio (a) {bqp} Excitation v(k) r(k) 0- - pulse 1 generation A(z) (b) Figure 5: Block diagram of MPLP. (a) Analyzer, (b) Synthesizer. We have proposed a new signal model for music analysis and synthesis in this paper. From our extensive experiments with a variety of musical instrument sounds including piano, trumpet, violin, flute, bassoon, we found that with the unique quadratic polynomialphase sinusoids and residual signal model, along with the analysis and synthesis algorithms introduced in this paper, a very good sinusoidal and stochastic decomposition can be achieved. Compared with the M&Q and SMS algorithms, the proposed approach reduces memory use for storing sound parameters by more than 33% and reduces computations for calculating sinusoidal phase by about 50%. We are currently working on the implementation of the proposed algorithm on a Texas Instruments' fixed-point DSP, and will report our results in the future. ples at the beginning of the note is plotted. For the piano residual, we use the MPLP algorithm introduced in Section 3.2. The residual analysis frame length is about 6 milliseconds long, twice as long as the frame length used for sinusoidal parameter estimation. A 12th order LPC filter coefficients are obtained on each frame with a rectangular window and a 50% overlap. The "shaping coefficient" 7 is chosen to be 0.8, and the sample to pulse ratio, L/Q, is 4. We plotted the attack portion of the piano residual and its reconstruction from the "sparse" pulses in Figure 8. It is easy to see that the phase coherence between the sinusoidal part and the reconstructed residual is well preserved. We also experimented with other musical instrument sounds including flute, violin, bassoon and trumpet. We found that for many musical instruments, their residuals are much more stable than the piano residual. Therefore a simpler coding algorithm called RPE can be used instead. In all residuals of respective musical tones, no tonal quality of sounding effect and no difference between originals and re-synthesized residuals were perceived. These, together with their periodograms and waveforms, indicate that the proposed analysis algorithm performs a good sinusoidal and stochastic decomposition and the selected coding algorithms are well suited for analyzing and synthesizing residuals of musical tones. 0 0.5 1.5 Hz x 104 Figure 6: Periodograms of the piano note G3, mezzo forte

Page  00000008 Piano, G3, mezzo forte, MPLP x 104 1.5 0.5 0 -0.5 t5 (2 rr 0.03 0.04 0.05 0.06 0.07 Second 0.08 0.09 0.1 500 1000 1500 Sample 2000 2500 3000 Figure 7: Waveforms of the piano note G3, mezzo forte References [1] R. J. McAulay and T. F. Quatieri, "Speech analysis/synthesis based on a sinusoidal representation," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, pp. 744-754, August 1986. [2] X. Serra and J. O. Smith, "Spectral modeling system: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition," Computer Music Journal, vol. 14, no. 4, 1990. [3] Y. Ding and X. Qian, "Processing of musical tones using a combined quadratic polynomial-phase sinusoids and residual (QUASAR) signal model," submitted to the Journal of the Audio Engineering Society, Dec. 1996. [4] X. Qian, Track frequency components in speech and music signals. PhD thesis, University of Rhode Island, Kingston, RI, 1997. [5] L. B. Almeida and F. M. Silva, "Variable-frequency synthesis: An improved harmonic coding scheme," in Proc. of the International Conference on Acoustics, Speech and Signal Processing, (San Diego, CA), p. 27.6.1, March 1984. Figure 8: Original and re-synthesized residuals of a piano note [6] C. de Boor, A practical guide to spline. SpringerVerlag New York, Inc., 1978. [7] S. Peleg and B. Friedlander, "Multicomponent signal analysis using polynomial-phase transform," IEEE Transactions on Aerospace and Electronic Systems, vol. 32, pp. 378 - 386, January 1996. [8] M. Goodwin, "Residual modeling in music analysis-synthesis," in Proc. of the International Conference on Acoustics, Speech and Signal Processing, (Atlanta, GA), pp. 1005- 1008, May 1996. [9] B. S. Atal and J. R. Ramde, "A new model of LPC excitation for producing natural-sounding speech at low bit rates," in Proc. of the International Conference on Acoustics, Speech and Signal Processing, (Paris, France), pp. 614-617, April 1982. [10] P. Kroon, E. F. Deprettere, and R. J. Sluyter, "Regular-pulse excitation - A novel approach to effective and efficient multipulse coding of speech," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-34, October 1986.