Page  00000001 Transient Modeling Synthesis: a flexible analysis/synthesis tool for transient signals Tony S. Verma (1), Scott N. Levine (2), Teresa H.Y. Meng (1) (1) Department of Electrical Engineering, Computer Systems Laboratory, Stanford University (2) Center for Computer Research in Music and Acoustics (CCRMA), Stanford University Abstract We propose a flexible analysis/synthesis model for transient signals that effectively extends the Spectral Modeling Synthesis (SMS) parameterization of signals from sinusoids+noise to sinusoids+transients+noise. The explicit handling of transients provides a more realistic and robust signal model. The model presented is a parametric model for transients that allows for a wide range of signal transformations. In addition to modeling, a transient detection scheme is also presented. 1 Introduction Transient Modeling Synthesis (TMS) is a flexible analysis/synthesis tool for transient signals. TMS is the frequency domain dual to sinusoidal modeling. Sinusoidal modeling and Spectral Modeling Synthesis (SMS) [1, 2] have enjoyed a rich history in both speech and audio. SMS is a flexible signal model that consists of sines and noise. While SMS provides a representation for sinusoidal signals that allows a wide range of transformations, TMS provides a representation for transient signals that allows a wide range of transformations. TMS combined with SMS effectively extends the sines+noise model to sines+transients+noise. The explicit handling of transients provides a more robust signal model and is essential for synthesizing realistic attacks of many instruments. Although there has been work on explicit handling of transients within the SMS framework [3, 4], these methods are not flexible in their representation of transients. TMS not only allows explicit handling of transients, but allows manipulation of the model parameters and thus maintains the spirit of SMS as a flexible signal representation. The first section of the paper describes the framework of TMS. The second describes a transient detection scheme that allows TMS to work more effectively. The final section gives an analysis/synthesis example. 2 The TMS Framework An explicit transient model is motivated because transients do not fit well into the SMS framework. SMS is a parametric modeling tool that consists of two parts: sinusoidal modeling and noise modeling. The analysis portion of sinusoidal modeling finds well developed sinusoids by tracking spectral peaks over time. It finds the sinusoidal components in a signal by using short-time Fourier analysis and tracking meaningful peaks from frame to frame. During synthesis, these meaningful peaks, which consist of the parameter triplet {magnitude, frequency, phase}, control a bank of oscillators (additive synthesis) or can be used in an inverse Fourier Transform/overlap add scheme for signal reconstruction. SMS furthers its decomposition by considering a residual signal. This residual signal is the difference between the original signal and the synthesized sinusoidal signal. The residual consists of components that are not well modeled by sinusoids. These components are transients and noise [3, 5]. In the SMS framework, the transient+noise residual is modeled as slowly varying filtered white noise. Transients in the residual do not fit within this model. This is a serious drawback when considering instruments with sharp attacks because transients modeled as noise become smeared in time and the attack is lost. As suggested by others [3, 6], transients need to be considered separately from noise. Others have done this by removing transient areas from the residual, performing noise analysis, then adding the transients back into the signal. Although this method works, it has a few drawbacks. First, it lacks flexibility in representing transients. Representing transients as PCM samples is far from the flexible representation goal of SMS. Secondly, many instruments have an underlying noise, the breathiness of a flute

Page  00000002 x(n) rl( rdr ~rl P ~ ku rd~ Flu (a) (b) Sinusoids yntheais IDCT Transients + ~(n) Noise Transient oontrol Detection parameters 2nd residual r2 noise ~i5 Noise Aaalysis coatr~>l parameters Figure 1 (a) Analysis block diagram (b) Synthesis block diagram for example, that is neither sinusoidal or transient. When removing transients in the fashion stated, both transients and noise are removed. It is desirable to model transients separately but leave noise to the noise model. These needs motivated TMS. The system block diagram for the combination of TMS and SMS is given in figure 1. We use TMS on the first residual, 4l, which consists of noise and transients. TMS optionally first detects where transients occur. It then fits a parametric model to the transients. The transients are synthesized and subtracted from the first residual, 4l, to create a second residual, r2. This second residual consists of primarily slowly varying white noise. Thus attacks of instruments are well preserved and the underlying noise of an instrument is left for the noise model. Since sinusoidal modeling tracks well developed sines, it cannot track transients which are timelimited, pulse-like signals. However, pulse-like signals in one domain can be periodic in another domain. The basic idea underlying TMS is the duality between time and frequency. TMS is the frequency domain dual to sinusoidal modeling. While the analysis portion of sinusoidal modeling finds sinusoids by tracking the well developed spectral peaks of a time domain signal, TMS finds transients by track ing the well developed spectral peaks of a frequency domain signal. That is, we first map segments of the time domain signal into the frequency domain. This causes transients in the time domain to become periodic in the frequency domain. We then perform sinusoidal modeling on this frequency domain signal. The well developed spectral peaks found from analyzing the frequency domain signal represent well developed transients in the original time domain signal. The block length of the time to frequency domain mapping must be sufficient to make transients compact entities within the block. A block size of about one second is sufficient. The mapping from the time domain to frequency domain is chosen so that transients in the time domain become sinusoidal in the frequency domain. The Discrete Cosine Transform (DCT) provides such a mapping. It is defined as: C(k) a(k) N- x(n) cos F(2n ~ 1)kw 1 for I I.. L~ fork~o

Page  00000003 Therefore if x(n) is a Kronecker delta, then C(k) is a cosine whose frequency depends on the time location of the impulse. Roughly speaking, an impulse that occurs toward the beginning of the frame results in a DCT domain signal that is a relatively low frequency cosine. If the impulse occurs toward the end of the frame, then the DCT of the signal is a relatively high frequency cosine. For signals commonly encountered, attacks are rarely a simple Kronecker delta, but are usually groups of closely spaced samples. This results in a DCT that consists of many closely spaced sinusoids or a sinusoid that varies slowly over the domain of definition of the DCT. This type of periodic signal is exactly what sinusoidal modeling works well on. Thus we obtain a parametric model of transients by performing sinusoidal modeling on the DCT of blocks of the time domain signal. In order to perform sinusoidal modeling on the DCT of a section of a signal, we must take overlapping discrete Fourier transforms (DFT) on the DCT domain signal. This combination of operations, DCT then DFT, brings the signal back into some type of time-like domain. Although this may seem redundant, these operations rotate (unitary transforms simply rotate vector spaces) the signal in such way to make transients readily apparent. When performing short-time Fourier analysis in the DCT domain, the window size for the spectral analysis must be shorter than the DCT size, but adequate zero-padding or other frequency interpolation methods, such as parabolic interpolation [3] must be used to avoid quantization of time. Each spectral peak found in the DCT domain is a triplet of information, as in the case of sinusoidal modeling, of the form {magnitude, frequency, phase}. This frequency parameter, however, is actually time domain information. The frequency here corresponds to where a transient occurs. This dictates how much frequency resolution (actually time resolution) is required when performing TMS. Since frequency corresponds to a time location, we must guarantee that the amount of frequency resolution is greater than the number of time samples used in computing the block DCT in order to avoid quantization of transients. 3 Transient Detection If we know where possible transients occur, we can restrict TMS to model transients only in those areas. This is done by keeping only those frequencies in the TMS domain that occur in possible transient areas. This allows TMS to run more efficiently because peaks that are clearly not in transient areas are not given to the spectral tracking algorithm. This step is optional because running TMS without restrictions and proper control parameters can be reliably used to find onsets. We describe here, however, a simple transient detection scheme based on energy in the synthesized sinusoids, denoted s, and the first order residual, rl. Qualitatively, possible transients occur when the sinusoidal model breaks down and the energy in the first order residual increases rapidly. To quantitatively measure this, we look at energy in s and rl over the entire DCT block and over a smaller shortterm sliding window. Let N-1 S= z x(n)l2 n=0 denote the energy of a signal over the DCT block where N is the length of the DCT. Then E, and Eri denote the energy in s and rl, respectively, over the DCT block length. Now define the energy in the sliding window as: k+L2 ex(k)= | x(n)|2 for k =ala2,... (1) n=k-L Where L is the length of the sliding window, a is the hop size and x is properly defined, e.g. zero padded, outside of the region n = 0, 1,..., N so equation 1 makes sense at the edges of the DCT block. A possible transient occurs when the ratio of normalized short-term energy of rl and s is larger than some threshold. Specifically, when erl (k)/Erl eri(k)/Eri > THRESHOLD es(k) E, (2) possible transient areas are noted and frequencies in the TMS domain outside of these areas can be discarded. 4 Examples As an example, we show the sines+transients+noise analysis on a xylophone hit, the results of which are shown in figure 2. The xylophone, although inharmonic, has a perceived pitch which is modeled well by the sine portion of the representation. Figure 2(a) is a plot of the original signal sampled at 44.1KHz, while figure 2(b) shows the synthesized sinusoids. Figure 2(c) is the first residual, rl, which shows the sharp attack of the sound as well as some underlying noise. The attack, as modeled by TMS, is shown

Page  00000004 (a) range of sounds while allowing the synthesized versions to be perceptually identical to the original. In addition, because noise models assume residual signals which consist of slowly varying noise, using TMS to remove transients allows the models to work more effectively. Finally, because TMS has the same flexibility as SMS, a large number of transformations are possible on the analyzed signal. In addition, these transformations will be more robust because we have an explicit parameterization for sines, transients and noise. For example, when time stretching a signal, it is desirable for transients to move to their proper onset locations but remain localized, while the harmonics and noise stretch. By using TMS combined with SMS, these types of transformations are possible. Many other signal transformations are a subject of current research. References [1] Xavier Serra and Julius 0. Smith, "Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition", Computer, vol. 14, no. 4, pp. 14-24, WINTER 1990. [2] Robert J. McAulay and Thomas F. Quatieri, "Speech analysis/synthesis based on a sinusoidal speech model", IEEE Transactions on Acoustics Speech and Signal Processing, pp. 744-754, 1986. [3] Xavier Serra, A System For Sound Analysis/Transformation/Synthesis Based on a Deterministic Plus Stochastic Decomposition, PhD thesis, Stanford University, 1989. [4] K. N. Hamdy, M. Ali, and A. H. Tewfik, "Low bit rate high quality audio coding with combined harmonic and wavelet representations", Proceedings of ICASSP-96, vol. 2, pp. 1045-1048, May 1996. [5] E. Bryan George and Mark J. T. Smith, "Analysis-by-synthesis/overlap-add sinusoidal modeling applied to the analysis and synthesis of musical tones", J. Audio Engineering Society, vol. 40, no. 6, pp. 497-515, June 1992. [6] Michael Goodwin, "Residual modeling in music analysis/synthesis", Proceedings of ICASSP-96, vol. 2, pp. 1005-1008, May 1996. 0 5000 10000 15000 Figure 2: (a) Original xylophone. (b) Synthesized sinusoids. (c) First residual containing transients+noise. (d) Synthesized transients. (e) Second residual containing noise in figure 2(d). Figure 2(e) shows the second residual, r2, which is the part of the original signal that is not well modeled by sines or transients. This is slowly varying noise. If the first residual signal were passed to the noise model without TMS, the attack would be smeared and the characteristic 'knock' of the xylophone would be lost. The summation of the sines+transient+noise portions yield a signal that is perceptually indistinguishable from the original. 5 Conclusions There are many benefits to using TMS. First, synthesized attacks of instruments are well preserved while maintaining a flexible model of these attacks. Combining TMS with SMS allows modeling of a wide