Page  00000001 Transients modelling by pruned wavelet trees Laurent DAUDET Department of Electronic Engineering Queen Mary, University of London Mile End Road, London El 4NS Abstract - This paper presents an original way to model sound transients, based on stuctures of wavelet coefficients in the time-scale domain. The method allows both the extraction of transients from complex signals and to their parametrization in a compact way. This representation can be relevant in numerous applications, such as audio compression, sound effects, and feature detection (e.g. rythmic pattern detection). 1 Introduction The notion of musical transients, although broadly used by musicians and audio engineers, lacks a clear definition from a signal processing point of view. Transients are often understood as "attacks", or note onsets; and many articles discuss work related to the detection of such abrupt changes[MB96, Kla99], mainly for segmentation purposes. Recently [Rod97], much attention has also been put on transient modelling in the context of analysis/synthesis. Indeed, the broadly used modelling of audio signal by sinusoidal models [MQ98, Ser89] fails to represent sharp, well-localized signals. The residual of such models cannot be well modelized by a stochastic (locally) stationary signal [Goo96] as it includes strong singularities, which can be clearly perceived by the human ear. Therefore, current schemes [VM98, VM99] often include some kind of modelling for transient features. With a somewhat different approach, here we are interested in a simultaneous detection and modelling of audio transients. Indeed, this is part of a more general project [DauOO] of modelling audio signals as a sum of three "layers": a tonal part, transients and a residual that can generally be modelized as a shaped Gaussian noise (note that this is not a segmentation algorithm, as these three components are defined at any time). Starting from the timescale structure of the residual from sinusoidal models, we derive a characterization of transients as structures across scales in the transform (here wavelets) domain. Transients are here associated with "branches" of wavelet coefficients. The branches are selected in two steps: first a singularity detection selects a set of admissible "full branches" (ie. from the root of the tree to the smallest scales) branches; and then these branches are pruned out in a top-down approach according to the value of individual coefficients. Our article is structured as follows: after a brief review of definitions and notations (sec. 2), we will describe the heuristics leading to our definition of transients (sec. 3). The associated transient detection / estimation pro cedure is then performed in two steps (sec. 4): admissible branches are first selected, and later pruned by a topdown quantization scheme. Results are discussed (sec. 5) in terms of the stationarity of the remaining residual, and we conclude our paper by an overview on ongoing research (sec. 6). 2 Dyadic wavelet definitions and transform: notations (b) FIG. 1: Two trees of dyadic wavelet coefficients. (a) Full tree (b) Incomplete connected tree By its ability to zoom down across scales, the wavelet transform is a natural tool for the analysis and representation of singularities [MH92]. Its discrete version, called the Dyadic Wavelet Transform (DWT), can be seen as a sampling of the time-scale plane of wavelet coefficients on the so-called dyadic grid, at scales 23, j = 1, 2,... J, and,

Page  00000002 at a given scale j, at times 2jk. If ~b is a (discrete) wavelet, we define Q/'p~t) - 2i/2&b(21t - k). The wavelet coefficients of a signal f e L2(L) are the scalar products: d~k(f4'J),j=1, 2,... J Wavelet coefficients are naturally organized as dyadic trees (fig. 1 (a)): each coefficient dJk at scale j, j 2... J has two children d>1l,2k and dftl1,2k~1 at scale j - 1. In this paper, we will also consider incomplete dyadic trees (fig. 1 (b)): these are full trees that have been pruned from the top. Note that all branches are connected, which means that for every node of the tree its parent node also belongs to the tree. The extremity of branches are called leaves, and the coefficient at the largest scale is called the root. 3 Audio transients and time-scale represent at ions In order to get a somewhat intuitive understanding of our approach, let us first examine (fig. 2) a time-scale representation (by means of a dyadic wavelet transform) of a musical signal whose tonal part has been extracted by some standard technique such as the sinusoidal model or the smoothed local cosine transform. As we want to detect events with the best possible time localization, we use Daubechies wavelets with a small number of vanishing moments (1 or 2). scales. Indeed, given a threshold ij, when a node is found significant (ie. its absolute value is larger than 17) there is a high probability that its ancestors are also significant. This reminds us of tree-structured image coding techniques, such as the EZW coder [Sha93]. We are now able to give a precise definition of transients, that will be used for a simultaneous detection and modelling: a transient is described by a structure o~f large dyadic wavelet coefficients, assembled as an incomplete connected tree. 4 W~avelet trees construction We now have to design a construction technique for the (incomplete) trees that are to be considered as representing transients. This is be done in two steps: an admissibility stage, where full branches with a high correlation of large coefficients are selected; a pruning stage, where insignificant leaves are pruned out, in a top-bottom approach. 4.1 Branches selection We first select the set of full branches (ie. from the root to the smallest scale) where large coefficients are located. This is done by studying the following "modulus of regularity": (j,k)CB where 23 represents the set of coefficients on a given full b ranch. The parameter s allow us to change the features to be detected. As we don't want to emphasize large or small scales, a good choice for s is s 0. It also can be noted that the coefficients IdJ,k can be summed with other power values. The top plot of fig. 3 shows the variations of K~ for the full branches corresponding to 4096 samples of the previous signal. A simple thresholding (with threshold value 6) is then applied to select the full tree (as seen in the middle plot of fig. 3). In this short example containing a strong transient, the full tree covers about 18% of the total number of DWT~ coefficients. 4.2 Tree pruning We now proceed to the second stage of our tree estimation, which consists of pruning the selected full branches. This is done by a top-down approach, ie. from the top scale to the root of the tree. At each scale, we prune out the leaves that are smaller (in absolute value) than a given threshold 17. Note that only leaves can be pruned, so that the resulting tree will always still be connected. The bottom plot of fig. 3 shows the result of pruning the tree drawn on the middle plot. Here, we only keep about 7%o of the total number of DWT coefficients (ie. there is coefficients x 10 S I' 'I ' I'III 1111 i i ' 1''1 II, 2 iiI I ' I 'I II ' 2 C' I' I I I. I I3 I FIG. 2: Set of DWT coefficients. (Top) Original signal: 0.74s of jazz music whose tonal part has been removed. (Bottom) Value of DWT coefficients at scales 2 to 212. On the bottom plot of fig. 2, one can easily see that the position of large coefficients are highly correlated across

Page  00000003 FIG. 3: (Top) Variations of kappa for 4096 samples of data, extracted from original signal (top plot of Fig. 2) (Middle) Corresponding full selected branches (Bottom) Corresponding pruned tree. a 3:1 reduction in the number of retained coefficients here as compared to the case of full branches). In a data compression context, a suitable value for the threshold 17 can be taken in the same order of magnitude as the quantization step used to represent the DWT coefficients with a finite precision. 5 Results Results for the same portion of sound can be seen on fig.4. The top plot is the original signal. The middle plot represents the transients reconstructed after the pruned tree previously constructed (fig. 3 bottom). One can see that the features with a precise time localization are mostly fetched in the transient part. The rythmic patterns appear very clearly, as well as other "texture" features that cannot be well modelled by the tonal anaysis. The bottom plot is the residual, ie. the original signal minus the transient part. This signal now exhibits a much more stationary behavior (on time scales of 1024-samples windows), and thus can be modelled with greater accuracy as a (slowly-varying) filtered white noise. Similar results have been plotted on a much larger time scale (fig. 5), and one can see that although the greatest part of transients have been detected, there remains some in the residual. This is due to the fact that the threshold 6 that determines the critical value for the K function is constant, therefore low-amplitude transients are not detected. As we are concerned about audio compression performances, let us approximate the amount of data needed to encode the transient information. For the whole se rq FIG. 4: (Top) Original signal (same dataset as Fig. 3) (Middle) Extracted transients (Bottom) Residual quence, about 2% of the DWT coefficients are retained for the transients, and their value can be quantized on 10 bits without audible distortion. One of the great advantages of using tree structures is the low cost of encoding the tree (ie. the addresses of the retained coefficients): 2 bits for each node, O bits for leaves. Therefore, the transient information can be represented using only.24 bits / sample on average. This little extra bit of data leads to a great improvement in the perceived sound quality if we later model the residual as a (slowly-varying) filtered white noise. 6 Conclusion and perspectives Is this paper we have presented a new characterization of "transients" that appear in musical signals, which are here represented as pruned trees of wavelet coefficients containing large coefficients. This method allows us to extract such transients (that appear as residual of a sinusoidal modelling of audio), and to model them with a very small amount of data (typically less than.25 bit / sample). Furthermore, this method can be used for other applications, such as time stretching, transient enhancements, or used as input data for indexation methods. Future direction for research include finding a right set of (possibly time-varying) thresholding parameters 6 and p, which greatly influence the shape and size of the selected trees. We have also worked on a stochastic estimation of the trees [DMT01] using Hidden Markov Models, and comparisons with the deterministic method presented here are in progress.

Page  00000004 I il I u I Ijm i I m I i III 1_1_ u ma.m!mIII mI I JL W i 0 1 m _mII I I M _a IN K! m! imn Al _n m!__ m I iN I FIG. 5: Transient separation for about 6 s of sound (Top) Original signal (Middle) Transients (Bottom) Residual Acknowledgements This research has been supported by a Marie Curie Fellowship of the European Community programme Human Potential under contract number HPMF-CT-2000-00917. References [Dau00] L. Daudet. Representations structurelles de signaux audiophoniques. Methodes hybrides pour des applications a la compression. PhD thesis, Universite de Provence, 2000. [DMT01] L. Daudet, S. Molla, and B. Torr6sani. Transient detection and encoding using wavelet coefficient trees. In Proc. 18th Symposium GRETSI'01 on Signal and Image Processing, Toulouse, 2001. [Goo96] M. Goodwin. Residual modeling in music analysis-synthesis. In Proc. IEEE ICASSP, pages 1005-1008, Atlanta, GA, 1996. [Kla99] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In Proc. of the International Conference on Acoustics, Speech and Signal Processing, Phoenix, 1999. [MB96] P. Masri and A. Bateman. Improved modelling of attack transients in music analysisresynthesis. In Proc. of the International Computer Music Conference, 1996. [MH92] S. Mallat and W. L. Hwang. Singularity detection and processing with wavelets. IEEE Trans. Inf. Th, 38:617-643, 1992. [MQ98] R. McAulay and T. Quatieri. Computationally efficient sine-wave synthesis and its application to sinusoidal transform coding. In Proc. Int. Conf. Acoustics, Speech and Signal Processing, 1998. [Rod97] X. Rodet. Musical sound signal analysis/synthesis: Sinusoidal+residual and elementary waveform models. In Proceedings of the IEEE Time-Frequency and Time-Scale Workshop 97, Coventry, UK, 1997. [Ser89] X. Serra. A System for Sound Analysis/Transformation/Synthesis based on a Deterministic plus Stochastic Decomposition. PhD thesis, Stanford University, 1989. [Sha93] J. M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. Signal Processing, 41(12):3445-3462, 1993. [VM98] T. Verma and T. Meng. An analysis/synthesis tool for transient signals that allows a flexible sines+transients+noise model for audio. In Proc. Int. Conf. Acoustics, Speech and Signal Processing, 1998. [VM99] T. Verma and T. Meng. Sinusoidal modeling using frame-based perceptually weighted matching pursuits. In Proc. Int. Conf. Acoustics, Speech and Signal Processing, 1999.