Page  1 ï~~ANALYSIS-SYNTHESIS MODEL FOR TRANSIENT IMPACT SOUNDS BY STATIONARY WAVELET TRANSFORM AND SINGULAR VALUE DECOMPOSITION Wasim Ahmad*, Htiseyin Hacthabiboglut, Ahmet M. Kondoz* *Centre for Communication Systems Research (CCSR) University of Surrey, Guildford, UK tCentre for Digital Signal Processing Research (CDSPR) King's College London, London, UK ABSTRACT We encounter a wide range of sounds everyday which are harmonic, transient or a mixture of both. A number of models have been developed in recent years to synthesise harmonic sounds but these models do not perform effectively on transient sounds because of the sharp attack and decay part of the transient sounds. This paper presents a new technique for analysis-based synthesis of transient sounds by adding orthogonal bases of frequency bands. The proposed technique consists of four stages. First, the sound samples taken from a group of sounds are decomposed into frequency bands using the stationary wavelet transform (SWT). A set of orthonormal bases for each frequency band is then computed using singular value decomposition (SVD). The model parameters for all sounds present in the group are derived from the orthogonal basis and their weights. Finally, the required sound is synthesised by adding the weighted orthogonal basis for each frequency band and then taking the inverse stationary wavelet transform (ISWT). The proposed technique provides a generic synthesis approach to synthesise a diverse transient set of sounds using a single analysis-synthesis model. 1. INTRODUCTION Our environment is full of diverse types of sounds such as cracking, hitting, collision, bumping, breaking, dripping, car engine, etc. Most of these sounds are transient impact sounds with energy concentrated in a short time window. Such transient impact sounds are generally produced when two or more objects interact with each other. The resulting sounds are dependent on the physical properties of the vibrating objects such as, shape, size, elasticity, material, and the type of interaction such as, position, force of interaction, velocity. Therefore, upon hearing an impact sound, listeners can identify the general properties of the objects that generated it. Listening to everyday sounds is not just listening to these sounds per se but rather it is an experience of hearing events in the world [1]. During the last two decades a number of synthesis algorithms have been developed and applied mostly to syn thesise musical instrument sounds [2]. Some of these synthesis algorithms were adapted to synthesise everyday impact sounds but the results were not effective and satisfactory [3] as the perceptual dimensions and attributes of everyday listening are different from musical listening [1]. Therefore new synthesis algorithms have been introduced in recent years to synthesise everyday impact sounds. 1.1. Related work Physical models were initially introduced by the computer music community to simulate the existing musical instruments, but subsequently this technique was used for sound rendering in interactive applications. The physical models represent the relationship between the sound and its connection with the physical characteristics of the structure. Doel et al. [4] presented a framework to synthesise contact sounds using modal synthesis which describes the acoustic properties of the vibrating objects. Doel also [5] developed a physically-based sound synthesis model for liquid sounds. Doel identified that more complex sounds like streams, pouring water, rivers, and rain can also be generated from single water bubble model and their stochastic models. Rath et al. [6] used a modified form of modal synthesis to synthesise a number of impact sounds. Rath et al. used the proposed model to generate bouncing, breaking, rolling, and sliding sounds. Cook [7] introduced two algorithms, PhISAM (Physically Informed Spectral Additive Modelling) and PhISEM (Physically Informed Stochastic Event Modelling), to analyse and synthesise musical and everyday sounds. The PhISAM is based on modal synthesis and is suitable for objects which are struck or plucked. The PhISEM is based on pseudo-random overlapping and adding of small grains of sound. Cook used PhISEM algorithm to analyse and synthesise footsteps, police whistle, gravel shaking, crunching of gravel, and sandpaper sounds [8, 7, 3]. Peltola et al. [9] presented two physics-based synthesis algorithms and control methods to synthesise hand clapping sounds. These methods can synthesise single hand-clap and applause from a group of clappers. Spectral models represent and imitate the properties of sound signals and parameterise sounds as perceived by

Page  2 ï~~the listener. Aramaki et al. [10] proposed an analysissynthesis model based on a time-varying subtractive synthesis process which acts on a noisy input signal. This model was aimed at reproducing perceptual effects corresponding to impacted materials, without simulating physical properties. Serra [11] developed a method, called spectral modeling synthesis (SMS) which can be used to synthesise the sound produced by musical instruments or by any physical system. The proposed algorithm is based on modeling sounds as stable sinusoids (i.e. deterministic component) plus noise (i.e. stochastic components). The SMS model restricts the deterministic part to sinusoidal components with piecewise linear amplitude and frequency variations. This affects the generality of the model and sounds, like transient and noisy, cannot be accurately modeled by the technique. Verma et al. [12] presented an algorithm, called transient modeling synthesis (TMS), to analyse and synthesise the transient part of the input signals. It was also proposed that the TMS can be combined with spectral modeling synthesis (SMS) to model a wide range of instrumental sounds. Misra et al. [13] introduced an analysis-synthesis system with user interface that was used for analysis, transformation, and synthesis of natural sounds. It used deterministic, transient, and stochastic models to analyse and represent the input sound. Deterministic components were extracted using the sinusoidal analysis [11] and the transient part was detected and modeled using transient modeling [12]. 1.2. Motivation The diversity of everyday sounds and sources is such that it is not feasible to generate all or a group of everyday impact sounds using a single analysis-synthesis model. Therefore effective and flexible analysis-synthesis models are needed for environmental transient impact sounds which can be used to design sounds for computer music, computer games, and virtual reality (VR) applications. Physical models are very efficient and accurate in simulating simple sounds where a simple vibrating structure is involved and a mathematical model of the interaction can be deduced. For complex structures where the vibrating mechanism can not easily be described, physical models can become infeasible. Furthermore, the analysis process in physical models is specifically related to that particular sound or musical instrument and is not applicable to any other sound source. From a perceptual point of view, the refinement of physical models is not always successful because the physical mechanisms of many environmental sounds are still not completely understood [7]. Spectral models have a broader scope than that of the physical models because they parameterise and construct the spectrum as received in the ear. Therefore, their refinement and repurposing is easier than physical models. On the other hand, spectral models (e.g. additive synthesis, spectral modeling synthesis (SMS)) perform very well on the harmonic, stationary, and noise free sounds but fail to model and synthesise transient, nonstationary, and noisy sounds accurately [11]. Therefore, generic synthesis mod els with better analysis and parameterisation techniques are required that can be applied to many if not all of the impact sounds. In this paper, an analysis-synthesis algorithm is presented which has the potential to synthesise common transient sounds. The analysis process in the proposed model uses the stationary wavelet transform (SWT). The SWT analyses both the frequency and time behaviour of the input signal and has the ability to analyse transient, nonstationary, and noisy sounds. The parameterisation process uses singular value decomposition (SVD) that parameterises and represents the data by identifying the similarities and differences present in the data. Furthermore, the presented synthesis model is generic, robust, and can synthesise a single sound or a group of transient sounds. 1.3. Overview The detailed introduction of the stationary wavelet transform and singular value decomposition is presented in sections 2 and 3 respectively. In section 4, the proposed analysis-synthesis model is presented and different blocks of the model are explained in detail. In section 5, a set of example sounds and their synthesis are presented. The need for an expressive synthesis model and its importance in the natural sound synthesis context are also discussed. In section 6, the summary of achievements and the results of the paper are reviewed. 2. STATIONARY WAVELET TRANSFORM The wavelet transform (WT) decomposes an input sound signal into wavelet coefficients through a series of filtering operations. Similar to short-time Fourier transform (STFT) where a window is translated to analyse the input signal, WT uses wavelet function to analyse the input signal which is not only translated but also scaled (expanded or dilated) in time. A wavelet V(t) is a function E L2(R) with a zero average, f (t)dt- 0. The wavelet /(t) is also named as mother wavelet because it generates a new family of wavelets by scaling V (t) by a and translating it by ': - Ta (t) _ 1 t T a a (1) Usually, the mother wavelet is dilated by power of two and translated by integer. The wavelet transform of a sound signal s(t) at time T and scale a can be represented as: Ws(T, a) = (S(t),/)ya(t)) (2) (3) Ws(T, a) J s(t) 1 * T)dt -0 V),, (T) (4)

Page  3 ï~~where (*) represents the convolution and /a is, V () -=V1 -t).(5) The convolution in Eq. (4) computes the wavelet transform of the input signal with dilated band-pass filters. The wavelet transform splits the input signal s(t) into two sets of coefficients at each level: approximation coefficients, CAL, and detail coefficients, CDL, where subscript represents the level of decomposition. The WT involves decimation process at each level which makes it a shift-variant transform. It means the wavelet transform of a signal and the wavelet transform of a time-shifted version of the same signal are significantly different. The lack of shiftinvariance is a well-known disadvantage of the classical WT. There are real-valued extensions to the standard wayelet transform [14, 15, 16], which are shift invariant and which have been used for different purposes. To overcome the shift-variance problem, the stationary wavelet transform (SWT) is used to decompose the input signal into frequency bands. The SWT allows the projection of the signal to an orthonormal basis which is shift-invariant and redundant. Nason et al. [14] proposed a simple algorithm for SWT that is adopted in this paper. In classical WT, the standard high and low pass filters are applied to the input signal which produces two sequences at each level. These sequences are decimated by factor of 2 at each level but filters are not modified. In the shiftinvariant algorithm, the decimation is not applied on both high and low pass filtered sequence. Therefore, the output signals have the same length as the original undecimated signal. Instead, both high and low pass filters are modified at each level by interleaving them with zeros. 3. SINGULAR VALUE DECOMPOSITION Singular value decomposition (SVD) is a well-known statistical technique that has been widely used in signal processing and statistics [17, 18]. SVD factorises the rectangular real or complex matrix into three simple matrices: two orthonormal matrices and a diagonal matrix. The SVD of a real valued m x n rectangular matrix Y (where m < n) can be written as: Y = UAVT (6) where U [il,.2..., am] is an m x m orthonormal matrix (i.e. UUT I), A is an m x n matrix equal to [diag{A1,A2,...,Am} 0], and V [v1, 2,..., v] is an n x n orthonormal matrix (i.e. VVT I). The diagonal elements Ai are called the singular values of Y and they are represented as A1 > A2 >... > Am > 0 for convenience. The corresponding column vectors Hi and Di are also arranged accordingly which means the matrices U and V are ordered from most variation to the least. This means, the first row of VT has the highest variance and is called the principal component. The matrix U is the left singular matrix, V is the right singular matrix, and A is the singular value matrix. 4. SOUND SYNTHESIS MODEL In this section, a sound synthesis model is presented, which simulates the impact sounds. The proposed synthesis model addresses the two major issues encountered in the existing models: the lack of generality and the deficiency to analyse transient sounds. The presented synthesis model has the capability to analyse and synthesise all kinds of everyday impact sounds or groups of impact sounds. A block diagram which depicts the building blocks of the proposed sound synthesis model is shown in Fig. 1 The synthesis model is made up of three stages: analysis, parameterisation, and synthesis. The analysis part takes the input sound signals and extracts the most salient features of the sound. The parameterisation block finds the relationship between these features and represents them in a compact form. In the synthesis block, the user controls the variables and the sound parameters that are used to generate the required sound signal. The performance of the synthesis block and the synthesis sound quality measures the robustness of the selected features and their parameterisation. In fact, these blocks are interlinked and the synthesis quality depends on all of them. Analysis Process r--- ---.---n --- es --- -- -- -- -- - Se entat2on Stationa CA: ASinuar Value S Peak Wavelet - cir, S Alignment Transform CD' _ Decomposition B-_ - - - - - - _ - - - - - - - - - - - > -----1 ua e for C'1 I isouniparameters II I..U, 4 for CDC Inverse Stationa y CA U, for CA Wavelet Transfonm 'CD s Paet sio I Parameterisation Process Synthesis Process Figure 1. Sound synthesis model. 4.1. Analysis process Parameterisation of input sounds is the final representation of sound features at the end of the analysis process. As the success of synthesis depends on the quality of these methods, the analysis part forms the core of the analysissynthesis system. Furthermore, the analysis process is responsible for extracting the sound features that are most significant and directly related to the input sound characteristics. The proposed analysis process uses time-scale decomposition of the input signal which is known as stationary wavelet transform (SWT) which makes the analysis process generic and robust. The stationary wavelet transform has the ability to perform local analysis, i.e. to analyse a localised area of a large signal. The SWT decomposes the signal into different frequency components and then represents each component at a different scale. The SWT extracts the sound features that are salient, localised, and

Page  4 ï~~directly related to input sound signal. Therefore, SWT has the ability to underlie and represent time-varying spectral properties of the transient and nonstationary signal. 4.1.1. Pre-processing In Figure 1, the input sound signals {si: i -1,..., m} form one group of input sounds. The sound si can be a single sound event or a sequence of same event repeated n times. For example, it can be one clapping event or a sequence of a number of claps. If si is a sequence of the same sound event, then each sound event is segmented and the onset and offset points are labelled. The length of each segment in one group of sound {2 i: i 1,..., m} should be the same. There is no constraint on the number of sound events required from each type of sound, si. Equal or different number of sound events can be selected from each type of sound but the length of all the segments should be equal. During the segmentation of the sound, all the sound segments are peak aligned i.e. same length and the heights peak occurs at the same point. This increases the similarities between the extracted sound features and improves the parameterisation process. All the segmented sound events are put into a matrix form where each row contains samples of a segmented sound event. The input sound is represented using following input matrix: S1S 11 312... S1n S2 s21 S22... S2n X 4 Z [ (7) sm sml Sm2... Smn where m represents the number of sound samples and n represents the length of each sound samples. 4.1.2. Sound features extraction using stationary wavelet transform To extract the wavelet coefficients, the stationary wavelet transform is applied to each row of input matrix, X, which decomposes it into the first level approximation coefficients matrix CA1, and the first level detail coefficients matrix CD1. The sizes of these matrices are same as that of X and rows of these matrices represent the wavelet coefficients of the corresponding row of X. The first level approximation coefficients matrix CA1 is further decomposed using SWT into second level approximation coefficients matrix CA2, and the second level detail coefficients matrix CD2. This decomposition process using SWT continues up to Lth level as shown in Fig. 2. At the end of Lth level decomposition, the set of coefficients matrices {CD1,..., CDL, CAL } are the sound feature that contains all of the time-varying frequency information of the input sound matrix X. The sound feature matrix CAL represents the low frequency approximation co efficients, and sound feature matrices {CD1,..., CDL} represent high frequency detail coefficients of the input sound matrix. The approximation coefficients contain the slow changing characteristics of the signal. The detail coefficients contain the rapidly changing structure of the input sounds. The selection of wavelet and their decomposition level depend on the application area and synthesis model. The sound features changes from signal to signal and application to application, therefore the best choice of wavelet is the one which extract the salient sound features successfully. The decomposition level reveals the time-varying frequency information of the input signal, which is used to analyse the time-frequency patterns of each band which plays key role in the parameterization process. Some sound groups expose these patterns at lower decomposition levels and some requires further decomposition. 2 Li~xLL Li HL.DD '4+ Figure 2. Lth-stage analysis tree of SWT. The extracted wavelet coefficient matrices {CD1..... CDL, CAL } represent the time-frequency pattern of the input sound matrix X, where ith row of X (a sound event) is decomposed into six frequency bands which are placed in the ith row of each feature matrix {CD1..... CDL, CAL }. The sound events present in the input matrix, X, can be different transient sounds and the differences and similarities between these sounds are reflected in these features matrices. The correlation and variation present in these sound features are parameterised by singular value decomposition in the next section. 4.2. Parameterisation process The sound features extracted during the analysis process contain all the characteristics and information in the input sounds. The parameterisation process represents these sound features in a very compact way such that their similarities, differences, and interconnection with the input sound signals are preserved. 4.2.1. Sound parameters using singular value decomposition The purpose of the singular value decomposition is to reduce the dimensionality of the dataset, identify existing patterns in data, and highlight the similarities and differ ences. SVD identifies and orders the dimensions in which input data exhibit the greatest variation. Once it is identified where the most variation is, it is possible to find the

Page  5 ï~~best approximation of the original data points using fewer dimensions. The coefficient matrices {CD1,..., CDL, CAL} represent the time-frequency pattern where in each matrix some of the rows have very strong correlation. The SVD can be useed to factorise each coefficient matrix and then find the best way to represent them. Let us take any one feature matrix C E {CD1,..., CDL, CAL} and factorise it using SVD. Expanding Eq. (6) using matrices and vectors, we obtained: It is known that the singular values Ai are in decreasing order and the rows of matrix VT are ordered from most variation to the least in Eq. (6) and (10), therefore the rows of the orthogonal basis matrix 4 are also in order from most variation to the least. This means, first few basis vectors of 4 contain most of the information about the sound event set. Hence, to find the approximation of the input sound coefficient matrix, variation below a particular threshold can simply be ignored to massively reduce the data. T 82 g2 0 Cm.m 0 L[m] [Lim] L 0 0 0... 0 1 n ] A2 0... 02n 2 0 Am... Omn]_ Ln C2 Cm [ ul U2... Um u2... Um 1 A1v1 I 42 4.3. Synthesis process (8) The synthesis process is mainly controlled by input control variable from the user. The control variables are employed to select the best sound parameters for synthesis process. In the presented synthesis model, any sound signal from the set of input sounds {si: i= 1,..., m} can be syn(9) thesised by taking the inverse stationary wavelet transform (ISWT) of the set of approximation and detail coefficients which are the linear combination of orthogonal bases 4 and their weights U (sound parameters). For example, the synthesis of a sound signal Sk e s: i= 1,..., m} us(10) ing all the orthogonal basis (perfect reconstruction with r =m) can be obtained by using Eq. (13), where ISWT stands for inverse stationary wavelet transform. The approximation synthesis of the sound signal sk using first 4: two orthogonal basis is described in Eq. (14), when r = 2. [ ul where {ui: i- 1,..., m} are the column vectors, { j 1,..., m} are the row vectors, and ci is the ith ficient vector. Eq. (10) can be expanded as: coef(11) C1 Ull C2 U12 Cm mxnUlm U21 U22 U2m 'uml 4b1..Um2 H'2... 11mm L1m CDk Sk = ISWT CDk CAL -j l UCD1 "CD1 lj=1 (jk) j r ucD--CDL Zj 1 (j,k) j r UCAL CAL Z j=1 (jk) j (13) CDk., -=I SWT UCD1k1 -CD1+ CD21 -CD1 '(1,k) (I1 + ((2,k) (1)2 The rows of matrix VT are orthonormal and form a linearly independent basis which spans the input matrix C. Therefore, the matrix product AVT produces a matrix 4 whose rows are linearly independent and forms the orthogonal basis which spans the input matrix C. Equation (11) shows that each row vector of coefficient matrix can be written as a linear combination of orthogonal bases, { j 1,..., m}, and weight vectors, {ui: i- 1,..., m}. Therefore, Eq. (10) can be written as: i = uji ) (12) j=1 Equation (12) reveals that any coefficient vector can be perfectly reconstructed using the orthogonal bases matrix 4, and the corresponding weight matrix U, obtained using SVD. For perfect reconstruction of C, r = m and for approximation of C, r < m. Therefore, a set of orthogonal bases 4 and weight U matrices are calculated to parameterise each coefficient matrix. A set of {U,, D} are the parametric representation for each coefficient matrix. (14) CD =UCDL CD CDL"CDL CDk (1,k) 1(2,k) 2 CCL -CAL L -CAL CA =HCACAL CAk u(1,k) 1 +u(2,k) 2 The quality of the synthesised sound sk is directly related to the number of bases used during the synthesis process but this proportionality is very sharp for the first few basis as they contain the most information of the feature matrices (in order from most variation to the least). Therefore, the change in the synthesis quality will be imperceptible after a particular value of r (threshold value) which can be found heuristically or by using subjective listening tests. During synthesis process, same or different value of r can be used for each feature vector band CD,..., CDf, CAS}. 5. SYNTHESIS EXAMPLES The proposed synthesis model is implemented on a group of common impact sounds. The group contains six different impact sounds which were recorded in an acoustical

Page  6 ï~~booth (T60 < 100 ms) at CCSR, University of Surrey. These impact sounds are: tennis ball, taped tennis ball (tennis ball covered with PVC insulation tape), football, basketball bumping sounds on laminate floor, male, and female clapping sounds. 5.1. Synthesis results The sound database was recorded at sampling rate of 44.1 kHz. A sequence of sound events was recorded for each sound source. One sound event sk from each sound source is taken for analysis purposes. Highest peaks of all the sound signals are aligned and their onset and offset points are labelled. The length of each sound signal present in one group should be the same. If a sound sample is not long enough, zeros are padded at the end of the signal. The length of each sound sample used here is 8192 samples. The segmented, peak aligned, and energy normalised samples of sounds are plotted in Fig. 3 and 4. These sound samples are put into a matrix X [si,..., S6]T, where each row represents a sound sample and the column represents the length of each sample. that the synthesis quality does not change by increasing the decomposition level further. Therefore, the SWT using 'db4' wavelets as applied to the sound matrix X decomposed the signal up to the fifth level. This transform decomposed the sound matrix X into six coefficient matrices {CD1,..., CDs, CAs5} where the ith row of each of these matrices is a coefficient vector belonging to the ith row of sound matrix X. The coefficients matrices are then parameterised using SVD where a set of orthogonal basis {i2: i= 1,..., 6} and the corresponding weight matrix {fu3i: j = 1,..., 6; i= 1,..., 6} are obtained for each coefficient matrix. During synthesis process, any sound signal si E X can be synthesised either perfectly by using Eq. (13), or approximately by using Eq. (14). For example, the tennis ball sound is synthesised from full set of orthogonal basis r = m= 6 (perfect reconstruction) using Eq. (13). The original and synthesised signal is shown in Fig. 5 below. Furthermore, the basketball sound and the female clap sound are also synthesised from full set of orthogonal bases where r= m= 6 (perfect reconstruction), and from sub-set of orthogonal bases when r= 5 and r= 4 (approximation). These original and synthesised signals are plotted in Fig. 6 and 7 respectively. - Tennis Ball -Tapped Tennis Ball Football Basketball -Original Tennis Ball Perfectly synthesised Tennis Ball /'k,.. " J,'11 \. No. of Samples Figure 3. The input tennis and taped tennis balls, football, and basketball bumping sounds samples. Figure 5. Original and synthesised tennis ball sound using full set of basis (perfect reconstruction). 15 10 5N 0 - Male Clap --FemaleClap 1000 2000 3000 4000 5000 6000 7000 8000 No. of Samples 4 -2 iSynthesised using 6-Basis 0 1000 2000 3000 4000 5000 6000 7000 8000 Synthesised using 5Basis S 1000 2000 3000 4000 5000 6000 7000 8000 -2 4 Synthesised using 4-Basis Figure 6. Original and synthesised basketball using full set (r 6) and subset of basis (r 5, r 4). Figure 4. The input male and female clap sound sample. An extensive simulation was performed to select the best wavelet basis and decomposition level for the input sound matrix X. It was found that 'db4' at decomposition level five outperformed all others. It was also observed

Page  7 ï~~i i r that update the weights and produces expressive sounds, n Syee which are similar but not identical. The expressive sound hesised using S-Basis hesisedsingBaSi synthesis model, presented in Eq. (15), is used to synthesise taped tennis ball and football sounds. The original and the three expressive sound samples are plotted in Figs. 8 and 9. It may be observed that the sound events synthesised using the expressive model are not identical to the originals. The listeners gave the same feedback when these samples were played back. These sound signals are perceptually similar. 800 900 1000 -10 400 500 600 700 No f Samples Figure 7. Original and synthesised female clap using full set (r = 6) and subset of basis (r = 5, r = 4). 5.2. Expressive synthesis using interpolation of weights When an everyday sound source generates two sound events consecutively, they both may be similar but not identical. For example, when a person claps twice in the same way with the same applied force, a ball is dropped from the same height twice etc., they generate similar sounds but not identical. The presented sound synthesis model produces the input sound accurately. The synthesis model can generate only one sound event for every sound during the synthesis process and exactly the same sound event is generated if the synthesis for that particular sound is repeated nr times. However, this is not natural and a listener can easily perceive that the same sound is repeated n times. To generate more natural expressive sounds, an expressive synthesis process is presented here. The expressive synthesis process modifies the synthesis process proposed in Section 4. The calculated sound parameters {U,, } are used in Eq. (13) to generate any sound signal 3k E s i 1,..., m}. Every time this equation is repeated for sound signal 3k, it uses the same set of weights to combine the orthogonal bases and generate one sound event of Sk. In the proposed expressive synthesis process, these set of weights are modified by adding a small random vector 6 such that the overall time-varying spectrum for the target sound will not change. The value of E is generated randomly over a hypersphere of radius R with the origin at the weight vector of the generated sound. Different c are generated for each frequency band. The size of E is same as of weight vector and the radius R is controlled by the user. Thus, the Eq. (13) is modified for the expressive synthesis process as given in the Eq. (15). 6 f'I. I.,1 4 No. ofSampl aa f.x -2. -4 0 1000 2000 3000 4000 No. of Samples Original Tapped Tennis Ball Expressive Synthesised Sound-1 Expressive Synthesised Sound-2 Expressive Synthesised Sound-3 6000 7000 8000 5000 Figure 8. Original and three samples of synthesised sound of tapped tennis ball using expressive synthesis. 0 1 i 3 r -2 -0 1000 2000 3000 4000 5000 No ofSamples Original Footbal ExpressiveSynthesised Sound-lH Expressive Synthesised Sound-2 0Expressive Synthesised Sound-3 6000 7000 8000 Figure 9. Original and three samples of synthesised sound of football using expressive synthesis. 6. CONCLUSIONS A large proportion of existing musical synthesis models is not suitable for everyday impact sounds as their analysis and parameterisation processes are specific to harmonic sounds. We addressed both the analysis and parameterisation processes and proposed a model which is more general and which can synthesise common transient sounds effectively. We used SWT in the analysis process, which gives better time-frequency varying information of the nonstationary and transient signals as compared to STFT. SVD is used to parameterise the sound features, which represents each sound features matrix as a linear combination of orthogonal basis. The expressive synthesis is also presented, which generates natural sounds. CDk Sk =ISWT CDL CAL S CD1 D1 CAcD1 Sj= C DL D L - DL E =1 CuD1,k) + (l.) D j= CA AL -CAL E 3 \ Ua(j,k) + 16(1,j)) When any sound signal Sk E {i: i = 1,..., m} is synthesised using Eq. (15), different set of c are generated

Page  8 ï~~The proposed model is evaluated using a set of six everyday transient sounds. The simulation results showed that the perfect reconstruction of the input sound is possible using a full set of orthogonal bases. The synthesis results showed that proposed model can generate many everyday sounds accurately. The sounds synthesised using a subset of orthogonal basis is also very similar to the original and perceptually convincing. An expressive synthesis model was presented which generates sounds that are more natural and convincing. This enhances the realism of the generated sounds and has potential applications in computer music, computer games, and virtual reality. In future, we want to further investigate the expressive synthesis model and analyse the distribution of weight vectors in real life sound events and their possible statistical or mathematical modeling. The quality of the synthesis models and the synthesised sound quality will be evaluated using subjective tests. 7. ACKNOWLEDGMENTS This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) under Research Grant GR/572320/01 Portfolio Partnership Award in Integrated Electronics. 8. REFERENCES [1] W. W. Gaver, "What in the world do we hear? an ecological approach to auditory event perception," Ecological Psychology, vol. 5, no. 1, pp. 129, 1993. [2] C. Roads, "The Computer Music Tutorial, The MIT Press," Cambridge, Massachusetts, USA, 1995. [3] P. R. Cook, "Toward physically informed Parametric Synthesis of Sound Effects," Invited Keynote Address, in Proc. IEEE WASPAA, pp. 1-5, New Paltz, New York, October 1999. [4] K. van den Doel, P. G. Kry, and D. K. Pai, "Foleyautomatic: Physically-based sound effects for interactive simulation andanimation," in Proc. ACM SIGGRAPH 2001, Los Angeles, CA, pp. 537544, August 2001. [5] K. van den Doel, "Physically-based Models for Liquid Sounds," ACM Transactions on Applied Perception, vol. 2, no. 4, pp. 534-546, October 2005. [6] M. Rath, F.avanzini, and D. Rocchesso, "Physically based real-time modeling of contact sounds," in Proc. of ICMC, Goteborg, 2002. [7] P. R. Cook, "Physically informed sonic model ing (PhISM): Synthesis of percussive sounds," Computer Music Journal, vol. 21, no. 3, pp. 38-49, 1997. [8] P. R. Cook, "Modeling Bill's gait: Analysis and parametric synthesis of walking sounds," in Proc. ofAES, Helsinki, Finland, pp. 73-78, June 2002. [9] L. Peltola, C. Erkut, P. R. Cook, and V. Vilimaiki, "Synthesis of hand clapping sounds," IEEE Trans. on ASLP, vol. 15, no. 3, pp. 1021-1029, March 2007. [10] A. M. Aramaki and R Kronland-Martinet, "Analysis-synthesis of impact sounds by realtime dynamic filtering," IEEE Trans. on ASLP, vol. 14, no. 2, pp. 695-705, March 2006. [11] X. Serra, "Musical Sound Modeling with Sinusoids plus Noise," in G. D. Poli, A. Picialli, S. T. Pope, and C. Roads, editors, Musical Signal Processing. Swets & Zeitlinger Publishers, 1997. [12] T. S. Verma, S. N. Levine, and T. H. Meng, "Transient Modeling Synthesis: A Flexible Analysis/Synthesis Tool for Transient Signals," In Proceedings of ICMC, pp. 164-167, Thessaloniki, Greece. September 1997. [13] A. Misra, P. R. Cook, and G. Wang, "Musical Tapestries: Re-composing Natural Sounds," In Proceedings of lCMC, New Orleans, U.S., November 2006. [14] G. P. Nason, B.W. Silverman, "The stationary wavelet transform and some statistical applications," Lecture Notes in Statistics, 103, pp. 281-299, 1995. [15] J. C. Pesquet, H. Krim, and H. Carfatan, "Time-invariant orthonormal wavelet representations," IEEE Transactions on Signal Processing, vol. 44, no. 8, pp. 1964-1970, August 1996. [16] R. R. Coifman, and D. L. Donoho, "Translation invariant de-noising," Lecture Notes in Statistics, 103, pp. 125-150, 1995. [17] L. L. Scharf and D. W. Tufts, "Rank Reduction for Modeling Stationary Signals," IEEE Trans. on ASSP, vol. 35, no. 3, pp. 350-355, March 1987. [18] S. Lee, and M. H. Hayes, "Properties of the Singular Value Decomposition. for Efficient Data Clustering," IEEE Signal Processing Letters,Vol. 11, no. 11, pp. 862-866, November 2004.