Page  00000040 Data Compression of Sinusoidal Modeling Parameters Based on Psychoacoustic Masking Guillermo Garcia, Juan Pampin Center for Computer Research in Music and Acoustics (CCRMA), Stanford University {guille, juan}@ccrma.stanford.edu, http://www-ccrma.stanford.edu/ Abstract We present a method for data compression of sinusoidal parameters for additive synthesis of sound. The partial parameters extracted by automatic analysis of a sound usually require large amounts of storage space, and are typically several times larger than the analyzed pcm files. Since not all partials present in the analysis data have the same perceptual weight, important data reduction can be achieved if both temporal redundancies and masking effects are taken into account. Our current research project includes the development of a software codec for sinusoidal modeling parameter files. 1 Introduction The sinusoidal model parameters, i.e. frequency, amplitude and phase time-functions corresponding to each partial, are generally extracted from the signal by automatic analysis methods. The partial parameters issued from the analysis usually require large amounts of storage space, and are typically several times larger than the analyzed pcm files. This leads toward problems of storage and transmission of such files, especially if we consider internet streaming. Furthermore, large parameter files overload the additive synthesizer and limit its use in real time. Since not all partials present in the analysis data have the same perceptual weight, important data reduction can be achieved if both temporal redundancies and masking effects are taken into account. Data reduction is achieved in several stages: Segments of partial trajectories which are consistently below the psychoacoustic masking curve are discarded. Short trajectories are discarded if their average signal-to-mask ratio Amp (dB) I. d ci.d slope J. SMR icq Muat fr Mae FCq ( sa e) Figure 1: Signal-to-Mask Ratio evaluation using a simple masking model (SMR) is below a given threshold. Frame rate is reduced during stationary regions. The time resolution of parameter trajectories is reduced according to their perceptual weight, mean frequency, and frequency and amplitude smoothness. The remaining trajectories are quantized according to their perceptual significance. 2 Trajectory Pruning 2.1 Masking Model A simple masking model is used to evaluate the signal-to-mask ratio (STMR) of each partial [1]. This model consists of: 1) the difference between the level of the masker and the masked threshold (called delta, typically -10 dB); 2) the masking curve slope towards lower frequencies, or left slope (typically around 27 dB/Bark); 3) the masking curve slope towards higher frequencies frequencies, or right slope (typically around 15 dB/Bark) 2.2 Masking Curve Evaluation Using a sliding window of W frames, partial trajectories of frequency and amplitude are averaged over time within the window. Frequen -40 - ICMC Proceedings 1999

Page  00000041 cies are converted to the bark scale and amplitudes are expressed in decibels. We compute the masking curve for the "average" frame for each position of the sliding window, which is typically three to ten frames long. The SMR value for each partial in the "average" frame is stored in a file. 2.3 Pruning Partials A SMR threshold is specified, and partials whose SMR is below the threshold are discarded. 3 Discarding Short Trajectories Trajectories shorter than a specified minimum length are discarded if their average SMR is below a certain threshold ([2. This threshold is computed as a linear function of the length of the trajectory. The shorter the trajectory the larger should be its average SMR for the trajectory to be kept. 4 Frame Rate Reduction At regions where the sound is stationary, partial parameters evolve smoothly and their amplitude and frequency trajectories can be represented at a lower frame rate, allowing for significant data compression. On the other hand, the original frame rate must be kept during transients in order to keep the fast evolution of partial parameters. We propose two different strategies to reduce the frame rate during stationary parts. They both reduce the frame rate proportionally to some measure of parameter smoothness that we will define later. The first strategy is to perform frame decimation by simply discarding frames which are considered redundant, i.e. which do not contain much new information since the relatively high smoothness of partial trajectories would allow to recalculate those frames by interpolation between their neighbors. This strategy is somewhat drastic, but has low complexity. The second strategy is to resample the frame sequence at a variable frame rate, which allows the frame rate to vary smoothly as opposed to the frame decimation which causes frame rate jumps. The frame resampling is done by linear interpolation of the partial parameters (amplitudes in decibels, frequencies in octaves). As we have said, a measure of parameter smoothness is necessary to control the variable frame rate. This measure will be based on the error introduced if the partial were regenerated by interpolation. Furthermore, we will weigh this error by the SMR in order to account for the perceptual relevance of the original (i.e. noninterpolated) parameter. Therefore, if ap(k) is the amplitude (in decibels) and fp(k) is the frequency (in octaves) of partial p at frame k, t(k) is the time tag of frame k, and n(k) is the number of partials in frame k, the relevance of partial p at frame k is defined as: rp(k) = SMR (Ja (k) - ap(k)I + a If (k) - fp(k)l) where: a' (k) = ap(k - 1) + 7 [ap(k + 1) - ap(kC - 1)] fp(k) = fp(k - 1) + r [fp(k + 1) - fp(k - 1)] [t(k) - t(k - 1)] ' t(k + 1)- t(k- 1)] are the amplitude and frequency at time t(k), interpolated between frames k-1 and k+1, SMR is the signal-to-mask ratio, and a is a coefficient chosen to equalize the weight of amplitude and frequency. SMR is expressed in decibels and limited to a minimum value of zero, since if the partial is masked then its relevance is null. The relevance of a frame is obtained by averaging the relevance of all partials in the frame: R(k) = 1E rp(k) (kR(k) Eventually, the curve R(k) can be low-pass filtered to smooth out discontinuities. R(k) is used to calculate the new frame rate for the decimation or interpolation algorithm, rounding it to a sub-multiple of the original frame rate in the case of the decimation algorithm. We calculate the new frame rate F as a linear function of R(k): F = Fmin + R(k) (Fmax - Fmin) Rmaz ICMC Proceedings 1999 -41 -

Page  00000042 decimation factors for each amplitude and frequency trajectory at each frame are evaluated. Second, the decimation is performed. Let's define the amplitude and frequency relevance of partial p at frame k respectively as: r'(k) = SMR a(a,(k) - ap(k)) rf(k) = SMR (f,(k) - fp(k)) where a'((k) and fp(k) are the same as defined in the previous section. Let's also define r, rp and fprespectively as the average of r"(k), r (k) and the partial frequency fp(k) over a window of length W around frame k: k+ -1 1 SZ r (k) Wkf 2 rp.. 1:>rf,(k) k-2 S- W 1 fp(k) One decimation factor is calculated from each of these mean values. Let's call the respective decimation factors Dr;, Drj and Dfp respectively. Dra is calculated in the following way: P Figure 2: Decimation of Frequency and Amplitude Trajectories 5 Trajectory Decimation The step described above takes advantage of frame redundancy to reduce the frame rate, which obviously remains common to all partial trajectories. This has its limitations, since in many cases some partial parameters of a frame will be redundant, while others will be quite relevant. In these cases, reducing the frame rate will cause audible artifacts due to the interpolation error of relevant partials. Moreover, in some cases either the amplitude or the frequency trajectory of a partial may be relevant, while the other is smooth and can be regenerated by interpolation without audible error. To take maximum advantage of this fact, we have developed an algorithm to decimate amplitude and frequency trajectories independently, i.e. different partials are decimated by independent factors, and moreover, for a given partial, the amplitude trajectory is decimated by a different factor than the frequency trajectory, as shown in Figure 2. Additionally, we wish to exploit the fact that trajectories of low mean-frequency can be sampled at a lower frame rate than high meanfrequency trajectories [2]. Our algorithm works in two steps: first, the D = { a - (Da - 1) f! p 1 if < Pa if r >i Pa where DR, is the specified maximum decimation factor for amplitude trajectories, and Pa is a reference value. D, and D, are calculated in a similar way. Finally, the amplitude and frequency decimation factors for partial p at frame k are calculated as a weighted sum of the decimation terms: Do = p3 Dr; +(1 - p) Df D/ = 7Dr + (1 - ) D - 42 - ICMC Proceedings 1999

Page  00000043 Figure 3: Computation of the Partial Decimation Factors 6 Parameter Quantization Frequencies and amplitudes are quantized using floating point. Phases are uniformly quantized. Bit allocation is based upon psychoacoustics masking. Scales have a fixed number of bits, and the mantissa length Rm is calculated as: SMR Rm = R+ where R is the average number of mantissa bits, and K is the number of decibels per additional bit. R is bigger for frequencies than for amplitudes, due to the fact that the ear is more sensitive to frequency modulation than to amplitude modulation. 7 Bitstream Format Compressed data requires additional information to be stored as part of the bitstream, in order to specify mantissa/scale sizes and number of partials present in the frame. A compressed frame is composed of data blocks corresponding to each of the four ways partials are coded: 1) index, amplitude, frequency and phase; 2) index, amplitude; 3) index, frequency and phase; 4) index only. Missing parameters are recalculated by interpolation by the decoder. A block is divided into subblocks, each one containing data of same mantissa size. 8 Results and Conclusions We have evaluated our software codec on monophonic musical instruments, polyphonic orchestral music, singing voice, and speech. Our goal has been to determine what data reduction factors can be achieved while maintaining very high quality with respect to the sound synthesized with uncompressed partial parameters. We have achieved the following compression ratios with respect to the partial parameter file: speech 1:28, clarinet tone 1:90, orchestral music 1:16, singing voice 1:30. Results have shown that high compression factors can be achieved while maintaining high quality, for a wide variety of audio signals. Our current research project includes the development of a software codec for parameter files for sinusoidal modeling. In the future, this software could be integrated as a compression tool in the context of the SDIF (Sound Description Interchange Format) standard being currently developed [3]. References [1] Zwicker, E. and H. Fastel. 1990. Psychoacoustics, Facts and Models. Springer, Berlin, Heidelberg. [2] Levine, S. 1998. Audio Representations for Data Compression and Compressed Domain Processing. Ph.D. Dissertation, Stanford University. [3] Wright, M. et al. 1998. "New applications of the Sound Description Interchange Format." Proceedings of the ICMC 1998 ICMC Proceedings 1999 - 43 -