Page  00000001 On the Perception of Transients: Applying Psychophysical Constraints to Improve Audio Analysis and Synthesis Gregory H. Wakefield (1) Laurie M. Heller (2) Laurel H. Carney (3) Maureen Mellody (4) (1) Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor MI 48109 (2) Naval Submarine Medical Research Laboratory, Groton CT 06359 (3) Hearing Research Center and Dept. of Biomedical Engineering, Boston University, Boston MA 02215 (4) Applied Physics Program, University of Michigan Abstract In a psychophysical experiment, a wideband, 4-ms noise is compared with spectrally smoothed versions of the noise. To isolate on the magnitude spectrum, the phase spectrum is controlled by assigning the same random phase spectrum to both the original and smoothed signals. Depending on the choice of phase spectrum, discrimination thresholds for spectral shape vary from a maximum spectral deviation of 2-3 dB to as much as 25 dB. The exact source of this effect remains an open question. However, it suggests that not all transients require the same signal fidelity. Given a proper characterization of the perceptual response, it may be possible to reduce the size of the codebook required to maintain acceptable perceptual results. 1. INTRODUCTION Recent advances in audio representation decompose the signal into a parallel combination of sinusoidal, transient, and noise components and then apply specialized techniques to analyze and synthesize each of these components (Levine and Smith 1999). With respect to time-varying sinusoids and wideband noise, knowledge about auditory perception has been used to optimize the representation. In contrast, relatively less is known about the perception of transients so that the processing of transients is governed primarily by standard signal transform techniques. The present paper presents new psychophysical results on the perception of transients and suggests how these results may be used to improve the analysis and synthesis of transients in the audio signal. Limits on temporal resolution in auditory perception typically range from 25 ms, for the judgment of temporal order, to a lower limit of 1-2 ms for the discrimination of monaural phase (Eddins and Green 1995). For example, Ronken (1970) measured discrimination thresholds for a pair of unequal amplitude clicks and their time-reversed form as a function of the temporal separation between the clicks and the relative amplitude of the clicks. He demonstrated that the threshold amplitude was relatively constant for temporal separations greater than 2 ms, but that the threshold increased when decreasing the temporal separation from 2 to 1 ms. Patterson and Green (1970) studied the discrimination of Huffman sequences, which, for a given duration, share the attribute that they have identical power spectra. They found that discrimination between different Huffman sequences was near chance for durations less than 2 ms. The signals used in both the Ronken and the Patterson and Green experiments are fairly simple in construction. Both are spectrally flat, except for a small degree of spectral rippling associated with the temporal separation between the clicks in Ronken's experiment. Other studies have used more spectrally concentrated signals to study the importance of phase. Wier and Green (1975), for example, showed that listeners were able to discriminate 90% of the time between a 1-ms 1000-Hz tone followed by a 1-ms 2000-Hz tone and the same signal played backwards in time. In addition, other studies have measured the sensitivity to group delay using signals of longer duration. In our own unpublished research, Wakefield and Viemeister, for example, found that the onset disparity threshold between relatively long 1 kHz and 4 kHz tones was between 3-5 ms. Transients are problematic for a variety of audio compression techniques. Levine and Smith (1999) introduced a hybrid form of audio compression that switches between different sub-coders according to the state of the source. By detecting a transient, the coder isolates the effects of the transient on neighboring segments of the signal by switching to a different sub-coder. The authors argue that by proper phase matching across changes in sub-coders, perceptually acceptable performance can be achieved in many cases. Levine and Smith propose a transient sub-coder that uses MDCT on 256 pt. windows of a 44.1 kHz sampled waveform using an overlap ratio of 50%. At this sampling rate, the duration of the window is 5.8 ms with a update rate

Page  00000002 of 2.9 ms. While these durations are close to the lower psychophysical bounds on temporal acuity, they are still larger than these bounds for isolated transients. This raises the question of how to better design the codebook of the transient sub-coder to more effectively utilize the limits in auditory sensitivity to waveform shape in this very short duration regime. 2. PSYCHOPHYSICAL DISCRIMINATION OF TRANSIENTS: THE IMPORTANCE OF PHASE The goal of perceptual coders is to tile the quantization space in such a way that signals within any given tile are perceptually indiscriminable. The coder then transmits the tile location, which, at the receiving end, is synthesized into a quantized signal with no loss of fidelity. While much is known about tiling for longer duration musical signals, there remain a number of questions concerning the space of signals on the order of 5 ms or less. The following summarizes the main points of a psychophysical study we have conducted to study the perception of transients. 2.1. Experimental Method and Stimuli Listeners were asked to discriminate between a 4-ms signal (standard) and a "smoothed" version of the signal (comparison) in which the magnitude spectrum was altered while leaving the phase spectrum the same. A smoothing spline A(f) was used to approximate the log magnitude spectrum of the standard, where A(f) is chosen to minimize the regularizing functional N Fmax 2 2 p, (10log10oS(fk)--A(fk))2 (I p)J d(-A(I)jdf k= 1 o0 over samples {Jfk. This particular function can be adjusted from the best fitting straight line to an exact match to the data by adjusting the smoothing parameter p between 0 (linear) and 1, respectively. The bandwidth of the standard and comparison was 100 Hz to 10 kHz. Signals were generated digitally at a 44.1 kHz sampling rate. For the standard, the level (in decibels) of each spectral component was uniformly distributed i.i.d. over a range of 40 dB. The phase spectrum of the standard was uniformly distributed i.i.d. over 0 to 2 C. The same phase spectrum was used for the comparison, while the magnitude spectrum was varied using a spectral smoothing function. Within a given block of trials, the phase and magnitude spectra remained constant. Signals were presented monaurally to the right ear over headphones to listeners in a sound-proof booth. A flanking-cue two-interval forced-choice adaptive psychophysical procedure was used to measure discrimination threshold. Each trial consisted of four See the Matlab function csaps for an imple mentation of the smoothing spline. observation intervals. The first and fourth interval contained the standard. On any given trial, the second (or third) trial contained the standard while the other contained the comparison, with probability 0.5. The listener's task was to select that interval containing the comparison. The degree of smoothing, p, was adjusted from trial to trial according to a "two-down, one-up" Levitt tracker. Reversal points were averaged to determine the degree of smoothing that yields a psychophysical threshold of 70.7% correct for each run of the test. Thresholds are reported as the average of the levels obtained in five such runs. 2.2. Results Thresholds for a number of different magnitude and phase spectra were measured in three highly trained listeners. While we observed individual differences across the three listeners in overall sensitivity, everyone showed evidence that the phase spectrum of the transient can significantly affect the ability to detect smoothing in the standard's magnitude spectrum. Typical results are shown in two panels of Figure 1 for Subject 3. In each panel, the heavy and thin lines denote the magnitude spectra of the standard and comparison at threshold, respectively. The top panel is an "easy" phase condition, i.e., a smoothing parameter of -57 dB; the standard and comparison spectra differ by no more than 9.7 dB anywhere in the frequency with an average squared difference of 2.8 dB. The bottom panel is a "hard" phase condition, i.e., a smoothing parameter of -78 dB. In this case, the comparison's magnitude spectrum at threshold is much flatter than the standard, particularly in the 3-8 kHz region. In comparison with the "easy" phase condition, the largest deviation between the two spectra is 19.6 dB and the average squared difference is 6.4 dB. We observed similar phase effects for a variety of different random magnitude spectra. We also observed that phase conditions that were difficult for one subject were difficult for all subjects. Figure 2 shows a distribution of threshold values of the smoothing parameter, p, as measured in dB. In this case, single trial measures were taken across a range of random phase spectra. No attempts were made to draw the same phase spectra for each of the three subjects. As shown in the figure, individual differences exist between the three subjects with respect to the overall range of the phase effect. S1 is generally the least affected by the phase spectrum, while S3 is the most affected of the three subjects. Specifically, under certain phase conditions, S3, required nearly straight line smoothing of the spectrum before discriminating between the original and smoothed versions. In his earlier work, Ronken (1970) also noted large intersubject differences in sensitivity to signals that differ in their phase spectra. The results show that the effect of phase on spectral discrimination, while skewed to larger values ofp, is not simply a matter of a few outliers in the population of phase spectra. Even for p -values between -60 and -70 dB, the

Page  00000003 Sl Lm A,m,....................................... -III ~ Bl 11 B a3.Mw S2 4:ll NM fit S3 p [dB] F 1 Figure 1: The magnitude spectrum of the standard is shown in the light solid lines of the two panels. The heavier lines show the magnitude spectrum of the comparison stimulus at discrimination threshold for two different phase spectra. The top panel illustrates an "easy" discrimination condition, where the spectral differences between the standard and comparison are very small at threshold. The bottom panel illustrates a "hard" discrimination condition. In this case, the large physical differences between the spectra between 3.0 and 8 kHz are just discriminable. spectral peaks within the 3-8 kHz region are attenuated, at threshold, by 10-20 dB. Clearly, in these conditions, there is no need to spend quantization bits encoding such spectral variations. On the other hand, thresholds below - 55 dB are generally resolving spectral variations on the order of 2-4 dB. It is interesting to compare these thresholds for 4-ms "samples" of the spectrum with thresholds for much longer, steady-state signals. Once the signals are sufficiently long in duration, no phase effect is observed and all three listeners yield thresholds around -50 dB. The phase effects are generally larger than what the (sparse) literature on the auditory perception of transients would suggest. Examination of the temporal and spectral properties of the signals, however, does not reveal the underlying source of such an effect. Peak factor, for example, does not appear to be important, nor does the concentration of energy in different bands of the spectrum. To obtain a better understanding of the possible causes for the results observed, we examined the response of a computational model of the human auditory system. 3. COMPUTATIONAL AUDITORY MODEL The auditory nerve (AN) model used here is a simplified version of a model that was developed to study nonlinear properties of physiological AN responses (Camey 1993). The model used in this study consists of three main stages: 1) a linear filter bank for the tuning in the auditory periphery, 2) a model for the transduction and low-pass filtering of signals by sensory cells in the inner ear, and 3) a neural adaptation model for the synapse Figure 2: Histogram of thresholds for different random phase spectra are shown in the three panels for S1 (top), S2 (middle), and S3 (bottom). Thresholds are reported in values of the smoothing parameter p (dB). The histograms reveal different sensitivities of the three subjects to the effects of phase on spectral discrimination. The data suggest a skew towards larger values of p, but they do not indicate that large phase effects are uncommon events. between the sensory cells and AN fibers (Heinz, Camey et al. 1999). The linear filterbank consists of 4-th order gammatone filters with bandwidths matched to psychophysical tuning curves for human listeners (Glasberg and Moore 1990). The sensory cells are modeled by an asymmetric saturating nonlinearity implemented using an arctangent, with an asymmetry of 3:1, followed by a 7-th order low-pass filter with a 3-dB cutoff at 2500 Hz. The low-pass filter limits the synchrony of the AN model responses; the rolloff in synchrony with increasing frequency matches that reported for physiological AN responses (Johnson 1980). The final stage of the model is a neural adaptation model, which is a time-varying implementation (Camey 1993) of the three-stage diffusion model proposed by Westerman and Smith (Westerman and Smith 1988). Preliminary analysis of the output of the AN model indicates that differences in the phase spectra of two signals are preserved. The two panels of Figure 3 show the output from the neural adaptation model in response to one of the 4-ms signals studied psychophysically. The top panel presents a gray-scaled image of the output amplitude of the neural adaptation model with time along the abscissa and center frequency of the filterbank along the ordinate. In general, the lagging of the response to the transient at the lower frequencies is an attribute of the AN filterbank, rather than a property of the particular phase spectrum used in this example. Different average intensities across time correspond to the variations in spectral magnitude of the random transient. The bottom panel presents a grayscaled image of the differential between the output amplitudes of two signals with the same magnitude

Page  00000004 Figure 3: The output of the neural-adaptation stage of the AN model is shown in the top panel for the standard stimulus in "easy" phase condition. The image encodes the amplitude of the output as a function of the center frequency of the auditory filters and time. The bottom panel shows the differential between the outputs for the standards used in the "hard" and "easy" phase conditions. spectrum but different phase spectra. The rippling observed in regions of this differential image is due to the differences in phase spectra across the two transients. The sensitivity of the AN model to the phase spectrum of 4-ms wideband transients suggests that further characterization of the output, or differential, surfaces, will help us better understand the source of the phase effects observed psychophysically. Similar sensitivity is not as readily observed for the spectrogram. Our research is currently applying surface-representation techniques used in wavelets and time-frequency distributions to summarize the dependence of surface features on the signal's phase spectrum. 4. DISCUSSION AND CONCLUSIONS In the limit as the duration of the transient increases, we observe that phase plays a much weaker role in the discrimination of spectral magnitude (Moore 1995). We observed similar results in the present case when the 4-ms transients were concatenated to form 264-ms signals. Phase effects were not seen in the discrimination of spectral smoothing among the concatenated signals; thresholds were similar to the best thresholds observed for single transients. A much more surprising observation was that a delay in the concatenation of two transients by as much as 250 ms also eliminated much of the phase effect. Perceptual studies are underway to better understand this result. The data suggest that not all transients require the same perceptual tiling. Instead, for a number of phase conditions, there exists a wide range of spectral variations which are indiscriminable from the original. Properly identifying the phase conditions for which spectral resolution is poor can have significant impact on the number of codewords used to represent a class of transients. By setting the codebook design according to measures of spectral resolution from long-term (phase independent) experiments, one is likely to cover all discriminable cases for the 4-ms regime of transients, but one is also likely to use far too many codewords for a majority of the encoded signals. We are currently investigating the impact of these psychophysical results on the codebook design. Without regard to phasing across frames and other time-varying issues, proper adjustment of the size of the perceptual tiles can reduce the codebook size by as much as a factor of 5 -10. 5. ACKNOWLEDGMENTS This research was supported by grants from the Office of Naval Research (MURI Z883402 and N0001499WR30037). 6. REFERENCES Carney, L. H. 1993. "A model for the responses of low-frequency auditory nerve fibers in cat." J Acoust. Soc. Am. 93: 401-417. Eddins, D. A. and D. M. Green 1995. Temporal integration and temporal resolution. Hearing. B. C. J. Moore. New York, Academic Press. Glasberg, B. R. and B. C. J. Moore 1990. "Derivation of auditory filter shapes from notched-noise data." Hearing Research 47: 103-138. Heinz, M. G., L. H. Carney, et al. 1999. "Performance limits for frequency and intensity discrimination based on a computational auditory-nerve model". Conference Abstracts of the Association for Research in Otolaryngology, St. Petersburg Beach. Johnson, D. H. 1980. "The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones." J Acoust. Soc. Am. 68: 1115-1122. Levine, S. and J. I. Smith 1999. "A Switched Parametric & Transform Audio Coder." IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP'99, Phoenix, Arizona. Moore, B. C. J. 1995. Frequency Analysis and Masking. Hearing. B. C. J. Moore. New York, Academic Press. Patterson, J. H. and D. M. Green 1970. "Discrimination of transient signals having identical energy spectra." Journal of the Acoustical Society ofAmerica 48: 894-905. Ronken, D. 1970. "Monaural detection of a phase difference between clicks." Journal of the Acoustical Society of America 47(4 (Part 2)): 1091-1099. Westerman, L. A. and R. L. Smith 1988. "A diffusion model of the transient response of the cochlear inner hair cell synapse." J. Acoust. Soc. Am. 83: 2266-2276. Wier, C. C. and D. M. Green 1975. "Temporal acuity as a function of frequency difference." Journal of the Acoustical Society ofAmerica 57: 1512-1515.