Page  6 ï~~Timescale modifications and wavelet representations Daniel PW Ellis MIT Media Lab Music & Cognition Group Cambridge, MA, 02139 ABSTRACT The phase vocoder has been a very useful tool for computer musicians. None the less, its results are sometimes surprising; there seem to be important differences between how we perceive sounds and this representation. Our experience in implementing the phase vocoder algorithm for Csound gave rise to a new interpretation of timescale modification which clarified the role of the time window, and why it presents difficulties. Rather than uniformly stretching a conventional magnitude spectrogram, we view the time window as a horizontal division in a uniformly-projected constant-Q spectrogram (one that becomes more narrow at lower frequency rather than maintaining constant width). In this domain, features below the time window are simply being shifted down in frequency (implicitly extending their time support), whereas only the features above the window are being stretched. This suggests a new algorithm for time-scale modification using a wavelet representation rather than the short-time Fourier transform (STFI). We might expect that the increased time resolution of the wavelet transform at high frequency would provide a method to avoid the 'blurring' often associated with time-scale expansion. In fact, it presents serious difficulties for maintaining the pitch, since we are now stretching a wideband representation which resolves individual cycles. 1. INTRODUCTION One fruitful source of interesting sound textures for computer music is real sounds, somehow processed to or presented to affect listeners in a new way. Often, it is desirable to change the duration or pitch of such sampled sounds. The phase vocoder is a popular tool to change these parameters independently. In general it is reasonably straightforward to lower the pitch and extend the duration of a given sample by 'slowing down' the playback, or, equivalently, resampling to a greater length at a constant playback rate. Unfortunately, it is much harder to escape this intimate relationship between duration and pitch, for instance to transpose without affecting duration, or to slow down a voice without sending it subsonic. This difficulty may be a surprise the first time it is encountered, since the problem 'feels' well defined. Say we want to modify a given recording of speech to sound as it would have done if we had instructed the speaker to talk very slowly. Since we can so trivially imagine what the result should sound like, it seems not unreasonable to expect our computers to generate this desired result. Sadly, as in many other situations, something that is simple to define in perceptual terms turns out to be rather subtle and complex when expressed in terms of conventional signal processing. In this paper we present an interpretation of time-scale modification as achieved by the phase vocoder by looking at the 'scaleogram' analyses of the dilated sounds. This constant-Q representation has been helpful to us in understanding the significance of the time window in the phase vocoder, and suggests some novel approaches to time scale modification, as well as some new problems. We will describe our investigations in these directions. 2. THE PHASE VOCODER The phase vocoder was initially conceived as a domain in which the redundancy of speech signals could be exploited to improve bandwidth efficiency for telephony (Flanagan & Golden 65). The approach remained computationally prohibitive for the 'average' user (i.e. those outside large research groups!) until about 1980. The potential musical applications were described by Moorer in 1978 (Moorer 78) and a rigorous mathematical basis for speech modification by this method was presented by Portnoff in 1981 (Portnoff 81). Mark Dolson'spvoc implementation as part of the CARL library provided a widely available phase vocoder to the computer music community, and his tutorial (Dolson 86) was extremely useful (at least to the current author) in spreading understanding of this technique. The integration of a similar algorithm into Barry Vercoe's Csound has furtherwidened availability (Vercoe 90). The simultaneous increase in the power of personal computers now makes it feasible to run the phase vocoder on common computers such as the Macintosh, for which a Csound port is available, as well as the more specific SoundHack (Erbe 92).

Page  7 ï~~kHz spoil.f5.n1024.aft 10 -8 -6. 4. f f 4. 4 4 + f'::4 " 4 the spectrogram is only telling about half the story (it is showing the STFIT magnitude but not its phase, although the two are related). None the less, reasonable algorithms can be derived and this is the basis of the phase vocoder. The phase vocoder also provides pitch modification: once TSM is achieved, pitch shifting is a simple matter of resampling the resulting signal (changing the speed of playback) to change the pitch. The TSM is then calculated to give the desired duration after the speed changes applied to affect the pitch. The time window is the most important parameter for STFT analysis, which also defines the filterbank bandwidth. This distinguishes the narrowband spectrogram of figure 1, where individual harmonics appear as horizontal features, from the wideband spectrogram, where individual filters in the filterbank are too broad to isolate individual harmonics, but are fast enough to respond to variations in energy within each pitch kHz spoll.f5.n1024.t2.aft 0 fi 1, ' a1. dsetrga f figure 1 - Narrowbandl spectrogram of "spoil" kHz 10 spoil.f5.n128.aft 4. *:::x4: 4. 4 4 10 8 6 -4' + t a t +, 4 f4.:.. ** + e Â~:.ti. rr ' ".....,..0..2 4.4. figure 2 - Wideband spectrogram of "spoil" ig... re 3 '. NB's:pe.g"":spo.l"1:..timesexpan 1...... 2 6 -6 -6!..:.........: figure 3- NB spectrogram of "spoil" 1.5 times expanded The basic idea behind timescale modification (TSM) via the phase vocoder is based on the following observation: When looking at a narrowband spectrogram of a pseudoperiodic signal such as voiced speech, we typically see many near-horizontal lines corresponding to the relatively stable harmonics of the pitch cycle (as in figure 1, which shows an analysis of the single word "spoil"). If the sound were to evolve more slowly while retaining the same pitch, we would expect the narrowband spectrogram to look more or less the same in terms of the position and shape of these stripes,-except the entire time dimension would be stretched to reflect the longer duration of the sound (as in figure 3, which is the same sound as figure 1 doubled in length with the phase vocoder). Now we have something approaching a quantitative definition of how to perform time-scale modification: stretch the narrowband spectrogram representation of a sound, essentially a simple geometric operation, then do whatever is necessary to re-render this into a sound. In fact, it is a little bit tricky to do this last part, since cycle, resulting in vertical features as shown in figure 2. The choice of time window is tricky because it is relative; the general rule of thumb is to use a time window two to three times the length of the lowest pitch expected. If the time window is too short, the individual pitch bursts will be moved apart in time, and pitch will not be preserved (although formant position will). If the time window is too long, the temporal details will be audibly 'blurred' along time - visible in figure 1. 3. CONSTANT-Q ANALYSIS It is puzzling to be constantly adjusting the time window for particular results; why should this parameter assume such importance? This question highlights the fact that the fixedbandwidth filterbank of a conventional spectrogram is a poor model of the human auditory system; the filtering performed-by the cochlea is far better modeled by a filterbank where;the bandwidth of individual elements increases with frequency - a 7

Page  8 ï~~kHz spo 10 6 4 i n1024.t2.agt 2 1.0 -0.6 -0.4, 0.2 -0.10 0.06 -0.04. 0.02. *t4t. t4. t 4+ 4 kHz s 1 0....+ Â~ + + + + f + 4o. 2 i% b......"....... ":ai 0.I04 '.,:1....,,:,0 0. 10 4 f 4!::'!",:'i..: -~::::;:... 0. 0 2,,< iii7iT!!g i:;:'Â~ '''':"".:,:.......:o.0,.2..... 4-..... 6....-. 0 I o- 15..2. 4 -6 2 5 TIF w 1 0 7 0 1 1 v 0 x 9 -4 RP 1 0- 1 W- 0 0 1 0 -- - I.0 0.2 0.4 0.6 0.8 i.0 1.2 1.4 figure 4- Scaleogram of "spoil" extended 2x with pvoc 'constant-Q' filterbank (CQFB). Such filterbanks are generally slower than the fixed bandwidth versions, but the increased perceptual relevance outweighs their disadvantages. Figures 4 and 5 examine the results of phase vocoder time expansion with two different time windows as displayed in a spectrogram derived from a CQFB,alsoknown as a 'scaleogram'. Note that the vertical frequency axis is now logarithmic. Note also how the analysis varies from narrowband behavior at low frequencies, resolving the first few harmonics of the vowel, to wideband-like analysisat high frequency, resolving the individual pitch pulses. A pleasing property of the CQFB is that the point at which harmonics begin to fuse into pitch pulses: occurs at the same harmonic number regardless of pitch; there is nothing to adjust depending on the expected pitch range. Figure 4 shows a conventional factor-of-two time dilation, but for figure 5 the phase vocoder time window was set deliberately narrow so that the individual pitch pulses were in fact stretched apart, and the voicing pitch dropped by the expansion factor. We can clearly see the longer period of pitch pulses in the high frequency of figure5. This is also reflected in the position of the lowest harmonic, which finishes at about 100 Hz in figure 4, but an octave lower at 50 Hz in figure 5. This is in contrast to the formant ridges on either side of 2kHz, which have the same frequency in both images. The interpretation we put on these images is this: If we compare the fundamental harmonic between these two images, the primary effect is that it has been shifted vertically downwards, whereas the detail in the higher frequency regions has not been shifted. Compared to a similar analysis of the original sound (which looks like a condensed version of figure 4) the phase vocoder time expansion has lowered the frequency of certain features, but.the range of features so affected is determined by the time window. Specifically, features identified as harmonic peaks in the STFT analysis are extended but not moved, but figure 5- Scaleogram of "spoil" extended with short window features, stable or otherwise, characterized by energy at frequencies below those visible in a given STFT analysis, are moved in frequency according to the time dilation. The scaleogram is a good representation for thinking about these issues because it has no 'lowest frequency'; the filters become more and more narrow at low frequency, such that an arbitrary number can be fit into the finite bandwidth. The lower the frequency, the longer the time support of the corresponding filter. We therefore imagine the effect of the phase vocoder as a horizontal division of the sound spectrum as represented on the scaleogram, with features above this division maintaining their 'height' or frequency, and features below the division being moved down in frequency (for time expansion). Both regions are expanded in the horizontal direction. The location of this transition height is determined by the phase vocoder time window. If we imagine a simple geometric shift applied to this low frequency region, we might expect a 'hole' to appear between the two regimes. This does not occur, exposing a weakness of our explanation. Instead, in our example, extra harmonics of the newly-lowered fundamental appear to fill, the gap. 4. UNIFORM PROJECTION We have noted above that at lower frequencies the bandwidth of individual elements of the CQFB reduces, and hence their time support increases. This appears on our scaleograms as increasingly smooth variation along the time axis at low frequencies, as filter impulse response duration becomes very long. In figure 6, however, each horizontal stripe has been scaled by an factor so that the time windows would all appear the same width, and the smoothness is constant over the whole image. The big problem with this kind of display is that the horizontal spacing corresponding to a fixed time increment

Page  9 ï~~.o0 - 6.2 - 0.4 6.-6.8 1.0 1.2 1.4 a figure 6- uniformly projected' version of figure 4 varies with frequency, so vertical alignment is only possible for a single moment in a given display. The reason for introducing this bizarre display is to make one final insight into the nature of TSM when examined with a constant-Q analysis. If we now consider the process of 'shifting down' the low frequency energy as described above, we find that there is no need to adjust its width at all; since the real time corresponding to a certain geometric width increases as we move down in the plane, the simple translation also effects the required increase in time extent. This is saying that to time-expand a critically-sampled time series at a given frequency in the CQFB by a factor of two, we simply reassign the series to a filter an octave lower. Since the critical sampling period of this lower filter will be twice as long, the time expansion is implicitly achieved. 5. CONSTANT-Q BASED TIME EXPANSION All this analysis of phase vocoder TSM begs the question: could we achieve better TSM using the CQFB representation of the sound in the first place? The previous section has described how we could deal with the low frequency components, but there is no simple answer to what should be done with the high frequency components. Comparing figures 4 and 5 shows the different kinds of results that we would need to achieve for different effects. A considerable strength of the traditional phase vocoder compared with, say, time-domain TSM methods, is that it is relatively model-free; there is no part of the algorithm that tries to measure the period of the signal or classify its spectral peaks as harmonics. The phase vocoder gives useful results for a wide range of sounds, rather than failing dismally for sounds outside a narrow family. The strength of models is that by putting a strong interpretation on the role of each parameter, more complex operations on the data are possible. For instance, modeling a sound as a sequence of speech phonemes would allow us to apply 'natural' speed modification rules to slow down the vowels more than the consonants. Certain models have enabled spectacularly successful TSM on appropriate sounds, e.g. (Quatieri 85), (Serra 89). We are currently investigating CQFB-based TSM using the categorical representation, as introduced in (Ellis 91). When energy in the high frequencies has been identified, it can potentially be replicated, as is the case in figure 4, or stretched, as would be required to produce the effect shown in figure 5. Although these approaches do not feel as satisfying as the simple operation of the phase vocoder, they may none the less provide interesting results. 6. CONCLUSION We have tried to explain some slightly nebulous insights into the process of timescale modification gained through experience with constant-Q filterbank analysis. In particular, this gives a new interpretation of the time window as the demarcation between features perceived at a particular frequency 'place' which must be preserved, and features actually heard to evolve over time, whose frequency components, such as they are, must be shifted down to effect timescale expansion. Although this has suggested some new approaches to timescale modification based on constant-Q representations, we have yet to overcome new difficulties that appear in this domain. REFERENCES M Dolson, "The phase vocoder: a tutorial", Computer Music Journal, 10(4), Winter 1986 DPW Ellis, BL Vercoe, "A wavelet-based model of sound for auditory signal separation", ProcJCMC, October 1991 Tom Erbe, Soundtlack 0.60 documentation, May 1992 J A Moorer, "The use of the phase vocoder in computer music applications", J. Audio Eng. Society, 24(9), 1978 Michael R Portnoff, "Time-scale modification of speech based on short-time Fourier analysis", IEEE tr.ASSP, vol ASSP-29(3), June 1981 Thomas E Quatieri, R J McAulay, "Speech transformations based on a sinusoidal model", ProcJCASSP, March 1985 Xavier Serra, "A system for sound analysis/transformation/ synthesis based on a deterministic plus stochastic decomposition", PhD thesis, CCRMA, Stanford University BL Vercoe, DPW Ellis, "Real-time Csound: software synthesis with control", Proc JCMC, September 1990