Page  00000287 On timbre stamps and other frequency-domain filters Miller Puckette University of California, San Diego msp @ucsd.edu bly complex-valued, and compute short-time spectra Abstract FFT-based filters find wide use in both live and 'tape' electronic music. Here an attempt is made to develop straightforward guidelines for choosing parameters such as window size and overlap in order to obtain desired time andfrequency resolution and minimize artifacts. As an application, filters derived from other sound sources ("FFT vocoders" or "timbre stamps") are discussed in detail. 1 Introduction On the frequent occasions when one reaches for one or another sort of filter, one can choose either a time-domain "classical" filter or an FFT-based one. Typically, time-domain ones can achieve very sharp frequency definition and low time latency, and are often cheaper to implement than FFT-based ones. But FFT-based ones have other graces such as explicit phase control and greater ease of varying the filter characteristics in time. Furthermore, certain applications lend themselves naturally to FFT filtering: for example, frequencyband-variable spatialization (Torchia and Lippe 2003) or delay (Kim-Boyle 2004), or "vocoders" or "timbre stamps" in which the spectrum of one sound is used to derive a filter for another (Settel and Lippe 1998) (Puckette 2007). Here we will consider some issues that arise in FFT-based filtering, particularly for timbre stamping. The following section sets a framework and defines parameters used. Next, we consider whether, and when, FFT filtering really works correctly with arbitrary time-varying filter gains. In Sections 5 and 6 we turn our attention to the the timbre stamping algorithm. Several possible variations are developed. All of them boil down to computations of various time-varying FFT channel gains, thus fitting into the framework developed and analyzed in Sections 1 through 4. 2 Setup Using variable names and conventions as in (Puckette 2007), the filters under discussion take an input signal X [n], possi S[m, k] = FT{fwa[n]X[n + mH]} (1) N-1 S e-27ink/N wa[n]rX [n + mH] n= O where N is the window size, H is a hop size, i2 - -1, k is the frequency in bins, m is the frame number, and Wa [n] is the analysis windowing function. In general the spectra S[m, k] are complex valued. We will use "linear-phase" filters in which we multiply the spectra by real-valued gains g [m, k], which may depend both on frequency k and frame number m. (For a non-timevarying filter there is no m dependence so that we may write the gain as g[k].) The output is then computed by windowing and overlapadding the inverse Fourier transform: Y [n] 1w,[n-mH] (,FE-~{g[m, k]S[m, k]}) [nr-mH] m (2) Here FT-1 denotes the inverse of the discrete Fourier transform FT and w, [n] is a resynthesis windowing function. It is reasonable to ask that the signal X [n] be correctly reconstructed when the filter gains are all 1, which implies: waL[n - mH]ws[n - mH] = 1 m (3) for all n (tacitly putting Wa = ws = 0 outside the window). A possible choice for the analysis and resynthesis window function is the Hann window: 1 h[n] -- 2(1 - cos(27n/N)) 2 (4) We will also consider the "squeezed" version: h((1 -p) N + )0 < (1 -p) N +pri< N 0 2otherwise (5) where 0 < p < 1. This is just the Hann window function rescaled to occupy a segment of length pN in the middle of 287

Page  00000288 the window 0,.., N - 1, and zero padded elsewhere. The Fourier transform is Hp =~FTf{ hp n]I} (6) e- ik {Nsinc(k) + SinC( k + ) +-NSinc( 1)} f2 P 4 P 4 P where sinc(k) sin(k)/k and sinc(O) 1. (The phase term comes in because the window function is centered at n = N/2 but the Fourier analysis phase is zero at n = 0). The "main lobe" of Hp extends over -2/p < k < 2/p for a total bandwidth of 4/p. The filter is completely specified by the gains g, the window size N, the analysis and resynthesis window functions (including "squeeze factors" p), and the overlap, defined as N/H. 3 Simplest low-pass filter Our analysis will loosely follow that given in (Allen 1977) and appendix B of (Laroche and Dolson 1999). Assuming the filter is time-invariant (i.e., the gain g does not depend on the frame number m), we can predict its behavior from that of the low-pass filter with g [0] 1 and g [k] 0 for k: 0. This is possible because, first, the filter's output is a linear function of g, and second, passing a signal through a filter admitting only the kth bin gives the same results as for the zeroth bin, except with a frequency shift. Suppose we introduce the sinusoid X [n] Z"h with angular frequency wo and whose frequency in bins is ko 27wwo/N. The Fourier transform at DC is S[m, 0] = ZN/2+mHe-ikoW(ko) (7) Here the first phase term is that of the incoming signal at the middle of the mth analysis window; the rest is the Fouriertransformed window function evaluated at the point -ko. We now take the inverse FT and overlap-add using the resynthesis window function. Viewed in the time domain, this convolves the resynthesis window function with the signal: ZN/2, 0,.0.. 0, ZN/2+H, 0.... 10, ZN/2+2HI (8) H-1 times H-1 times This pulse train contains the bin frequencies:.,ko -N/H, ko, ko +N/H,... (9) and the result of convolving it with the resynthesis windowing function is to apply a low-pass filter with transfer function W5. For the "correct" result we should filter out all but the ko0 term; this works provided N/H > |ko I+ Cs where cs is the cutoff frequency of the windowing function. Given that we do not wish to have to control the range of frequencies in the incoming signal, the only factor that controls the magnitude of ko is the bandwidth of the analysis window function (call it ca). Then the condition for not aliasing is: N Ca + C8 s (10) If we are using Hann windows with squeeze factors pa and ps, we get: 2 2 Pa psH (11) For an overlap of four, we barely get away with it at pa, p, 1; no squeezing is allowed. The signal is attenuated at the analysis stage by Wa (ko), and again at the resynthesis stage by W, (ko), so the frequency response is the product of the magnitudes of the two. If we use Hann windows with no squeezing, we get 12 dB reduction at ko 1 and 2.84 at ko 0.5; so the bandwidth can reasonably be stated as one bin. But if, for example, we wish to place a filter at a center frequency of k 0.5, we have to superpose filters at k 0 and k 1. The gain then only falls off 1.9 dB one half bin off peak and 5.7 dB one bin off (at k 1.5, e.g.). If desired, the uniformity of bandwidth can be improved by zero-padding the Fourier transforms, effectively doubling N and using squeeze factors of 0.5 so that the filter may be expressed at a resolution of 1/2 bin instead of 1. 4 Time-varying filters The low-pass filter (from which we may understand the behavior of any other filter) may be made time-varying by specifying that the gain g [IM, k] be zero except when k 0, but varying with m. Since the filter output is a linear function of the gain g, it suffices to know the behavior of a sinusoidally varying filter: g[m, k] { 2-FimHkfI/N 0 k 0 otherwise (12) This oscillates at the frequency kf in bins. (The factor mH appears because the mth frame starts at sample mH.) Everything goes as before and out come the frequencies:.. ko + kf - N/H, ko + kf, ko + kf + N/H,... (13) For the result not to alias, we must limit the frequency kf so that N |kf -H (14) Using Hann windows with an overlap of four does not allow any time variation bandwidth at all, the Convolution Brothers' well-known practices notwithstanding. 288

Page  00000289 FILTER CONTROL INPUT INPUT of the filter coefficients, we can now make preliminary bounds on overlap and squeeze factors. We'll continue to assume squeezed Hann windows so that the windowing bandwidth is 2/p bins. If we consider the gain computation as being approximated by a polynomial function of the two spectra (the complex amplitudes and their conjugates, say, so that the square magnitude is of degree two), then terms of degree n will yield at most frequencies of 2n/p where p is the minimum (i.e., worse case) of the squeeze factors of the two analysis windows. To control terms up to degree d, we must choose an overlap factor N/H of at least 2d 2 2 P Pa P+ (15) Figure 1: FFT vocoder (timbre stamp) block diagram. As before, the transfer function is the product of the two window functions, but the resynthesis window function acts at the aliased frequency; the frequency response is equal to SWa(ko) - Ws(ko + kf). If we wish, therefore, for the frequency response to behave "properly", that is, as a function of ko alone, we should squeeze the resynthesis window so that its larger bandwidth makes the frequency response less dependent on kf. This can be done only at the expense of raising the minimum attainable bandwidth. 5 The timbre stamp Figure 1 shows an overall block diagram for the timbre stamp. The three operations at left are the analysis/resynthesis chain of Section 1, with the input now renamed "FILTER INPUT" to distinguish it from a new, second input that alters the filter. The filter input passes first through a windowed short-time Fourier transform (WSTFT), whose outputs are complex-valued. These are multiplied by a real-valued gain (i.e., their magnitudes are changed but their phases maintained). Then the output is computed using a windowed shorttime inverse Fourier transform (WSTIFT). The gain is a function of the magnitudes of two spectra: that of the original input and that of a second, "control" input. In the simplest procedure we would simply compute the ratio of the control amplitude to the original amplitude (individually for each bin) so that the gain multiplication replaces the original amplitude with the new one; but there are many possible refinements as discussed below. In light of the previous discussion of allowable bandwidth or, for unsqueezed windows, 2d + 4. An overlap of eight will cover us up to quadratic terms. 6 Computing suitable gain functions Figure 2 shows a block diagram for computing an appropriate gain for the timbre stamp, including several possible variations that are useful at times. The main idea is simply to divide the two spectra bin by bin, returning the quotient in linear amplitude units. The two inputs are assumed to be in units of power (squared amplitude). The operations labeled "convolve" and "squelch", and the division, may be carried out in those units. The next operation ("depth" control) is best carried out in so-called Sones (Rossing, Moore, and Wheeler 2002, p. 108), which we here approximate as square root of amplitude (fourth root of power). Finally, if needed, a lowpass filter may be added to control foldover; it should be applied to the gain expressed in linear amplitude units. The first, "convolve" operation in effect averages neighboring power measurements in order to prevent peaks arising from the filter input from falling between neighboring, relevant peaks in the control signal (Penrose 2001). This may also help in averaging out interference patterns between peaks of the incoming signals. At frequency bands in which the filtering signal has very low level, it might give unfortunate results to divide by its power spectrum. For this reason it is usually wise to put some sort of limit on the gain that will be applied when filtering it. There are two places in the chain where this might be done. The most logical-sounding spot is after computing the gain as a quotient of the two power spectra. This control appears as "max gain" in the block diagram. Gains greater than a fixed threshold are simply limited to that threshold. An alternative viewpoint is to regard the filter as having two stages, the first in which the filtering signal is "whitened" by dividing by its own amplitude (so that the resulting spectrum has equal energy at all frequencies), and then applying 289

Page  00000290 the spectrum of the control signal as a further stage. It often yields good results to limit the gain of the "whitening" stage instead of limiting the quotient of the two gains. This control appears as "squelch" in the block diagram. Squelching effectively sets a minimum strength below which the filter input is considered silent, by limiting it below before dividing by it. It is often useful to set squelch to decrease as a function of frequency. (All these controls may vary with time and/or frequency as desired). Another possible control is the "depth" of the effect. If we consider the identity filter (with unit gain) as one extreme, and fully applying the timbre stamp as the other extreme, then a continuum of mixtures is available between the two. Crossfading between the two is best done in units of Sones. One can even choose "depth" values outside the range from zero to one to generate deeper than 100% filtering, or to filter the original input "away" from the timbre of the control input. It is possible to morph one sound into another using two timbre stamps applied in opposing directions, with one "depth" ramped from 0 to 1 and the other from 1 to 0. One then crossfades from the first timbre stamp to the second one over a suitably chosen sub-interval of the ramping period. Finally, either to control foldover or as an effect in its own right, one can low-pass filter the filter gains. This can be brought about naturally by increasing the analysis window size (or making the squeeze factor of the control input analysis greater than that of the filtering input), but this also would have the effect of narrowing the analysis bandwidth. If a higher bandwidth is desired one can return to a smaller window size and, in compensation, low-pass filter the filter gains. This is an alternative to the strategy of convolving a suitable kernel into the power spectra at the top of the diagram; each has its own advantages and drawbacks. References Allen, J. B. (1977). Short term spectral analysis, synthesis, and modification by discrete fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 25(3), 235-238. Kim-Boyle, D. (2004). Spectral delays with frequency domain processing. Laroche, J. and M. Dolson (1999). New phase-vocoder techniques for real-time pitch shifting. Journal of the Audio Engineering Society 47(11), 928-936. Penrose, C. (2001). Frequency shaping of audio signals. In Proceedings of the International Computer Music Conference, Ann Arbor, pp. 334-337. International Computer Music Association. Puckette, M. 5. (2007). The Theory and Technique of Electronic Music. Singapore: World Scientific Press. FILTER POWER SPECTRUM CONTROL POWER SPECTRUM to sones) a2 (convert to linear) * depth max gain OUT Figure 2: Computing the per-bin filter gain. Rossing, T. D., F. R. Moore, and P. A. Wheeler (2002). The Science of Sound (Third ed.). San Francisco: Addison Wesley. Settel, J. Z. and A. C. Lippe (1998). Real-time frequency-domain digital signal processing on the desktop. In Proceedings o] the International Computer Music Conference, Ann Arbor, pp. 142-149. International Computer Music Association. Torchia, R. H. and A. C. Lippe (2003). Techniques for multichannel real-time spatial distribution using frequencydomain processing. In Proceedings of the International Computer Music Conference, Ann Arbor, pp. 41-44. International Computer Music Association. 290