Page  00000508 Extending Musical Mixing: Adaptive Composite Signal Processing Christopher Penrose Keio University SFC Faculty of Environmental Information 5322 End5 Fujisawa, Kanagawa 252-0816 Japan penrose@sfc.keio.ac.jp 1 Overview It can be useful for an experimenting artist to study the art and science of unfamiliar media. For example, the field of image processing has revealed many useful combinatorial tools for graphic artists. Comparing the differences between these image manipulation tools, and available tools for sound exploration, particularly the combinatorial processes utilized by these tools, one can easily discover new metaphors and possibilities for sound manipulation. We are all numbingly familiar with the standard, or even "classical", tools that have emerged for digital music making. But lets reconsider the obvious; what is happening when a standard software sound mixer combines two sounds? When two sounds are placed at overlapping regions of time in a typical mixing application, the standard combinatorial process utilized is transparency: the two sounds and their amplitudes are simply added together causing the two signals, and their constituent frequencies, to coexist in time. However, when two images are placed in overlapping regions of space in a typical image processing application, the standard combinatorial process is opaque addition: at points of intersection the newcomer image replaces the image magnitudes of the oldcomer. These two processes are very different and reveal constrasting assumptions about sound and image as artistic media. Yet looking a little closer at image processing applications we find that other combinatorial processes are available for the aggregation of images, such as transparent compositing and a host of other compositing tools. With typical com mercial audio applications, similar tools are not yet common nor even understood as possible or desireable. I have sought to explore this void as the audio medium and our perception of it is not as divergent from the visual medium as one may think. Opaque combination, and other visual metaphors can contribute to the creation of powerful musical signal processing tools. 2 Analysis Framework The tools that I wish to discuss are all implemented by utilizing an overlapped short-time fourier transform (STFT). Perhaps you are painfully familiar with this equation [1] and some of its ramifications, if not, feel free to consult the bibliography: For k = 0,1,..., N-1, where w = 27rk/N INF(k) = W(n)I(n)e-fr n= o We will also use the following form: (1) F(f)(k) = F(k) (2) where f is the spectral frame index of the current spectral frame. This provides a nomenclature for describing processes that utilize spectral frames beyond the current analysis frame. We perform a windowed short-time fourier transform, as described by equation (1) above, upon overlapped segments of our input signal I(n). After -508 - ICMC Proceedings 1999

Page  00000509 representing our input segment in the frequency domain as a series of complex values, we convert these values into polar form: M(k) = Fma(k) = Freai(k)2 + Fima9(k) (3) F,,,,k= f Fi(k), F1(k) > F2(k), F2(kc), F1(kc) <= F2(kc). (5) or the inverse: Fima (kc) P(k) = Fphase(k) = aretan FreL0k F,,,,()- F1(k), Fi(k) <= F2(k), F2L(k), Fi(k) > F2(k). (6) (4) We utilize a four-quadrant arctangent function, (Unix math library atan2fQ)) where the range of P(kc) 'Utilizing polar form allows us to almost (but not quite, as will be discussed in a future paper) isolate signal phase relationships from their magnitudes. Further, the phase unwrapping technique of the phase vocoder is not necessary for these particular processors. Each of the processes presented in this paper will be discussed at this stage of the STET analysis/resynthesis process, where our input signals are decomposed into a polar spectral representation. The processes that we will discuss will normally analyze two separate time-domain input signals, 11(n) and 12(n), and generate an output signal time-domain signal 0(n) through an inversion of the analysis process. See (Allen [1] and Moore [6]) for details. 3 ether ether is an extremely simple compositing processor both in concept and in implementation. Two signals are combined adaptively: ether will compare each analyzed frequency band of both signals, and create a new signal composed only of components with the greatest magnitude among the two. However, this behavior can be inverted, where only the smallest magnitudes of the two frequency bands is resynthesized. The effect often is to merge the source of perceptibly disparate signals, sounds which normally sound distinct when heard together, into a single sound event. It is conceivable that some composers may actually find such an effect sallient and useful. F0,t,,t(kc) is derived from both F1(kc) and F2(kc) using the following dynamic threshold relation: Both phases and amplitudes from F2(k) are placed in F,,,(k) when the above threshold conditionals dictate, though ether also allows the user to swap only 1i4(k), or P2(kc) in isolation, having subtly different effects. ether has been extended into three different signal processors, astral, led (lowest common denominator) and shapee. 3.1 astral astral is essentially identical in design to ether, however astral selects the frequency band with the greatest (or lowest) magnitude from a pool of c signals, 11(n),...,4I(n), rather than ether's utilization of only two. 3.2 lIcd lcdis a logical extension of astral, utilizing a more general form of band selection. For a given set of frequency domain inputs, F1(k),...,F(k), the magnitude extrema, minimum and maximum, are determined for each M, (k) value where xc is in the range of [O,...,c]. For each F,,(k), the magnitude and phase is selected from the frequency domain input, F1 (kc),..., F,(k), whose magnitude value is closest in value to the target magnitude M, (k): M,(k) = Mmin(k) + (Mmax(k) - Mmin(k)) index (7) Where Mmtn (k) and Mmax (k) are the magnitude extrema for the inputs M1(k),..., M,(k) and index is a value within the range [O,...,1]. Setting index: =.5 will select the magnitude and phase of the filter which is closest in magnitude to the median magnitude of the frequency domain inputs. The name, lowest common denominator was chosen by ICMC Proceedings 1999 O9 509 -

Page  00000510 Jane Dowe. before its implementation, for its sociological rather than its mathematical metaphor as the latter obviously is not specifically relevant to the process. 3.3 shapee Though shapee will probably be demonstrated at the ICMC, its scope is too vast for the confines of this paper. A future paper will be created that describes its foundation and implementation. 4 burrow burrotw is another simple compositing processor that masks an input source signal with another. burrow will remove frequency band energy from the source signal where frequency band energy in the filtering signal is greater than a given threshold. burrow again utilizes an overlapped spectral analysis of two frequency domain signals, F, (k) and F2(k). to create a new signal: Fo,,tput(k). For each frequency band in each analyzed spectral frame of F2(k), if the filter signal energy crosses a threshold, t, the spectral frame Fi(k) is multiplied by a scalar s. As an option, the filter threshold can be related to the RMS (root-mean square) energy of II. The scalar s has normally been set 0 < s < 1 to achieve masking effects, however, it can be set to any value. the patzna processor is to impart some of the shorttime dynamic spectral characteristics of a signal upon another without changing the large-scale spectral character of this destination signal. It's name is a useful analogy: the processor places a patina, or surface texture, upon a sound. Delta magnitude values from successive frames are accumulated from the destination signal and filter signals. AM,(k) = Mi(f + 1)(k) - Mi(f)(k) AM'2(k) = M2(f + 1)(k) - M2(f)(k) (10) (11) We define t as a user-definable threshold, and a(x) is an attenuation function that will reduce the magnitude of AM2 [k] with respect to AMi(k) before being added to Moutput(k). {( Ml (k) + AM2(k), Mi,(k) + AM2(k)a(x), AM2(k) AM,(k) < t AM2 (k) >= t. AMI(k) (12) F - F (k)s, Foutput (k F(k) F2(k) > t, F2(kc) <= t. (8) A further simple extension to burrow allows F2(k) to be averaged over several frames: A(k) - Ff(k) + Ff+,(k) +... + Ff+T(k) (9). x Where f is the current frame of the STFT analysis, Ff = F2(f)(k), and x is the frame count of the averaging. Substituting A(k) for F2(k) in equation [8] imparts smoother filtering upon the source signal. 5 patina patina is a spectral compositing processor that attempts a form of feature extraction. The intent of patina can also be implemented in this manner: M(k) + AM2(k)a(, M tk) Moutpt ) M k + M(k)a(), >= t. (13) where the decision to apply the attenuation function, a(x), is based upon the magnitude Mll (k) rather than AMl(k). 6 morphine morphine is a signal processor that performs an adaptive temporal transition, or a "morph", between two input sounds. morphine's predecessors: twerp, turpitude, and turpentine, were initially released to the internet in the summer of 1992. This process will exchange spectral energy from the departure signal Fi(k) to the arrival signal F2(k). This exchange is based upon the spectral differences between the two signals, where smaller spectral differences are exchanged earlier in the transition than larger differences. With the classical "cross-fade" technique, two signals, one departing and one arriving, are usually perceived as two separate sound events. morphine often succeeds in merging these -510 - ICMC Proceedings 1999

Page  00000511 two signals into one perceptible event. 8 References First, we set our output Foutput(k) equal to our departure signal Fi (k) for all k. Fotput(k)= F, (k) (14) Then, for each F (k) and F2(k), a delta magnitude value, AM(x), is calculated for all k: AM(z) = M2(k) - Mi(k) (15) Each element of the collection, AM(0),..., AM(NE), is first grouped with the corresponding index k and then is sorted by magnitude, lowest to highest. The value index controls the extent of our transition or morph. index, having a range from [0,... -E], is the number of filter values in Foutput that are replaced by values from F2. We iterate AM (x), index times, where x traverses the range of [0,..., index] and references the index value ki, that is grouped with AM(x). For each ki in AM(x) 1. Jont B. Allen, "Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform." IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-25(3), pp. 235-238 (1977) 2. Kenneth R. Castleman, "Digital Image Processing", Prentice-Hall (1996) 3. R. E. Crochiere, "A Weighted Overlap-Add Method of Short-Time Fourier Analysis/Synthesis", IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(1), pp. 99-102 (1980) 4. James D. Foley, and Van Dam, Andries. Fundamentals of Interactive Computer Graphics. Addison-Wesley, 1982. 5. R. W. Hamming, Digital Filters. Prentice-Hall, (1989) 6. F. Richard Moore, Elements of Computer Music. Prentice-Hall, (1988) 7. James A. Moorer, "The Use of the Phase Vocoder in Computer Music Applications." Journal of the Audio Engineering Society 24(9). pp. 717-727 (1978) 8. Christopher Penrose, "Practical Signal Processing: Filtering, Interpolating, and Enriching Digital Signals with the Handy Phase Vocoder Algorithm." Proceedings of the International Computer Music Conference and Festival at Delphi, Greece, Center for Contemporary Music Research, Athens, Greece (1992) 9. Christopher Penrose, "Musical Spectra: Fourier Signal Processing and Composition." Keio SFC Review No. 4: Data Science, ISBN4-7664-0737-7, pp. 130-137 (1999) 10. Michael R. Portnoff, "Time-Frequency Prepresentation of Digital Signals and Systems Based on Short-Time Fourier Analysis." IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(1), pp. 55-69 (1980) Foutput(ki) = F2(ki) (16) 7 Future Directions Recently I have developed a new technique for improving the functionality of ether and patina; I name the technique "frequency shaping", as it fits one signal to the harmonic structure of another. It is a useful addition to the spectral compositing suite, and it has inspired another very useful processor, shapee, whose theory and implementation will be detailed in a future paper. At the time of this writing however, ether has already been fitted with this technique, and shapee is operational, so they will be available for demonstration at the ICMC. Currently these signal processing tools are available as UNIX commandline tools as part of a larger software distribution entitled "PVNation". The current preferred platforms for PVNation are Openstep 4.x and MacOS X Server. Versions for GNU/Linux and Solaris are also available. However, it should not be difficult to use most PVNation software on any flavor of UNIX. PVNation can be obtained at the following address: http://www.sfc.keio.ac.jp/-penrose/software/ ICMC Proceedings 1999 -511 -