Page  00000001 Manipulation and Resynthesis with Natural Grains Reynald Hoskinson, Dinesh Pai Department of Computer Science University of British Columbia reynald@cs.ubc.ca, pai@cs.ubc.ca Abstract This paper describes a novel method to extract the component parts of an audio signal, and play them back in a continuous stream of indeterminate length. An automatic speech recognition algorithm involving wavelets is used to split up the input signal into syllable-like audio segments. For each segment, a table of similarity between it and all the other segments is constructed. The segments are then output in a continuous stream, with the next segment being chosen from among those other segments which best follow from it. In this way, we can construct an infinite number of variations on the original signal with a minimum amount of interaction. 1 Introduction Natural sounds are an infinite source of material for anyone that works with audio composing. Although the source is infinite, there are many instances where one sample has to be used repeatedly. Composers use samples as motifs that reappear again and again over the course of a piece. Acoustic Installations often stretch pre-obtained source material over the entire life of the exhibit. Simple repetition is not effective for long, so we often create variations of a sample by manipulating one or more of its properties. There is a long tradition in the electroacoustic music community of splitting audio samples into portions and manipulating them to create new sounds. Curtis Roads (Roads 1978) and Barry Truax (Truax 1994) pioneered granular synthesis, in which small grains are combined to form complicated sounds. Grains can be constructed from scratch, or obtained by splitting an audio sample into small segments. More recently, Bar-Joseph (Bar-Joseph, Dubnov, El-Yaniv, Lischinski, and Werman 1999) proposed a variant of granular synthesis using wavelets where the separation and re-combination of grains is done in the time-frequency representation of an audio sample. Similar work is also being done on images to produce variations of tiles or textures. (Wei and Levoy 2000) (Schodl, Szeliski, Salesin, and Essa 2000) When what is desired is simply a variation that still bears a strong resemblance to the original, the audio techniques above have critical problems. Granular synthesis is a technique to create new sounds, not recognizable variations of the original except in a very abstract sense. A long audio sample is not even required; it suffices to specify the shape of the grain and its envelope. When an audio sample is used, the concept of a grain is an arbitrary slice chosen independently of the sound's inherent structure. To better preserve the original structure of the sound, BarJoseph uses a comparison step where wavelet coefficients representing parts of the sample are swapped only when they are similar. The authors employ "statistical learning", which produces a different sound statistically similar to the original. In this algorithm, only the local neighbours in the multiresolution tree are taken into account when calculating similarity, and the swapping is very fine-grained. This means that large-scale changes over time will not be taken into account. On almost any signal, this results in a "chattering" effect. To address the problems with the above methods, we have developed an algorithm for segmenting sound samples that focuses on determining natural transition points. These are the points in the sample where events of some kind start or end. The sound in between these transition points is considered atomic and not broken up any further. We call these segments "natural grains." Once we have the segments, creating new sounds becomes a problem of how best to string them together. We do this by constructing a Markov chain with each state of the chain corresponding to a segment. The transition probabilities from one state to others are estimated based on the smoothness of transition between it and all other segments. The segments are thus output in a continuous stream, with the next segment being chosen at random from among those other segments which best follow from it. In this way, we can construct an infinite number of variations on the original signal with a minimum amount of user input. 2 Segmentation Algorithm To properly divide the sample into natural grains, we need to understand a little of the physiological perception of sonic events. What do we perceive as a unified event, and what are

Page  00000002 the clues that our brains pick up to distinguish between where one event starts and the next begins? 2.1 Background (Casey 1998) defines three components to sound structure: 1. Fourier Persistence: cochlear mechanics of the ear are sensitive to changes on the order of 50ms and shorter. They represent such micro-temporal change as an approximately static quality in log frequency space for rates of change in air pressure greater than 20 Hz, which is 1/0.050, and represents the frequency perception threshold of the cochlear mechanism. 2. Short Time Change Changes occurring at rates less than 20 Hz, and continuous in terms of a function of the underlying Fourier-time components, are perceived as short-time change in static frequency spectrum. They are perceived as change. In our segmentation algorithm, we would like to preserve these changes in their integral form, splitting them up as little as possible. 3. High-Level Change: larger changes which are not perceived as small and continuous from the perspective of short-time perception. These are kind of changes we would like to split between, to allow for more combinations of high-level changes in our resultant output streams. For example, if I smash a glass, there is a Fourier persistence that is characteristic of glass sounds, which are its spectral features. There is a short-time change which reflects impact and damping in each individual particle, and there is a high-level change structure which reflects the scattering of particles in the time-frequency plane. To distinguish the desired high-level structures from the other two types, we have to look at the rate of change of the frequencies in the sample, and also the relative magnitude of the difference. Changes occurring at rates less than 20Hz, but which are small and continuous in terms of the underlying components, should be considered part of a integral sound structure that should not be segmented. The application uses samples recorded at 44.1kHz, so this means our grains should be larger than 2205 samples. However, we are not only interested in determining whether there has been a change in event; we are also interested in pinpointing the best place to segment this event from the next. Therefore, we make our windows 1024 segments wide, with an overlap of 256, but also impose the rule that no two consecutive windows can both be segment points. If such a situation does arise, we take the best one. 2.2 Algorithm Our segmentation method is inspired from work by (Alani and Deriche 1999) to segment speech into phonemes. A wavelet transform is then done on each 1024-sample window. Wavelets are used because their coefficients have the desirable property of accentuating points in the original signal where there are frequency changes. The energies of each of first six levels of difference coefficients are calculated for each window. In this case, energy is a metric which quantifies the match between the wavelet basis and a bandwidth of the signal. The bandwidths are organized in octaves, with each higher-frequency octave having twice the frequency range as the previous. These metrics are then normalized so that we focus on the differences in strength of bandwidths between windows. The next step is to segment the signal based on the differences in energy metrics between consecutive windows. A Euclidean distance function over four frames is used. As an example, we calculate the strength of transition between windows 2 and 3: 246 2 4 6 D(f2,f3) = 1/4 L (Xi,k - Xjk)2 i=i j=3 k=l Figure 1: Distance measure for transition between frames 2 and 3 (Alani and Deriche 1999) use the algorithm to isolate phonemes which are then fed into a separate speech-recognition system. Isolating phonemes, however, is quite a different task than trying to isolate grains in a sound that are independant from each other as possible. Human speech phonemes usually meld into each other, making isolation an even more difficult task, one that can really only be successfully achieved using context-sensitive information. Taking this into account, the authors pick the points where the distance measure is highest, reasoning that this is where the speakers are going from one phoneme to another. We are not trying to "understand" speech, however. Isolating phonemes is not critical to our application. Rather, we would like to segment on the granularity of a syllable,

Page  00000003 where transitions are much more pronounced and the boundaries more amenable to shuffling. We would like to confine ourselves as much as possible to high-level sound changes. So instead of looking for the points which have the greatest difference between frames, we look for those with the least difference. These points are more likely to be in the troughs of the signal between events, rather than in the middle, or in the attack portion of an event. Pij + C Pij =jPi + nC where n is the number of segments. C denotes the constant noise we want to add to the distribution to give those with smaller similarities more of a chance to be selected. This number can be changed interactively to alter the randomness of the output signal. Once we have the transition probabilities, resynthesis is as simple as choosing what segment will be played next by random sampling from the empirical distribution pij. In this way, the smoother the transition between current segment and the next, the higher the probability that this segment is chosen as the successor. A high noise variable C flattens this preference, but never eliminates it. Figure 2: Segmented Waveform and Interface Figure 3: Input Waveform 3 Grading the Transitions The segments derived from the above approach represent the locations in the signal where it changes least abruptly. The degree of change is given by the result of the above difference algorithm. Our final aim is to re-create randomized versions of the signal that retain as many of the original characteristics as possible. The next task, then, is to determine which of the segments flow most naturally from any given segment. To enumerate the most natural transitions, the segments are compared against each other and graded on their similarity. This is done in much the same way as we calculated the original segments from the sample as detailed above. To calculate the transition between segment A and B, the last two windows of A are fed into the four-frame Euclidean distance algorithm, along with the first two frames of B. The lower in magnitude the result, the smoother the transition will be between the two segments. 4 Resynthesis The similarity metric between two segments allows us to construct the probabilities of transition between segments as follows. Since it is the lower scores which denote smoother transitions, we will take their inverse to orient the weights in their favour. Let Pij = 1 /D(i, j), indicate the likelihood that segment i is followed by segment j. We can convert this to a probability pij by normalizing as follows. Figure 4: Portion of Output Waveform 5 Performance Our resynthesis system is implemented in Java, and features a graphical interface to facilitate real-time interaction. By manipulating sliders, users can change the degree of granularity of the segments, and also the degree of randomness in the segment selection process. All of this can be done in real-time with no signal interruptions. To facilitate construction of larger-scale sound ecologies, a higher-level interface is also available. Per sample, there are controls for pan and gain. Because not every sound should be played continuously, there is also a trigger variable, which controls how often the sound should be played. For instance, a value of 10 would mean the sound would be activated every 10 seconds. A separate duration variable controls how long each of these higher-level segments are. Trigger and Duration both have associated values for controlling random variability. The variable and its associated random element are used as the mean and standard deviation in a normal distribution random number generator. The pan and gain envelopes affect each of these higher-level segments individually.

Page  00000004 Figure 5: Interface for Controlling Multiple Samples To test it, we used samples taken from the Vancouver Soundscape (Truax 1999), among other environmental recordings. This provided us with a wealth of real-world samples, both pitched and unpitched, which were ideal for this algorithm. The program can also be used for other purposes besides sound extension. Interesting effects can be achieved when we set the granularity to be very fine, in which case we achieve a sparse form of granular synthesis. Intriguing combinations and textures can also be generated by using a sample with many different heterogeneous sound sources present. The algorithm mixes them in unexpected and interesting ways, often in short bursts that are connected seamlessly with other short bursts from elsewhere in the signal to create new macroscopic structures. 6 Future Work We are currently experimenting with several extensions to the dynamic scrambling algorithm: * We would like to not only extract the components in the time domain, but also split the signal up into multiple parts per time chunk. Independent Component Analysis (ICA) (Casey 1998) could be a method of separating multiple signals from one source. For instance, in a sample where there is simultaneously traffic noise and birds singing, we can extract the bird sound from the background and re-synthesize the two separately. * Modifying the underlying signal through manipulation of the time-frequency domain can produce desirable and predictable changes in the perceived objects involved in the sound production. For instance, Miner and Cadell (Miner and Caudell 1997) have developed methods for altering the wavelet coefficients in the representation of a rain sample to change the perceived surface the rain was falling on. They could also change the perceived size of the drops, or their density. Since our algorithm already maintains a version of the signal in the wavelet domain, this would be relatively efficient to implement. * Rhythmic structure is currently not preserved because the current algorithm examines only the local neighbours for compatibility. A higher-dimension Markov chain could be used to impart more temporal structure to the resultant sound. This algorithm and its proposed extensions all aim to give the user an intuitive interface to sound design. We believe it is not enough to provide novel physical or graphical interfaces to sound synthesis engines; what is often needed, and rarely found in practice, are interfaces that reflect the underlying physical properties of a sound. When dealing with samples from the real world, as we have done, this involves developing methods for signal understanding, and manipulation techniques that preserve important aural properties. References Alani, A. and M. Deriche (1999). A novel approach to speech segmentation using the wavelet transform. In Fifth International Symposium on Signal Processing and Its Applications. Bar-Joseph, Z., S. Dubnov, R. El-Yaniv, D. Lischinski, and M. Werman (1999). Granular synthesis of sound textures using statistical learning. In Proceedings of the International Computer Music Conference. Casey, M. A. (1998). Auditory Group Theory with Applications to Statistical Basis Methods for Structured Audio. Ph. D. thesis, Massachusetts Institute of Technology Media Laboratory. Miner, N. E. and T. P. Caudell (1997). Using wavelets to synthesize stochastic-based sounds for immersive virtual environments. In Proceedings of the International Conference on Auditory Display. Roads, C. (1978). Automated granular synthesis of sound. Computer Music Journal 2(2), 61-62. Schodl, A., R. Szeliski, D. H. Salesin, and I. Essa (2000). Video textures. In K. Akeley (Ed.), Proceedings ofSiggraph. ACM Press / ACM SIGGRAPH / Addison Wesley Longman. Truax, B. (1994, summer). Discovering inner complexity - time shifting and transposition with a real-time granulation technique. In Computer Music Journal, Volume 2, pp. 38-48. Truax, B. (1999). Handbook for Acoustic Ecology. ARC Publications, 1978. CD-ROM version Cambridge Street Publishing 1999. Wei, L.-Y. and M. Levoy (2000). Fast texture synthesis using treestructured vector quantization. In K. Akeley (Ed.), Proceedings of SIGGRAPH, pp. 479-488. ACM Press / ACM SIGGRAPH / Addison Wesley Longman.