Page  00000320 FAST SOUND TEXTURE SYNTHESIS USING OVERLAP-ADD Martin Frojd Chalmers University of Technology Department of Computer Science and Engineering ABSTRACT Sound texture synthesis aims at extending the length of sound recordings without causing audible artifacts, a concept inspired by graphic texture synthesis. This paper describes an overlap-add method for sound texture synthesis that performs significantly better than earlier, more complex methods. The results also suggest that larger scales of granularity than those previously used are more effective. The approach allows for real-time synthesis of sound textures, which is useful in applications such as computer games and sound editors. 1. INTRODUCTION In computer graphics, a texture is a pattern or image that can be repeated (or stretched) to cover a larger area. However, humans can often easily distinguish repeating patterns. To overcome this problem, graphic texture synthesis aims at creating large images that do not seem repetitive - in other words, creating something that seems natural to the human eye. Sound textures share many of these properties, but also have some important differences, in particular that a sound texture is a one-dimensional structure whereas graphic textures are two-dimensional. The basic idea of sound texture synthesis is to be able to produce sounds of any given length that exhibit the same qualities as the source sound. For example, given a 10-second recording of the wind blowing, sound texture synthesis can be used to produce several minutes of sound with the same natural wind-like qualities. Sound texture synthesis should be able to extend the sound without changing its character, unlike other techniques such as time-stretching and re-scaling. Sound textures have a number of useful applications including sound editing, interactive environments, computer games and compression. Various applications such as sound tracks for films and advertisements need background environmental sounds of a specified length. If the source sound is too long it can easily be truncated or faded out, but it is not as easily extended. Sound texture synthesis offers a sound editing tool for solving such problems, but the quality of the resulting sound must be very good and without any distracting artifacts. Interactive environments, such as computer games, are another ideal application for sound texture synthesis. In interactive environments, it is not possible to know the length needed for a sound. For example, in a Andrew Horner Hong Kong University of Science and Technology Department of Computer Science and Engineering computer game the user might stay in the same environment for several minutes. The standard solution is to loop the sound, and computer game sound designers typically prepare sounds for seamless looping. However, sound textures offer a potentially more dynamic solution, avoiding simple repetition which can become boring and even annoying. For these real-time applications, the sound texture quality has to be good because it is not possible to hide artifacts through postprocessing. There is no established formal definition of a sound texture. Previous work has used different and sometimes conflicting definitions of sound textures [6]. In this study we do not try to strictly define sound textures. Most previous attempts at defining sound textures are very similar, although with a different focus. Examples of sound textures would be sounds produced by the natural environment, fire, machinery, and large groups of animals [1,6]. Our approach also does not require a strict definition of sound textures - it is general enough to work under varying definitions. Nevertheless, our approach requires sound textures to exhibit some self-similarity. It also assumes that for people, animal and machinery sounds there are many "voices" rather than a few (e.g., a whole nursery of crying babies, not just a single crying baby). 1.1. Previous work Several researchers have investigated sound texture synthesis. Dubnov et al. used a granular synthesis approach based on wavelet tree learning for recombining small grains to synthesize sound textures. Their work was derived from 2D-texture synthesis, a technique from computer graphics. Their method had problems coping with longer sound events and rhythmic sounds. Moreover, the generated sound textures had some "stutter," probably due to the small time scale used for their grains or possibly inadequate interpolation [2]. Lu et al. later used a slightly larger time scale, analyzing short frames using a mel-frequency cepstrum coefficient approach borrowed from speech recognition. Their analysis split the sound into sub-clips, which could then be rearranged to change and extend the sound, producing a sound texture. They achieved some impressive results, especially when restoring recordings with silent sections [4]. The work most closely related to our study is that of Parker and Behm. They called their method "tiling and 320

Page  00000321 stitching," which was inspired by quilted 2D textures. They tiled source blocks together to form textures, using 15% overlaps for block transitions. They also included some interesting work called chaos mosaic to iteratively randomize initially ordered sequences of blocks. However, their sound examples exhibited clearly noticeable seams between the blocks. [5] Zhu and Wyse used time- and frequency-domain linear predictive coding to synthesize sound textures, producing textures "quite similar" to the originals [7]. For further details into previous work, see Strobl et al.'s thorough survey of sound texture modeling [6]. All of this previous work has utilized relatively sophisticated analysis stages, often based on methods borrowed from other fields such as computer graphics and speech processing. 2. AN OVERLAP-ADD METHOD This section describes a new method for sound texture synthesis based on an overlap-add approach. The assumption of our method is that the source clips exhibit self-similarity [4], and by overlap-adding excerpts of the sound clip to itself, we create a different and longer texture with characteristics of the original. The original clip is used as a stochastic source from which blocks are extracted at random locations. The extracted blocks are then overlap-added to the synthesized texture. This process is repeated until the desired length is reached. The overlap-add procedure crossfades pairs of blocks so that as one block fades out the next fades in. When a block completely fades out, another immediately replaces it and begins fading in. The blocks are overlapped by 50% of their duration such that two blocks are always crossfading (except the first and last blocks). For the crossfade, simple linear envelopes will work, but to achieve a smoother effect, we use a sine envelope to keep the power of the synthesized texture relatively constant. Other envelopes, such as Gaussian or Hann windows, are also possible. 2.1. Improvements While the basic overlap-add algorithm performed reasonably well in our preliminary tests, we made some modifications to make the synthesized textures more dynamic. With fixed block lengths, the seams between the blocks always appear at regular intervals, and periodic artifacts can result which are easily recognized. To avoid this problem and achieve more dynamic results, we randomized the individual block lengths based on the desired average block length. This avoids a regular pattern of seams in the synthesized texture. However, when creating long synthesized textures from short sources, noticeable repetition can still be a problem even with randomized block lengths. To lessen this problem, filtering and other effects such as a slight time expansion or compression can be applied to each block to make them seem slightly different. For example, these effects can make a bird call sound like several birds of the same type calling from different locations with slightly different voices [4]. Effects can generally make the sound more dynamic - sounds will sometimes be close by, sometimes far away, sometimes moving, sometimes stationary. In our current implementation, the simplest of these effects is included by multiplying each block by a random amplitude scalar, making some blocks more prominent than others. Moreover, many sounds have particular transients at their beginning and end, such as a characteristic fade-in or fade-out. To capture this characteristic, we simply copy the first and last blocks from the source so that synthesized textures will always begin and end just like their sources. In our current implementation, the start and end blocks are set to one second each. Finally, to avoid echo caused when two very similar blocks are placed nearby in the sound texture, we only picked blocks that had at least a minimum prescribed distance from the previous block. 2.2. Limitations In our tests we found that the overlap-add approach worked best for composite sources, that is, sources consisting of several smaller, concurrent, sound events. A source consisting of only one long sound event would, for example, be poorly synthesized by our method. In a sense, overlap-add assumes that sound events are small enough to fit inside (or almost inside) a single block, similar to the approach of Zhu and Wyse [7]. Even when the source is not composite, the overlap-added result will be composite. For example, a recording of a baby crying would result in a texture where more than one baby appears to be crying. This is a result of crossfading being done in the overlap. Further, this means that our overlap-add approach is not well-suited for music or speech. Music usually has a clear underlying rhythm, while speech requires complex casual models of language. Overlap-add cannot easily retain these structures. These types of sounds are better synthesized by other means. Finally, as in most other sound texture synthesis methods, longer synthesized textures require longer source clips to avoid sounding repetitive. In our testing we mainly used source clips ranging from 5 to 15 seconds, and found that source clips lasting 10 seconds or more are needed to synthesize natural-sounding textures of a minute or more. 2.3. Comparison with previous work Most previous researchers have used very short timescales to capture sound events; Dubnov et al. based 321

Page  00000322 their work on granular synthesis using very small grains. Lu et al. used 32ms frames for their MFCC-analysis, and Wyse and Athineos both used millisecond lengths in their work [1,2,4,7]. In our overlap-add approach longer time scales generally perform better. Short time scales often distort longer sound events because they cannot preserve higher-level characteristics of the sound. Parker and Behm used longer blocks in an approach similar to our overlap-add method [5]. The main difference between their approach and ours is that they used relatively short overlaps between the blocks. This can cause artifacts such as clicks and pops, and the sound examples given on their web site show such artifacts in the seams between the blocks. Their method also requires user intervention to hand-tailor the block lengths for each sound. By contrast, our method is more robust and does not require user intervention. Another way in which our approach differs substantially from all other previous work is that it requires no analysis stage. Instead, it takes advantage of the inherent source self-similarity, assuming that many combined random samplings will result in a sound similar to the source. This simplicity makes the algorithm very efficient. By way of comparison, Dubnov et al.'s wavelet tree learning method uses a 0(n) transformation to and from the wavelet domain, but the synthesis stage of recombining the grains is, in a naive implementation, 0(2n) and very memory intensive. They recommend an optimized implementation, but even this is much slower than overlap-add [2]. Lu et al.'s approach is faster than Dubnov's but still includes a complex analysis stage for matrix construction and MFCC calculation. The synthesis stage is, however, more or less as straightforward, depending on the amount of post-processing required [4]. 3. EVALUATION To evaluate overlap-add texture synthesis, we conducted a listening test. The 22 subjects who participated in the test were students at the Hong Kong University of Science and Technology, ranging in age from 19 to 27 years, and reported no hearing problems. All listeners had at least 5 years of practice on a musical instrument. The subjects were paid for their participation. The Java command-line program which controlled the experiment was custom-written by the authors. Subjects were seated in a "quiet room" with a 40 dB SPL background noise level (mostly due to computers and air conditioning). The stimuli, which were stored on hard disk in 8- and 16-bit integer format, were converted to analog by SoundBlaster X-Fi Xtreme Audio soundcards and then presented through Sony MDR-7506 headphones. The soundcard DAC uses 24 bits with a maximum sampling rate of 96 kHz and a 109dB S/N ratio. The sounds were actually played at between 8.0 kHz and 22.05 kHz depending on the source sound file. The test included eight source clips that were used for synthesis and comparison. Six of the clips were used by earlier studies [2,4], allowing us to compare methods. Test subjects first heard all the original source clips. Then, 30-second synthesized textures were presented in a random order. There were four versions of each overlapadd texture with block lengths of 0.25, 0.5, 1 and 2 seconds. Subjects rated each texture for its naturalness (see Table 1 for the rating scheme). A high score of seven meant that the texture sounded as natural as the original clip. Additionally, users were asked to rate four trials for each texture where each trial was set using a different random seed. Users also rated textures produced by Dubnov and Lu where available (four textures produced by Dubnov and two by Lu as shown in Table 2). In total, each subject listened to eight source clips, six textures produced by Dubnov/Lu, and 128 textures produced by our overlap-add method. Score Meaning 1 Very unnatural 2 Mostly unnatural 3 Somewhat unnatural 4 Somewhat natural 5 Mostly natural 6 Very natural 7 As natural as the original Table 1. Rating scheme for the listening test 4. RESULTS The data produced by the listening test allowed us to determine both the optimal average block length for overlap-add, and the overall performance of the methods. The listening test data was analyzed to see if there were any extreme outliers, but none were found. The average rating was calculated for each source sound and block length, along with average ratings of Dubnov/Lu's textures where applicable (see Table 2). Source Dubnov/ Overlap-add score for block lengths: clip Lu score O.25s O.5s 1.Os 2.Os aviary N/A 3.70 3.96 4.42 4.44 baby 4.09 2.21 3.19 3.96 4.62 racing 1.59 2.12 3.48 4.65 5.36 rain 3.95 3.86 3.85 3.89 3.90 seagulls N/A 4.06 4.51 4.47 4.73 shore 2.27 4.59 5.47 5.88 5.78 stream 3.63 4.45 4.06 4.05 4.17 trafficjam 3.45 3.17 3.87 4.63 5.00 All 3.16 3.52 4.05 4.50 4.75 Table 2. Average scores for different configurations 322

Page  00000323 Table 2 shows that longer block lengths generally give better results. In our own informal tests, even longer block lengths of 4 seconds also gave good results. With longer block lengths, however, there comes a greater risk of repetition. There are obvious problems when the block length approaches the length of the source sound, since there are fewer ways to segment the source. Blocks of two-seconds seem the best balance of these factors, producing an average score of 4.75, corresponding to a rating of "mostly natural." By comparison, the average score for Dubnov/Lu taken together was 3.16, corresponding to a rating of "somewhat unnatural." (The two textures of Lu received an average score of 3.79, compared to 2.85 for the four textures of Dubnov.) We assume that the sound textures provided by Dubnov and Lu were those they judged to be best for various parameter settings. Overlap-add gave a remarkable improvement on the shore and racing car examples. Dubnov's synthesized textures of these examples scored the lowest of the six source clips, while for overlap-add they scored the highest. This is probably because overlap-add is especially well-suited to sounds that come and go, like waves washing on the shore and racing cars driving by. Overlap-add with 2-second block lengths produced better textures than the methods of Dubnov and Lu for almost all source clips. Lu's rain texture was better but only by a small margin. However, the rain clip only had a length of 3 seconds, probably not enough for any evaluated method to produce good sound textures. Paired sample t-tests [3] revealed that the average score for overlap-add is significantly higher than that for the six synthesized textures of Dubnov and Lu taken together. We compared against both source clips (t = 2.46, p < 0.001) and listeners (t = 8.53, p < 0.001). 5. CONCLUSIONS Our listening test results shows that overlap-add texture synthesis produces mostly natural-sounding results when block lengths are one second or more. They also indicate that overlap-add performs significantly better than the methods of Dubnov and Lu. The results indicate that longer blocks perform better than the shorter blocks used by previous researchers. Overlap-add can give consistently natural-sounding results over many different types of input sounds with varying character and length. It is especially well-suited to sounds that come and go, like waves washing on the shore. Additionally, overlap-add is very efficient, with a time complexity of O(n) since the method does not rely on an analysis step, unlike other methods. Overlap-add is capable of real-time performance in computer games and other interactive environments, a clear advantage over previous methods. An area for possible future investigation is the use of more filters and effects to help minimize obvious repetition. Low-pass filters, optimized envelopes, and pitch-shaping are all potentially useful at the cost of slightly more computation. Also, a segmentation technique could be used to optimize the individual block lengths [4], though the expected gain would be relatively small. Finally, to make this approach suitable for professional use, stereo channel support in the algorithm could be added. The authors would like to kindly thank Jenny Lim and Lee Chung for performing the listening test. We would also like to acknowledge that this study was supported in part by Hong Kong Research Grant Council's Projects HKUST6135/05E and HKUST6138/06E. 6. REFERENCES [1] Athineos M. and Ellis D. P. W., "Sound texture modeling with linear prediction both in time and frequency domains," Proc. of Int. Conf on Acoustics, Speech and Signal Processing, 2003. [2] Dubnov S., Bar-Joseph Z., El-Yaniv R., Lischinski D. and Werman M., "Synthesizing Sound Textures through Wavelet Tree Learning," IEEE Computer Graphics and Applications, Virtual Worlds, Real Sounds, pp. 38-48, July/August 2002. [3] Hogg R.V. and Ledolter J., Applied Statistics for Engineers and Physical Scientists, Maxwell and Macmillan, 1992. [4] Lu L., Wenyin L. and Zhang H. J., "Audio Textures: Theory and Applications," IEEE Trans. On Speech and Audio Processing, vol. 12, no. 2, pp. 156-167, March 2004. [5] Parker, J.R. and Behm, B., "Creating audio textures by example: tiling and stitching," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on, vol.4, iv-317- iv320 vol.4, pp. 17-21 May 2004. [6] Strobl G., Eckel G. and Rocchesso D., "Sound Texture Modeling: A Survey," Proc. of the Sound and Music Computing Conference (SMC'06), Marseille, France, 2006. [7] Zhu X. and Wyse L., "Sound Texture Modeling and Time-Frequency LPC," Proc. of the 7th Int. Conference on Digital Audio Effects (DAFx'04), pp. 354-349, 2004. 323