Analysis and Resynthesis of Percussion Sounds: Two Methods Compared
Skip other details (including permanent urls, DOI, citation information)This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact : [email protected] to use this work in a way not covered by the license.
For more information, read Michigan Publishing's access and usage policy.
Page 00000347 Analysis and Resynthesis of Percussion Sounds: Two Methods Compared James W. Beauchamp School of Music and Department of Electrical & Computer Engineering University of Illinois at Urbana-Champaign Urbana, L 61801 USA e-mail: j-beauch@ uiuc.edu Abstract - Two methods of time-variant spectral analysis have been applied to the synthesis of three different percussion sounds: chime, timpani, and symbal. A phase vocoder method which determines modal frequencies by phase calculation has proved superior in resolving frequencies to a spectral peak tracking method, where frequencies are determined by spectral peak interpolation. Ordinary time-variant additive synthesis is used to synthesize stretched versions of the sounds. Syntheses of the chime sound about the same with either analysis method but synthesis of the timpani and cymbal are markedly improved in naturalness when based on the phase vocoder analysis. 1. Introduction Percussion sounds are particularly challenging for analysis/synthesis because of their combination of rapid amplitude transients; inharmonic and oftentimes closely-spaced partials, and possibly rapid pitch changes. We have been using two different short-time Fourier analysis/synthesis methods to process percussion sounds: 1) a fixed filter bank phase vocoder program with finely adjustable window size and base frequency, 2) an "MQ" spectral peak tracking program with coarsely adjustable window size based on algorithms by McAulay and Quatieri [1986] and Smith and Serra [1987]. The methods are depicted in Figure 1. The essential difference between the two algorithms is that with 1) frequencies are determined from a phase calculation on each harmonic bin of the FFT analyzer, whereas with 2) frequencies are interpolated from the positions of spectral peaks, and tracks are formed by joining peaks of similar frequency from one frame to the next. There is no perfect algorithm for peak tracking, and the one we have been using is prone to "errors" [Maher, 1989]. However, the attractiveness of the MQ approach is that it makes no assumptions about the spacing of the frequencies in the signal, e.g., whether they are harmonic or inharmonic. SOUNO INTERPOLATE APP FFT SIGNAL PITCH PERIO0 N ANALYSIS In this paper, we consider three very different percussion sounds using both analysis methods: a) chime, b) timpani, and c) cymbal. These sounds can be described briefly as follows: The chime consists of widely-spaced partials (modes) spaced according to the pattern fm = fo (2m + 1)2, where f depends on various physical properties of the chime. Modes ft and f2 are weak, whereas partials f3, f4,..., f9 are quite strong and decay times vary from.5 to 2 sec. The "strike tone" of the chime is attributed to f4/2, whereas the "hum tone" is f2. The timpani sound initially contains a very rich set of closely-spaced partials, but most of these damp out quickly leaving a nearly harmonic set of partials which gives the timpani its pitch. Its overall decay time is short, about 1.5 sec. The decay of the cymbal is somewhat longer (approximately 2 sec.). In contrast to the timpani, the cymbal remains rich in densely spaced partials throughout its length. Indeed, the cymbal contains literally hundreds if not thousands of frequency components and is considered to be indefinite in'pitch. Our principal goal is to accurately represent percussion sounds in terms of their individual frequency components (modes). However, when modal frequencies are too dense, it becomes very difficult, if not impossible, to separate them and still retain the rapid transient nature of the signal. Consider that when two (a) (b) Fig. 1. Implementation of two time-variant spectrum analysis methods: a) phase vocoder with FFT window matched to a prescribed pitch period (1/f,). b) MQ spectral peak tracking algorithm. ICMC Proceedings 1999 -347 -
Page 00000348 closely-spaced sine waves are combined, they can be interpreted as a single sinusoid with time-varying amplitude and phase: Alco(mlor) + A2cos (m2t) rA ^2Aý+ ^2A A 6 t+(AO2x 1 2 - 1- CO 2) t+At2 SA1-A2 1 Cl)1 (([2 ((A I + A2T 2 These amplitude beats occurring at a rate of (co - co). can be considered legitimate temporal activity, and whether temporal changes are caused by a single mode's undulating time envelope or by two or more close modes beating together is quite ambiguous. From a perceptual point of view, a good test of whether a modal structure has been captured by spectral analysis is to stretch the sounds by a considerable factor If a sound retains its original natural quality after stretching, we can consider that the modes have -been isolated satisfactorily in a practical sense. 2. Analysis/Synthesis Results For an MQ analysis of a percussion sound it is only necessary to give the lowest frequency of analysis. This not only determines the basic frequency resolution but it also sets the time window size and thus the effective time resolution.-Figure 2 shows track analyses for the three percussion instruments, and Figure 3 shows the corresponding three-dimensional amplitude vs. time/frequency graphs for the same cases. While the widely-spaced partials of the chime are clearly distinguished, the tracks of the timpani and cymbal seem very chaotic. For these two sounds, the tracking algorithm seems incapable of forming coherent partials, even though numerous correct frequen cies have been detected. This would not necessarily be a problem for further processing if track segments could somehow be joined together to form perceptually correct partials. Synthesis should accomplish this, and indeed if the sounds are resynthesized at their original durations, they all sound quite reasonable. However, when the sounds are resynthesized with a stretch factor of 10, the quality of the chime is preserved very well, but the stretched timpani and cymbal have a peculiar "swirling" quality, indicating the tracks are moving in a way not perceptually representative of the originals. Our phase vocoder program was written to optimize analysis of sounds with harmonically-related partials using a FFT-based filter bank approach. The original assumption was that the first harmonic would emanate from the first filter, the second from the second, and so on. However, for the case of the chime, the modes do not line up neatly with a normal harmonic series. Even though the series fo (2m+1) is integrally divisible by fo, fo is so low (9.1 Hz for F#4) that its use would compromise the resulting time resolution (= 1/ fo). Fortunately, we have found that fa = 0.33 fptch (122 Hz) for F#4 gives good mode separation. Figure 4 shows a) a perspective view of the time-variant analysis and b) a graph of the mode frequencies detected using this value of f,. Clearly, the phase vocoder and the MQ peak tracking methods give very comparable results for the chime, as they do for various harmonic sounds we have tested (e.g., the oboe). For the timpani and cymbal, we have been forced to break our rule about isolating modes in distinct filter outputs since it is extremely difficult to do so with an FFT-based method. Instead, we choose a low analysis frequency (20 Hz) which represents a reasonable compromise between our needs for adequate resolu I! I | t I C. 't ~ I I I't dim (AI (a) go. 0.? Air.-.. ""^-t^^^?^^^ 1 ~ ^~~'^*~- -- 1*^ ^A<^^?y-yr'y ' ^^^ - ICV--- ^~)~ (b) Fig. 2. MQ spectral peak frequency vs. time tracks for a) F#4 chime, b) F3 timpani, and c) cymbal -348 - ICMC Proceedings 1999
Page 00000349 11 I ~L. 2~~. -g " c~ 2.1 EeC *.Pe (a) (b) (c) Fig. 3. Three-dimensional spectrum graphs (amplitude vs. frequency vs. time) for a)-chime, b) timpani and c) cymbal obtained from MQ peak tracking analysis data.. tion in both time and frequency. The timpani's 3D spectrum and modes detected by this method are shown in Figure 5. We see that the initial complex spectral structure of the timpani is quickly reduced to a set of nearly harmonicpartials, which are particularly strong in the frequency region from 400 to 600 Hz. As reported in the acoustics literature (e.g., Rossing, 1990], these are the well known membrane modes m 1, m=1,..., 7, where m = no. of radial nodal lines on the drum surface and the 1 represents the one nodal circle which occurs at the timpani's rim. The timpani's pitch, in this case F3, is attributed to the.Saw: ft mode. On the other hand, although there are certain modes that stand out, the cymbal's sound remains very dense in modal structure throughout its duration, especially for frequencies above 700 Hz. Significant energy exists out to 10,000 Hz and beyond. Because there are so many irregularly spaced modes, it lacks any definite pitch. An examination of the amplitudes of the individual modes indicates that they undulate up and down at asynchronous rates, producing a very complex time-varying spectrum. 3000n 27M 24M - 2100 -two0I 1 yr~ Ii'.... 00T F 0.0 X0 2.4 (seC) O.i i o i o. o i o o.2 2. 0.00 0.20 0.40 0.60 0.10 1.00 1.20 I.0.60 1.90 2.00 2.20 2.40 t I (sac) (a) (b) Fig. 4. Phase vocoder analysis of the chime sound: a) three-dimensional spectrum (amplitude vs. frequency vs. time); b) frequencies detected from FFT i1ter output combined on a single graph, with grey scale indicating amplitude. ICMC Proceedings 1999 - 349 -
Page 00000350 Both the timpani and cymbal sounds have been resynthesized and stretched by a factor of 10 based on the phase vocoder data. The resulting sounds were found to be quite natural and were devoid of the "swirling" character plaguing comparable synthetic sounds produced from the MQ analysis. 3. Summary and Discussion We conclude that the phase vocoder method is as good as and is often superior to our MQ spectral peak tracking method for analysis of a wide variety of percussion sounds. This result was not expected because one of the virtues of the MQ method is its independence with respect to particular frequency ratios of the spectral components. However, when components are very close together and amplitudes are not stable, this project has been supported by the Research Board at the University of Illinois at Urbana-Champaign. For information about downloading SNDAND, contact the author of this paper. The author is grateful to Timothy I. Madden for his recent assistance in graphics programming and Robert C. Maher for writing the phase vocoder and MQ analysis/synthesis programs used in this study. References [Beauchamp, 1993] J. W. Beauchamp. Unix workstation software for analysis, graphics, modification, and synthesis of musical sounds. Audio Engineering Society, Preprint No. 3479, 1993. rMaher, 1989] R. C. Maber. An approach for the separation of voices in composite musical signals. Ph.D. thesis, University of Illinois at Urbana A p 6: F iano a 1600 Tan Ira Z40 ~4~~F..' ~T ~ ~-'Lkrrr ~~~~~:~ ' ~~ ~u~nr ii r~-,,,, ~-r~~~~ ~I. ~CL~~~-r~ m ---C~.~~ ` IWC ~C~~~~lr~~YI~I~. Z: ~~:~.-~~~.~ ~,, 4 ~~Ci C~ -~~--, ~r. ~~~~~I r~h~W -' ~~.~~~ rr --~--ru.~~~:~~.~~~~ ~~~~~~ ~~ trrr~arn~~.~ r~~rr.:~~~~~~~~~~~~~~~~~~~~~~~~ ~rt~"~~ '~~'Y" ~r~~i ~Lrr~~c;i~ ---~ c~ c C~rL ~~Y 4 r.rC-. ~~~~ ~~ L -~~ ~I ~'~~~~L' ~L --~---~ ~~U~~ Ts-~-t~; -~~~~~, ~ ~~~ ~ ~~~~.. r~. ___ _~ eo~t~strr~ccL-~~~~ ~ ~-~ -~ ~`i;c-;IE-~ p~-- ~ ~~~ ~ ~ ~_,~,H~t~ I~~ ~~~~~ ~~~ ~i;c~rr ~ES~I~-L~C~CII II:I~~ -~ 6:s~ ~c. ~~~r~ Q0.0 2.0 (ledc TIKEg (a) 0.00 0.20 o.a 0.60 0.10 1.0o 1.20 t.40 Lse 1.10 2.00 TIII(SE) (b) Fig. 5. Phase vocoder analysis of the timpani sound: a) three-dimensional spectrum (amplitude vs. frequency vs. time); b) frequencies detected from FFT filter outputs combined on a single graph, with grey scale indicating amplitude. it is easy for peak tracking algorithms to go awry. On the other hand, the trick to using the phase vocoder for these types of sounds is to use a low enough base frequency (or equivalently, a large enough FFT window) so that only one modal frequency occurs in each FF17 bin. Even if this rule is not strictly obeyed, a 20 Hz base frequency results in a good compromise between time and frequency resolutions and effectively only one frequency is produced by each bin albeit with some time modulation. 4. Acknowledgments The analysis graphics shown in this article have been generated by the SNDAN analysis/synthesis package [Beauchamp, 1993]. Development of SNDAN for Champaign, 1989. IMcAulay and Quatieri, 1986] R. J. McAulay and T. F QuatierL Speech analysis/synthesis based on a sinusoidatl representation. IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-34, No. 4: pp. 744-754, 1986. (Rossing, 1990] T. D. Rossing. The science of sound, Addison-Wesley, pp. 267-270, 1990. [Smith and Serra, 1987] 1. 0. Smith and X. Serra PARSH3L: An analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation. Proc. 1987 Int. Computer Music Conf, nt. Computer Music Assn., pp. 290-297, 1987. - 350 - ICMC Proceedings 1999