ï~~Proceedings of the International Computer Music Conference (ICMC 2009), Montreal, Canada August 16-21, 2009 AC/DC Young MC Sound Source K Sample Length Error 77 Error 7 Noise N/A N/A 0.6395 N/A 0.6265 N/A Ramones 100 116 ms 0.3844 0.004569 0.4039 0.001979 Ramones 200 116 ms 0.3787 0.002386 0.4100 0.003376 AC/DC 100 116 ms 0.3455 0.005579 0.3821 0.003689 AC/DC 200 116 ms 0.3349 0.002382 0.3841 0.001939 MC Hammer 100 116 ms 0.3838 0.005017 0.3753 0.003875 MC Hammer 200 116 ms 0.3740 0.002993 0.3732 0.002499 TIMIT 100 464 ms 0.5898 0.004742 0.6102 0.003262 TIMIT 200 464 ms 0.5275 0.002097 0.6110 0.001796 Table 1. Errors obtained by our approach when trying to match songs by AC/DC and Young MC using various sets of sound sources, and the learned values of the hyperparameter r7. In all cases our method outperforms a baseline of white noise. Note that lower errors do not necessarily translate to a more aesthetically interesting result. differences between the normalized spectrograms of the target sound and resynthesized sound. Let z be the normalized spectrogram of the resynthesized sound. Our error metric is BW err=0.5Z Z Zwbb-Ywb b=1 w=1 (14) which ranges between 0.0 (perfect agreement between the spectrograms) and 1.0 (no overlap between the spectrograms). Table 4 presents the errors obtained by our approach when trying to match 23.2 second clips (1000 512-sample windows at 22.05 KHz) from the songs "Dirty Deeds Done Dirt Cheap" by AC/DC and "Bust a Move" by Young MC, using samples selected at random from the songs "Dirty Deeds Done Dirt Cheap," "Blitzkrieg Bop" by the Ramones, and "U Can't Touch This" by MC Hammer. We also used words spoken by various speakers from the TIMIT corpus of recorded speech as source samples. Samples from similar songs tend to produce lower errors, whereas the model had trouble reproducing music using spoken words. The speech samples produce a quantitatively weaker match to the target audio, but the "automatic a cappella" effect of trying to reproduce songs using speech proved aesthetically interesting. All output sounds are available at the URL given above. 5. DISCUSSION We presented a new audio mosaicing approach that attempts to match the spectrogram of a target sound by combining a vocabulary of shorter sounds at different time offsets and amplitudes. We introduced the SIMM model and showed how to use it to find a set of time offsets and amplitudes that will result in an output sound that matches the target sound. Our probabilistic approach is extensible, and we expect future refinements will yield further interesting results. 6. ACKNOWLEDGMENTS David M. Blei is supported by ONR 175-6343, NSF CAREER 0745520, and grants from Google and Microsoft. 7. REFERENCES [1] M. Escobar and M. West, "Bayesian density estimation and inference using mixtures," Journal of the American Statistical Association, vol. 90, pp. 577-588, 1995. [2] M. Hoffman, D. Blei, and P. Cook, "Finding latent sources in recorded music with a shift-invariant HDP," in International Conference on Digital Audio Effects (DAFx) (under review), 2009. [3] A. Lazier and P. Cook, "MOSIEVIUS: Feature driven interactive audio mosaicing," in International Conference on Digital Audio Effects (DAFx), 2003. [4] R. Neal, "Probabilistic inference using Markov chain Monte Carlo methods," Department of Computer Science, University of Toronto, Tech. Rep. CRG-TR-93-1, 1993. [5] D. Schwarz, "Concatenative sound synthesis: The early years," Journal of New Music Research, vol. 35, no. 1, pp. 3-22, 2006. [6] P. Smaragdis, B. Raj, and M. Shashanka, "Sparse and shift-invariant feature extraction from non-negative data," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 2069-2072. [7] A. Zils and F. Pachet, "Musical mosaicing," in International Conference on Digital Audio Effects (DAFx), 2001. 170
Top of page Top of page