ï~~Proceedings of the International Computer Music Conference (ICMC 2009), Montreal, Canada
August 16-21, 2009
AC/DC Young MC
Sound Source K Sample Length Error 77 Error 7
Noise N/A N/A 0.6395 N/A 0.6265 N/A
Ramones 100 116 ms 0.3844 0.004569 0.4039 0.001979
Ramones 200 116 ms 0.3787 0.002386 0.4100 0.003376
AC/DC 100 116 ms 0.3455 0.005579 0.3821 0.003689
AC/DC 200 116 ms 0.3349 0.002382 0.3841 0.001939
MC Hammer 100 116 ms 0.3838 0.005017 0.3753 0.003875
MC Hammer 200 116 ms 0.3740 0.002993 0.3732 0.002499
TIMIT 100 464 ms 0.5898 0.004742 0.6102 0.003262
TIMIT 200 464 ms 0.5275 0.002097 0.6110 0.001796
Table 1. Errors obtained by our approach when trying to match songs by AC/DC and Young MC using various sets of sound
sources, and the learned values of the hyperparameter r7. In all cases our method outperforms a baseline of white noise. Note
that lower errors do not necessarily translate to a more aesthetically interesting result.
differences between the normalized spectrograms of the target sound and resynthesized sound. Let z be the normalized
spectrogram of the resynthesized sound. Our error metric is
BW
err=0.5Z Z Zwbb-Ywb
b=1 w=1
(14)
which ranges between 0.0 (perfect agreement between the
spectrograms) and 1.0 (no overlap between the spectrograms).
Table 4 presents the errors obtained by our approach
when trying to match 23.2 second clips (1000 512-sample
windows at 22.05 KHz) from the songs "Dirty Deeds Done
Dirt Cheap" by AC/DC and "Bust a Move" by Young MC,
using samples selected at random from the songs "Dirty
Deeds Done Dirt Cheap," "Blitzkrieg Bop" by the Ramones,
and "U Can't Touch This" by MC Hammer. We also used
words spoken by various speakers from the TIMIT corpus
of recorded speech as source samples. Samples from similar
songs tend to produce lower errors, whereas the model had
trouble reproducing music using spoken words. The speech
samples produce a quantitatively weaker match to the target
audio, but the "automatic a cappella" effect of trying to reproduce songs using speech proved aesthetically interesting.
All output sounds are available at the URL given above.
5. DISCUSSION
We presented a new audio mosaicing approach that attempts
to match the spectrogram of a target sound by combining a
vocabulary of shorter sounds at different time offsets and
amplitudes. We introduced the SIMM model and showed
how to use it to find a set of time offsets and amplitudes that
will result in an output sound that matches the target sound.
Our probabilistic approach is extensible, and we expect future refinements will yield further interesting results.
6. ACKNOWLEDGMENTS
David M. Blei is supported by ONR 175-6343, NSF CAREER 0745520, and grants from Google and Microsoft.
7. REFERENCES
[1] M. Escobar and M. West, "Bayesian density estimation
and inference using mixtures," Journal of the American
Statistical Association, vol. 90, pp. 577-588, 1995.
[2] M. Hoffman, D. Blei, and P. Cook, "Finding latent
sources in recorded music with a shift-invariant HDP,"
in International Conference on Digital Audio Effects
(DAFx) (under review), 2009.
[3] A. Lazier and P. Cook, "MOSIEVIUS: Feature driven
interactive audio mosaicing," in International Conference on Digital Audio Effects (DAFx), 2003.
[4] R. Neal, "Probabilistic inference using Markov chain
Monte Carlo methods," Department of Computer Science, University of Toronto, Tech. Rep. CRG-TR-93-1,
1993.
[5] D. Schwarz, "Concatenative sound synthesis: The early
years," Journal of New Music Research, vol. 35, no. 1,
pp. 3-22, 2006.
[6] P. Smaragdis, B. Raj, and M. Shashanka, "Sparse
and shift-invariant feature extraction from non-negative
data," in IEEE International Conference on Acoustics,
Speech and Signal Processing, 2008, pp. 2069-2072.
[7] A. Zils and F. Pachet, "Musical mosaicing," in International Conference on Digital Audio Effects (DAFx),
2001.
170