Page  00000491 SOUNDSPOTTER / REMIX-TV: FAST APPROXIMATE MATCHING FOR AUDIO AND VIDEO PERFORMANCE Michael Casey Media Futures Lab Goldsmiths, University of London ABSTRACT SoundSpotter is an open source software system for realtime matching of an audio input stream to a database of continuous audio or video. Among its novel features are real-time control over audio segmentation, feature selection and match radius. The system uses audio input to control selection of output from a database using approximate matching. The low latency methods employed allow a feedback loop between the performer and SoundSpotter, so it can be used as a live electronic musical instrument. We employ approximate matching using nearest neighbor search with variable-length sequences of features. The feature space is controlled by range selection over the feature vector dimensions determining which are left out of similarity calculations. The current implementation is capable of real-time matching in tens of hours of audio or video on current laptop hardware. Finally, we describe a SoundSpotterdriven video editing application called REMIX-TV. INTRODUCTION Audio matching is a query by example task that has the goal of returning close perceptual matches from a database to the given query. The task is very complex and can be approached in a number of ways. In this paper, we use methods that preserve the temporal ordering of acoustic information in the query segment when making a match. The system uses sequences of audio features, called audio shingles that are segmented by periodic windows, onsets or beats.SoundSpotter was designed as a method to play a large database of audio using acoustic control. The mapping from input to output depends on the audio properties of segments stored in the database, the audio properties of the input, the extracted audio features and the matching algorithm. SoundSpotter is an open source project implemented using the FLEXT C++ framework for compatibility with the Max/MSP and PD environments, but a stand-alone version is also available [17]. Being incorporated into such environments enables integration with a host of widely used music analysis and synthesis tools such as onset detectors, beat trackers, phase vocoder, cross-synthesis and video tools. We describe prior work in the area of audio matching in Section 2, SoundSpotter's implementation in Section 3 and present sound-driven algorithmic approaches to video editing, using SoundSpotter's REMIX-TV capability, in Section 4. Mick Grierson Music Department Goldsmiths, University of London PREVIOUS WORK A number of previous systems employ techniques of music information retrieval in music performance. Among these the most relevant to our work are Musaicing [15], MatConcat [12], Mosievius [8], BeatBox [7], ScrambledHackz [16] and Caterpillar [11]. Our work extends this previous work in this area in several important respects. First, SoundSpotter uses sequence matching for matching temporal patterns in the multi-dimensional audio features. Secondly, variable length segmentation is employed so that the sequence size can be changed on-the-fly during a performance. Thirdly, real-time feature selection changes the feature space for matching without requiring re-extraction of features for the source database. In addition, SoundSpotter is timbre, time and pitch sensitive with complete control given to the user in real-time. Furthermore, real-time control over match distance allows a continuous range of similarity to dissimilarity in the match. Beyond this, matching without replacement is implemented by efficient table lookup. Finally, a live spot function performs matching in an accumulating real-time source buffer. There is a wide spectrum of previous work on audio similarity tasks: from very specific fingerprinting work [5][14] to genre recognition [13][10]. SoundSpotter falls in the middle of these two extremes which are well represented in the literature. We want to find audio segments that are similar to the query segment but that are not exact matches. Our approximate audio matching task needs both new features-we don't expect the very specific features used in fingerprinting to work-and a new matching criteria because we want a good perceptual fit to our query that considers temporal, timbral and pitch-based features. Audio Shingling is a technique for similarity matching that concatenates audio feature vectors into a sequence of vectors, and matches the entire sequence. The number of vectors used varies between 10 and 100 in the literature with feature frame rates of the order of 100lms. In previous work we explain how audio shingles can be used to identify remixed music by matching specific audio content embedded in a new context, [1] [2]. A similar technique was employed in [9] for approximate matching of passages from classical music works, thus identifying the same work performed by different artists and even different ensembles such as piano reductions of orchestral works. 491

Page  00000492 In the current work we combine pitch and timbral features in our similarity measure by employing cepstral coefficient features which have been widely used for music information retrieval tasks [13] but not in conjunction with the shingling method. Our cepstral coefficients are organized on a strictly logarithmic scale so they are log-frequency cepstral coefficients (LFCC). THE SOUNDSPOTTER SYSTEM The system was designed to meet a set of requirements for real-time audio matching. We consider the primary real-time constraint to be low latency-- given an input segment, the time to compute the match should not exceed 20ms, so efficient algorithms and implementations are required, especially for large source databases. Additional requirements are that the audio matching is temporally sensitive and that it should allow real-time control over both the match criteria and the match distance. As such the system implements real-time controls for sequence length (possibly under algorithmic control such as by onset detection), on-the-fly feature selection, adaptive radius-based searching, database range selection, a variable-length queue system for matching without replacement and matching to a realtime accumulating input source. Figure 1 shows the soundspotter interface in the PureData environment. Database 20 usually retained. Figure 2 shows that there are 89 elements in the full LFCC feature vector. The features are obtained using a 1/16th-octave constant-Q transform with a 16384-point FFT for audio sampled at 44.1kHz. The logarithmic spectral frequencies are in the range 62.5Hz to 8kHz, the power spectrum is computed, the real base-10 logarithm is taken and a Discrete Cosine Transform (DCT) applied. The LFCC features are then sequenced into a series of vectors according to the segmentation method. Feature extraction for the database of audio sources can either be pre-computed or extracted on-line as an accumulating resource during a performance. The time to extract pre-computed features is many hundreds of times faster than real time, the complexity being primarily determined by the FFT. Generally, live audio is the control and the audio source database can either be the accumulated input (livespot) or a WAV file. 44.lkJHz audio 2048 win live audio audio sources) 16384 point FFT 16384d LFCC LFCC SCQT | I SI Segment i Database i 89d ______ log(.) l9d Match 89d DCT Output _ Soumce Rtrival SSegment LFCC 89d Figure 2: System components and data flow. Logfrequency cepstral coefficients are extracted for the source and live input. Features are segmented into shingles and matching occurs over the shingle sequences. Output is the closest matching source audio. Segmentation Segmentation is a data-dependent stage of the system that divides the database, and live input stream, into chunks that are useful for the given context. The simplest type of segment in SoundSpotter is a variable-length sequence that is triggered at periodic intervals. If the sequence length parameter is constant then the segments are fixed-length. If the sequencelength parameter is varied the segments are non-uniform. Here, matching is unconstrained by boundaries, and exhaustive searching is necessary to find the best alignment between the input segment and the database. Inter-onset interval segments (IOI) trigger matching at event boundaries caused by onsets. A variety of onset detection method can be employed such as PureData's built-in bonk- object. SoundSpotter then triggers matching at onset boundaries. The length of the match sequence is the last inter-onset interval. Here the matching is aligned to onsets in the audio database; this has the advantage of reducing the number of sequence comparisons because onsets sparsely populate the source MatchMode f or t-athi.nriq Figure 1: PureData interface to SoundSpotter. Audio Features The main requirement for audio sequence matching is that the feature is equipped with a simple norm such as LI or L2, thus forming a metric space. For example, this is true of spectrum-derived features such as LFCCs and chromagrams, but it is not true of commonly-used statistical features such as GMMs [10]. SoundSpotter's features are log frequency cepstral coefficients (LFCC) that are like standard MFCCs but use the full range of coefficients instead of the first 12 to 492

Page  00000493 database. The frame indexes of onsets are stored in a lookup table that is separate from the audio features. This allows fast retrieval by traversing only the list of known onsets in the match algorithm. A third type of segment is a beat, or tactus, intervals which are determined by a common metrical pulse that tracks tempo in the audio input stream. For this type of matching the audio database must also be tempo tracked and beat locations stored in a lookup table. Here, retrieval is constrained by tempo as well as beats and thus is more efficient than unconstrained matching by variable-length sequences. Sequence Matching As in our previous work, [1] [2], we use a Euclidean distance metric on the feature space for matching. For features that are unit norm, the Euclidean distance for audio shingles is proportional to the convolution between the input vector sequence and the sequences in the database; this leads to an efficient implementation. Figure 3 shows the convolution algorithm for input shingle V, source database features, D, shingle sequence length, w, database length, n, low feature, a, feature range, b, and feature dimensionality, d. Figure 4 illustrates computation of the distances between the input shingle and the database shingles. The index of the maximum value of the convolution is the index of the closest matching segment in the source database. shingleMatch(V,D,w,n,a,b,d) -- M for(k=0; k < n-w; k++){ nl=n2=0; // norm variables for(j=0; j < w-l; j++) for(i=a; i < a+b; i++){ vl = V[j*d+i]; v2 = D[ (j+k)*d+i]; M[k] += vl*v2;//convolution nl += vl*vl; n2 += v2*v2; M[k] /= sqrt(nl*n2); //norm }// Distance is now l-M[k] Figure 3: Pseudo-code for audio matching by convolution of an input shingle with the source database features. Real-Time Control The real-time controls include feature selection using an element range for vector features. Low element, a, and high element, b, control the range of values accessed by the matching algorithm. By selecting ranges of coefficients, the matching can be made sensitive to different audio properties such as timbre or pitch. In a typical audio application, the first 12 cepstral coefficients are used to characterize wide-band, or timbral, spectral information only. In SoundSpotter, the remaining 77 coefficients are used to match increasingly narrow-band, or higher-frequency, components of the spectrum. Range selection over the LFCC feature allows matching to be steered toward timbral features, pitch features or both. The matching algorithm is able to work on time-varying feature selection because the range of features defines a complete metric sub-space for the matching algorithm. Segmentation is controlled, in real-time, by a parameter w that is the length of the sequence to match. w can either be controlled directly by the user or by a segmenter which sets w at each onset point to be the last inter-onset interval. Therefore, the previous w feature vectors of the input stream are used for matching. Real-time control over search radius sets the Euclidean distance for matching. For example, when the search radius is zero then the segment that is closest to zero distance is selected for output. If the radius is set to 0.4 then the segment closest to a distance of 0.4 from the input shingle is selected. The distance range is between 0 (most similar) and 1 (most dissimilar) by normalizing features to unit norm in the shingle match function. j=o 12 3 4 5 6 7;w- 1 Quer-y I I i I I i Sequence v[j*d+i] d-1i 1 1 k=02 34 I I I ' ' n-I D atab a.e D L -L L _". -- I k=0 1 2 3 4 5 n-1 - - - - - - -l TTT -T -r k=I M [k]+=v [ j d+i]*D [ (k+j) *d+i] S~equcncc Di~tanc M[k] Figure 4: Audio shingling. The input sequence is advanced over the database and a multidimensional convolution operation gives the similarity values. SOUND-DRIVEN ALGORITHMIC APPROACHES TO VIDEO EDITING - REMIX TV Audio analysis and segmentation can be used as an effective method for algorithmic video editing. Representational video material depicting real-world events often contains large amounts of synchronous audio material. Events captured on screen are almost always accompanied by direct or related sound events, either as a result or in response to activity that can be seen [3]. Audio material within a video stream can be effectively used as region markers for visual events in a large number of cases. Audio analysis is therefore an acceptable starting point for the development of algorithmic video editing strategies. In addition, it can also be used as a trigger for video manipulation. Visual material can be re-sequenced using a number of specific strategies, for example, probabilistic methods or through real-time edit decisions. Significantly, SoundSpotter allows for visual material to be re-sequenced through the process of audio matching in live performance. 493

Page  00000494 Previous Work Previous work utilising similar approaches includes Weapons of Mass Deconstruction [4], a software suite for the analysis and real-time editing of audiovisual material. In this system, a low pass filter is used for event onset detection. This process can operate in either real-time (using RAM) or non real-time (using the hard disk). In non-real time mode, video files are prepared first. Audio and video streams are rendered as separate files with identical file names. When a video file is loaded, its audio stream is analysed. Real-time mode allows for the analysis and resequencing of audiovisual material with a specified delay limited by available RAM, and is equal to the overall length of the material in the region list. This list is continually updated through the use of a dual alternating buffer system - while one set of data is being used for editing, the next is being analysed. Region start points are stored in a buffer in millisecond units, and are assumed to end one millisecond before the next region starts. The audio region start point is used to calculate the video start frame. For a video file with a frame rate of 25 frames per second, region start points specified in milliseconds are divided by 40 (1000/25), giving the corresponding frame number whilst allowing audio regions to retain their timing accuracy. This maintains perfect audiovideo synchronisation even in cases where extreme audio and video manipulation is performed. Remix TV SoundSpotter's Remix TV functionality improves on previous systems in two key areas. First, Soundspotter's audio analysis and segmentation approach is significantly more advanced, allowing for re-sequencing to be driven by audio matching. Sonic events in a video stream can be located and restructured in real-time, with connected visual material synchronised automatically. Secondly, SoundSpotter's 'livespot' mode dispenses with the need for a real-time dual alternating buffer system, whilst enhancing real-time functionality. On computing an audio match, SoundSpotter frame values are converted to video frame numbers by scaling the output to the frame rate (25 frames per second in most circumstances). SoundSpotter outputs synchronisation points related to its hop size. To maintain synchronisation, video frames are matched to the nearest millisecond. The region size is calculated relative to window size, so that playback of video halts on completion of a loop equal to audio segment length. This system can be used for the remixing of continuous visual material via audio matching. Performers can present specific sections of video material by matching known audio sequences that correspond with specific moments, or present sequences spontaneously. In this way, video and film material can be re-edited in an improvised fashion, or to a structured score. An effective use for SoundSpotter's remix TV functionality is real-time non-linear re-editing of narratives. Musical motifs are often used to underscore specific characters, events, and themes in narrative cinema [6]. SoundSpotter allows for the accurate recall of scenes in Cinema and Television through the realtime live matching of musical material specific to particular Films. The entire corpus of Hitchcock's Vertigo (1957) can be analysed, complete with Bernard Herrmann's score. If a performer then plays the love theme from Vertigo into SoundSpotter, there is a high chance (given the correct settings) that the audience will see an image of Jimmy Stuart kissing Kim Novak. In this way, the entire musical narrative structure of of a cinematic work can be navigated by a performer, providing a non-linear approach to cinematic viewing based on musical, thematic, character-based or motif search. As SoundSpotter efficiently encodes elements of texture in addition to pitch sequences, specific scenes featuring both sound effects and music can be triggered by using a similar sound as a search query. Scenes can be analysed by SoundSpotter and used to trigger each other. In this way, similarities in film form can be explored in live performance via feature extraction and audio matching. REFERENCES [1] Casey, M. and Slaney, M. Song Intersection by Approximate Nearest Neighbor Search. Proc.ISMIR, 2006. [2] Casey, M. and Slaney, M. The importance of sequences in music similarity. Proc. ICASSP, 2006. [3] Chion, M., Audiovision Sound on Screen, Columbia University Press, 1994 [4] Grierson, M., Audiovisual Composition, Thesis, University of Kent, 2005 [5] Herre, J., Allamanche, E., Hellmuth, O., Kastner, T., Robust identification/fingerprinting of audio signals using spectral flatness features. JASA, Vol. 111, 2002. [6] Kalinak, K, Settling the score: music and the classical Hollywood film, University of Wisconsin Press, 1992 [7] Kapur, A., Benning, M. and Tzanetakis, G. Query-byBeat-Boxing: Music Retrieval for the DJ, in Proc. ISMIR, Barcelona, 2004. [8] Lazier, A. and Cook, P. Mosievius: Feature Driven Interactive Audio Mosaicing, in Proc. DAFx-03, 2003. [9] Muller, M., Kurth, F. and Clausen, M. Audio Matching via Chroma-Based Statistical Features. In Proc. ISMIR, London, Sept. 2005 [10] Pampalk, E., Flexer, A., Widmer, G. Improvements of Audio-Based Music Similarity and Genre Classificaton. in Proc. ISMIR, pp. 628-633, 2005. [11] Schwarz, D. A System for Data-Driven Concatenative Sound Synthesis. In Proc. DAFX-00, Verona, Italy, 2000. [12] Sturm, B. MATConcat: An Application for Exploring Concatenative Sound Synthesis using MATLAB, in Proc. ICMC, Miami, 2004. [13] Tzanetakis, G. and Cook, P. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293302, 2002. [14] Wang, A., Smith, J. System and methods for recognizing sound and music signals in high noise and distortion. United States Patent 6990453, 2006 [15] Zils, A. and Pachet, F. Musical Mosaicing. Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, 2001. [16] [17] 494