A Timbre Analysis And Classification Toolkit For Pure Data

Brent, William

« Prev Next »

ï~~7c Figure 4. Sending training snapshots and continuous overlapping cepstral analyses to timbrelD. with two instances of Pd's list splitting object keeps timbrelD's feature database more compact. The choice of cepstral coefficient range 2 through 40 is somewhat arbitrary, but it is very easy to experiment with different ranges by changing the arguments of the two list split objects. In order to train the system on 3 vowels, about 5 snapshots must be captured during training examples of each sung vowel. In order to distinguish background noise, 5 additional snapshots should be taken while the vocalist is silent. Next, the "cluster" message is sent with an argument of 4, which automatically groups similar analyses so that the first vowel is represented by cluster 0, the second vowel by cluster 1, and so on. The cluster associated with background noise will end up as cluster 3. It is not necessary to ensure that each vowel receives the same number of analyses. If there were 7 training examples for the first vowel and only 5 for the others, the clustering algorithm should still group the analyses correctly. Clustering results can be verified by sending the "clusterlist" message, which sends a list of any particular cluster's members out of timbrelD's fourth outlet. To switch from training to classification, cepstrum~'s pre-processed output must be connected to timbrelD's second inlet. The actual example patch contains a few routing objects to avoid manual re-patching, but they are omitted here for clarity. Activating the metro in Figure 4 enables continuous overlapping analysis. If finer time resolution is desired for even faster response, the metro's rate can be set to a shorter duration. Here, the rate is set to half the duration of the analysis window size in milliseconds, which active attribute range to use only the 2nd through 40th coefficients in similarity calculations. corresponds to an overlap of 2. As each analysis is passed from cepstrum~ to timbrelD, a nearest match is identified and its associated cluster index is sent out timbrelD's first outlet. The example patch animates vowel classifications as they occur. 4.2. Target-based Concatenative Synthesis Some new challenges arise in the case of comparing a constant stream of input features against a large database in real-time. The vowel recognition example only requires a feature database containing about 20 instances. To obtain interesting results from target-based concatenative synthesis, the database must be much larger, with thousands rather than dozens of instances. This type of synthesis can be achieved using the systems mentioned in section 1, and is practiced live by the artist sCrAmBlEd?HaCkZ! using his own software design [5]. The technique is to analyze short, overlapping frames of an input signal, find the most similar sounding audio frame in a pre-analyzed corpus of unrelated audio, and output a stream of the best-matching frames at the same rate and overlap as the input. The example included with timbrelD provides an audio corpus consisting of 5 minutes of bowed string instrument samples. As an audio signal comes in, an attempt at reconstructing the signal using grains from the bowed string corpus is output in real time. Audio examples demonstrating the results can be accessed at www.williambrent.com. In these types of applications, timbrelD's third inlet can be used in order to search large feature databases. Classification requests sent to the third inlet are restricted by a few additional parameters. For instance, the search for a nearest match can be carried out on a specified subset of the database by setting the "searchcenter" and "neighborhood" parameters. The concatenative synthesis example provides options for different grain sizes and analysis rates, but with default settings, the process of computing a BFCC feature for the input signal, comparing it with 2500 instances in the feature database, and playing back the best-matching grain occurs at a rate of 43 times per second. Using a 2.91 GHz Intel Core 2 Duo machine running Fedora 11 with 4 GB of RAM, the processor load is about 17%. By lowering the neighborhood setting, this load can be reduced. However, reducing processor load is not the only reason that restricted searches are useful. A performer may also wish to control which region of the audio corpus from which to synthesize. A third parameter, "reorient" causes searchcenter to be continually updated to the current best match during active synthesis. With matches occuring 43 times per second, the search range adapts very quickly to changes in the input signal, finding an optimal region of sequential grains from which to draw. 227 0

« Prev Next »