ï~~7c
Figure 4. Sending training snapshots and continuous overlapping cepstral analyses to timbrelD.
with two instances of Pd's list splitting object keeps timbrelD's feature database more compact. The choice of cepstral coefficient range 2 through 40 is somewhat arbitrary,
but it is very easy to experiment with different ranges by
changing the arguments of the two list split objects.
In order to train the system on 3 vowels, about 5 snapshots must be captured during training examples of each
sung vowel. In order to distinguish background noise, 5
additional snapshots should be taken while the vocalist is
silent. Next, the "cluster" message is sent with an argument
of 4, which automatically groups similar analyses so that the
first vowel is represented by cluster 0, the second vowel by
cluster 1, and so on. The cluster associated with background
noise will end up as cluster 3. It is not necessary to ensure
that each vowel receives the same number of analyses. If
there were 7 training examples for the first vowel and only
5 for the others, the clustering algorithm should still group
the analyses correctly. Clustering results can be verified by
sending the "clusterlist" message, which sends a list of any
particular cluster's members out of timbrelD's fourth outlet.
To switch from training to classification, cepstrum~'s
pre-processed output must be connected to timbrelD's second inlet. The actual example patch contains a few routing
objects to avoid manual re-patching, but they are omitted
here for clarity. Activating the metro in Figure 4 enables
continuous overlapping analysis. If finer time resolution is
desired for even faster response, the metro's rate can be set
to a shorter duration. Here, the rate is set to half the duration of the analysis window size in milliseconds, which
active attribute range to use only the 2nd through 40th coefficients in similarity calculations.
corresponds to an overlap of 2. As each analysis is passed
from cepstrum~ to timbrelD, a nearest match is identified
and its associated cluster index is sent out timbrelD's first
outlet. The example patch animates vowel classifications as
they occur.
4.2. Target-based Concatenative Synthesis
Some new challenges arise in the case of comparing a constant stream of input features against a large database in
real-time. The vowel recognition example only requires a
feature database containing about 20 instances. To obtain
interesting results from target-based concatenative synthesis, the database must be much larger, with thousands rather
than dozens of instances. This type of synthesis can be
achieved using the systems mentioned in section 1, and is
practiced live by the artist sCrAmBlEd?HaCkZ! using his
own software design [5]. The technique is to analyze short,
overlapping frames of an input signal, find the most similar
sounding audio frame in a pre-analyzed corpus of unrelated
audio, and output a stream of the best-matching frames at
the same rate and overlap as the input.
The example included with timbrelD provides an audio
corpus consisting of 5 minutes of bowed string instrument
samples. As an audio signal comes in, an attempt at reconstructing the signal using grains from the bowed string corpus is output in real time. Audio examples demonstrating
the results can be accessed at www.williambrent.com.
In these types of applications, timbrelD's third inlet can
be used in order to search large feature databases. Classification requests sent to the third inlet are restricted by a few
additional parameters. For instance, the search for a nearest match can be carried out on a specified subset of the
database by setting the "searchcenter" and "neighborhood"
parameters.
The concatenative synthesis example provides options
for different grain sizes and analysis rates, but with default
settings, the process of computing a BFCC feature for the
input signal, comparing it with 2500 instances in the feature
database, and playing back the best-matching grain occurs
at a rate of 43 times per second. Using a 2.91 GHz Intel Core
2 Duo machine running Fedora 11 with 4 GB of RAM, the
processor load is about 17%. By lowering the neighborhood
setting, this load can be reduced. However, reducing processor load is not the only reason that restricted searches are
useful. A performer may also wish to control which region
of the audio corpus from which to synthesize.
A third parameter, "reorient" causes searchcenter to be
continually updated to the current best match during active
synthesis. With matches occuring 43 times per second, the
search range adapts very quickly to changes in the input
signal, finding an optimal region of sequential grains from
which to draw.
227
0