USING A PERCEPTUALLY BASED TIMBRE METRIC FOR PARAMETER
CONTROL ESTIMATION IN PHYSICAL MODELING SYNTHESIS
Hiroko Terasawa, Jonathan Berger, Julius O. Smith
Center for Computer Research in Music and Acoustics (CCRMA)
Department of Music, Stanford University
ABSTRACT
Manual adjustment of control parameters for physical modeling synthesis suffers from practical limitations of timeintensive and sometimes arbitrary and haphazard parameter tweaking. An efficient approach to automatic parameter estimation, the goal of this study, would potentially eliminate much of the hit or miss nature of parameter tuning by finding optimal control parameters for physical modeling synthesis. The method is based on psychoacoustically motivated timbre distance estimations between
a recorded reference sound and a set of corresponding synthesized sounds.The timbre comparisons are based upon
the sample mean and standard deviation between MelFrequency Cepstral Coefficients (MFCC) computed using
several steady-state time frames from the reference and
synthesized sounds. This framework serves as a preliminary model of the auditory feedback loop in music instrument performance.
1. INTRODUCTION
Music instrument performance is a complex sensorimotor behavior. Through training with auditory timbre feedback, the production task becomes finely controllable and
seemingly automated.Attainment of expertise can be described as the stage in which timbre is conceptualized and
an integrated set of control parameters, some with a remarkable degree of subtlety of change, are set with seemingly effortless thought. For the composer and orchestrator expertise involves similar conceptualization of desired
timbre and knowledge of the notational cues needed to
suggest appropriate production to the performer.
In the domain of digital music synthesis the absence
of efficient integration between conceptualized timbre and
parameter controls creates an impediment to effective and
efficient timbre manipulation. In order to create a better
(i.e., efficient, intuitive and interactive) auditory feedback
loop a composer/performer of a physical model would
benefit from a perceptually informed algorithm that estimates timbre control parameters.
A primary goal of physical model (PM) synthesis is to
create sound that convincingly and compellingly approximates the sound of a humanly performed instrument [1, 2].
Coupling a model of the performer's technique and the
auditory feedback loop that is vital in the creation of the
musical sound is a vital aspect toward achieving this goal.
Some work toward this goal has been done. These include studies to embed expressive nuance in MIDI scores
using KTH rules [3] and in parameter estimation of PM
synthesis. Diana Young and Stefania Serafin investigated
the playability of the violin physical model controlled by
bow force and bow position [4]. Caroline Traube et al. has
estimated the plucking position of a guitar using the spectral centroid for the timbre evaluation [5]. The analysissynthesis research group at IRCAM has worked on the
estimation of the control parameter of trumpet by diverse
approaches: by inverting the trumpet physical model [6],
vector quantization on cepstral coefficients and derivative
of cepstral coefficients [7], and minimization of two perceptual similarity criteria as a function of the control parameters [8], to which, we chose a very similar approach.
In spite of the difficulty in controlling the non-linear system with delayed feed-back, these attempts have been quite
successful. Guillemain et al. investigated the distribution
of the clarinet timbre as a function of control parameters
using classical timbre descriptors [9].
Our goal is to create a control system for PM synthesis,
which produces: 1) Desired pitch, loudness and timbre at
2) Desired time so that this system could integrate with existing expressive nuance rendering systems in which pitch,
loudness and time are the typical control parameters.
In addition to the primary PM synthesis inputs of pitch
and time it is important to address issues of loudness and
timbre. The Mel-Frequency cepstral coefficient (MFCC)
is a perceptually valid metric to capture both loudness and
timbre [10].
Our ideal purpose is to provide a means of coupling a
conceptualization of a desired timbre to the production parameters of physical models of acoustic instruments.There
are two primary motivations for our work:
1. to provide improved performance and practicality
in music composition and performance of physical
models,
2. as a preliminary step toward a model of the auditory
feedback loop in music instrument performance.
Our research employs the STK clarinet PM [11] and focused on breath pressure and breath turbulence as variable
control parameters. Recordings of acoustic clarinet tones
are used as reference tones to imitate. In order to evaluate
the perceptual timbre difference, the synthesized sounds
and a reference sound are compared in terms of the mean
0