USING A PERCEPTUALLY BASED TIMBRE METRIC FOR PARAMETER CONTROL ESTIMATION IN PHYSICAL MODELING SYNTHESIS

Terasawa, Hiroko; Berger, Jonathan; Smith, Julius O.

USING A PERCEPTUALLY BASED TIMBRE METRIC FOR PARAMETER CONTROL ESTIMATION IN PHYSICAL MODELING SYNTHESIS Hiroko Terasawa, Jonathan Berger, Julius O. Smith Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University ABSTRACT Manual adjustment of control parameters for physical modeling synthesis suffers from practical limitations of timeintensive and sometimes arbitrary and haphazard parameter tweaking. An efficient approach to automatic parameter estimation, the goal of this study, would potentially eliminate much of the hit or miss nature of parameter tuning by finding optimal control parameters for physical modeling synthesis. The method is based on psychoacoustically motivated timbre distance estimations between a recorded reference sound and a set of corresponding synthesized sounds.The timbre comparisons are based upon the sample mean and standard deviation between MelFrequency Cepstral Coefficients (MFCC) computed using several steady-state time frames from the reference and synthesized sounds. This framework serves as a preliminary model of the auditory feedback loop in music instrument performance. 1. INTRODUCTION Music instrument performance is a complex sensorimotor behavior. Through training with auditory timbre feedback, the production task becomes finely controllable and seemingly automated.Attainment of expertise can be described as the stage in which timbre is conceptualized and an integrated set of control parameters, some with a remarkable degree of subtlety of change, are set with seemingly effortless thought. For the composer and orchestrator expertise involves similar conceptualization of desired timbre and knowledge of the notational cues needed to suggest appropriate production to the performer. In the domain of digital music synthesis the absence of efficient integration between conceptualized timbre and parameter controls creates an impediment to effective and efficient timbre manipulation. In order to create a better (i.e., efficient, intuitive and interactive) auditory feedback loop a composer/performer of a physical model would benefit from a perceptually informed algorithm that estimates timbre control parameters. A primary goal of physical model (PM) synthesis is to create sound that convincingly and compellingly approximates the sound of a humanly performed instrument [1, 2]. Coupling a model of the performer's technique and the auditory feedback loop that is vital in the creation of the musical sound is a vital aspect toward achieving this goal. Some work toward this goal has been done. These include studies to embed expressive nuance in MIDI scores using KTH rules [3] and in parameter estimation of PM synthesis. Diana Young and Stefania Serafin investigated the playability of the violin physical model controlled by bow force and bow position [4]. Caroline Traube et al. has estimated the plucking position of a guitar using the spectral centroid for the timbre evaluation [5]. The analysissynthesis research group at IRCAM has worked on the estimation of the control parameter of trumpet by diverse approaches: by inverting the trumpet physical model [6], vector quantization on cepstral coefficients and derivative of cepstral coefficients [7], and minimization of two perceptual similarity criteria as a function of the control parameters [8], to which, we chose a very similar approach. In spite of the difficulty in controlling the non-linear system with delayed feed-back, these attempts have been quite successful. Guillemain et al. investigated the distribution of the clarinet timbre as a function of control parameters using classical timbre descriptors [9]. Our goal is to create a control system for PM synthesis, which produces: 1) Desired pitch, loudness and timbre at 2) Desired time so that this system could integrate with existing expressive nuance rendering systems in which pitch, loudness and time are the typical control parameters. In addition to the primary PM synthesis inputs of pitch and time it is important to address issues of loudness and timbre. The Mel-Frequency cepstral coefficient (MFCC) is a perceptually valid metric to capture both loudness and timbre [10]. Our ideal purpose is to provide a means of coupling a conceptualization of a desired timbre to the production parameters of physical models of acoustic instruments.There are two primary motivations for our work: 1. to provide improved performance and practicality in music composition and performance of physical models, 2. as a preliminary step toward a model of the auditory feedback loop in music instrument performance. Our research employs the STK clarinet PM [11] and focused on breath pressure and breath turbulence as variable control parameters. Recordings of acoustic clarinet tones are used as reference tones to imitate. In order to evaluate the perceptual timbre difference, the synthesized sounds and a reference sound are compared in terms of the mean 0