Page  00000442 COMBINING PHYSICAL MODELING AND ADDITIVE SYNTHESIS AS A MAPPING STRATEGY FOR REALTIME CONTROL Philippe Guillemain LMA-CNRS 1 13 ch. Joseph Aiguier 13402 Marseille Cedex 20 France ABSTRACT This paper presents a mapping strategy for the control of a signal-based digital synthesis model of the clarinet. This strategy combines the use of a synthesis model based on a physical model with the use of a signal model based on additive synthesis. From the output of the physical model, specific sound descriptors are extracted and used to control an additive synthesis model based on the analysis of natural sounds. This approach allows to benefit both from the control quality of the physical modeling with the sound quality provided by the additive resynthesis of natural sounds. However, some improvements are still required to better interface and set the two models, particularly concerning transients. 1. INTRODUCTION The control of self-sustained musical instruments such as woodwinds is a challenging problem. The nonlinear nature of their functioning implies that subtle variations of the very small number of continuous parameters controlled by the performer may have important effects on the perceived timbre. Moreover, gesture capture devices have been developed for some instruments such as the piano whereas collecting performance datas from a wind instrument player is a very difficult task to handle. Synthesis models based on the physics of the instruments have demonstrated their potential in terms of the naturalness of the control, for now more than fifteen years. This is largely because such models analytically link control parameters with physical continuous controls of the performer such as blowing pressure or lip pressure on the reed. However they somewhat lack naturalness in timbre: they allow synthesis of the sound of a clarinet but not the sound of this clarinet (which results from the unique combination of a specific player and a specific instrument). On the other hand, signal models based on the analysis of natural sounds (such as additive analysis/resynthesis) 1 Laboratoire de M6canique et d'Acoustique, Centre National de la Recherche Scientifique. 2 Sound Processing and Control Laboratory. 3 Input Device for Musical Interaction Laboratory. 4 Centre for Interdisciplinary Research in Music Media and Technology. Vincent Verfaille SPCL 2 & IDMIL 3, CIRMMT 4 McGill University Montreal, Canada offer the production of timbres perceptually identical to natural sounds. However, these models lack ease of control, which can be achieved only at the price of building and indexing additive databases (that represent as many as real life situations as possible) and specific mapping strategies. Another drawback is that the database indexing is done with respect to 'guessed' controls obtained from the database itself using interpolation mapping strategies, and not with respect to the physical controls that are not available. For instance, both Escher [1] and Ssynth [2] use an additive database of sustained notes structured according to 'abstract parameters': instrument (related to timbre), performed pitch (related to mean fundamental frequency) and performed dynamics (related to mean sound level and to note brightness). This paper presents an approach that intends to combine the advantages of both physical models and signal models by considering them as mapping layers of a synthesizer. It consists first in replacing the physical model sound output by a realtime sound descriptors extraction, thus providing an objective link between the physical controls and perceptually relevant features. Secondly, indexing the additive synthesis database with the 'guessed' controls is replaced by indexing based on the sound descriptors related to pitch, loudness and timbre. Sound is finally generated by additive synthesis. Ideally set, such a physically-controlled additive resynthesis should result in a sound that is perceptually identical to a natural sound obeying the physical controls. We now present and compare both synthesis models (sec. 2). After developing the synthesis by physicallycontrolled signal models and the selected sound descriptor set (sec. 3), we will present our first results and discuss various limitations of the approach and its actual implementation.. We will finally draw our first conclusions and propose some solutions for further investigations (sec. 5). 2. PHYSICAL MODELS AND SIGNAL MODELS Physical models takes the viewpoint of the performer and the control. On the other hand, signal models take the auditor viewpoint. In this section, we summarize both approaches in the case of clarinet. 442

Page  00000443 2.1. Simplified physical model of the clarinet The physical model used for the extraction of timbre descriptors is made of three coupled parts. The first part is linear and represents the bore of the instrument. The second part expresses the reed channel opening, linked to the reed displacement considered as a pressure driven massspring oscillator. The third part couples the previous ones in a nonlinear way. In what follows, we use dimensionless variables for the pressure, flow, and reed displacement according to [3]. 2.1.1. Bore model The bore is considered as a perfect cylinder of length L. Under classical hypothesis, the input impedance Ze linking the acoustic pressure Pe and flow Ue in the mouthpiece in the Fourier domain is classically written as: Z(w) =Pe) i tan(k(w)L) (1) Ue () The wavenumber k(w) corresponds to the classical viscothermal losses approximation: the bore radius is large with respect to the boundary layers thicknesses. Note that for any flow signal, the acoustic pressure mostly contains odd harmonics as the input impedance corresponds to that of a quarter-wave resonator. At high frequencies, the increase of losses taken into account in the wavenumber induces non-zero impedance values for even harmonic frequencies. Hence, if the flow contains high frequency even harmonics, these will also appear in the pressure. 2.1.2. Reed model The classical single mode reed model we use describes the reed displacement x(t) with respect to its equilibrium point when it is submitted to an acoustic pressure pe,(t): 1 d2X(t) 4r a(t) 1 2 d2 + x(t) = p((t) (2) w dt2 wr dt where w 2= fr and q, are respectively the circular frequency and the quality factor of the reed resonance. The reed displacement behaves like the pressure below the reed resonance frequency, as a pressure amplifier around the reed resonance frequency, and as a low-pass filter at higher frequencies. 2.1.3. Nonlinear characteristics The classical nonlinear characteristics used here is based on the steady Bernoulli equation. It links the acoustic flow (i.e. the product between the opening of the reed channel and the acoustic velocity) to the pressure difference between the bore and player's mouth. The reed channel opening S(t) is expressed from the reed displacement by: S(t) = 9(1 - 7 + x(t)) - (1 - 7 + x(t)) where O denotes the Heaviside function whose role is to keep the opening of the reed channel positive by cancelling it when 1 - 7 + x(t) < 0. The parameter ( characterizes the whole embouchure and takes into account the lip position and the section ratio between the mouthpiece opening and the resonator. It is proportional to the square root of the reed position H at equilibrium and inversily proportional to the reed resonance frequency. Common values of ( for the clarinet are between 0.2 and 0.6. The parameter 7 is the ratio between the pressure pm inside the player's mouth (assumed to be slowly varying on a sound period) and the static beating-reed pressure. In a lossless bore and a massless reed model, 7 evolves from 1 which is the oscillation threshold, to 1 which is the extinction threshold. 7 = 1 corresponds to the position at which the reed starts beating against the lay. Since the reed displacement corresponds to a linear filtering of the acoustic pressure, the reed opening mostly contains odd harmonics. Nevertheless, the 8 function introduces a singularity in S(t) for playing conditions (given by ( and 7) yielding a complete closing of the reed channel (dynamic beating-reed case). This leads to a sudden rise of even harmonics in S(t) (saturating nonlinearity) and the generation of high frequencies. The acoustic flow is finally given by: Ue(t) = S(t). sign(7 - pe(t)) - 7V - pe(t)\ (3) This nonlinear relation between pressure and opening of the reed channel explains why the flow spectrum contains all the harmonics. 2.1.4. Coupling of the reed and the resonator Combining the impedance relation, reed displacement and nonlinear characteristics, the acoustic pressure, acoustic flow and reed displacement in the mouthpiece are solutions of the coupled equations (1), (2) and (3). The digital transcription of these equations and the computation scheme that explicitly solve this coupled system is achieved according to the method described in [3], and implemented for realtime synthesis as a Max/MSP external written in C. The computation cost per sample of this simplified physical model of the clarinet is: three multiplications/additions for the reed model, four multiplications/additions for the resonator model and six additions, seven multiplications and one square root for the explicit solution of the nonlinear system. This makes this cost negligible with respect to the cost of the additive synthesis. 2.1.5. External pressure The external pressure is computed from the pressure and the flow in the mouthpiece by the relation: Pet (t)= (pe(t) + ue(t)) since Pext(w) = i exp(-ik(w)L)(Pe(w) + Ue(w)) and exp(-ik(w)L) can be ignored from a perceptual point of 443

Page  00000444 view. This relation corresponds to the simplest approximation of a monopolar radiation. This expression shows that in the clarinet spectrum, the odd harmonics are generated from both the flow and the pressure, while the even harmonics mostly come from the flow. An interpretation would be that the ratio between odd and even harmonics is a signature of the nonlinearity 'strength'. 2.2. Additive synthesis model in Ssynth 2.2.1. Control viewpoint and synthesis engine in Ssynth Ssynth is a real time additive synthesizer with advanced and flexible control functionalities that can synthesize polyphonic sounds and handles OSC control messages [2]. It is as a further development of Escher [1], a system developed for studying gestural control in interpolation of digital musical instruments playing. Considering the control viewpoint of synthesis in terms of design and implementation, Ssynth generates high quality sounds from an additive database of instrumental sounds, providing coherent control and both interpolation and extrapolation of musical playing of digital instruments. Ssynth implements 3-order1 polynomial phase models [4], with scalar and vectorized implementations (using Altivec PowerPC and SSE2 for MacIntel). The additive synthesis is implemented in C, and can be compiled as a stand alone program or as a Pd object using the Pd scheduler to have output audio. Ssynth also uses various models and conversion methods between various spectral envelope models (formants, cepstrum, LPC coefficients) and provide gestural control of the spectral envelope. 2.2.2. Modular mapping Modularity is an important feature of digital musical instruments design. In order to reduce the great number of additive synthesis control parameters to a smaller set, Ssynth is designed with two parameter conversion layers. The first mapping layer converts control parameters provided by the gestural transducer into 'abstract' parameters (namely instrument, performed pitch and dynamics). This imitative mapping is implemented as a set of Pd patches and models the physical coupling between breath pressure, lip pressure and/or fingering in the way they naturally affect sound level, brightness and pitch modulations. The second mapping layer lies within the additive synthesizer. It navigates the additive database to find a parameter set that best reproduces a frame of sound for given abstract parameters. The additive database is organized as a 3-dimensional mesh according to each note's performed pitch (7 values), dynamics (3 values; related to sound level and spectral centroid) and instrument (4 up to now). The dynamics parameter controls both intensity and brightness; it is however interesting to more directly control the spectral envelope. 1 Though 1 and 2-order can be used to reduce the computational cost, 3-order is more suited to this case as it provides a better sound quality. 2.2.3. Morphing, interpolation, extrapolation The sound database contains additive analysis and spectral envelope models of wind, wood and brass instrument tones (clarinet and oboe as in Escher, plus saxophone and trumpet) from the McGill University master samples database (MUMS) [5]. Morphing according to abstract parameters is provided in two steps. First, interpolated/extrapolated additive frames are computed as a weighting of pitch-shifted additive frames taken from a set of neighbor notes with fundamental frequency and dynamics of the same instrument. This is obtained by generalizing the weighting of shared harmonic partials from 2 sets [6] to L sets of frequency/amplitude trajectories (see [2] for more details). The second step consists in weighting interpolated/extrapolated frames obtained for several instruments. Identical equations as those used in [2] can be used. Note that the two steps are useful for better understanding of morphing as well as for facilitating the weighting of shared/non shared partials. Morphing attack requires a time-warping of additive data to provide better quality timbre of the morphed sounds. Concerning extrapolation from the database, instead of using the four surrounding neighbors, we use the four nearest neighbors (for each instrument). Then, partial weighting follows the same rules as interpolation, except that weights are computed by linear extrapolation from the four neighbors (instead of liner interpolation). 2.3. Physical models versus signal models At least four criterion should be taken into account to compare physical synthesis models and signal synthesis models in the general case (cf. Tab. 1). criterion control ability mapping interpolation sound quality physical model identical imposed param-dependent similar signal model arbitrary / imitative free / imitative arbitrary / imitative perceptually identical Table 1. Comparing physical and signal models in terms of control, mapping, interpolation and sound quality. First, the control ability refers to the control quality of the synthesis model: ease, naturalness and predictability of control. Ideally, the gestural transducer provides identical physical control in terms of parameters, range and behavior of parameters, haptics, etc. (not discussed in this paper). Then, physical models have identical control abilities as acoustical instruments, since the instrument physical elements and properties can be manipulated with physically meaningful parameters. Conversely, the control ability of additive synthesis is arbitrary, as any gestural parameter can be mapped to any synthesis parameter. The second criterion is the mapping of gestural transducer to synthesis algorithm parameters. Without going into details about terminology [7], we note that mapping is imposed in physical models by the physics equation that 444

Page  00000445 are modelled, whereas it is freely designed for additive synthesis. However, imitative mappings provides similar control ability by considering some of the control's physical behaviour, for instance in terms of coupling [1]. The third criterion to compare both synthesis techniques is the provided interpolation of parameters. Physical models allows for interpolating parameters that reflect the instrument physical structure (e.g. length, shape and materials). This has potential to changing pitch and timbre. On the other hand, signal models provide interpolation of perceptual parameters in arbitrary ways, which can hardly be controllable using physical models. Once again, it makes sense to use imitative interpolation of parameters to provide the possibility of similar results with both synthesis techniques, and also for the sake of understandability. Sound quality is the last and most obvious criterion. A physical model will deliver a very realistic sound; however delivering a perceptually identical sound to a given clarinet sound would require physical model inversion 2 and is still an open problem. On the other hand, a signal model based on the analysis of natural sounds will then deliver a perceptually identical sound, provided that it is adapted to the analyzed sound (which is the case of additive model for the clarinet). wid ------------------------- D wind Ssynth so controller physical feature:S-- -, additive sounds model extraction morpher L3b synthesis L1 L2 L3a additive analysis database Figure 1. Diagram of the physically-controlled synthesis, interfacing a physical model and an additive synthesizer. 3. SYNTHESIS BY PHYSICALLY-CONTROLLED SIGNAL MODELS Built on physical and signal models, a new synthesis technique called 'physically-controlled synthesis' is described and explained as a mapping strategy. 3.1. Rationale We consider a performer which gestures are captured using an imitative gestural transducer, for instance a WX5 wind controller in the case of clarinet sound synthesis. The idea of physically-controlled synthesis is to combine the strengths of both synthesis methods (refer to Table 1) in term of sound and control quality. The sound quality of physically-controlled synthesis can benefit from perceptually identical resynthesis of a particular clarinet. The control quality of physically-controlled synthesis could benefit from identical control ability as the real acoustical instrument (minus the controller quality), together with free dom and/or inventivity of the mapping design: imitative 2 Model inversion requires indepth knowledge of the instrument. as in physical and signal models, or arbitrary as in signal models. Such a synthesizer would also allow for arbitrary interpolation/extrapolation of the instrument(s): for instance by allowing to modify signal parameters - from perceptually viewpoint or arbitrary - and/or physically relevant parameters - like the instrument size, shape, etc. 3.2. Mapping in physically-controlled synthesis Combining the two synthesis techniques that way can also be considered as a mapping strategy, designed to derive signal model controls from the gestural control and used to refine the control ability of additive synthesis. Indeed, a physically-controlled synthesizer combines synthesis models as follows: the physical model is interfaced to the additive synthesizer Ssynth using sound descriptors. As depicted in Fig. 1, we identified 3 layers to map gestural parameters to the effective sound synthesis. The first layer (LI) maps gestural parameters to physically-informed synthesis parameters, more specifically to the physical model input. Mapping issues for layer (LI) are not developed here since this it is shared by any synthesizer and mappings were already discussed for the physical model in [3] and for the additive synthesis in [1, 2]. The second layer (L2) is the first layer of the physicallyinformed synthesis and corresponds to the sound descriptors extraction (see sec. 3.3) from the output sound of the physical model synthesis in the present example. The third layer (L3) encapsulates the two sub-layers of the additive synthesizer Ssynth (see sec. 2.2.2). The first sub-layer (L3a) derives morphed additive frames from a descriptors-driven navigation in the additive database. The second sub-layer (L3b) is the sound synthesis engine, that generates the digital sound samples from the morphed additive parameters (see sec. 2.2.1). The key point of physically-controlled synthesis resides in interfacing the physical model with the additive synthesizer. This interface is provided by sound descriptors related to perceptual features of sound (pitch, loudness, timbre). As we wish to improve the control ability of sound synthesis, realtime gestural control must be provided, requiring realtime extraction of sound descriptors from the physical model. Though analytical formulations of the physical model in terms of additive synthesis exist and could lead to sound descriptors extraction from the model itself [8], these formulae are too restrictive from a performer's viewpoint (non beating-reed, cf sec. 2.1.3). For that reason, they are extracted from the sound rather than from the physical model. As sound descriptors extraction is notoriously a critical task and can often be affected by errors, specific strategies were developed (see sec. 3.4). 3.3. Using knowledge about clarinet timbre to select a sound descriptors set In the particular case of physically-controlled synthesis of clarinet sounds, knowledge of acoustics used in the physical model design helps to specify a good set of sound 445

Page  00000446 descriptors to start with. We now intuitively derive a set of primary timbre descriptors from section 2.1. First, the self-oscillation functioning induces: a perfect harmonicity of the sound in the permanent regime, a playing frequency that differs from the frequency of the first impedance peak (see sec. 2.1.1) and depends on the control parameters ( and 7 values [9], as well as a loudness and brightness depending on these two parameters. For small losses in the bore i.e. at low frequencies which are below the reed resonance frequency, the mouthpiece pressure contains mostly odd harmonics, while the mouthpiece flow contains both even and odd harmonics. Hence the balance between odd and even harmonics in the external pressure is mostly driven by the control parameter (. The mouthpiece flow being proportional to the timevarying reed channel opening (which is a low-pass resonant filter of the mouthpiece pressure), the mouthpiece flow exhibits a formant with features linked to the reed resonance features. Above the beating-reed threshold for 7 - 0.5, the presence of an additional saturating nonlinearity - corresponding to the shock of the reed on the lay - introduces a sudden rise of even harmonics in the timevarying reed channel opening, and a decay of harmonics amplitudes of the flow in l/w, generating additional high frequency contents. Most of these behaviors have been observed both on synthesized sounds and on natural sounds played in laboratory conditions by an artificial mouth [10]. Using this knowledge about clarinet timbre, we derive a minimal control set of sound descriptors: * fundamental frequency fo(k) C [0, ], * ratio of even/odd harmonics power re/o (k) E [0, 1], * intensity I(k) E [0, 1], * spectral centroid SC(k) e [0, E-], with k the frame number, x(n) the current frame sound samples with kJ -N/2 <n< K k +N/2- 1, F (resp. Fd) the sampling rate or sound (resp. sound descriptors). A more complete set would also contain a spectral envelope representation (for instance using antiformants) and the asymptotic frequency decay of harmonics, computed separately on odd and even harmonics. Indeed, it has been shown that two different sets of control parameters could lead to the same values of the spectral centroid (see [11] for synthetic sounds and [10] for natural sounds). 3.4. Sound descriptors extraction Sound descriptors can either be extracted from the sound signal or from the additive representation of the sound, i.e. amplitudes and frequencies of partials versus time. Since this representation is the one used by the additive model, the additive database can be expanded with the interface sound descriptors computed. Right now, Ssynth only allows for being controlled according to pitch, dynamics (pp, mf andff) and instrument. A modification of dynamics implies both a change in sound level and brightness (as the spectrum gets enriched in higher harmonics for higher dynamics), so the sound descriptors used to control are fundamental frequency and something related to the spectral centroid. Therefore, we had to add both spectral centroid and even/odd harmonics power ratio in order to allow four 1-to-1 mappings in parallel. Sound descriptors are then computed as: relo(k) I(k) HC(k) z'k) ah(k) 2 I=i) a2hl(k) 2 1 H(k) ~ ahH(k) =l ah(k). fh(k) CEh~=l~ ah(]) with H(k) the number of partials and {ai(k), fi (k)} the amplitudes and frequencies of partials. As previously mentioned, sound descriptors are directly extracted from the output signal of the physical model, using the usual windowing technique (1024 samples, 10 ms hop size, Hanning window). The autocorrelation function (ACF) of the signal AC(k) is used to extract the even/odd harmonics power ratio. The main peak of the ACF (in 0) is the energy of the signal on the given analysis window. The second peak (at the index pi) is the energy of all harmonics, and the peak located at p2 ~ is the energy of even harmonics. Sound descriptors are then extracted as: relo(k) I(k) SC(k) AC(pi) AC(pi) - AC(p2) 1 k+N/2-1 N S x(i)| i=k-N/2 o 0N/21. X (1, k) C/02X(1, k) (4) (5) AC(0) NV with X(f, k) the signal's short-term Fourier transform. In any application using sound descriptors, the computation of such descriptors is a critical task that has to be addressed, as it can often be affected by errors, and therefore result in erratic behaviour of the descriptor-driven computations. To avoid such situations, specific strategies were developed. First, the peak piking is helped by a realtime detection of the fundamental frequency. It is performed with a zero-crossing method on a band-pass filtered version of the physical model output around the fundamental frequency given by the known MIDI pitch value. Another concern is the strange behavior that may have both the harmonic centroid (idem for the spectral centroid) and even/odd balance for low level signals: the harmonic and spectral centroid will reach Nyquist frequency, whereas the even/odd ratio could be undefined (division by 0). To avoid such situations, Beauchamp [12] proposed to add a constant noted co at the denominator of the frac 446

Page  00000447 tion used to compute the centroid 3, as follows: therefore preferred to use a truncation according to a similar constant co as follows: HC(k, co) SC(k, co) cHok' ah(k) fh(k) CO + EH(k'! ah(k) N1 0N1 - X (1, k) Co + Y1=02X(1,k) (6) HC(k, co) SC(k, co) EHCk' ah(k) - fh(k) max (co, IH(! ah(k) h=l 1 0N/2 - X (1, k) max CO I 0j'2 X (1, k) Normalized air pressure 0.6 - 0.4 - S 0.2-_ 0 Normalized lip pressure 0.8 0.6 -Midi note 70 60 -zn 4 6 8 Time/s - 10 12 14 Figure 2. Control data for 'Pierre et le loup' played on a WX5: normalized air pressure (top), normalized lip pressure (middle) and midi note (bottom). t 4 Sound intensity by RMS -20- A -40 --60 I -80 Spectral Centroid (c=1) By doing so, the harmonic and spectral centroids are only biased for low levels. Moreover, the same bias is applied when computing sound descriptors on the physical model output and on the additive database, which means that it may not change the neighbor note search in the additive database. Therefore, we consider these modifications of computation method for the sound descriptors as mapping strategy that mainly reduce the variation range of descriptors for low levels. 4. DISCUSSION 4.1. First results We ran a first series of tests, from single notes to series of notes and musical sentences 4. They were played on a WX5 (see Fig. 2 for an example of raw normalized control data for an excerpt of 'Pierre et le loup'). The physical model output sound was recorded and the control data were stored as OSC messages together with their timing. Since there are many mapping and control issues to deal with before being able to use the synthesizer in realtime, and for the sake of experiment reproducibility, we first stored those data in files. Then, the sound descriptors were extracted and converted to OSC messages that were sent with proper timing to Ssynth. We observed that the sound quality obtained in Ssynth was not as good as with the 'traditional' imitative mapping. Indeed, the imitative mapping controls pitch and dynamics (so indirectly brightness) according to their mean values along the note, and locally compensates fundamental frequency and sound level. Ssynth acts as an additive sampler that smoothly interpolates between four additive frames (producing 10 ms of sound). It also preserves the frame-by-frame time unfolding, preserving in the meantime the variability of partials' amplitudes and frequencies. However, the variation with time of sound descriptors is specific of each sound recorded. Then, using those descriptors on a frame-by-frame basis to navigate in the additive database may create too fast time variations (see Fig. 4), resulting in morphing weights that vary too fast. 4.2. Limitations As said in sec. 2.1, the physical model is not a complete model of clarinet. It therefore simplifies some aspects of the physics of the instrument: the dynamic behaviour 4 See http: //www.music.mcgi11. ca/musictech/spc1/ pocs -cons onne s for more detailed information. 0.5 - 4 6 8 10 12 14 Time/s Figure 3. Sound descriptors for 'Pierre et le loup' synthesized using the physical model: sound intensity computed by RMS from Eq. (5) (top), spectral centroid from Eq. (6) Eq. (middle), even/odd harmonic balance from Eq. (4) (bottom). The problem with such a formulations is that it induces a bias in the spectral centroid, not only for low level signals but also for portions of interest of the signal. We 3 One may also consider adding a constant to the even/odd ratio, to avoid undefined values. However, this situation does not occur so often with the clarinet at its sound exhibits stronger odd than even harmonics. 447

Page  00000448 D3 \ A3 pitch - - - ideal case with slow time variations - - case with fast time variations Figure 4. Example of navigation in the additive database: smooth trajectory with slow variations of sound descriptor values, and erratic trajectory due to excessively fast variations of sound descriptor values. of any particular instrument is not precisely captured in the mapping. This means that the additive synthesis will tend to resemble this physical model, from additive data taken from a real clarinet that will never really be perceptually identical. It can only be so in the sense of 'instantaneous spectral timbre'. Second, the advantages of additive synthesis over physical modeling (e.g. reproducibility of recorded sounds, spectral control) mainly concern sustained or slowly varying sound frames. It is less adequate for reproducing and controlling impulsive or fast transients. With this respect, the mapping between fast transients tends to be the weakest part of physically-controlled synthesis. This means that several improvements will have to be made to support the idea that this new technique can provide perceptually identical re-synthesis. 4.3. Improvements and future works The method presented here is, at this time, applicable in the permanent regime or slowly varying situations. Several ideas may later be investigated, dealing with improving the analysis quality of attacks with more adapted spectral analysis, using better physical models, increasing the additive database size, refining the strategy for search neighbor notes or frames in the database, and combining both acoustical information from the physical model and the additive synthesizer. We now detail the last four ideas. First, we could use a less simplified physical model, to provide more perceptually relevant clarinet sound descriptors. For instance, enhanced models of the beatingreed phenomenon could be used. Also, the roles played by the network of toneholes are: closed network adds periodic antiformants in the external spectrum while an opened network adds a strong cut-off frequency. Second, we could increase the database information, by several means. For instance, having more attack types analyzed (normal, staccato) would better represent the ad ditive frames of possible clarinet attacks. Also, refining the database granularity with an additive frame every 5 ms instead of 10 ms would be more realistic for the transients. Another example resides in adding analyses from other performers or othr clarinets, using other database such as the RWC [13], MIS [14], SOL [15]. The processing of transients will require additional timbre descriptors, such as the attack time. More generally, non stationary situations may require to take into account not only the instantaneous values of timbre descriptors, but also their time derivative, either within an additive frame or between two frames. Then, such sound descriptors of the attack may help to find a more suited additive frame to a given situation, but would at the same time require more complex strategy for search neighbor frames in the database. Another idea would consist in increasing in the database the number of notes with different pitch (for each dynamics), in order to better represent both the whole frequency range of the clarinet and its register changes, and the whole variation range and type of the sound descriptors. A combination of those three ideas will offer a better representation of the natural behavior of HC and re/o depending on fo and the dynamics, yielding to the use of more adequate data to be interpolated by the control from the physical model. Alternately, we could look for a better understanding on how sound descriptors of a given frame are affected when pitch-shifting this frame. This will lead to rules about how to infer the sound descriptors from the notes used to interpolate in the database. The third idea would be to use other strategies for searching neighbor frames according to fo, HC, relo. One could for example get rid of the time unfolding of additive frames: Ssynth uses the (fo, dyn) map (organized as a 2D mesh with clear quadrants), representing notes by means of intended fo and dynamics. It therefore is simpler to synthesize preserving time unfolding as the (fo, dyn) is regularly sampled. However, both the (fo, HC) and (fo, re/o) meshes vary quite a lot during a database note. Then they vary a lot from one frame to another, and their mean values on a note would not adequately represent the range of their variation. It then could make sense to remove time axis and to search in all possible neighbor frames. Another idea of other mapping strategies would be to use a neighbor search in the (fo, HC, re/o) 3D map instead of a 2D map neighbor search, either on (fo, HC) or on (fo, re/o). This last solution was chosen in a first step for the sake of simplicity because Ssynth is based on (fo, dyn). Various modifications are required for the synthesizer to support 3D neighbor search. One could also use different strategies for searching neighbor frames/notes depending on what synthesize: attack or sustain, possibly providing more realistic attack transients. Fourth, one could also consider merging somehow the acoustic outputs from both the physical model and the additive synthesis model, instead of discarding the former, and see if it results in a more effective framework.For example the sound output of the physical model could be kept in transient situations and the signal model used for 448

Page  00000449 slowly varying situations. This however requires satisfactory morphing strategies and better understanding of the physically-controlled synthesis itself first, as the more the two synthesizer outputs differ, the more morphing their output sound is difficult. 5. CONCLUSIONS In this paper, we introduced the physically-controlled synthesis, a way to combine two synthesizers in order to keep advantages of both: a physical model for its control quality, and an additive resynthesizer for its sound quality. Fundamentals of the two synthesizers were presented. We then explained how to interface them in order to get sound synthesis. A first key point resides in the choice of the sound descriptor set that interfaces the two synthesizers, which can then be seen as two layers of a mapping strategy. During our first tests, it appeared that simply interfacing the two synthesizers with four 1-to-I mappings of fundamental frequency, sound intensity, even/odd harmonics power ratio and spectral centroid is not enough. More refined mapping strategies are required. Some ideas to investigate in the near future consists in better taking into account transients, for instance by using different strategies depending on the note portion; to develop some synthesis-by-rule where Ssynth's morphing strategies will need to be adapted in order to get a better match between outputs of the physical model and the additive synthesizer. For example, fundamental frequency and sound intensity could still be controlled on a frameby-frame basis; whereas timbre parameters (even/odd balance and spectral centroid) should be used with constraints on the regularity of their frame-to-frame variation (using sound features derivative, or removing time-unfolding dependency for search neighbor frames). 6. ACKNOWLEDGEMENTS We thank the reviewers for their remarks and suggestions. This research is supported by grants from the 'Consonnes' project (ANR, French Agence Nationale de la Recherche), the FQRNT (Fond Quebecois pour la Recherche en Nature et Technologies), the FQRSC (Fond Quebecois pour la Recherche sur la Socidtd et la Culture), the NSERC (Natural Sciences and Engineering Research Council of Canada), Qu6bec's Ministry of Economic Development (PSIIRI grant), and CIRMMT (Centre for Interdisciplinary Research on Music Media and Technology). 7. REFERENCES [1] M. Wanderley, N. Schnell, and J. B. Rovan, "Escher - modeling and performing composed instruments in real-time," in Proc. of the IEEE Int. Conf on Systems, Man and Cybernetics (SMC'98), San Diego, 1998, pp. 1080-4. [2] V. Verfaille, J. Boissinot, P. Depalle, and M. M. Wanderley, "Ssynth: a real time additive synthesizer with flexible control," in Proc. of the Int. Computer Music Conf (ICMC'06), New Orleans, 2006. [3] P. Guillemain, J. Kergomard, and T. Voinier, "Realtime synthesis of clarinet-like instruments using digital impedance models," J. Acoust. Soc. Am, vol. 118, no. 1, pp. 483-494, 2005. [4] R. J. McAulay and T. F. Quatieri, "Speech Analysis/Synthesis Based on a Sinusoidal Representation," IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 34, no. 4, pp. 744-54, 1986. [5] F. Opolko and J. Wapnick, "McGill University Master Samples," Montreal, QC, Canada: McGill University, 1987. [6] E. Tellman, L. Haken, and B. Holloway, "Timbre Morphing of Sounds with Unequal Numbers of Features," J. Audio Eng. Soc., vol. 43, no. 9, pp. 678-89, 1995. [7] M. M. Wanderley Ed., Organised Sound, Special issue on mapping. Francis & Taylor, 2002, vol. 7, no. 2. [8] J. Kergomad, S. Ollivier, and J. Gilbert, "Calculation of the spectrum of self-sustained oscillators using a variable truncation method: Application to cylindrical reed instruments," Acustica, vol. 86, pp. 685 -703, 2000. [9] T. A. Wilson and G. S. Beavers, "Operating modes of the clarinet," J. Acoust. Soc. Am, vol. 56, pp. 653 -658, 1974. [10] R. T. Helland, "Synthesis models as a tool for timbre studies," Master's thesis, Norwegian University of Science and Technology, Department of Electronics and Communications, 2004. [11] P. Guillemain, R. T. Helland, R. Kronland-Martinet, and S. Ystad, "The clarinet timbre as an attribute of expressiveness," Lecture Notes on Computer Science, vol. 3310, pp. 246-259, 2004. [12] J. W. Beauchamp, "Synthesis by spectral amplitude and "brightness" matching of analyzed musical instrument tones," J. Audio Eng. Soc., vol. 30, no. 6, pp. 396-406, 1982. [13] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, "RWC music database: Music genre database and musical instrument sound database," in Proc. 4th Int. Conf Music Information Retrieval (ISMIR 2003), 2003, pp. 229-230. [14] IOWA: Musical Instrument Sounds Database, "http: //theremin.music.uiowa.edu/MIS.html," 2005. [15] IRCAM: Studio On Line, "http://sol.ircam.fr," 2000. 449