Page  00000001 Data-driven Modeling and Synthesis of Acoustical Instruments Bernd Schoner, (Chuck (Cooper, Chris Douglas, Neil (Gershenfeld Physics and Melia Group MIT Media Laboratory 20 Anmes Street, Cambridge, MA 02139 Abstract We present a framnework for the analysis alndl synthesis of acoustical instruments based on data-diriven probabilistic inference modeling. Audio tiiiie series iand boundary colnditions of a played instrunient are recorded and the ion-linear minapping from the control data into the audio space is inferred using the general inference framnework of Cluster-Weighted Modeling. The resulting model is used for real-time synthesis of audio sequences from new input data. 1 Introduction Most of todays muSicasl synthdesis is based on either samipling acoustical instruments [MAassie, 1998] or detailed first-principles iphysical modeling [Smith. 19_92]. The sa-mpling approach typically results in high sound (fi1alit, lbut,has no notioll of the instinllment a s a <lvijinic syst Sem with variable control. The physi-al tmodeliti approach retailis his div ilaci (hitlt bhi llt results iI iitrartabl l>I a r iniu((lls when all the (l)hsiC IIle trees oof fleeldolli ale considerld. lii sailrif ti Ih.I iht ii ( ciat iIn I ni lparamneters Is lifficilt an l there is ii svstiinatic wax I; itaii ail i ra i iwc)Al lI'IlCTIiftr c1is between insittimients ol the saiie (lass. slCili as t;-I,, ii;ist' vi 1Wii XWe prIesent a new synthesis mni101. iif brii III(thlie hiV5a li i hir fI Iinstuinilent froi observation of its elobil performance, t hat is concept tally iltitermediate betw en thie traditional techniques. Dynamical systems tnheory shows that wre cati re(onstIt rlct a state spice of a ophysical sXstein that is homeomniorplhiic to the act ual state space tising the input a(d1 outltut observables of the systenm along with their time lags (embeddlillg space) [Takens, 1981. (Cisllt, 1992]. The recconstructe(1 slpace captres tlihe dynamics of the enltile system and its dimensionalit Illa be c (hosen iIn sll(ci a waa that, it corriespondls to tohe iminmber of effective reather than actual dlegretes of fretdom0 of t'he sstem. In the case of' dlissipalti.Sstetiis. such as Imusicld illstrullients, this effective (lilt I ensiotinalitv Can be conside-rably lower than the pIhNsic(al one. W\e Collibill tllhis result. with ade(lllate stignal processiing aind sensing teCtiology to build a i Illodling and synthesis frailvmork thatnitintroduces n'ew (-apa biliti1es aw1( controol flexibilit ito accelpted au idin ature sInt h sist((Ii tchr i(lIties. Ui~une the violii as our test. intst~rlttn( t I le (1ev)lope1( ulnolbt rltsi\:e setisors t hat t tack tie osit iou of the boxx relativte to tle violiti, the oressrit of lie foreinger on thlie bow. atid 1lie posit iou of tle hiner 011 the hingerboard. Ii a trainilig sessioti. we recordl control input. dat a from liese setisots alotig v wit Ii Ilie violin's audio oltt ptlt. These sigiials serve as t raiing data for t ihe itiferetice in, etiie Clutstet-Weieht ed Moelinivg, which lear thle ioti-l ittea r tel at iotiSIi bl~lr etlweenl the conttol iii putt s at nd tlie taret auo 11(10 t.W have developedl C lust~er-WVeighted IM odel ing ((XV X) as a getieral lptolablilIistic inferetice etiginle that. 66 PROCEEDINGS ICMC98

Page  00000002 naturally extenllds beyond linear inference and signal processing practice into a powerful non-linear framework that ha iidles non-( 4aussia.nity, non-stationarity, aid discontinuity in the data. CWM retains the functiuonality of conventional neural networks, while addressing many of their limitations. Once it has nvrI t'1 hlie model can predict audio data based on new control data. A violinist plays the int4rlfar < tvice (Which could be a. silent violin), and the sensors now drive the computer model to I(lu It sl OlIund of the original violin. 2 Data Collection and Hardware isrs(, capture the gestural input of tlie\ violin pkla r. We measure the violin 1o)w posi-;I a ca.J)acitive coupling technique that detects lisplacement current inll a resistive antenna,1 1i, and Gershenfelli. t1T] and infer a bow velocity estimate by differentiation and Kaliman filv*m. We deteermine the distance between how and brid ge of the violin by the same technique. Bow Ilrfs is %. i nfeirred fhon the force exerted by the player's fillnger on a. foce-sensitive resistor imoited II 111( how. A llicrolphllone placed ne(ar the violin detects the acoustic output.,lit fitlLOr-Cposition sensor consists of a. thin stril) of stainless-steel ribbon attached to the finger boardl. Al alt erliating current at. 5 klIz passes throughi the violin strin oand is divided according to the distance hot ween the contact point and the two einds of the ribb)on. Synchronous (letection measures the difference hctween the two currents, fromI which we 'if" Lh-1 stion of the playerm s fiigel. Ve use the sum of It' ~1 currents I,. on xih h fiv SIc curlrents to determine wh ether the strin is in ll conact oaith the fi.ger board. I)muring data recordling sessions, we sinultfaneouslv recoird sensor and audio (dlata along wi th an initial 'Vuchlronization im]pulse to ensuIre )properi' time alignllent of the signals (see Fig.1). For performtance, Ih1 violinist plays the same instrument,with the strings covered by a. shield thalt prevents bow-string niitact. The sensor si__nals are again fed into the compuniter, and now a real-tine program predicts S1111(1 I)aramleters and synthiesizes audlio outputlt correspondillng to the player's performance dat a. 3 Modeling Sequence Hit ll ullilding and prediction sequence is broken iiito the following steps: (1 I) trainsforim the output o r0 1 slt at~, ion from time series audio samples to spectral dlat, (2) prepare the effective state space h ipjut an(l outputt dclata series and reduce the dlimensionality of this space using a princilal i1i ii t aaysis. (3) Build all illpit-output prediction m odel using the cluster-weigohted algoritnhm, t oi m to predict output spectra l d(ata. based oiln new input data, allnd (5) synthesize the A il.T: iT f1r, I t n h Ie predicted sp)ectral seiquetice. 3.1 Data Analysis and Representation ii 1ist a~tt~empt~s at using eilledldilig svilthesis t~o 11o((d(l diven iiiput/ouit puit systems xwere elltirel v lloltain-based. Whlile this approach is, (lose to thli t( cllliqjles sugestEd hv(l dylam~ic svstellis t heorv i lagli, 1992], it sti fflredt fi'oi inst abilit x (lie tou tolle (Ii fflcdlce iii chalra(c't~eristit ti Ilie s('cales lbet weeit S1 iiputs anl gnd ouptits. 111 ii-d(ollinaill a)l)roach lno(lels a l)art icullar realization of a Sss. lot the pi0oc(s its lf. Ha (I it ret a inus I percept i all v i rrelevaiit. features such s Iphase information dcr Berelic l iuformiiiatiot sllchl1 as noise. The sp)ect 'a l iepresent~at~ion we now use overconiess tI ese Ii~ lu [M\ IcAn lax md QuatI eri, 1985, Serra andl S in ith I 1990]. ttt0 Scos t he uleasurell a~uidio signal into spectial frames ill sutch a. way that. each auidio fralne -q tI a I easitell set of injput. varialbles x'. At each framne tinie stel) we coiiput~e the (Iis(.ret~e PROCEEDINGS ICMC98 67

Page  00000003 00 0 m 2 Cl) cn 0 co 71 0 r^-r t t t~ 0 13 -- -7 i t-0 A, 0 crr - -r - I, -r Ct -rt 7.D -r Zr1 A, - -rr -rr, r~ 0 0l -r -r r 1- A, 0; r -rr -r -r a ~ 0- CrI rr 7 O c: cl r ~1 TC c: ~7; ri; c r v:~r r v h I c r a r: c: r~ ~1 r O r j r u,, 3 ~7: c 7: z -e -~ z I e ~/ j 01 -r3 i 5 5 C3~ ci r 3 ~~ 3 w ri~ d o S n~ z P' r:C riii 32 h v ~-jl ~L 3 v =3 2: c* C' c: 3 3;rl 3~ 3 13 3 a Jr -r, 3 3 ~v? e=3 =L: 3 i r; 3 j Bow/Bridge Dist. Bow Press. Finger Pos. Bow Veloc. Bow Pos. Audio c (D (D 00 (D U, 2 CD Cn

Page  00000004 -2 +15 -+ 1 1+ 0+ +0 ++ 1 -+- + ++ ++ ++ + I- +. _-2 --2 -2 Bow-Speed(nT) Bow-Speed(nT-10) Bow-Speed(nT) a): ti n(d clusters il the joint input/outlput. spa.ce. a:) Vertical view of the input space. r' Ireselnted by their centers a.nd their domains of influence (variances). b:) ('lusters in 3D i, rCt angles re)resenting the plane of the linear functions. S1 rves as a. predictor that geineraites ani estimiat.ed olutput)lll, given a. new input vector x. From the I.1 ipect.ral vector y we reconstruct. a. time domainl waaveform bhy linearly interpolating frequencies tli I-es of the independent partials between framilles. The sinusoidal comnponents corresponding Irtiills are sui-nmmed into a single au1lio sinla. 3.2 Cluster-Weighted Modeling Sr-\\(ighted Modeling ((WM) is an input/output, inference franlework bahsed on pirobalility denS. nti i011 of a joint set of feature (control) and target. (spectral/audio) data. It is derived fromi _.,rt hil Iixttiire-of-experts type architectures [Jordan alnd Ja.cobs, 1994] alld canl be interpreted as Sil:. 1 i transparent techniquie for approximlating an arbitra ry function. (lusters a utomatically W,re thie data is" and approxinmate subset.s of the data space according to a. smooth domnain of: ig.2). Glolally, the influenlle of the different, clusters is weighted by (Gaussian bhasis terms,. i. ally. each cluster represents a siimple model such as a. linea-r regression function. Thus, previ-,lt fromi linear systems theory, linear time series analysis and tranditiolnal nmusical synthesis are \wIt lili tihe broa.der context of a globally non-linear lnodel... ssing the exAperilental measurements we obtain a. set. of training (lata {y,, x,, _ 4 1, where S-" -- i ll feature iniiut vector a.t a time im and y refers to the target. vector contamiming spectral.i t imle n. \We infer the joint probahbility de nsity of feature and target vector p(y, x), which - r onditional tquantities such as ti.he eXlpectd~ value of y given x, (ylx), and the expected i I''. 11;1at rix of y given x. (Cu,,x). The va Ile (ylx) serves as prediction of the target. value y and C x r.'s s its error estimlat.e [(G'ershenfeld et. al., 1997]. 11 i1 (,lpiisitxy p(x, y) is expanded in clusters labeled c,,,, each of which contains all ilnput dloniali m of'. a local imod(el, a.ndi( alln outpu.t distribiution p(yjx) = /p(y, x, 'e,) (1) I:.,r..lilihtv fitctions,(y x.,,,) auni( p(x[,,,) are taken to be of Gaussian form so that., p(xc,,,) - S '.. i:ini p(,(y x. c,,,) - (f(x..,,,)C' C,,,1), where.V(t, C) stands for the multi-dimensional PROCEEDINGS ICMC98 69

Page  00000005 (Gaussian distribution with mean vector p Iand covariance matrix C. The function f(x, 3,,,) with inknown parameters 3,,, in gIeneral needs to be a linear coefficient model [Gershenfeld, 1998], for example a lpolynomial inodlel. The complexity of the local model is traded off against the complexity of the global architecture. In the case of polynomial expansion, there are two extreme forms thliat illustrate this trade-off. WNe may use locally constant modlels in connection with a large number of clusters, in which case the predictive power comes from the number of Caussian kernels only. Alternatively we may lecide to use a high-order polynomial moldel andl a single kernel, in which case the model redluces to a global polynomial mniodel. In this particular application we choose the linear model D./(x,,,= ) - n,, +.E,,,,,.rt (2) (i~t where d refers to an input (limension and D to the total number of input dlimensions. fiven the density estimate, we infer a. conditional forecast. 11 f(x, 3m,) P(x IeCm) p(enl) (y x) =Y(3) Em.= P(x |c,,, ) p( cm. ) Ias well a.s a conditional error florecast, I [C,,,, + f(x, 3),,) f(x, 3,,,0)] I)(Xjc,,,) p(c,,,) K (C,, |x) =( I m - yx)c (4) The11 choice of the number of clusters ill controls under- versus over-fitting. The model s.hould have etiough clusters to model the predictable data, but should not. become so complex that it tries t o piredict. the noise and non-generalizable features. The olptimal ill can be determined by cross validation with respiect. t.o an abstract error measure such as the square error or witlih respect to perceptual performanace. W\e find thlie model parameterts using a variant of tIhe Expectat ioin-Maxiimizat ion (EM) algoritlihm (Conventional EM updates are used to est'imnatfe the unconditioned clust er probabilities (c,,,). clustrI locations p.,,, and covari ances C,,. We thei is psei.lo- in xe rs s If th ci st i w Ii h 1v: matrices to updat.e the local model 1)1aramIeters 1,.. Th ll I a d III-l tl i ti T1I 1. 1 I I.1 vi parameters by iterating between an exp wt ation st: an1I a ii;nij/ain i-,. tI 1 nill 1 I Al.. 1,.J1ordan and Jacolbs, 199].1 E-step: (Given a starting set of paiameiters, we find the Ip)rolbablility of a cluster given the lata: )((,m, y, x) Y,=x) (5) MyU, X) p(y, x c,,,) 1)( c,,,) l=l M(y,x cl) M(cI) where the sum over clusters in the 1deominator lets clusters itnteract. ail specialize in data tlhey lbest. explain. M-step: Now we assuer the currentata.a (listribhmt~ioin correct, and maximize lie Ii likelihoo I fmi ctioii by re(01o1putin g th le cluster para met ers. The new est inmate for thle in coiiditioiied (Ilist er lprobablilit~ies Ibecomnes: /)( (,,,) p(c, |y. x) p(y, x) dy (Ix (6) ~ 0, y,,, 70 PROCEEDINGS ICMC98

Page  00000006 The cluster-weighted expectation of any function 0(x) is defined as (0(x)) /0(x) p(x~cr,) dx (7) Z-I J(x C ) p(cm |yn xI) =nx), (p(T),|y,, This lets i-s iulpt( tl, ldust r HICnanIs 8111d the cluster weighted covariance matrices p = (x,,, ( 8 ) [Cm] = ((xi -,p) (.x - pj))j hlie lerivation of the maxinumn likelihood solution for the modell parametI ers viells - B-I AA (9) with [B]dij= (f(x. fm) fj (x, im)),)? and [A,,].= (i- J M(x, 3, ))m Finally tde output covariance mllatrices associated with each model are estimated, C,,,,,, = ([y - fi x,30,,,)], [y - f (xI/3,,m)]')m,,w (10) \\e iterate betwween the E- and thlie M-step nuti the overall likelihood of Ilie ldata, as ldefiiedl bw t he product of all data likelihoods (equ.1) dloes not increase further. 4 Experimniental results \Ve collected approximately 30 miitenuts of ifnput/ontlut \violin data, bhothll single notes and scales witI various bow strokes (Fig.1). \We thien built models based on suhsets f thi dlita, and used thei for off-line and on-line Synthesis. The model perfornis very\ (1 well on small subsets of the overall trainiing dalta. It is particularly rohust at representincg pitch and amiplitide fluctiuations caused hy vibrato. Figure 3 illustrates that (CM can reproduce the spectral characteristics of a segment of violin sound. The souiind examples demlnionstrate the results of autdio resntihesis both in- and out-of-sample (For training only thlie first half of each origiiia.l soundiiil file was used). \Ve were able to ibuild a real-tinme svstemii runniing oin Windows NT' using a Pentit im 11 300 MH1z. The systeii i respondls well to control changes. but considlerable latency is cau sed hy Inerat ii svstem(. data acquisit ion board and sound card. Large scale models that co ver the entire musical space are not yet satisfying. Note transitions are particularly difficult to model, beicause there is very little data. representing them. While thef inference approach interpolates well between oblservedl control data, t i s hot ldesielled( to extrapolafte into uIIiseell control dlomains. Therefore control 11in1pt onutsidle the rangeo covered by the trainingl data,results inl unpredict able output, during performaice. 5 Conclusions and Future Work We have shown ihow thlie general iiiference a fnrew or k (V cM an he riseil in a sensing and signalprocssi ng systein for mnusical synthesis. The approach is particularly aliliropriate for IImisica(l iniist.ruments that rely on coiitinuous liunian coiit.rol. such as the violin, since (\VM naturally relates input an(d output time series. PROCEEDINGS ICMC98 71

Page  00000007 NJ 0 0 in z C,, 0 co ~ ~ ci~ c~ 3 O c C: h c:r: 3 3 11 3 ci~sl i c O " r ~~ 'O u r: TCrl O r e ci C~l'rl ~~i C3 j fr~ e ~-e cl h h:-C?~ e Z R Y r I r, I: ~ cg -~ 'C i 7r v i II j v r, I. d-' ri~ Y h Lh C Y i ~3 a cm i C L v -~C3 j ~ 7?,, c~3) C:~ -i:3 O ~-h h~ ~-~L ~-~ it c u -r re: 3= "?~ z ~~ z h I -O 3-,I,e ~3r e-. '3 i r Y "r: c r5 rii hj O~c r Z r,;i I--L h 3; r ~ Zj r cl j c ~ h i, r. c j 1. rj i. r r r r ~1; -/1 c 3 r y e c-~ h 3 -h e3 0~ ci 0~ 0r i t: i~ 5i t3 t 0e 0 Ti: cJ' tCiT t TiC 0 CiTa - cJ 3 (D BowVel FingPos Audio Predicted Harmonic Frequencies o -- r\L &N -w (P L 0M F T- -i O l 0 I r I V I /I /, /I 1 Predicted Harmonic Amplitudes ) N) w P cn 0)o --4o( BowVel FingPos Audio (0 -o Cl, =3 U) 0 N) 3 CD (n w) K, k

Page  00000008 synthesis techniques to generate a model that provides both control flexihbility and high fidelity to the original. More work will be needed to progress to a high quality digital instrument but these preliminary results indicate the promise of "physics saiplingll" 6 Acknowledgments The ant hots nVbL lik 1) thik 1H nix Shijoda. Sandy (ho and Teresa,Mlarrin for helpiig in the data collectioI Ir e. E r h r \x-writing parts of the code used in this work, and.Joe Paradiso f(r tH. 1 Th XIIlin Xv. This wxork xvas made possible by the Media Lab's Things That Think illt 1~ II' References [Casdagli, 1992] C(asda.gli, I. (1992). A lynaimlical systems approach to modlelng inpnt-ontplt systens. In Casda.gli, M. and Eubank, S., editors, Nonlincar Ml odeling and IiForccasling, Santa Fe Institute Studies in the Sciences of Complexity, p)ages 265-281, Redwood City. Addison-Wesley. [Dempster et al., 1977] Dempster, A., Laird, N., and Rubin, D. (1977). Maximuni likelihlood froim incomillete data via the enm algorithim. J. R. Statist. Soc. B. 39:1-38. [GCershenfeld, 1998] CGershenfel d, N. (N1998). The ANaur of Ma/irtnhmatical Mod(liny. (Camlbridge University Press, New York. [(Gershenfell et al., 1t997] Gershenfeld, N., Schoner, B., anld Metois, E. (1997). (Cluster-Nweighted miodeling for time series prediction and characteriza ltion. submitted. [Jordan and Jacobhs, 1994] Jordan, NI. aid. Jacobs, R. (19) )). Hierarchical mimxtutres of exIperts anid the emn algorithm. Ncaural Computation. 6:181-214. [Massie, 1998] Massie, D. C(1. (1998). W avetable sainpling synthesis. In Kahlirs. NI. anidn Brandenburg, K., editors, Applications of Digital Signal Proccssing to 1 adio andl Aconustics, pages: i 1-341. Kluwer Academic Publishers. [McAulay and Quatieri, 1985] McAulay, R. and Quatieri, T. (1985). Speech analysis/synthesis based oni a silusoi(lal representation. Technical Report 693, M assachusetts Institnte of Technology / Lincoln Lid oi t o (. Cambridge, MA. \l.\i:in:l Q(2111iatii. 1986] McAulay, R. and Quatieri, T. (1986). Speech analysis/synthesis based on:1 iin.iJu in rs(ntaion. Iai.EEE Transactions on Aicous/tics. Sp,(ch and Si gnal Proc ssing, ASSP-34 No 1.-I7 1 )17d. [Paradiso an( Gersheinfell. 1997] Paradiso, J. A. and G(ersheiifeld. N. (1997). Musical applications of electi-c field sensing. Coimpat(c r Mtusic Journal, 21(2):69-89. [Serra a iud Smith, t99)0] Serra, N. and Smith, J. O. (1099) ). Spect ra.l nodeli ng synthesis: A sotnldl analysis/synthesis systent lasedl on a (let erninist ic plius si (cliast ic decon] losi tion. Coinpnur Mu1/sic Jo urlint, 11(4): 12-24. [Smith, 1992] Smith..1. C). (1992). 1Physical modeling tusitig I igital wvaveguides. ( oinputt c Music Jouralt. 6(1(). [Takens. 1981] iTakens, F. (19$ 1). Detect itig strange attractors in turbulence. In Rand. 1D. and Young. L., editors. Dyniaiiiical syst( Rns and Turba/l c, voltume 898 of Lcctur( Nots in ilJat/he inatics, pages:166-8 31. Nexx York. Springer-Verlag. PROCEEDINGS ICMC98 73