Page  00000410 Time-varying estimation of parameters in rule systems for music performance Patrick Zanon, Giovanni De Poli, Alessandro Dorigatti CSC - DEI University of Padova - Italy {patrick,depoli}@dei.unipd.it, excawind@inwind.it Abstract Most performance rule system compute deviations by using a set of weighted rules. In this paper we describe a method of parameter (or weight) estimation, so that the rule system can generate a performance that will be as similar as possible to a human interpretation of a given score. The method allows either a global and a time varying estimation on different time scales. 1 Introduction The analysis of deviations in music performance has led to the formulation of some models that describe their structures, with the aim to be able to automatically synthesize what the player unconsciously adds to the notation of the score. Some system of rules, with different characteristics and various degrees of flexibility, have been proposed, but all of them have the aim of covering a range of "expressive" variations as wide as possible. These rules are developed mostly using the analysis by synthesis or by machine learning algorithms [5]. The most important is the KTH rule system [2]. Different rules can be weighted by the so called k parameters, allowing them to model performances more closely and adapt the rules to different situations. Moreover tuning weights could be used to model some emotional expression [1]. Weighting parameters are normally estimated by a trial and error procedure with analysis by synthesis approach. Friberg [3] attempted automatic estimation using an iterative minimization of a sort of distance between performances, by varying the values of the parameters with the intent of approaching a given performance. However the performance style or strategy can change along the piece. Thus it is interesting to have a time varying estimation of parameters. In this work we will use a suitable pre-Hilbert' space to represent the scores allowing an optimal estimation of parameters on different temporal scales and following their time variations. 2 Estimation Methodology We observed that the considered models start from the nominal performance (a literal or mechanic interpretation of the score) and introduce, in an additive way, duration and volume variations on some notes according to the rules, each of them weighted by a characteristic multiplicative coefficient kj. The idea is to estimate the parameters so that the interpretation, generated by the rule system, will be as similar as possible to the given one (e.g. a human performance), which will be called sample performance. This suggests the way to go: by representing the performances with suitable vectors and formalizing their 'A pre-Hilbert space is a linear space equipped with an inner product. A complete pre-Hilbert space is a Hilbert space. concept of distance with a particular formulation of the Euclidean norm, it will be possible to access to the results of the theory of pre-Hilbert space', and in particular to the theorem of the projection, that is the best approximation in the least square sense. This can be do with some definitions: every performance of n notes was symbolized as a vector p laying in a (3n - 1)-dimensional space P, in which the elements are: n sound intensities (Sound Levels Li), n durations (Di), and n - 1 time intervals between notes (Inter On Set Intervals 1i = Oi+1 - Oi, where Oi is the onset time of the i-th note). The vector space P as it is, does not evidence the variations inserted by the performer or by the rule of the expressiveness model. Therefore, when it is necessary to refer to the deviations from the nominal performance PN we use the symbol Ap = p - PN. With this notation, each rule of the model is represented by a (3n - 1)-dimensional vector rj, whose components are the deviations that would be applied to PN using a unitary parameter as weight. Thus the model can generate the following performances: m Pg = PN + kjrj j=1 (1) where m is the number of applied rules. All pg lay in a vector subspace 7 of dimension m whose base is R = [ri... r. ]. The Euclidean norm should take in account characteristics of the human ear, so several psychoacoustic principles were used. On the base of these studies it has been chosen to measure the variations of intensity in dB and the variations of duration in percentage relative to the nominal duration or inter on set. At this point a criterion about how loudness variations and time variations could be considered together must be selected. Many possibilities were tested, but the best results were obtained by weighting this different kind of variations so as to balance the effects of the just noticeable variations for human ear (we used 0.5 dB on average for sound levels and 5% approximately for the durations). Thus the norm used in our space is: II Ap II S L2 AD2 n-i Al2 21 i A1 Ni i-1N2 a, ALi + aOD: + D a 2 i= i=l i =1 N,i (2) 410

Page  00000411 where OL = 0.1, OD - 1 and 0a = 1 are the weights. In order to find the performance f3g as similar as possible to a sample one, the theorem of the best approximation is used: this means that the solution is the orthogonal projection of the vector, representing the variations of a given performance, onto R7 and that the multiplicative coefficients are the components of the projection according to the base R. Moreover it is possible to measure how much the model of expressiveness can approach the human performance by comparing the norm of the human deviations and the norm of the synthetic ones by defining efficiency as S1- (3) II P - PN |I Having defined a pre-Hilbert space, some others features, such angles among rules and among performances can be obtained. This approach allows to verify if the rules or the performances are related in a geometrical way. Rules that are related are almost parallel and analysis of the angle between them allows a simplification of the model of expressiveness. In fact often rules are triggered by the same underlying principle and cooperate to introduce deviations. Care should be payed in the estimation procedure that if some rules are approximately parallel, their parameters can be estimated with very high values because one rule tends to cancel the effect of the other. The method instead try to explain deviations using the small difference between the two rules (over-fitting problem). With this methodology it is possible to estimate the optimal coefficients that can be used to approximate a given performance according any additive model of expressiveness. We present our results with reference to the KTH rule system. Minimization of the norm (2) can be used also to estimate non additive parametric rule systems (e.g. neural networks), but in this case we cannot exploit geometric properties such angles of the performance space. 2.1 Tempo and amplitude scales estimation The nominal performance is played using the nominal tempo specified in the score, but the musician can choose another value or change it according to his expressive strategy; similar considerations for the intensity can be done. On the other hand we want that the performance deviations Ap, used as input to our estimation procedure, should be really expressive deviations and not an artifact resulting by a bias caused by a wrong estimation of tempo and intensity. It would be possible to normalize tempo and loudness to the total performance duration and average intensity, but the presence of short rests separating each phrases, and other phrasing effects sometime make this normalization not very reliable on the local scale. Thus we extended our estimation procedure in order to have an optimal estimation of tempo and intensity scales. To this purpose, two special vectors were added for tempo and amplitude estimation represented by two (3n-1)-dimensional vectors: r [0...0 D1...DO II...I_ 1]T.... (4) rL[1...1 0...0 O...0]T where in the rule rL the first n components are set to 1 and the others to 0. The first rule rt introduces variations only for the timing components: if its relative weighting parameter kt = 1, then all the durations of the synthetic performance will double. The second rule rL introduces deviations only for loudness components; if its relative weighting parameter kL = 1, then intensities will increase of 1 dB. It is important to underline that rt and rL, after the estimation, indicate how much the human performance is slower and louder than the nominal performance in the least square sense. For example we can use a nominal performance played at 100 bpm and at 80 dB; if the human performance we want to analyze was played at constant 60 bpm and at constant 85 dB, then the estimation of the parameters using only the rules rt and rL produce these results: kt = 2/3 and kL = 5. It can be noticed that in general rules have a side effect: they introduce deviations that change the average tempo and loudness. Our method allows to compensate the side effect of the rules and to obtain a better fitting of the human performance. Thus when using all the rules different kt and kL could be obtained. This situation can be expressed in term of non-orthogonality between rt (or rL) and some other rules. Notice that rt and rL are orthogonal because the non-zero components are different in the two rules. 2.2 Sliding Window The methodology was extended allowing a time-varying estimation of parameters using a user-definable mobile window, called sliding window. In this way the estimation have a local meaning and so it is possible to see if parameters change during the time. For this local behavior, attention should be payed on which rules have to be included for the estimation: many of them have a global meaning (e.g. the Phrase Arch rule) and for this reason have to be excluded from the sliding window estimation in order to avoid artifacts in the estimation. Another interesting topic is the over-fitting problem that may emerge from the k parameters trend: sometimes many rules introduce, for a particular score, the same kind of deviations and for this reason only one of them were included in the estimation. 3 Results The method have been employed with different musical pieces, each of them was interpreted by a pianist more times. As an example we present the results obtained with different interpretations by a professional pianist of the "Sonatina in G major" by Ludwig Van Beethoven on a Yamaha Disklavier Coda. The first 24 measures were played without repeats, for a total of 150 notes, divided in three main phrases, respectively from the 1st to the 48th note, from the 49th to the 102nd note, and from the 103rd to the 150th note. The structure of the piece is ABA. For each stimulus the pianist was asked to play many performances, and then he chose the most effective one. A first set of performances was played in a natural way: the pianist was asked to play without any expressive intention, and interpreting the score in an as similar as possible way. The values estimated (using all the KTH rules and rt, rL) were strictly related with the performance style and the set of coefficients was very stable (figure la). A second group of performances was used, in which the pianist 411

Page  00000412 Estimated k Values I BNatural 3 SBNatural 5 8 Average '5 8 55' 555' 5 55 as '5t '555. 55 ^^y.^ ^D z^/^^ v <.'^\ Estimated k Values BAnger S Disgust HFear I Happiness 4 NIndifferent Z Sad I Average Figure 1: Estimation of parameters on the whole piece different, Sadness) expressive intentions (Anger, Disgust, Fear, Happiness, Indifferent, Sadness) 1-150 1-48 49-102 103-150 kt -0.26 -0.28 -0.23 -0.28 kL -3.78 -3.83 -3.90 -3.58 ( 0.54 0.52 0.74 0.47 Table 1: Estimated parameters using only the rt and rL rules, over the whole piece and over the three phrases for the "Anger" performance. was asked to play with different expressive intentions: in this case some parameters presented considerable variations (figure Ib). A second set of results were obtained considering the estimation of parameters using only rt and VL rules. This choice allows to estimate the mean tempo and the mean loudness. From now on the discussion of the results will be carried out with reference to the "Anger" performance. Similar discussion can be done for all the other performances. The estimation were done over the whole piece and over each phrase (table 1). In general the piece was played faster (kt < 0) and softer (kL < 0) than the nominal performance. In particular the two instances of the phrase A were played faster and louder than the B one on average, and from the efficiency ( we can see that the A phrases were played introducing a greater number of deviations than in B that cannot be explained using only the rt and rL rules. Subsequently the analysis was carried out using a sliding window of 10 notes large (1-2 bars on average). Results of the estimation are presented in figure 2. We can see how the central phrase was played using a lower tempo than in the two other ones, as seen above; moreover for each instance of the phrase A two Figure 2: Estimation of parameters kt (a) and kL (b) using the sliding window of 10 note large for the "Anger" performance. The two vertical lines indicate the division in phrases. parabolic arcs are present. Similar considerations can be done for the profile of kL parameter: it indicates a softer interpretation of the phrase B, and in general a louder interpretation marking the boundaries between phrases; it should be noticed the absence of loudness modulation in the final phrase. Finally, as predicted by the efficiency ( previously computed, we can see that in the phrase B a lower number of deviations were introduced by the musician than in phrases A. Finally a restricted set of rules were used using the sliding window procedure. In this case the window size was increased to 25 notes (3-4 bars on average) in order to avoid over-fitting problems. The rules used in this estimation were the following: rt, rL, Duration-ContrastAO-D1, Duration-Contrast-Al-DO, Duration-ContrastArt, Harmonic-Charge-AO-Dl, Harmonic-Charge-Al-DO, High-Loud, Melodic-Charge-AO-D1, Melodic-Charge-AlDO and Punctuation-DO-Dl (Dur=0O, Duroff=l), where AO-D1 means the duration part of the rule and Al-DO means the amplitude part of the rule. The PunctuationDl-DO rule was excluded because of the high correlation with the Duration-Contrast-AO-D1 rule that produced over-fitting and over estimation of the respective k pa 412

Page  00000413 , Estimated k parameters....______- ____ ____ __ _ ____ __ AcoustIcal Parameter Deviations SI I 0 20 -- RT S-RL Duration-ContrastA0D1.per Duration-ContrastADO. per SDuration-Contrast-Art-Dr.pei Harmonic-ChargeAOD.per Harmonic-ChargeA1DO.per......... High-Loud.per - Melodic-ChargeAOD1.per._Melodic-ChargeAlDO.per PuntuationDOD107n.per 10 i 100 120 '--------- KeyVelocity 2 i. I. lnterOnSet It\ I | jDuration 20 40 601 80 100 120 1 140 IDuration-ContrastA0Ol.per / y 4 _i__ I I I I - -!; V - - 4 01 I 20 40 60 80 1001 1120 Figure 3: Estimation of parameters using a sliding window of 25 notes large in the "Anger" performance. Rule name Mean Variance Duration-Contrast-AO-D 1 -1.50 3.37 Duration-Contrast-Al-DO 0.64 0.13 Duration-Contrast-Art 0.39 0.06 Harmonic-Charge-AO-D1 -0.32 2.97 Harmonic-Charge-Al-DO 0.01 0.02 High-Loud 0.15 0.01 Melodic-Charge-AO-D1 0.75 3.49 Melodic-Charge-Al-DO -0.19 0.08 Punctuation-DO-D 1 0.79 2.94 Table 2: The average k parameters an the "Anger" performance their variance for rameters. The parameters as shown in figure 3 are quite variable and some of them seem to be be related to the structure of the piece. Moreover in the second phrase, the parameters seem to be more stable than in the other two. The parameters estimated for the rules rt and rL tend to compensate the bias introduced by the other rules. The other parameters are reported in table 2 in which the average and the variance are shown. As an example we report a a description of the parameter of the Duration-ContrastAO-D1 rule: it has a high variance, and its value seems to be highly related to the structure of the piece (see figure 4). In the first phrase (1-48) most of the values are negative, and in particular in the first two sub-phrases (1-26) values are about -4. In the third sub-phrase (27-48) there is a positive trend that reach the value of about 1. In the second phrase (49-102) values are on average close to 0 and in particular in the first sub-phrase (49-61) is close to 1 and in the second one (62-102) vales ranges from -1 to 1 but are on average around 0. Finally in the third phrase (103-150) values repeat the same shape of the first phrase; in fact the structure of the piece is ABA. It should be noticed that sliding window smooths the values between phrases, so it happens that the last part of the second phrase and the first part of the third (90-110) have a negative trend that leads to the very similar shape of the first phrase. The same smoothing behavior is present between the first and the second phrase (40-50). Figure 4: Measured deviations and estimation of the k parameter of the Duration-Contrast-AO-Dl rule in the "Anger" performance. 4 Conclusions A method for the time varying estimation of weighting parameters in additive rule systems for music performance was presented. It allows to characterize the changing style of the performance. The definition of an Euclidean norm allows a geometrical interpretation of the approximation problem, giving more insight on the rule system. We attempted to include perceptual issues in defining our norm. Different definitions of norm in the performance space can be experimented using the same estimation approach. It should be noticed that the hypothesis of additive rules is very interesting for an user point of view for its explanation power, but it is a oversimplification of the real performance strategies of the performer. Different factor interact in an intricate way [4]. Minimization of norms as (2) can be used also to estimate non additive parametric rule systems (e.g. neural networks), but in this case we cannot exploit geometric properties such angles of the performance space. References [1] Bresin R. and Friberg A. 2000. "Emotional Coloring of Computer Controlled Music Performance". Computer Music Journal, 24(4): 44-62. [2] Friberg A. 1991. "Generative Rules for Music Performance: a Formal Description of a Rule System". Computer Music Journal, 15(2): 56-71. [3] Friberg A. 1995. "Matching the rule parameters of Phrase arch to performances of 'Triumerei': A preliminary study". Proc. KTH symposium on Grammars for music performance, 37-44. [4] Palmer C. 1997. "Music Performance". Annual Review of Psychology, 48, 115-138. [5] Widmer G. 2001. "Inductive Learning of General and Robust Local Expression Principles", Proc. International Computer Music Conference 413