Page  00000001 A M/ODEL TO ADD EXPRESSIVENESS TO AUTOM/lATIC MLIUSICAL PE RFO RMlA NC E Canazza Sergio, De Poli Giovanni, Di Sanzo GiaLnni, Vidolin, Alvise Dipartimento di Elettronica e Informatica - Universitjt di Padova via Gradenigo 6a - 35100 Padova - Italy canazzaedei.unipd.it; depoli edei.unipd.it; vidolinedei.unipd.it. ABSTRACT The fact that music can be used as a means for expression and communication is often acknowledged. Yet this is one of the least understood aspects of music, at least as far as scientific explanation goes. The performer introduces some deviations from nominal values specified in the score, that characterizes his own performance. It is known that several performances of the same score often differ significantly, depending on performer's expressive intentions. A model of expressiveness has been developed on the base of results of acoustic and perceptual analyses. The model allows to obtain different performances, by modifying the acoustic parameters of a given neutral performance. The modification of the input performance is performed by algorithms which use the hierarchical segmentation of the musical structure. Opportune envelope curves are applied, for every hierarchical level, to the principal acoustic parameters. Level's self-similarity is the main criteria for the envelope curves construction. The modular structure of the system defines an open architecture, where the rendering steps can be realized both with synthesis and postprocessing techniques. Different synthesis techniques, like FM, physical models or wavetable have been explored. 1. INTRODUCTION Music is an important means of communication where three actors participate: the composer, the performer and the listener. The composer instills into his works his own emotions, feelings, sensations and the performer communicates them to the listeners. The performer uses his own musical experience and culture in order to get from the score a performance that may convey the compositor's intention. Different musicians, even when referring to the same score, can produce very different performances. The score carries information such as the rhythmical and melodic structure of a certain piece, but there is not yet a notation able to describe precisely the temporal and timbre characteristics of the sound. The conventional score is quite inadequate to describe the complexity of a musical performance so that a computer might be able to perform it. Whenever the information of a score (essentially note pitch and duration) is stored in a computer, the performance sounds mechanic and not very pleasant. The performer, in fact, introduces some micro-deviations in the timing of performance, in the dynamics, in the timbre, following a procedure that is correlated to his own experience and common in the instrumental practice. It is exactly for this great variety in the performance of a piece that it is difficult to determine a general system of rules for the execution (Kendall Carterette 1990). An important step in this direction was made by Sundberg and co-workers (Friberg 1991) They determined a group of criteria which, once applied to the generic score, can bring to a musically correct performance. Further on, the performer operates on the microstructure of the musical piece not only to convey the structure of the text written by the composer, but also to communicate his own feeling or expressive intention. Quite a lot of studies have been carried on in order to understand how much the performer' s intentions are perceived by the listener, that is to say how far they share a common code (Seashore 1937; Sloboda 1983, 1985; Canazza et al. 1997a). Gabrielsson (1995, 1997) and Gabrielsson & Juslin (1996) in particular, studied the importance of emotions in the musical message. In this context, we tried to understand the way an expressive intention can be communicate to the listener and we realized a model able to explain how it can be possible to modify the performance of a musical piece in such a way that it may convey a certain expressive intention. A group of sensorial adjectives was chosen (haLrd, soft, light, heavy, bright, dark ) which should inspire a certain expressive idea to a musician. We observed that a musician, inspired by appropriate adjectives, produces different performances of the same piece. Perceptual analysis (Canazza et al. 1997a) proved that the audience can indeed perceive the kind of intention he wanted to convey. Acoustic analysis (Canazza et al. 1997b, 1997c) confirmed that there are micro-deviations in the note parameters. We outlined models to connect such deviations with the intention wanted. Following the analysis-by- synthesi s method, some musical synthesis were produced to verify and develop a model of musical expressiveness (Canazza et al. 1997c). This paper, starting from the results of the acoustic and perceptual analysis, presents the design of a model able to add expressiveness to automatic musical performance.

Page  00000002 These studies on the model of musical performance are interesting not only from a scientific point of view, but also from an practical one, both in the field of automatic musical performance and in general in the multimedia systems. This research was supported by Telecom Italia, under a research contract Cantieri Multimediali. 2. ARCHITECTURE OF THE MODEL The researches we have been making (Canazza et al. 1997c) prove that the performance of a piece following a certain expressive intention can be described observing which variations take place with reference to a neutral and a nominal performance of the same piece. By nominal performance we mean the mechanic performance of the score where the metrical durations (the score) are accurately observed and by neutral performance we mean a literal human performance of the score without any expressive intention or stylistic choice. The model presented is able to obtain an expressive intention, transforming a neutral performance both with reference to the score and the acoustic signal itself. It must be underlined that our approach provides for the adoption of hierarchical structures similar to the spoken language ones (words, phrases), in the musical language. Once these structures are recognized, it is possible to modify the parameter of a group of notes (for example metronome or intensity) following a certain curve. Such curve describes the characteristic of the musical gesture on the group of notes. It is therefore convenient to describe (appropriately codified) the information about the neutral performance and the nominal performance (i.e. the score), the variations to be applied on the expressive traits of the single note (timing, intensity, timbre), the subdivisions of the piece into expressive units (words, semi-phrases, phrases) characterized by curves that modify one or more parameters of the notes that constitute them. To this aim, we propose a new representation of the score, where the fundamental components and parameters of a musical piece are highlighted. Moreover it is provided with a number of controls on expressive parameters that allow the model to operate on the piece. Later we shall refer to this new score as metascore. Expressive intention Musical Expressiveness Model Nominal performance Post Processing ^ Post Processing ^ Instrument Synthesis Neutral performance Metascore Family Method 0- Synthesis Driver Driver aMIDI Hierachical structure Figure 1 Architecture of the model. The metascore is a file where the information about both a nominal performance and a neutral performance are codified. The parameters of the neutral performance are expressed as deviations from the nominal performance. The performances are read by a MIDI file and transcribed in the metascore. From the parameters of the MIDI protocol, the model computes new parameters that describe the attack, the intensity, the spectral characteristics and other physical attributes of each note. Thus, the basic parameters of each note are immediately accessible. Each of these parameters is expressed in perceptual scale, in which the unit represents the difference between two perceptual levels (e.g. the difference betweenf and ff for loudness). These parameters are independent from the particular musical instrument that will play the score. The importance of the notes in the piece can be specified by parameters such as the accent and the elasticity. This last parameter takes in account that expressive deviations are not made to the same extent in each note because the score sets technical and structural limits that prevent the musical phrase from being distorted (De Poli et al. 1998). The metascore also requires, as input, a description of the hierarchical structure of the piece (fig. 1). As already mentioned, we should not see the notes of a piece as units independent one from another: each piece has its own structure where the notes gathers to form the words or phrases of a musical discourse. For instance it is possible to set the metronome profile and the intensity on groups of notes following the division in words, semi-phrases and phrases. At the phrase level it is possible to specify an adjective inspiring the performance. With reference to the adjectives and considering other factors such as elasticity, the model computes the deviations

Page  00000003 from neutral performance in order to render expressive synthesis. Output of this first computation is still independent from the instrument that will play the score. It is then necessary to adapt the modify metascore to a particular family of instruments and then to a specific synthesis method (fig. 1). 3. THE HIERARCHICAL STRUCTURE Notes are hierarchical grouped in metascore In such a way it is possible to underline in the piece the structures equivalent to the word or phrase in the spoken language. On these structure it is natural to vary some acoustic quantities following the profiles described by the convenient envelope curves. It is possible to introduce a crescendo or an accelerando in a symbolic way, without working note by note. The metascore allows to describe four levels: note (N), word (W), semi-phrase (S) and phrase (P). The atomic event is the note, to which are associated the acoustic parameters (auxiliary parameters) of the nominal performance. The other events are described by the same variables. In the phrase a further field defines the expressive intention the phrase is to be played with. The parameters associable to the different events are: * time (T): starting moment of the event in tick. * duration (D): note duration in tick, or number of sub-events forming the event. * channel (C): number of the channel or track the events refers to. * elasticity (E): degree of expressive elasticity of the event. It indicates when it is possible to work on the parameters of that group of notes (intensity, metronome, timbre characters) to reach a certain expressive intention without distorting the piece itself. * dynamics (D): intensity curve that describes the profile of intensity deviations in the event. * metronome (M): metronome curve that describes the profile of the metronome deviations in the event. * expression (X): symbol of the adjective to be applied to the phrase + intention degree (only phrase level). * attack time (a): duration of the attack expressed in a perceptual scale. * legato (1): legato-staccato degree expressed in a perceptual scale. * intensity (i): perceptual loudness. * brightness (b): perceptual measure of the high frequency spectral components. * vibrato (v): rate and extent vibrato expressed in a perceptual scale. * portamento (p): glissando speed degree in a perceptual scale. Figure 2 summarizes the meaning of the parameters for the event note and word. It should be noticed how the definition of articulation (legato) between the notes (dur /IOI ) was defined together with the parameter for hierarchical higher events (dur /IOI, dur /IOI, dur /IOI ). This revealed to be particularly significant for the expressive control. In this context the information both about the nominal (symbolic level) and its deviations from the neutral is maintained. The parameters are represented by symbolic and numeric constants and by curves. The curves to be applied to a certain quantity (fig. 3) are described in the metascore-type text file by five values: t, pos, g7, g2, g3 where t is the type of curve (e.g. -1 cusp, 0 triangle, 1 parabola); p is the position of the curve vertex; go, g1, g2 indicate the variations in a perceptual scale at the beginning, at the curve vertex and at the end of the curve. The curve starts at the beginning (note on) of the first note of the group and it ends at the note off of the last note of the group. Figure 3 summarizes the meaning of the various quantities. For example: DO,2.25,-0.4,+0.6,-0.3 describes the profile of the sound intensity through a triangular-shaped curve with the vertex placed at a quarter of the second note duration. The curve begins with a reduction of intensity with respect to the value of the neutral performance equal to one degree. The degree of variation increases linearly up to +2 and returns to -1 at the end of the group.

Page  00000004 N ote state Note on Note off tN(fl)= tW(fl) tN(fl+1) tN(fl+2) Nf+) tm s tN(n+3) time [s] word m Figure 2 Meaning of the auxiliarvy parameters for the event. Figure 3 Typologies of curves applicable to a group of notes. 4. INTENSITY AND TEM/lPO CURVES MLIODEL The model allows to apply suitable envelops to the parameters (or their derivatives) associated to different events in order to reach the expressive profit. Here are a few examples of some of these curves. The variations applied to abstract parameters works constantly during the whole note duration. Thanks to dynamics curves it is possible to control the intensity profile on an event. Figure 3 graphically explains the meaning of the parameters defined above. Table 1 describes for each adjectives an exemplifying set of curves used in this work (in this case a triangular envelope was always used) The symbol n indicates the number of notes forming the words. The speed of performance (Tempo) changes when the expressive intentions vary. The variations of the metronome speed to be applied to the synthesis were calculated following the acoustic analysis of (Canazza et al. 1997b, 1997c; Battel & Fimbianti 1997). As usual, the neutral performance was the reference point to calculate the tempo deviations. Such deviations are applied through the tempo curve to the chosen event. Different tempo curves can be applied to each hierarchical level. In this way it is possible to perform different deviations both on local and global tempo. For instance local accelerando and ritardando can be modified applying tempo curves at phrase level. In Table 2 are quoted the percentage deviations of the global tempo depending from the expressive intention.

Page  00000005 Adjective Parameters graphic of the curve Neutral go = 1g = 82 = 0, durw -0.1 -.1------ --- p=0.---- -------. -0.2.:....'........... -0.3-:--- ---- --- --...Natural go = g2 =-0.1, g = 0, dur p, n<-.-du+--.-1 0.01 p = n/2 + 1 -...... -0.3-:----.......-... - Bright go = -0.2, g = 0, g2 -0.3,..durw 0.0............... p-= 1.2 -0.2.:....i... t S-0.3-:------------ -- Dark g = g1 = 0, g2 = -0.2, dur p = n/2 + 1 -0.2.:.:.: t -0.3-:-- - - -...........:Hard go = g1 = 0, g2 = -0.2, dur 0. 1 r 7 77.-7- '. ---6. -0.0:.............. p = n/3 + 1 t -0.3-:- - -........... '. Soft go = g2 =-0.1, g = 0, durW 0.0 1.. n/2 + 1 -........ s S o hie..durarch.i 0.0'. p -= n /3 + 1 -o0.2.:...............: t p = n/3 + 1 I -0.3-:- --.- - -........... Heavy go = g2 =-0.2, g] = 0, dur -0.0 7--- -.0.'.. 2 -............. p = 1.2 -02:::::,... t -0.3-:----.......-... - Table 1: an examples of intensity curves for the word according to the adjective that rules the expressive intention. Similar curves are used on the other hierarchical level. Adjective Deviation Neutral 0 Natural 0 Bright +6% Dark +2% Hard -1% Soft -7% Light +7% Heavy -9% Table 2 The deviations of metronome to be applied in accordance to the expressive adjectives.

Page  00000006 5. CONTROL OF THE SYNTHESIS PROCESS We have presented the approach we followed and the information contained in the model. Such information must be translated into sound. The inputs of the model are the nominal and the neutral performances. The model, taking into account the expressive intentions, calculates the acoustic parameters in order to render the sound (fig. 1). The output of the model is not dependent on the particular instrument. The next stage (Instrument Family Driver) processes this information in order to particularize it to a specific instrument family. Each family is characterized by its own kind of controls (vibrato, portamento etc). Based on the chosen instrument family, the model decides which parameters must be controlled and computes the appropriate curves. Finally the last block (Synthesis Method Driver) defines the controls for the synthesis method. At present the system provides interfaces for three different kinds of output performance: MIDI, synthesis and post-processing. The right rendering method is chosen on the base of timbre quality and controllability. We shall now present some significant examples of synthesis obtained thanks to the model here described. For the piano (and for string-struck instruments), for instance, we used a wavetable through a MIDI protocol, as the piano expressive controls are few. On the contrary, for instruments that allow a more articulated control, such as the violin we can chose other strategies. For instance, by using FM synthesis, we obtained a well controlled sound, even though its timbre quality was quite poor. Other attempts were also experimented; one of them exploits post-processing for manipulating the original sound (i.e. the neutral performance) through time-frequency techniques (Drioli & Di Federico 1998). This approach showed to be necessary whenever very high quality results on critical instruments such as singing voice or violin are required. Time-frequency techniques such as sinusoidal modeling (McAulay & Quatieri 1986) or sinusoidal plus residual SPR (Serra 1997) have shown to be very effective for timing or pitch related processing of the signal. Moreover, by separating the residual part of the signal (noise) from the harmonic component, these techniques allow the processing of the two parts independently. Time frequency techniques have been successfully used by (Serra 1997, Drioli & Di Federico 1998) in order to obtain expressive performances. 6. RESULTS The model was applied to different musical repertories. Besides the fact that it was developed mainly for western classical music, the model showed a general validity in its architecture, even if it needs some tuning of the parameters. Expressive synthesis of pieces belonging to different musical genres (European classical, European ethnic, Afro-American) verified the generalization of the rules used in the model. As example of a piece codified and performed through the model we present W. A. Mozart's clarinet concert in A Major K622. Figure 5 shows a musical phrase of the melody (bar 57, 58, 59, 60) and its segmentation. We shall show now the graphics of amplitude envelops of some clarinet performances obtained (through post-processing) thanks to the controls given by the model. In figure 6a the neutral performance is shown. In figure 6b, the hard performance and in figure 6c the soft performance obtained using time-frequency techniques in order to bring about the transformations calculated by the model. In table 3 there are the MIDI performances (thought for the piano), codified in Adagio format, of the nominal performance (i.e. score), of the input (i.e. neutral) and the six expressive performances (i.e. expressive intentions) calculated by the model. phrase semiphrase I I I word Figure 5: bar 57, 58, 59, 60 from Mozart's clarinet concert in A Major K622 with hierarchical segmentation.

Page  00000007 a) b) c) Figure 6 Clarinet original: neutral performance a). Clarinet obtained thr-ough post-processing, using timefr-equency techniques, from the neutral performance: hard performance b); soft performance c). Nominale Neutra Bright Dark Light Heay Soft Hard!tempo 117!MSEC!MSEC!MSEC!MSEC!MSEC!MSEC!MSEC e5 H e5 U1O59 L52 e5 U664 L71 e5 U1223 L58 e5 U793 L4O e5 U1181 e5 U1257 L55 e5 U824 L94 N996 N903 N1126 N967 L78 N1189 N1080 N1004 cs5 Q. cs5 U673 L50 cs5 U524 Cs5 U908 L56 cs5 U616 cs5 U887 cs5 U926 L35 cs5 U638 L92 N655 L110 N556 N729 L20 N597 L110 N755 N693 N624 d5 1 d5 U58 L54 d5 U123 L88 D5 U197 L60 d5 U133 d5 U187 d5 U2O4 L58 d5 U138 L96 N252 N224 N307 MA2 N244 L89 N314 N301 N252 fs5 I fs5 U274 L6O fs5 U279 L11O Fs5 U45O L67 fs5 U321 fs5 U447 fs5 U433 L47 fs5 U333 L1O3 N226 N177 N231 L31 N198 L110N249 N215 N207 e5 1 e5 U68 L51 e5 U111 L11O e5 U182 L57 e5 U119 L29 e5 U174 e5 U1881LA2 e5 U124 L93 N228 N2 12 N27 3 N220 L107 N276 N267 N229 d5 1 d5 U55 L51 d5 U1O7 L1O6 d5 U173 L57 d5 U116 d5 U163 d5 U18O L49 d5 U12O L93 N219 N184 N241 L39 N201 L101 N252 N234 N210 cs5 I cs5 U40 L39 cs5 U125 L83 cs5 U195 L43 cs5 U133 cs5 U186 cs5 U204 tAt cs5 U140 L79 N254 N219 N284 L21 N232 L81 N295 N270 N244 cs5 Q cs5 U144 L32 cs5 U235 L65 cs5 U387 L35 cs5 U254 cs5 U371 cs5 U403 L26 cs5 U265 L71 N964 N854 N1103 L20 N922 L67 N1154 N1056 N957 RQ ____ d5 Q d5 U476 L5O d5 U361 L11O d5 U63O L56 d5 U423 d5 U617 d5 U644 L35 d5 U441 L92 N47 2 N405 N5 11 L20 N438 L1 10 N542 N485 N456 b4 1 b4 U53 L41 b4 U118 L74 b4 U186 L45 b4 U127 b4 U177 b4 U194 L43 b4 U132 L81 N47 9 N424 N567 L27 N457 L77 N582 N548 N47 6 RI____ d5 Q d5 U492 L48 d5 U367 L11O d5 U641 L53 d5 U431 d5 U627 d5 U656 L33 d5 U449 L89 N488 N4 19 N525 L20 N452 L1 10 N558 N5 01 N469 b4 1 b4 U54 L36 b4 U121 L69 b4 U192 L4O b4 U13O b4 U182 b4 U200 L37 b4 U135 L76 N495 N433 N580 L22 N472 L71 N596 N563 N488 RI____ a4 H a4 U1O16 L42 a4 U764 L11O a4 U1339 L46 a4 U897 L2O a4 U13O3 a4 U1363 L26 a4 U933 L82 N1008 N871 N1113 N941 L103 N1178 N1059 N977 gs4 Q gs4 U459 L29 gs4 U336 L62 gs4 U589 L32 gs4 U394 gs4 U577 gs4 U606 L30 gs4 U410 L68 _____N1878 NO NO L2O NO L64 N0 NO NO RQ _________ ________________ ____ Table 3: MIDI performances (codified in Adagio) of input (nominal and neutral) and of output.

Page  00000008 8. CONCLUSIONS The performance of a piece cannot be reduced to a simple translation of the conventional score symbols into sounds just following the metrical durations and the note pitch. The musician puts his own sensitiveness and experience to communicate emotions and feelings to the listener, using the expressive resources of his instrument. Studies on musical expressiveness (Canazza et al. 1997a, 1997b, 1997c) made clear which are the choices made during performance in order to give a certain expressive intention. A new coding for the score (the metascore) suitable to the automatic performance of a musical piece was studied. The model was provided with a series of controls working on the single note. Characteristics such as intensity or note attack are described intuitively thanks to perceptive scales. Besides, special attention was given to the importance of working on groups of notes, hierarchically ordered, and significant for the performance of the piece. The musical discourse is made of phrases, semi-phrases and words where the profile of a certain quantity is defined. To this purpose a symbolic notation was introduced in order to describe the sound intensity profile and the performance timing (metronome) on phrases, semi-phrases and words of notes. The metascore thus obtained is not dependent on the instrument. The model processes this metascore in order to particularize it to a particular instrument family. Besides the fact that it was developed mainly for western classical music, the model showed a general validity in its architecture, even if it needs some tuning of the parameters. BIBLIOGRAPHY Battel G.U., & Fimbianti R. (1997). "Analysis of expressive intentions in pianistic performances". In Proceedings of the Int. Kansei Workshop 1997 (pp. 128-133). Genova: Associazione di Informatica Musicale Italiana. Canazza, S., De Poli, G., & Vidolin, A. (1997a). "Perceptual analysis of the musical expressive intention in a clarinet performance". In M. Leman (ed), Music, gestalt, and computing. Studies in cognitive and systematic musicology (pp. 441-450) Berlin, Heidelberg: Springer-Verlag. Canazza, S., De Poli, G., Rinaldin, S., & Vidolin, A. (1997b). "Sonological analysis of clarinet expressivity". In M. Leman (ed), Music, gestalt, and computing. Studies in cognitive and systematic musicology (pp. 431-440) Berlin, Heidelberg: Springer-Verlag. Canazza, S., De Poli, G., Roda', A., & Vidolin, A. (1997c). "Analysis and synthesis of expressive intentions in musical performance". In Proceedings of the International Computer Music Conference 1997 (pp. 113-120). Tessaloniki: International Computer Music Association. De Poli G., Roda' A., Vidolin A., (1998). "Note-by-note analysis of the influence of expressive intentions and musical structure in violin performance". Journal of New Music Research, in press. Drioli C., Di Federico R. (1998). "Toward an integrated sound analysis and processing framework for expressiveness rendering". In Proceedings of the International Computer Music Conference 1998. Friberg A. 1991. "Generative rules for musical performance: a formal description of a rule system," Computer Music Journal, Volume 15, Number 2, pp. 56-71. Gabrielsson, A. (1995). "Expressive intention and performance". In R. Steinberg (ed.) Music and the mind machine, (pp. 37-47), Berlin, Heidelberg, New York: Springer-Verlag. Gabrielsson, A. (1997). "Music Performance". The psychology of music, In D. Deutsch (Ed.) The psychology of Music, 2nd. ed. New York: Academic. Gabrielsson, A. & Juslin, P. N. (1996). "Emotional expression in music performance: between the performer's intention and the listener's experience". Psychology of music, 24, 68-91. Kendall, R. A., & Carterette, E. C. (1990). "The communication of musical expression". Music Perception, 8, 129 -164. McAulay, R. J., & Quatieri, T. F., (1986). "Speech Analysis / Synthesis based on a sinusoidal representation", IEEE Trans. ASSP vol. 34 No. 4, August, pp. 744-754. Seashore, C. E. (1937). "Objective analysis of musical performance". University of Iowa Studies in the Psychology of Music, Volume IV. Iowa City: University of Iowa. Serra, X. (1997). "Musical Sound Modeling with sinusoid plus noise", in Musical Signal Processing, Ed. by C. Roads, S. T. Pope, A. Piccialli and G. De Poli, Swets and Zeitlinger Publ., pp. 91-122. Sloboda, J. A. (1983). "The communication of musical metre in piano performance". Quarterly Journal of Experimental Psychology, 35A, 337-396.