Page  00000414 REAL TIME ANALYSIS OF EXPRESSIVE CONTENTS IN PIANO PERFORMANCES Sergio Canazza, Giovanni De Poli, Antonio Rodi, Giulio Soleni, Patrick Zanon {canazza,depoli,patrick}@dei.unipd.it, ar@csc.unipd.it University of Padova, Center of Computational Sonology (CSC) Dept. of Electronic and Informatics - Via Gradenigo 6/a - 35100 Padova, Italy Abstract Musical performance introduces some deviations from nominal values specified in the score. Music reproduced without such variations is usually perceived as mechanical. Most investigations explore how the musical structure influences the performance. There are a few studies on how the musician's expressive intentions are reflected in the performance. The purpose of this work is to carry out an automatic method able to perform an analysis of the expressive content of a musical performance. Starting from this method, a software that allows a real time analysis of the expressive intention of the music performer is developed. The work is based on the model called Perceptual Parametric Space, previously developed (at CSC) starting from perceptual and acoustical analyses: a twodimensional space where each point is associated with a set of acoustic parameters that can be correlated with the expressive characteristics of a musical piece. 1 Introduction In western tradition, music is usually conveyed by means of a symbolic description, namely a score. The conventional score is quite inadequate to describe the complexity of a musical performance so that a computer might be able to play it. The performer, in fact, introduces some micro-deviations in the timing of performance, in the dynamics, in the timbre, according to his own experience and also to the common instrumental practice. Further more, the performer operates on the microstructure of the musical piece not only to convey the structure of the text written by the composer, but also to communicate his own feeling or expressive intention. It is well know that spoken language can be enriched by different meanings depending on the variation introduced by the speaker. Similarly the musician adds and varies expressiveness to the musical message changing the timing, the dynamics, and the timbre of the musical events from the written score, during the performance. The main purpose of this work is to design a system for analysis of the expressive meaning of some quantities related to each sound event. The system measures, in real time, the deviations which occur on their value: in this way it is possible to estimate a set of parameters that can be used to determine the expressive intention characterizing the performance, in the exact time it is played. Up to now, a lot of effort have been placed in the attempt of achieve the necessary results to develop a model able to explain the process that guides a musician to translate one particular expressive intention into the music he performs. Some important researches have provided models for the synthesis of expressive intention, but few steps have been done for what concerns the automatic analysis of expressiveness. One fundamental approach to the synthesis of expressive music has derived models, which are described with a collection of rules, obtained using an analysis-by-synthesis method. The most important is the KTH rule system [3]. Each rule (which tries to explain quantitatively a particular psychoacoustics principle) has to be enforced in order to modify a mechanical performance and obtain a more humanlike one. This is an attempt to introduce the same deviations a musician used to insert in the piece he performs, using a systematic and controlled method. Other approaches that lead to establish a similar collection of rules make use of AI algorithms [4] [6]. The main streams of research nowadays make use of machine learning algorithms, case based reasoning systems, inductive learning, neural network techniques and fuzzy logic. Starting from the results of the acoustic and perceptual analysis, Canazza and Roda developed a model able to add expressiveness to automatic musical performance [1] [2]. The model is able to 414

Page  00000415 synthesize a performance conveying an expressive intention by transforming a neutral one (i.e. a literal human performance of the score without any expressive intention or stylistic choice), both with reference to the score and the acoustic signal itself. It must be underlined that this approach provides the adoption of hierarchical structures in the musical context similar to the spoken language ones (syllables, words, and phrases). Once these structures are recognized, it is possible to modify the parameters (i.e. metronome or intensity) of a group of notes according a suitable curve. Such curve describes the properties of the musical gesture the group of notes is expressing. The input of the expressiveness model is composed by a description of a neutral musical performance, by the nominal score of the performance, and by an interface that allows the user to control the expressive intention he wants to be conveyed. The expressiveness model acts on the symbolic level, computing the deviations of all musical parameters involved in the transformation. The model is based on a "Perceptual Parametric Space" (PPS) obtained by data analysis performed in perceptual test. The two axes of the space are closely correlated to physical acoustic quantities, respectively the kinetics of the music (bright vs. dark), e.g. Tempo and Legato, and the energy of the sound (soft vs. hard), e.g. Loudness (see fig. 1). possible to calculate one set of K-coefficients each one related to a particular acoustic quantity. This work develops a technique of analysis in real time of the expressiveness, starting from the results obtained until now and adopts the PPS model as an instrument of analysis. The system which has been developed aims to extract (better say estimate) from an expressive performance played in real time the different sets of K-coefficients related to three acoustic quantities: Tempo, Legato and Loudness. By comparing the different sets estimated with the sets of K-coefficients plotted for each point of a welldefined Perceptual Parametric Space, it is possible to detect the expressive intention that characterizes the music performed in a particular moment. In the next sections the algorithm used to implement this system will be described. It also will be briefly presented EyesWeb [5], the softwareenvironment used to build the various modules of the system. Finally the main results that have led to the validation of the system will be described. 2 The system Once the three main expressive acoustic quantities (Tempo, Legato and Loudness) have been selected the problem is: how can they be related to expressive intentions? More precisely, how can each single sound event (say each single note) be related to a precise set of values of K-coefficients and then to the suitable expressive intention? In order to understand the meaning of the first question it is important to clarify some terms that will be used in the following paragraphs to specify these expressive acoustic quantities. These definitions are strictly related to the MIDI protocol; in fact the system has been developed to work with Standard MIDI Files and real time streams of MIDI Events. When we use the word "Loudness", we consider the intensity of the sound that characterizes a single note. Normally, a sound event produced from a real instrument, can be described by an envelope diagram concerning the evolution of the sound during the time it is uttered. Using a MIDI instrument, however, each note can be characterized by one single value representing the loudness, that is to say the Key Velocity. When we use the word "Tempo" we want to define the metrical length of time of each quarter note. In order to do this, it is essential to have a reference measure that lead to the definition of a unit of length for metrical time (the tick or pulse). This reference can be obtained by fixing the relationship between the physical time length (ms) and the metrical time length for a note with a known metrical fac*or 1 Fig 1: Perceptual Parametric Space" (PPS) obtained by data analysis performed in perceptual test. First factor (75.2%) is correlated with kinetics of music; second factor is correlated with energy of the sound. This model aims to simulate the micro deviations of acoustic quantities introduced by musicians, which are needed to transform a deadpan performance (mechanical interpretation of the score) into a more human-like one. These deviations are represented by means of a set of measurable coefficients named Qcoefficients or K-coefficients that can be related to the various acoustic quantities (Tempo, Legato, Loudness etc.). Selected a generic point in the PPS, it is 415

Page  00000416 definition. This value is known as Pulse Per Quarter Note (PPQN) and establishes the number of pulse or tick that corresponds to the length of a note that has a metrical definition of a quarter. PPQN can be deducted only if it is known the score of the musical piece that we want to analyze. It is for this reason that the system should even take into consideration a the two musical performances according the PPS model. These coefficients indicate the way the three expressive acoustic quantities of each note of the neutral performance should be modified in order to obtain something that sounds like the expressive one. Finally, the system uses this sets of K-coefficients to decide note by note which is the expressive intention 7 HP!: IM L EG~i[ Fig. 2: The three parts composing the patch that realizes the system MIDI file that codifies a mechanical performance (score) of the piece. The last quantity that must be defined is the "Legato" between two consecutive notes. It can be represented as a ratio between the metrical length of the note and the metrical length till the next note begins (Inter Onset Interval - IOI). The system was developed using the softwareenvironment EyesWeb, a product of the Music and Informatics Lab of the University of Genoa [5]. It turned out to be a graphical environment planned for the development of projects based on the analysis and processing of multimedia data streams and the creation of audio/video interactive applications. The graphical interface is quite similar to other softwareenvironments like PD or Max, in which the user can build up an application using a library of blocks that can be connected together into a patch. In our work we developed a number of new blocks that implement dedicated functions and finally we connected them in a patch for the real time analysis of a performance. The analysis process that the system carries out is based on the comparison between the values of Key Velocity, Tempo and Legato of each note of a prerecorded neutral performance of the musical piece under consideration and the values of the same quantities of an expressive performance of the same musical piece, played in real time. The results of this operation lead to the computation of the various sets of K-coefficients related to the three expressive acoustic quantities. Each set is considered as a set of rules inferred from that characterize the piece played in real time. The analysis system then can be divided into three different parts: the Neutral Analyzer, the Real Time Analyzer and the Expressiveness Recognizer as depicted in fig. 2. It is possible to see immediately that some of operations are carried out in parallel and some others are carried out in sequence. The Neutral Analyzer sub-patch is the set of modules in the high left rectangle of fig. 2. It is dedicated to the computation of the expressive profiles related to the neutral performance: two data streams reach the module called ComputeLaws (2-ins 1-out), the first one (above) is related to the score and the second one (below) is related to the neutral performance. In this module, the three quantities Tempo, Legato and Key Velocity are computed for each note of the neutral performance. Then, this succession of just computed data reaches the last module of this sub-patch, the so-called ComputeAverage (1-in 2-outs), where the incoming values are worked out. This module turns out two other data streams that we call expressive profiles: the first one is the succession of the average values of the expressive quantities received in input (average computed over a mobile and user-definable time window); the second one is the succession of the deviation values from the average. The Real Time Analyzer sub-patch is the set of modules that lies below the Neutral Analyzer in fig. 2. This block realizes a task similar to that accomplished by the previous sub-patch. The computation of the expressive quantities Tempo, Legato and Key Velocity, is here relative to each 416

Page  00000417 sound event of the expressive performance that is received as a real time stream of MIDI events. The final sub-patch, on the right of the picture, is the Expressiveness Recognizer. It performs two separate tasks: the computation of the K-coefficients in connection with the three expressive quantities considered, and then the recognition of the expressive characteristics of the performance played in real time. The first task is once more split up into three parallel lines of computations: the K-coefficients relative to each expressive quantity are indeed computed one aside from the other. This means that each of the three modules of the second column in the sub patch is dedicated to the estimation of the K-coefficients of each expressive acoustic quantity. Let's consider one of these expressive acoustic quantities (let's say Tempo just to fix an example): the module receives at the first input the values connected with the average profile of the neutral performance related to the quantity under consideration, at the second input the values of the deviancy profile related to the same quantity, and at the third input the values of the profile of the realtime expressive performance. Each one of these three modules implements a multiple linear regression: by putting the input values 1 and 2 of the module into a mathematical relationship, the module deduces an estimation of the values received at the third input. This is done by setting a linear combination of the values coming from the first and second input, where the coefficients of the relationship (the K-coefficients) are estimated according to an optimum-criterion1 over a user definable window. Acting this way, for each tern of input received, each module returns a pair of Kcoefficients as output. The three pairs of K-coefficients related to each expressive acoustic quantity are then received as input of the ExpRecognizer module (3-ins, 1-out). This module is set out to work with the data of a configurable Perceptual Parametric Space, codified into a text file previously loaded and decoded. In this way, the module is able to compute the distance of each set of six K-coefficients received as input from the various sets of K-coefficients of each expressive intention defined in the PPS under consideration. The module then chooses the expressive intention that results closer to the set of K-coefficients received in input, giving an estimation of the expressive character of the performance played at one precise moment. 3 Validation The model has been tested by means of four professional pianists. They were asked to play the same musical piece according to five different expressive intentions, even considering a first neutral 1 The "Residual Sum of Squares" criterion. (without expression). The musical piece chosen for the test was an excerpt from W. A. Mozart's sonata K-545. The piece was previously codified into a score MIDI file. Then every one of the four musicians performed and recorded into another MIDI file his neutral version of the piece (according to his personal idea of neutral). Finally each one of the four pianists performed the piece according to four different expressive intentions: hard, soft, heavy and light. As mentioned above, the system can associate each single sound event with a particular expressive character. Fig. 3 and fig. 4 shows two examples. They have been obtained from the analysis of the heavy (pesante) and the light (leggero) performances of two musicians involved and they shows the percentage of notes that the system was able to recognize as played with the different expressive intentions. As we can see, in both cases, the system was able to recognize a wide percentage of sound events as played according to the expressive content of the performances. Sometimes the system has shown a certain degree of misunderstanding. It was proved that this is due to an incorrect tuning of the values of the parameters defined for the PPS under consideration. By tuning the values of the various Kcoefficients of the spots of the PPS used as a reference, the degree of recognition of the various performances always improve. "gc~d~l txggcr nd 1 'V. Fig. 3: Recognition graph for heavy (pesante) 417

Page  00000418 duro pan, o/ 0ax I 25% References [1] Canazza S., Roda A. (1999). "Adding Expressiveness in Musical Performance in Real Time". Proceedings of the AISB '99, Symposium on Musical Creativity, pp. 134-139, Edinburgh, April 1999. [2] Canazza S. De Poli G., Drioli C., RodiaA., Vidolin A. (2000). "Audio morphing different expressive intentions for Multimedia Systems". IEEE Multimedia, July-September, 7(3), pp. 79-83. [3] Friberg, A. (1991). "Generative Rules for music performance: a formal description of a rule system". Computer Music Journal, 15(2), pp. 56-71. [4] Arcos J. L., L6pez de Mantaras R., and Serra X. (1998). "Saxex: A Case-Based Reasoning System for Generating Expressive Musical Performance " Journal of New Music Research, 27(3), pp. 194-210. [5] Camurri A., Coletta P., Peri M., Ricchetti M., Ricci A., Trocca R., Volpe G. (2000) "A real-time platform for interactive performance", Proc. ICMC-2000, Berlin, pp. 374-379. [6] Widmer G. "Inductive Learning of General and Robust Local Expression Principles", Proc. of the International Computer Music Conference (ICMC'2001), La Habana, Cuba Fig. 4: Recognition graph for light (leggero) 4 Conclusions The automatic analysis of musical expressiveness, carried out in this work, is based from previous studies that have resulted in the definition of a model for the recognition and classification of different expressive intentions: the Perceptual Parametric Space [1] [2]. By means this model, a system was developed to analyze in real time the expressive content of a piano performance in an automatic way. The software developed was validated by means of professional pianists and showed a good recognition of the different expressive intentions of the performers. Acknowledgments This work has been partially funded by EU Project IST 20410 MEGA (Multisensory Expressive Gesture Applications). 418