Page  00000001 HOME CONDUCTING - CONTROL THE OVERALL MUSICAL EXPRESSION WITH GESTURES Anders Friberg KTH - Royal Institute of Technology Speech, Music and Hearing Stockholm, Sweden andersf@speech.kth.se ABSTRACT In previous computer systems for "conductingOa score, the control is usually limited to tempo and overall dynamics. We suggest a home conducting system allowing an indirect control of the expressive musical details on the note level. In this system, the expressive content of human gestures is mapped into semantic expressive descriptions. These descriptions are then mapped to performance rule parameters using a real time version of the KTH rule system for music performance. The resulting system is intuitive and easy to use also for people lacking formal musical education, making it a tool for the listener rather than the professional performer. 1. INTRODUCTION Several systems for conducting an electronic score have been suggested in the past. The Radio Baton, developed by Max Mathews, was one the first systems and is still used both for conducting a score as well as a general controller [1]. A recent system controlling both audio and video is the Personal Orchestra and its further development in Your're the Conductor [2]. These systems include a gesture recognition system connected to a player algorithm that either generates the music directly (MIDI to synthesizer) or transforms existing audio recordings. The available parameters that can be controlled are instant tempo and dynamics. In a traditional setting, the conductor express by gestures overall aspects of the performance and the musician interpret these gestures and fill in the musical details. Thus, by merely controlling overall tempo and dynamics, a lot of the finer details will inevitably be static and out of control. An example would be the control of articulation. The articulation is important for setting the gestural and motional quality of the performance but cannot be applied on an average basis. Amount of articulation (staccato) is set on a note-by-note basis dependent on melodic line and grouping [10]. This makes it too difficult for a conductor to control directly. How can the finer details of an electronic score be controlled more flexibly? The KTH rule system implemented in Director Musices contains a set of rules for rendering music with different expressions. These rules cover many performance aspects such as phrasing, articulation, microtiming variations, rhythmic patters or intonation. One obvious approach to improve the performance of an electronic score would then be to pre-process the score with the KTH rules system before conducting it. This has been tested with good results using the Radio Baton system [3]. The best results were obtained using only rules affecting the micro-qualities on the note level, thus, leaving all overall tempo and dynamics variations to the conductor. However, with the recent addition of the real time version pDM, the rule system is for the first time available for direct gestural control [4]. Thus, we have a system that allows more fine control of the performance details than available in the past. Another feature of the pDM implementation is to provide mappings from highlevel semantic descriptions such as emotions to the parameters of the rule system. Tools and models for doing gesture analysis in terms of semantic descriptions of expression has recently been developed within the EU-project MEGA [5] Thus, by connecting such a gesture analyzer to pDM we have a complete system for controlling the overall expressive features of a score. Recognition of emotional expression in music has been shown to be an easy task for most listeners including children from about 6 years of age even without any musical training, see e.g. [6]. Therefore, by using simple high-level emotions descriptions such as (happy, sad, angry) the system have the potential of being intuitive and easily understood by most users including children. Thus, we envision a system that can be used by the listeners in their homes rather than a system used for the performers on the stage. Our main design goals have been a system that is (1) easy and fun to use for novices as well as experts, (2) realized on standard equipment using modest computer power. In the following we will describe the system more in detail starting with the synthesis engine, followed by the gesture analysis, and finally the complete home conductor system. 2. MUSIC RENDERING 2.1. KTH Rule System for Music Performance The KTH rule system is the result of a long-term research project initiated by Johan Sundberg, e.g. [7,8, 9,10]. It contains about 30 rules that transform a score into a musical performance. The basic rule set covers many performance aspects used by musicians such as different types of phrasing, accents, timing patterns and intonation. In addition, high-level expressive descriptions such as emotions have been modelled by combin

Page  00000002 ing sets of rules and rule parameters [I1l]. Director Musices (DM) is the main implementation of the rule system and is a stand-alone lisp program available for most platforms [12]. The lisp environment has greatly facilitated the development of the rules but is not suitable for real time control. 2.2. pDM - Real Time Rule Control In order to move to a real time control of the rules we use a two-step approach. The rules are first applied to the score in DM producing an enhanced score containing all the possible rule-induced variations of performance parameters. This new score is then played by an application written in pd (pure-data) [13]. There were several advantages using this approach. First of all, it was not necessary to reimplement the rule system - a long and tedious process since each rule needs to be verified on several musical examples. Also, it avoids the splitting into two different systems to support. The first prototype following this approach was made by Canazza et al. [14] using the EyesWeb platform [15]. The current pd implementation, called pDM, is described more in detail in [4]. 2.2.1. Rule application In the first step, the rules are applied to the score. Most of the relevant rules defined in DM are applied with their default quantities. Each rule is applied on the original score and normalized with respect to overall length and dynamics. The deviations in the performance parameters tempo (DT), sound level (DSL), and articulation (DART) are collected for each note. The original score together with all the individual rule deviations are stored in a custom score file format, see Figure 1. 0 DT -0.05 -0.03 -0.06 0 0.01 0.04 0.08 0.08 -0.03 E 0 DSL -0.5 -0.4 -0.4 0 0.1 0.3 0.6 -1.2 -0.4 -1.1 0.1; 0 DART -47 0 0 0 -120; 0 NOTE 55 10 481; 481 DT -0.04 0.01 0.01 0 0.01 0.04 0.08 0.08 E 0; 0 DSL -0.5 0 0.4 0 0.1 0.3 0.6 -0.7 0 -1.1 -0.2; 0 DART 0 0 0 0 -40; 0 NOTE 57 10 160; Figure 1. An example of the score format used. Each line is a commando. First item is deltatime, second item the command and the following is data. The NOTE command specify the nominal values and the preceding commands (DT, DSL, DART) specify the rule deviations in the three performance parameters. 2.2.2. pDM player In the second step, the produced score is loaded into pDM, which is essentially an extended sequencer. Since all rule knowledge is kept in DM the structure of pDM is quite simple and is written in pd-extended v. 0.37 using only the provided libraries. The sequencer is centered around the qlist object in pd. qlist is a textbased sequencer allowing any data provided in an input file to be executed in time order. During playing, each of the three basic score parameters tempo (Tno,,,), sound level (SLnom,) and duration (DURnom) are modified using a weighting factor k, for each rule: 14 T=Tnom, -1+>'ki ATi), 1=1 SL = SLow + Y kiAX, DUR = DURnom + ZkiAART, i=1 (1) (2) (3) where T, SL, DUR stands for the resulting tempo, sound level, and duration; i is the rule parameter number, ki is the weighting factor for the corresponding rule, and AT,, ASLj, ADURj are the rule deviations given in the pDM score file. According to the formulas above, the effect of several simultaneous rules acting on the same note is additive. This might lead to that the same note receives too much or contrary amount of deviations. This is in reality not a big problem and some rule interaction effects are already compensated for in the DM rule application. For example, the Duration contrast rule (shortening of relatively short notes) is not applied where the Double duration rule would be applied and lengthen relatively short notes. nominal neutral expressive k values k t 1.2 1Phrase arch 4 1.00 2Phrase arch 5 1.00 3 Phrase arch 6 0 4 Phrase arch 7 0 3 Phrase ritardando 4 0 6 Phrase ritaidando 5 - 7 Phrase itardando 6 0.41 0 Final ritard 1.48 9 P|tuation S| 10High loud 11 Melodic charge 1 12 Haimonic charge 0.09 13 Duration contrast 0 14 Ineqales 15 Double duration 1.51 16 Repetition art 17 Score legato art 1 10 Scorestaccato art 819 overa1 articlation 0 eal ain 20 7 A....5.. overall scaling..0.98 7I77T ekp o Soundlevel Figure 2. pDM window for controlling individual rule parameters. The sliders to the left control the overall amount of each rule (k; values). These performance parameters are computed just before a note is played. In this way, there is no perceived delay from a real-time input since all control changes will appear at the next played note. Figure 2 shows the

Page  00000003 window for individual rule control in which the rule weights can be manipulated. 2. 2.3. pDM Expression ma~ppers pDM contains a set of mappers that translate high-level expression descriptions into rule parameters. We have mainly used emotion descriptions (happy, sad, angry, tender) but also other descriptions such as hard, light, heavy or soft has been implemented. The emotion descriptions have the advantages that there has been substantial research made describing the relation between emotions and musical parameters [17,11]. Also, these basic emotions are easily understood by laymen. Typically, these kinds of mappers have to be adapted to the intended application as well as considering the function of the controller being another computer algorithm or a gesture interface. Usually there is a need for interpolation between the descriptions. One option implemented in pDM is to use a 2D plane in which each corner is specified in terms of a set of rule weightings corresponding to a certain description. When moving in the plane the rule weightings are interpolated in a semilinear fashion. This 2D interface can easily be controlled directly with the mouse. In this way, the well-known Activity-Valence space can be realized by assigning halppy, sad, angry, tender to the four corners of the space. An installation in which the user can change the emotional expression of the music while it is playing is currently part of the exhibition "Se Hj~rnanO touring Sweden for two years. 3. GESTURE RECOGNITION There exists a large number of different gesture controllers developed for musical purposes [15]. We have tried to keep it as simple as possible both technically and practically for the user. Therefore we use a small video camera (webcam) as input device analysed by a robust and simple motion detection algorithm. 3.1. Gesture cue extraction The video signal is analyzed with the tools for gesture recognition developed at University of Genova included in the software platform EyesWeb [16,18]. The first step is to compute the difference signal between video frames. This is a simple and convenient way of removing all background (static) information in the picture. Thus, there is no need to worry about special lightning, clothes or background content. A number of advanced tools are available for example for tracking the individual limbs of a human figure. Again, for simplicity, we have been using a more limited set of tools. The analyzed gesture parameters (cues) have been the overall quantity of motion (QoM), x y position of the overall motion, size and velocity of horizontal and vertical gestures. 3.2. Mapping gesture to expression The simplest use of the analyzed gesture cues would be to connect them directly to the rendering machine. Alternatively, one can map the cues to semantic expressive descriptions. Such mapping was modeled using fuzzy techniques for predicting emotional expression. It then uses fuzzy set functions divided in three regions to classify the cues into high, medium or low~.. An emotion prediction is obtained by computing an average of a selection of fuzzy sets outputs. The output for each emotion will be a continuously varying function between 0 and 1 depending on the number of cues that are in the right region. The fuzzy mapper can be constructed directly from qualitative data found in previous experiments. Thus, no training of the system commonly required in data-driven approaches (e.g. neural networks or Bayesian classification) is needed. A detailed description is given in [19]. 4. HOMIE CONDUCTOR SYSTEMI By connecting the tools described above we obtain the complete home conductor system as outlined in Figure Expressive gestures Gesture cue extraction M~otion data Mapper High-Level Expression Mapper Rule parameters pDM Score Tone instructions MIDI synthesizer Sound Figure 3. Overall schematic view of a home conductor system. The selection of cues and mappings has so far only been described in a broad manner. The exact selection depends on the desired application and user ability. Three levels of interaction can be envisioned: Level 1 (listener level) The musical expression is controlled in terms of basic emotions (happy, sad, angry). This creates an intuitive and simple music feedback comprehensible without the need for any particular musical knowledge.

Page  00000004 Level 2 (simple conductor level) Basic overall musical features are controlled using for example the energy-kinematics space previously found relevant for describing the musical expression [20]. Level 3 (advanced conductor level) Overall expressive musical features or emotional expressions in level 1 and 2 are combined with the explicit control of each beat similar to the Radio-Baton system. Using several interaction levels makes the system suitable both for novices, children and expert users. Contrary to traditional instruments, this system may "sound goodOeven for a beginner when using a lower interaction level. It can also challenge the user to practice in order to master higher levels similar to the challenge provided in computer games. A few complete prototypes have been assembled and tested for levels 1 and 2. The one that we have used the most is very simple but effective in terms of the gesture interface. It uses two cues from the video analysis: (1) overall quantity of motion (QoM) computed as the total number of visible pixels in the difference image, (2) the vertical position computed as the center of gravity for the visible pixels in the difference image. These two cues are directly mapped to the Activity-Valence space above with QoM connected to Activity (high QoM - high Activity) and vertical position connected to Valence (high position-positive Valence). This interface has been demonstrated several times with very positive responses. However, formal testing in form of usability studies is planned in future work. The implementation of pDM and the gesture analysis is done using free software. The computing power is rather modest and the whole system runs on an ordinary PC with about 1Ghz clock rate. This work was partially supported by the Swedish Foundation for International Cooperation in Research and Higher Education (STINT). 5. REFERENCES [1] Mathews, M. V. "The Conductor Program and the Mechanical Baton." In M. Mathews & J. Pierce, eds. Current Directions in Computer Music Research. Cambridge, Mass: The MIT Press, (pp. 263- 282), 1989. [2] Lee, E., Nakra, T.M., and Borchers, J. "You're the Conductor: A realistic interactive conducting system for children.ONIME 2004, 68-73, 2004. [3] Mathews, M.V., Friberg, A., Bennett, G., Sapp, C. and Sundberg, J. "A marriage of the Director Musices program and the conductor programOIn R Bresin (Ed.) Proceedings of the Stockholm Music Acoustics Conference 2003, Vol I, (pp. 13-16), 2003. [4] Friberg, A. "pDM: an expressive sequencer with realtime control of the KTH music performance rules.6 Manuscript submitted for publication, 2005. [5] Camurri, A., De Poli, G., Friberg, A., Leman, L., and Volpe, G. "The MEGA project: analysis and synthesis of multisensory expressive gesture in performing art applications.OJournal of New Music Research, in press [6] Peretz, I. "Listen to the brain: a biological perspective on musical emotionsO In P.N. Juslin & J.A. Sloboda (Eds.) Music and Emotion: Theory and Research, Oxford: Oxford University Press, (pp. 105-134), 2001. [7] Sundberg, J., Askenfelt, A. and Fryden, L. "Musical performance: A synthesis-by-rule approach", Computer Music Journal, 7:37-43, 1983. [8] Sundberg, J. "How can music be expressive?", Speech Communication 13:239-253, 1993. [9] Friberg, A. "Generative Rules for Music Performance: A Formal Description of a Rule System.OComputer Music Journal, 15(2):56-71, 1991. [10] Friberg, A. and Battel, G. U. "Structural CommunicationO In (R. Parncutt & G. E. McPherson, Eds.) The Science and Psychology of Music Performance: Creative Strategies for Teaching and Learning. (pp. 199-218) New York: Oxford University Press, 2002. [11] Bresin, R. and Friberg, A. "Emotional Coloring of Computer-Controlled Music Performances.0 Computer Music Journal, 24(4):44-63, 2000. [12] Friberg, A., Colombo, V., Fryden, L. and Sundberg, J. "Generating Musical Performances with Director Musices.OComputer Music Journal, 24(3):23-29, 2000. [13] Puckette, M. "Pure Data." Proceedings of the 1996 International Computer Music Conference. (pp. 269 -272), 1996. [14] Canazza S., Friberg A., Rod^ A., and Zanon P. "Expressive Director: a system for the real-time control of music performance synthesis.0 in R Bresin (Ed.) Proceedings of the Stockholm Music Acoustics Conference 2003, Vol II, (pp. 521-524), 2003. [15] Wanderley, M. and Battier, M. (eds.) Trends in Gestural Control of Music. IRCAM - Centre Pompidou, 2000. [16] Camurri A., Hashimoto S., Ricchetti M., Trocca R., Suzuki K., Volpe G. "EyesWeb - Toward Gesture and Affect Recognition in Interactive Dance and Music SystemsOComputer Music Journal, 24(1):57-69, 2000. [17] Juslin, P.N. and Sloboda J.A. (Eds.) Music and Emotion: Theory and Research, Oxford: Oxford University Press, 2001. [18] Camurri A., Mazzarino B, Volpe G. "Analysis of Expressive Gesture: The EyesWeb Expressive Gesture Processing Library.0 In A. Camurri, G. Volpe (Eds.), Gesture-based Communication in Human-Computer Interaction, LNAI 2915, Springer Verlag, 2004. [19] Friberg, A. "A fuzzy analyzer of emotional expression in music performance and body motion.OIn Proceedings of Music and Music Science, Stockholm, 2005. [20] Canazza, S, De Poli, G., Rod^, A. and Vidolin, A. "An abstract control space for communication of sensory expressive intentions in music performance.OJournal of New Music Research 32(3):281-294, 2003.