Page  00000001 Recognition, Analysis and Performance with Expressive Conducting Gestures Paul Kolesnik and Marcelo Wanderley Sound Processing and Control Lab, Music Technology, McGill University pkoles @music.mcgill.ca Abstract Although a number of conducting gesture analysis and following systems have been developed over the years, most of the projects either primarily concentrated on tracking tempo and amplitude indicating gestures, or implemented individual mapping techniques for expressive gestures that varied from research to research. There is a clear need for a uniform process that could be applied toward analysis of both indicative and expressive gestures. The proposed system provides a set of tools that contain extensive functionality for identification, classification and performance with conducting gestures. Gesture recognition procedure is designed on the basis of Hidden Markov Model (HMM) process. A set of HMM tools are developed for Max/MSP software. Training and recognition procedures are applied toward both right hand beat- and amplitude- indicative gestures, and left hand expressive gestures. Continuous recognition of right-hand gestures is incorporated into a real-time gesture analysis and performance system in Eyesweb and Max/MSP/Jitter environments. 1 Introduction Conducting can be described as a way of controlling performance of multiple instruments with one's physical gestures but without direct contact with the instruments themselves. In a conductor-musician interactive environment, visual information perceived by musicians serves as the means of conveying directions for musical gestures that are created by conductors expressive physical gestures. First successful attempts to analyze conducting gestures with the help of a computer were made as early as 1980 with A Microcomputerbased Conducting System (Buxton et al. 1980) that was based on previous research in music synthesis carried out by Moore, Matthews and collaborators in Groove project and The Conductor Program (Matthews 1976). Following these works, a number of conducting recognition systems have been developed (Haflich and Burns 1983), (Keane and Gross 1989), (Matthews 1991), (Bertini and Carosi 1992), (Lee, Garnett, and Wessel 1992), (Tobey and Fujinaga 1996), (Marrin and Paradiso 1997), (Usa and Mochida 1998), (Ilmomen and Takala 1999), (Marrin 2000), (Segen, Mujumder, and Gluckman 2000), (Garnett et al. 2001), (Borchers, Samminger, and Muhlhauser 2002), (Murphy, Andersen, and Jensen 2003)1. Those systems experimented with a number of different approaches towards beat tracking and expressive gesture analysis which advanced the field of conducting gesture recognition. The two major advancements that occured over the years were a move from MIDI- to audio-based prerecorded musical score following used for system performance (Borchers, Samminger, and Muhlhauser 2002), and a transfer from 2-dimensional to 3-dimensional positional analysis of gestures (Tobey and Fujinaga 1996). Whereas the traditional school of orchestral conducting technique has developed a well-defined grammar of basic conducting gestures that can be used to set the required vocabulary for the recognition system (Rudolph 1994), design of identification and recognition procedures for expressive gestures has been one of the main issues in the field of computerbased conducting gesture recognition. A temporal segmentation technique that is used for beat and amplitude indicating gestures cannot be applied to expressive gestures since most expressive gestures do not contain clearly defined temporal transition points indicated by their positional boundaries and largely vary in terms of their form. 2 System Overview The proposed system was developed as a set of tools that could be applied to a range of tasks related to identification, classification and performance with a range of conducting gestures. Gesture image tracking was done by inexpensive Logitech USB cameras that were used to follow a colour glove worn by 'The references are cited in a chronological order. Proceedings ICMC 2004

Page  00000002 the user. However, the system was built to also be compatible with higher precision 6-DOF positional trackers for use in further research. Image acquisition and processing was handled by Eyesweb freeware (Camurri et al. 2000) using blob colour tracking techniques. An Eyesweb patch extracted positional coordinates of colour glove and sent their values to Max/MSP software via OSC network. Gesture analysis and performance systems were designed for conducting audio and video scores of orchestral performance in real-time. The primary purpose of the gesture analysis was to extract beat amplitude and beat transition points from conducting gestures based on maxima and minima of their absolute positional values. Gesture performance system was responsible for mapping the identified beat transition points and beat amplitude values to modifications in playback speed and volume of the audio score that was being conducted by the user. Camrera Inputs: i i-csTwe eta b th g/ I m as Nto,~d-'axt ý,,vo Figure 1: Schematic representation of the gesture analysis, recognition and performance system. To compute adjustments between user's beat indications and audio playback speed, audio stretching algorithm was used (Borchers, Samminger, and Muhlhauser 2002) (Murphy, Andersen, and Jensen 2003). Stretching/compression of the audio score was implemented based on phase vocoder techniques. As an optional feature, a prerecorded video score of McGill Symphony orchestra performance was used simultaneously with the corresponding audio score. Video score tempo modification was done by Jitter environment objects. 2.1 HMM Package Recognition of isolated and continuous gestures implemented in the system was based on Hidden Markov Model procedure. This statistical observation sequence analysis process, widely known for its use in speech recognition (Deller, Hansen, and Proakis 2000), has been also used in score following (Orio and Dechelle 2001) and sign language gesture recognition (Vogler and Metaxas 1999) systems, and has been applied to right-hand beat conducting recognition in MultiModal Conducting Simulator (Usa and Mochida 1998). Since none of the existing external HMM objects written for Max environment provided the functionality required for the system, an external HMM object was implemented in Max at the initial step of the project. The object was written as a representation of a discrete HMM model and served as an implementation of its three principal featureslearning, finding an optimal sequence of states and recognition. Sizes of state and label vectors as well as the type of HMM (left-to-right or ergodic) were passed as arguments to the object (left-to-right, 20-label, 5-state HMMs were used in all of the experiments described in this paper). Observation sequence recognition was implemented with a forwardbackward algorithm, calculation of the optimal sequence of model states used the Viterbi algorithm, and model training was done through the Baum-Welch reestimation procedure. Logarithm scaling techniques were used in order to avoid computational range errors that could occur for longer observation streams due to a recursive nature of the processes.2 The HMM object also provided features for storing, viewing, importing/exporting and editing of HMM models. Two other external objects that were written for support of the main HMM object were responsible for calculation of positional orientation and vector quantization of the data stream. Symbol Recognition Symbol recognition was the initial system developed with the external HMM object in order to test its performance. Absolute 2-D positional coordinates extracted from the movement of an input source (mouse or Wacom tablet) were used to calculate orientation values along the horizontal axis. Resulting data stream was then passed to the vector quantization external object that mapped observation stream to specifications of the codebook used by the HMM models. Each of the HMM objects that were implemented in the system represented an isolated symbol to be recognized. At the learning stage, HMM objects were individually trained with a number of symbol examples. At the recognition stage, an observation stream representing a symbol was passed to all of the HMM objects, and the one producing the highest probability was considered as the recognized symbol. At the initial stage, English alphabet symbols were successfully used for training and recognition by the system. 2A detailed overview of general HMM techniques can be found in (Rabiner and Huang 1986). Scaling procedure and other practical issues are described in (Huang, Ariki, and Jack 1990) and (Lien 1998). Proceedings ICMC 2004

Page  00000003 Furthermore, HMM recognition procedure performed equally well with examples of words-in both cases of letter and word recognition, recognition rate was over 92.5%. 3 System Results As the initial step of the gesture recognition experiment, the procedure used by the symbol recognition system was replicated using a single webcam to capture a 2-D positional user input. Resulting recognition rates were similar to those obtained in the previous experiment with a mouse/tablet. The system was then extended to accommodate the 3-D nature of expressive conducting gestures. A second USB camera that was placed in profile view of the user (right or left, corresponding to the tracked hand) was used to capture additional positional information. Two separate HMM channels, front and profile, were implemented to be responsible for gesture recognition. Each of the channels contained an equal number of HMM objects, corresponding to the number of gestures the system was intended to recognize. At the recognition stage, probabilities of the corresponding pairs of objects were combined to obtain a final resulting probability that determined the choice of identified gesture. mentation of beats, performed during continuous gesture identification process, used the data received from gesture analysis section of the system (see Figure 1). The system was able to correctly identify the conducting gestures during real-time performance with a 94.6% recognition rate. 4 Future Work One of the future goals of the project is to design a gesture recognition process that can be implemented in a continuous conducting movement environment in combination with the developed gesture analysis and performance system. Whereas the issue of temporal segmentation of a continuous gesture observation stream can be easily solved for right-hand beat indicating gestures through the use of information extracted by another process (such as tracking positional maxima and minima of the gestures), there is no simple way of using a similar technique for expressive gestures, since there is no clear uniform indication of positional transitions between them. A solution to this problem lies in the capability of HMM process to automatically segment an entire observation stream into isolated gesture states. This technique involves training HMM models separately with isolated gestures, and then chaining the trained models together into a single network of states (Vogler and Metaxas 1999). The Viterbi algorithm can then be used on the entire observation stream, so that the temporal segmentation problem is simplified to computing the most probable path through the network. Upon completion of the continuous process of gesture recognition, the eventual goal of the work will be to develop a classification library of conductors gestures for computer conducting gesture recognition systems. This part of the project will address the need for development of a uniform set of conducting gesture definitions in terms of their positional information and mappings to the music score. The proposed library will be based on the existing well-developed grammar of traditional conducting technique, and will be introduced as a standardized set of gesture definitions to be used for future research in the field of conducting gesture recognition. Positional 3-D recording of the library gestures will be done with Vicon Motion Capture and Polhemus Liberty systems soon to be available at McGill Music Technology Labs. 5 Conclusion The main achievement of the work is development of an HMM-based procedure that can be applied to analysis and classification of expressive conducting gestures. In particular, HMM training and recognition processes was applied to analysis of both right hand beat indicator gestures and left hand Figure 2: Front and profile camera view of the user training the recognition system with a left-hand crescendo-cutoff expressive gesture. All of the conducting gestures used for positional recordings were performed by a doctoral conducting student at the Music Faculty of McGill University. Five left-hand isolated expressive gestures were selected to be recognized (crescendocutoff, diminuendo-cutoff, tenuto gesture, marcato indication and wrist staccato gesture). For right-hand beat indicating gestures, two beat patterns were chosen-a four-beat legato and a four-beat staccato patterns. A separate HMM object was used to represent each beat gesture of the two patterns. There were 20 training sets and 10 testing sets for each gesture, and the system performed with a 97.2% recognition rate. The same right-hand beat indicating gestures that were used for isolated gesture recognition were applied in the training stage of a continuous recognition process. Temporal seg Proceedings ICMC 2004

Page  00000004 expressive articulation gestures. This brings an improvement over existing systems, since whereas right hand movements had been analyzed with HMM (Usa and Mochida 1998) and Artificial Neural Net (Ilmomen and Takala 1999) (Garnett et al. 2001) techniques in the past, there has been no previous research involving high-level recognition and classification techniques applied to left hand expressive gestures. The designed HMM package is intended for use as a general tool in Max/MSP environment, and is available for free distribution (http://www.music.mcgill.ca/~pkoles/download.html). It could be applied not only towards positional data classification but also towards any other process that involves pattern recognition-such as speech recognition, timbre recognition or score following. The resulting set of analysis, HMM-based recognition and performance tools will be directed towards future research in development of standardized classification of conducting gestures. References Bertini, G. and P. Carosi (1992). Light baton: A system for conducting computer music performance. In Proceedings of the International Computer Music Conference, pp. 73-76. International Computer Music Association. Borchers, J., W. Samminger, and M. Muhlhauser (2002). Engineering a realistic real-time conducting system for the audio/video rendering of a real orchestra. In Proceedings of the 4th International Symposium on Multimedia Software Engineering, pp. 352-362. International Computer Music Association. Buxton, W., W. Reeves, G. Fedorkov, K. C. Smith, and R. Baecker (1980). A microprocessor-based conducting system. Computer Music Journal 4(1), 8-21. Camurri, A., P. Coletta, M. Peri, M. Ricchetti, A. Ricci, R. Trocca, and G. Volpe (2000). A real-time platform for interactive performance. In Proceedings of the International Computer Music Conference. International Computer Music Association. Deller, J. R., J. H. Hansen, and J. G. Proakis (2000). Discretetime Processing of Speech Signals. New York: IEEE Press. Garnett, G. E., M. Jonnalagadda, I. Elezovic, T. Johnson, and K. Small (2001). Technological advances for conducting a virtual ensemble. In Proceedings of the International Computer Music Conference, pp. 167-169. International Computer Music Association. Haflich, F. and M. Burns (1983). Following a conductor: The engineering of an input device. In Proceedings of the International Computer Music Conference. International Computer Music Association. Huang, X. D., Y. Ariki, and M. Jack (1990). Hidden Markov Models for Speech Recognition. New York: Coumbia University Press. Ilmomen, T. and T. Takala (1999). Conductor following with artificial neural networks. In Proceedings of the International Computer Music Conference, pp. 367-370. International Computer Music Association. Keane, D. and P. Gross (1989). The midi baton. In Proceedings of the International Computer Music Conference, pp. 151-154. International Computer Music Association. Lee, M., G. Garnett, and D. Wessel (1992). An adaptive conductor follower. In Proceedings of the International Computer Music Conference, pp. 454-455. International Computer Music Association. Lien, J. J. (1998). Automatic Recognition of Facial Expressions Using Hidden Markov Models and Estimation of Expression Intensity. Ph. D. thesis, The Robotics Institute, Carnegie Mellon University. Marrin, T. (2000). Inside the Conductors Jacket: Analysis, Interpretation and Musical Synthesis of Expressive Gesture. Ph. D. thesis, Massachusetts Institute of Technology. Marrin, T. and J. Paradiso (1997). The digital baton: A versatile performance instrument. In Proceedings of the International Computer Music Conference, pp. 313-316. International Computer Music Association. Matthews, M. V. (1976). The conductor program. In Proceedings of the International Computer Music Conference, Cambridge, Massachusetts. Matthews, M. V. (1991). The radio baton and the conductor program, or: Pitch-the most important and least expressive part of music. Computer Music Journal 15(4), 37-46. Murphy, D., T. H. Andersen, and K. Jensen (2003). Conducting audio files via computer vision. In Proceedings of the 2003 International Gesture Workshop, Genoa, Italy. Orio, N. and F. Dechelle (2001). Score following using spectral analysis and hidden markov models. In Proceedings of the International Computer Music Conference, pp. 125-129. International Computer Music Association. Rabiner, L. R. and B. H. Huang (1986). An introduction to hidden markov models. IEEE Acoustics, Speech and Signal Processing Magazine 3(1), 4-16. Rudolph, M. (1994). The Grammar of Conducting: A comprehensive guide to baton technique and interpretation. Toronto: Maxwell Macmillan Canada. Segen, J., A. Mujumder, and J. Gluckman (2000). Virtual dance and music conducted by a human conductor. Eurographics 19(3). Tobey, F. and I. Fujinaga (1996). Extraction of conducting gestures in 3d space. In Proceedings of the International Computer Music Conference, pp. 305-307. International Computer Music Association. Usa, S. and Y. Mochida (1998). A conducting recognition system on the model of musicians process. Journal of Acoustical Society of Japan 19(4), 275-287. Vogler, C. and D. Metaxas (1999). Parallel hidden markov models for american sign language recognition. In In Proceedings of the International Conference on Computer Vision, pp. 116-122. Proceedings ICMC 2004