Page  00000534 A Distributed Real-time System for Interactive Music Mapping with Multiple Inputs Kia Ng '2 and Jon Scott' Interdisciplinary Centre for Scientific Research in Music (ICSRiM) SSchool of Computing 2 School of Music University of Leeds, Leeds LS2 9JT, UK web. www.kcng.org/mvm email. mvm@icsrim.org.uk Abstract This paper presents developments in the on-going Music via Motion (MvM) research project. The MvM framework involves tracking meaningful movements in a scene, and translating the movement into another domain. In this paper, we describe a distributed implementation of MvM, along with issues raised and applications of the system. Implementations of the system are detailed, along with planned and possible extensions, within the Trans-Domain Mapping framework described in Ng (2000, 2002). 1 Introduction The use of electronic and computing to enhance existing musical instruments, or to create new virtual instruments has been enabled in recent years by technological performance. Applications of these ideas have ranged from choreographing dance performances through to conducting 'virtual orchestras'. A comprehensive survey of existing projects is presented in Camurri et al. (2000), and many related works can be found in NIME (2002) and Engineering and Music (2001). This paper presents a distributed implementation of MvM, along with issues raised and applications of the system. 1.1 Main Modules The architecture of the MvM system adopts a modular design, based around five main sections: 1. Data acquisition: interfacing with video capture hardware. 2. Motion tracking: identifying and tracking areas of interesting movement. 3. Mapping: translating tracked motion into musical events with a flexible and userconfigurable interface. 4. Output: generating audio and visual events. 5. User interface: integrating the above components with a graphical interface, allowing flexible runtime configuration. 1.2 Original MvM software The original MvM software, described in Ng et al. (2000) is an integrated Windows application. Input is restricted to a single camera, and configuration possibilities, although extensive, do not allow complete control over how the data is mapped to the audio output. Extending the system requires rewriting of the C++ code, although the modular design of the code facilitates straightforward enhancement. This setup does however make the system more intuitive and portable. A new version of MvM has been developed which sacrifices some of this simplicity in favour of a more flexible, modular and distributed system. 2 Distributed MvM The new MvM system is designed and implemented in an entirely modular fashion, with communication between modules taking place across network sockets. There are three main modules in the system: input, mapping, and output. The current implementation ties together multiple inputs through a single mapping module, which can control a number of output modules. Input stages may themselves be modular, so, for example, a tracking module could send its data to a gesture recognition module which would in turn communicate with the mapping component. An advantage of this distributed approach is that computer architectures can be mixed in a way which takes advantage of the varying strengths of different platforms. It also allows appropriate resources be allocated to the work being performed. For example, a processing-intensive tracking module could be run on a dedicated powerful machine, in order to enable a higher frame rate throughput. Currently modules are being developed for Linux, Win32 and IRIX systems. An example configuration using multiple networked machines is illustrated in Figure 1. 534

Page  00000535 Trac Figure 1: An example configuration wave uutput of a distributed MvM 2.1 Modules Each module works as a standalone process, interacting with other modules via network sockets. They may run as separate processes on the same machine, or in an entirely distributed manner. Input components acquire data from a source, perform any required processing, and transmit their results to the mapping module. Output components receive data from the mapping module and communicate with output devices to realise the musical or visual event. 2.2 Camera-Based Object Tracking The original MvM system used framedifferencing to track moving objects. This approach is useful for scenes of constant motion, and is highly resistant to changes in the background of the scene, but does not allow for more detailed tracking of objects. An adaptive background subtraction module has been developed to provide an additional tracking approach. The implementation takes a very basic approach selected for speed of processing rather than accuracy. Successive frames are added to a mean frame (calculated on a pixel-by-pixel basis), and the standard deviation of each pixel in the frame is computed. After a predefined number of initialisation frames (typically 100), a binary foreground mask can be calculated by differencing subsequent frames with the mean frame. Pixels lying over a certain number of standard deviations (typically 2.50) from the mean are considered to lie in the foreground, the rest in the background. In order to cope with slow changes in background, due, for example, to lighting conditions, pixels which are labelled as background are used to update the mean frame and recompute the standard deviations. Distinct unconnected regions in the foreground image are then considered to represent objects of interest. A small amount of filtering is performed to remove objects of a small size (which can be considered to be noise). The bounding box containing the region of interest is then transmitted across a network socket to the mapping module. If further processing is required, this data along with the image data can be sent to another processing module (see Figure 1). Besides sending these raw parameters directly to the mapping module, additional image analysis modules could be introduced to interpret the raw data. An example would be to identify areas of high curvature on the object's border, allowing interesting subregions to be segmented and tracked independently. For example, the body of a dancer with hands by their side would be tracked as a single object, but as soon as the arm is extended this would be identified as a separate subregion of interest, and would create a further set of mappable parameters. 2.2 Sensor Input In addition to camera input, physical sensors can be used to provide raw data as input to the mapping module. One existing example of this comes from flex sensors, mounted in a pair of drum brushes, which communicate with a PC through a microcontroller and serial port interface (see Figure 2). The angle at which the brush is bent becomes a simple parameter which can be streamed to the mapping module and mapped in various ways. Potential uses for this include: * Mapping the angle to a musical note. In combination with a trigger to turn the note on and off (this could be taken from the drum itself, via a microphone), this would allow the performer to play melodic parts at the same time as purely percussive drum parts. * Triggering sample playback. This would allow the performer to play 'in the air' (virtual drum: drumming without a drum set). * Effects control. To provide control over parameters of effects units which themselves are used to process the sound of the drum. For example, cut-off frequency of a resonant filter. * Control of synthesis engine. For example altering envelope profiles, controlling grain size of a granular synthesis engine, altering harmonic content and influencing the timbre of the generated sound. The distributed MvM system allows this input to be used in combination with any other set of input modules, for example adding a camera on the player. Building up components in this way allows the creation of completely customised performance tools to suit the situation in which they will be used. 535

Page  00000536 (b) Figure 2: (a) Flex sensor with interface box and (b) drum brush 2.3 Face Tracking The main drive behind the distributed system has been the possibility of integrating MvM with existing tracking systems being developed within the University of Leeds Vision Group (URE: www. comp~leeds. ac.uk/vision). One of the first systems integrated in this manner is a real-time facial expression tracker (D~evin and Hogg 2001). This system tracks the shape and position of facial features, producing sets of coordinates approximating the outlines of the eyes, eyebrows, nose and mouth. A module has been developed for MvM which packages these coordinates and transmit them to the mapping module as another input stream of the MvM system. This allows a performer to control musical events, or influence the mood of a performance merely through changes in facial expression. A musical interface such as this would provide alternative routes to musical creativity for many people, including those for whom conventional instruments are not an option due to physical constraints. 2.4 Output Modules The current output modules provide access to the usual MIDI playback functions, as well as customisable output of control change messages. Work in hand includes a sample playback module and a synthesis module using a real-time version of CSound. Of these, the synthesis module should provide a particularly high number of controllable parameters, giving the potential for the system to become a highly expressive performance tool. 3 Network Protocol Issues Due to the real-time nature of the system, a network protocol which can support particular requirements of real-time communication is required. Ideally the protocol would allow negotiation of parameters such as latency, bandwidth, and jitter (variation in delay - which is particular important to preserve any rhythmic information in the transmission). Realistically, at the time of writing, an IP based solution would allow integration with many more existing networks, and use of the Internet without any adaptation. In this case, a stream based transport protocol such as TCP is not suitable, since any delays experienced will propagate delay throughout the entire transmission. Our design adopts a simple streaming protocol based on the UDP transport protocol. Each packet contains a sequence number. When a new packet is received, its sequence number is stored. If a packet with an older sequence number is subsequently received, it will be dropped. This would ensure that events are not mapped out of order when communicating over an unreliable or congested network. Currently, no provision is made for preserving the timing of the original input; the system assumes that any variation in timing introduced by the network is minimal. An extension to the protocol could rectify this situation. However, early experiments with the timing-free protocol have shown rhythmically interesting variations in musical output when different networks are used for communication. The network itself imposes its own characteristics on the music produced. In a timing-aware version of the protocol, an option will be provided to exploit this variation, or to rectify it. 4 R~le of Mapping Component It is the responsibility of the mapping module to assign musical meaning to the input data. A very simple example would be to map the X coordinate of an object to a note on a scale (1 to 1 direct mapping). This can be extended to mapping schemes of arbitrary complexity, utilising inputs from multiple sources. For example, a performer being tracked in one scene could control the velocities of notes being played by a performer in an entirely different scene. The mapping can be conceptually split into two separate sections: mapping functions and routing. The mapping functions allow arbitrary transformations of the input data, such as scaling or logarithmic mappings, while the routing component connects transformed inputs to the desired musical output module. The mapping functions are constrained by the characteristics of the output domain. For example, the input from a flex sensor would be scaled to lie between 0 and 127 in order to map to a MIDI note. Clearly, a flexible user interface is required to enable the creation of such mappings in an intuitive and interactive manner. The current system uses a dialog box interface. Inputs, mapping fu~nctions and outputs are represented as elements in a list of connectible components. The user is then free to connect the components as they see f~it. Mapping 536

Page  00000537 functions provided include scaling of note ranges, transpositions and a dynamic key-filter for mapping to musical scales. Background literature on development of various mapping models can be found in Hunt (1999), Hunt and Kirk (2000), Marrin-Nakra (2000), Wanderley (2001a, 2001b), Wanderley and Battier (2000) and Organised Sound (2002). Complex mapping strategies for expert instrument interaction are discussed in Hunt, Wanderley, and Kirk (2000). 5 Collaborative Performance The distributed system design and dynamic configuration, as detailed above open up a wide range of possibilities for exploration of geographically distributed collaboration. Several different performers may collaborate on the same piece across a wide area network, such as the Internet. The main hindrance to this possibility is due to network latency. The 'round-trip time' for an event encompasses: 1. Time from site A to site B 2. Reaction time for performer(s) at site B 3. Time from site B to site A With a high latency, this time would be too large for a meaningful interaction to take place. Studies have shown that a transmission delay of more than 250ms makes teleconferencing applications unusable (Jeffay and Stone 1994). For interactive performance, this figure would clearly be much lower, due to the need for two-way interaction whilst ensuring timing synchronicity between each performer. One way interaction, where one performer is unaware of the actions of the other, but the second performer receives their performance and can play over it, is feasible with a high latency network. The only extra requirement is a mechanism, preferably built into the network protocol, as described above, to preserve the timing between events, ensuring rhythmic consistency after transmission. 6 Conclusion The distributed MvM system is a further step towards creating a fully interactive augmented performance environment. It has applications ranging from creating customised instruments through to integrating choreography with composition. Future directions for the project include automatic detection of interesting features within a scene. For example, tracking a performer with a wide angle camera, and directing the attention of mobile input devices (such as pan and tilt cameras with zooming capability) to focus on certain features or gesture of the performer, in more detail, without any user intervention. Further information of the MvM project, including photos from various performances and video clips of an Augmented Drum demo can be found at the project website: www.kcng.org/mvm References Camurri, A., S. Hashimoto, M. Ricchetti, A. Ricci, K. Suzuki, R. Trocca, and G. Volpe. 2000. "Eyesweb: Toward gesture and affect recognition in interactive dance and music systems." Computer Music Journal 24(1): 57-69. Devin, V. E. and D. C. Hogg. 2001. "Reactive memories: an interactive talking-head." Technical Report 9, School of Computing, University of Leeds. Engineering and Music. 2001. Proceedings of the International Workshop on Human Supervision and Control in Engineering and Music. University of Kassel. Germany. Hunt, A. 1999. Radical user interfaces for real-time musical control. Ph. D. thesis, University of York, UK. Hunt, A. and R. Kirk. 2000. "Mapping strategies for musical performance." Trends in Gestural Control of Music. Ircam - Centre Pompidou. Hunt, A., M. M. Wanderley, and R. Kirk. 2000. "Towards a model for instrumental mapping in expert musical interaction." In Proceedings of ICMC 2000, Germany, pp. 209-212. Jeffay, K. and D. L. Stone. 1994. "Adaptive, best-effort delivery of live audio and video across packet-switched networks (video abstract)." In Proceedings of Second ACM International Conference on Multimedia, San Francisco, CA, pp. 487-488. Marrin-Nakra, T. 2000. "Inside the conductor's jacket: analysis, interpretation and musical synthesis of expressive gesture." Ph. D. thesis, MIT Media Lab. Ng, K. 2000. "Music via Motion." In Proceedings of XIII CIM2000 - Colloquium on Musical Informatics, Italy. Ng, K., S. Popat, B. Ong, E. Stefani, K. Popat, and D. Cooper. 2000. "Trans-domain mapping: A real-time interactive system for motion acquisition and musical mapping." In Proceedings of IlCMC 2000, Germany. Ng, K. 2002. "Sensing and Mapping for Interactive Performers." Organised Sound: An International Journal of Music and Technology, Cambridge University Press, 7(2). NIME. 2002. Proceedings of the New Interface for Musical Expression (NIME). Dublin, Ireland. Organised Sound. 2002. Organised Sound: An International Journal of Music and Technology, Special issue on Mapping Strategies for Real-time Computer Music. Marcelo M Wanderley (Guest editor). Cambridge University Press, 7(2). Wanderley, M. 2001a. "Gestural control of music." In Proceedings of the International Workshop on Human Supervision and control in Engineering and Music, Kassel, Germany. Wanderley, M. 2001 b. "Performer-Instrument Interaction: Applications to Gestural Control of Sound Synthesis." Ph. D. thesis, University of Paris 6, France. Wanderley, M. and M. Battier (Eds.) (2000). "Trends in Gestural Control of Music." Ircam - Centre Pompidou. 537