Page  00000001 Technological Advances for Conducting a Virtual Ensemble Guy E. Garnett, Mangesh Jonnalagadda, Ivan Elezovic, Timothy Johnson, Kevin Small University of Illinois, Urbana-Champaign garnett, jonnalag, elezovic, tejohnso, ksmall: Abstract This paper describes recent advances in our Interactive Virtual Ensemble project. This project aims at developing the ability to simulate the response of a human performing ensemble to more-or-less standard conducting gestures. Over the past year we have added several new components to the system. The two areas where primary additions have been made are in tracking/recognition and in sound synthesis. The system now uses a wireless MotionStar tracker, and a distributed communication model. We are also using a hybrid beat detection and classification system incorporating some neural net processing for both beat prediction and classification. The sound synthesis component uses dynamic control of an analysis-based additive synthesis model. 1 Introduction The notion of tracking conducting or conductor-like gestures has been of interest to the computer music community for a number of years. Work by Max Mathews on the Radio Drum (see Boie et al, 1989), showed the feasibility of such a project, as well as some of the problems. In particular, finding tracking technology that interferes only minimally with the conductor's movements has been problematic. The Radio Drum required a conductor to hold an unwieldy baton above a chessboard-like antenna. Other systems present their own set of difficulties. One important one (Marrin Nakra, 2000) relies on relatively cheap medical technology, EKGs, to detect muscle tension and extrapolate from that to perceived motion. This too is problematic especially in that not all features of the EKG signal represent features in the perceivable hand/arm motion of the conductor. Previous work by one of us (Brecht and Garnett 1995, and Garnett and Malvar-Ruiz, 1999) used a Buchla Lightning infrared position tracker. This had the merit of freeing the hand motion due to the fairly small baton and its wireless, infrared transmitter. It also allowed for greater flexibility of receiver positioning since there was no need for wires to connect the baton transmitter to the receiver-the receiver is even positionable as a member of the "orchestra" some 10 feet away from the conductor. However, the Lightning only tracked two dimensions per baton, the x and y locations contained within a vertical plane orthogonal to the conductor/receiver axis. At a useful distance away the spatial resolution tended also to become fairly coarse. Since this system only gives information about the location of the point of the baton (it can use up to two batons) it is useless for determining hand orientation. A conductor's left hand, in particular, uses orientation to convey such things as dynamics and cueing. For these, orientation must be determined. A further drawback to the Lightning is its reliance on MIDI to communicate with a computer. 2 Tracking Hardware To alleviate some of these, and other, difficulties, we have now added high-dimensional controllers to the project, similar to those used in the E-Violin project (Garnett and Goudeseune, 1999). We use the MotionStar~ Wireless system developed by Ascension Technologies. Our system consists currently of three magnetic sensors that each give three dimensions of positional information and three more of orientation, for a total of six dimensions of control: x, y, z, pitch, yaw, roll. It is also relatively easy to attach new sensors as necessary up to a total of 20. Within a 10 foot radius from the transmitter, the sensors yield a positional accuracy of 1/10 inch, and one degree of accuracy in orientation. The system uses a transmitting antenna controlled by pulsed DC to create a magnetic field that is picked up in the sensors located on the users hands and head. (The pulsed DC is an improvement on the older Motionstar system that used an AC transmitter. The AC magnetic field was more susceptible to interference from any metallic objects in the performance environment.) The signal produced in the sensors from the DC magnetic field is then processed to determine the position and orientation of each sensor with respect to the transmitter. The system can generate data for each dimension of each sensor at the rate of 100 Hz. Another advantage in this new system is that it is wireless: each sensor plugs into a small, belt-wearable device that transmits the data wirelessly to a receiver connected to the processing PC. This allows us much

Page  00000002 greater freedom of movement-which we expect to be particularly useful in performance situations. 2.1 Greater Discrimination With these sensors we are now able to detect not only individual conducting beats as previously (see Garnett and Malvar-Ruiz, 1999), but also more subtle positioning of either hand and the orientation of the head. For instance, we can fasten a sensor to the left hand and detect not simply horizontal and vertical location as previously, but also whether the hand is facing palm downward, as might be the case for a gesture indicating quiet, or palm upward, as might be the case for a gesture encouraging greater loudness. We can also attach a sensor to the conductor's head and make at least some crude determinations about where she is looking. This latter information will be used to facilitate a system of cueing whereby simple gaze detection, by tracking head orientation, enables a conductor to focus attention on a particular instrument in the virtual ensemble. We believe this kind of control is crucial to our goal of developing an accurate simulation of the response of a human ensemble to the gestures of a conductor. 3 Distributed Processing Another advantage of our new tracking system is that it communicates with the rest of the world not by MIDI, as many previous systems did, but by ethernet. This allows for great flexibility, use of existing networking infrastructure, and higher data rates. The additional control dimensions of our tracker results in an increase in the amount of data we need to process. Even just the raw position information comes in at 57.6Kbits/sec. We have therefore switched to a network-based model: one computer (currently an SGI Onyx) processes the incoming data from the sensors and communicates with a Macintosh using CNMAT's OpenSoundControl protocols (Wright and Freed, 1997). We have developed a simple mechanism for allowing a Max/MSP application to request controller data from the Onyx. This data can be preprocessed on the Onyx and the Max application layer can request all or any subset of the data streams to be communicated, thus keeping the communication bandwidth to a minimum for a given need. Furthermore, the application can request different data streams to be transferred at different rates. For example, we can request head motion data at a relatively slow rate compared to right-hand data, since the former is usually less time-critical. 4 Beat tracking Our application calls for great flexibility in the kinds of gestures we wish to recognize. We need to recognize different beat qualities such as tempo, legato, marcato, and others. We also need to recognize these different beat qualities as rendered by different performers. Another difficulty we face is that conductors in a live performance vary their beat gestures depending on details of the particular performance-they might even use a different gesture to achieve the same effect depending on the context in which it occurs. The approach we are using is related to that of (Usa and Mochida, 1998). We hope initially to be able to confirm their results with our new gesture tracking capability. We will then go on to expand the repertoire of recognizable patterns and simulate more varied and appropriate musical responses to each gesture type. Our initial work with the trackers makes us optimistic that the system will hold up under difficult performance constraints. A preliminary trial was performed using just two different beat types. These data were segmented using simple beat detection algorithms from (Garnett and MalvarRuiz). Once they were segmented, we fed them into a neural net configured as in Figure 1. It represents a backpropagation neural network with n inputs 9configured as a tap-delay line), m hidden nodes and, o output nodes, o = rlog2xl, where x is the categories of beats presented. Since an independent algorithm segments the beats, the neural network input nodes are reset between each beat of tracking data. The output response is therefore independent for each beat of the data set presented. To make appropriate decisions in a real-time setting, training occurs offline and the network performs as a feed-forward network when distinguishing between different types of beats during runtime. As the value of n is increased, more data is captured in the tap-delay line and the accuracy of the neural network response increases, but prediction is completed later in time. As the value of m is increased, the network becomes more descriptive and can make higher order distinctions. On a preliminary set of tracking data, we were able to accurately distinguish between two beat categories before the beat was even halfway completed with an accuracy rate of 85%. For this specific network, which was completely connected, n = 10, m = 5, and o = 1, as seen in Figure 1.

Page  00000003 :-:::.:.:.:.:.:.:.:.:.:..:; ^; Figure 1. Tapped-delay Neural Net 5 Synthesis The second area of improvements is in the synthesis domain, we have added a new capacity to control synthesis using an analysis/resynthesis paradigm instead of MIDI. In addition to a new twist on the sum-of-sinusoids resynthesis, which will be described elsewhere, we have implemented a unique feature specifically designed for our application: real-time, conductor-directed control of the tempo of the synthesis. One of the problems associated with, in particular, slowing down the tempo of the resynthesized music is that it effects transient and non-transient behavior equally. When an instrumentalist slows down in tempo, their note to note transitions do not occur at a substantially altered rate. Therefore if we simply slow down the whole signal-transient and non-transient alike-the transients get spread out in time creating artifacts unacceptable for our application. To avoid this situation, in our new implementation we mark transients during the analysis stage. During the synthesis stage we maintain the original time scaling during the transients and return to the modified scaling after the transient is over. We use a simple transient detection algorithm to determine when the signal will not be well-modeled by our resynthesis and switch at that point to a simple resynthesis by inverse FFT at the original rate. After the transients have settled down, we return to the timescaling and to the additive synthesis model. 5.1 Implementation In keeping with our general desire to work with open standards, we are using CNMAT's Sound Description Interchange Format (see Wright et al, 1999) for our analysis data. We are currently turning our synthesis code into a Max/MSP external object that will read the SDIF data and generate the samples. The MotionStar tracker runs on a PC, sends its data to an SGI which runs the neural network processing to perform the gesture recognition. The SGI then sends the processed control data to another Max/MSP external that drives the synthesis object. 6 Futures The eventual goal of the work is twofold: one, to develop a system suitable for conductor training, and two, to develop a conductor-based performance instrument. In order to achieve these goals a number of problems will have to be addressed. First, how many different gesture types do we need? And how many can we identify? We are currently recording data from professional conductors to analyze and determine how many gesture categories we will need. We will then try to train our classifier to recognize and distinguish as many of these as we can. Second, there are still more difficulties in determining what can loosely be referred to as "musical" responses to conducting gestures. We know we can speed up and slow down in response to detected beats, but we do not as of yet have precise measurements on exactly how good musicians speed up and slow down in relation to a conductor. What we need is to record both a conductor's gestures and a real ensemble's response and to compare and correlate these signals and try to quantify their relationship. References Boie, B, M. Mathews, and A. Schloss. 1989. "The Radio Drum as a Synthesizer Controller," Proc. ICMC, Columbus, Ohio, pp. 42-45. Brecht, B., and G. Garnett. 1995. "Conductor Follower," Proc. ICMC, Banff, Canada, pp. 185-186. Garnett, G., and F. Malvar-Ruiz. 1999. "Virtual Conducting Practice Environment." Proc. of the International Computer Music Conference, Beijing. _, and C. Goudeseune. 1999."Performance Factors in Control of High-Dimensional Spaces." Proc. of the International Computer Music Conference, Beijing. Morita, H., S. Hashimoto, S. Ohteru. 1991. "A Computer Music System the Follows a Human Conductor," Computer(24), No. 7, July, pp. 44-53. Marrin Nakra, Teresa. 2000. "Inside the Conductor's Jacket: Analysis, Interpretation and Musical Synthesis of Expressive Gesture", MIT, PhD, Diss. See htt:// Usa, S. and Y. Mochida. 1998. "A conducting recognition system on the model of musicians' process." Journal of the Acoustical Society of Japan 19(4). Wright, M. 1999. "SDIF Specification," http://cnmat.CNMAT.Berkelev.EDU/ISDIF/Sec.html. Wright, M., R. Dudas, S. Khoury, R. Wang, & D. Zicarelli. 1999. "Supporting the Sound Description Interchange Format in the Max/MSP Environment," Proc. ICMC, Beijing. Wright, M., and A. Freed. 1997. "Open SoundControl Protocol," Proc. ICMC, Thessaloniki.