Page  00000001 Communication of Musical Gesture using the AES/EBU Digital Audio Standard Adrian Freed and David Wessel CNM~AT 1750 Arch Street. Berkeley, CA 94709 (510) 643 9990, { adrian,wessel 1. Abstract We have adapted the AES/EBU digital audio standard to the coding and transmission of transduced gestures. We discuss the advantages of the AES/EBU standard over MIDI and other candidate methods and describe alternative mappings of gestural data to the audio streams of the AES/EBU protocol. We conclude with a description of a reactive glove system and a continuous position-sensing keyboard controller using AES/EBU communications. 2. Musical Gesture Communications Requirements 2.1. Introduction The following score card rates various standards for data communications according to requirements for gestural controllers (Roads 1996) for live musical performance: Gesture samples/second > 100k Synchronous cloc Connector Insertions > 100 V V Complexity < 1000 gates V VV Unoriented connecto V Locking Connector Robust Cable 2.2. Repeatability One requirement for successful musical expression is that the sonic response to a gesture must be predictable. This implies repeatable data for the same gesture and predictable delivery to the synthesis system. These needs are best met with digital communications which avoid the corrupting effects of inductive pickup, radio frequency interference (RFI) ground loops, connector contact noise, impedance mismatch, and "microphonic" cables. Precision and accuracy are optimized by converting to digital form as close as possible to the transducer and using the smallest number of shielded conductors as possible. 2.3. Throughput Multimodal, multidimensional (Hand 1997) gesture arrays can result in a effective data rates of more than 100,000 gesture samples/second. This clearly exceeds the performance available from MIDI. 2.4. Reliability Reliability during the rigors of live musical performance is critical. This constraint immediately eliminates many possible communications technologies. USB and 1EEE1394 do not provide for data transmission without repeaters over distances required in most performance venues. In addition USB and IEEE 1394 connectors are designed for a limited number of insertions and are oriented making it hard to connect them in the dark or under difficult lighting

Page  00000002 situations typical of stage performance. Also, multiwire cables are inherently less reliable than thicker cables with a few wires. Optical cables are fragile and those used in S/PDIF and ADAT are short, i.e., 6m. We particular favor the robustness inherent in coaxial cables that may be used in AES-3 (AES 1985) links. Reliability is also critical in audio, video and lighting device control; virtual reality; and medical monitoring applications so these applications may too benefit by adapting the AES/EBU digital audio standard. 2.5. Latency Controlled latency is essential for music synthesis applications that respond reactively to gestures. An advantage of using a digital audio standard for gestural transmission is that the AES/EBU and S/PDIF interface cards available for most personal computers and workstations are optimized for reliable, controlled-latency reception of audio. We have achieved our latency goal of 10~lms with the AES 3 input of the SGI 02 and Octane. Although latencies are much longer on Macintosh 8.0 and Windows 98 systems, we expect competitive pressures to improve results in future operating system and hardware products. It is interesting to contrast these results with data acquisition cards, an obvious alternative for gesture transduction. The driver software for these boards is often a source of debilitating latency and jitter, because the primary focus in their construction is reliable transfer of data to disk for later, non-real time analysis. Ethernet is an interesting alternative to explore because it so widely available. Although timely delivery of packets is a consideration in modern networking (Kim and Chien 1996), there is no widely adopted, transport-independent, application-layer protocol for using such networks in reactive systems applications. Of particular concern is the observation that the overhead of packet preparation, assembly and disassembly now dominates transit latency (Rodrigues 1997). We have achieved good latency and reliability results with our Open Sound Control Protocol (Wright and Freed 1997; Wright 1998) over 10BaseT and fast Ethernet. We use this protocol primarily for interprocess communication between synthesizer control clients and sound servers. The main problem with Ethernet for gesture transduction is the cost and complexity of hardware and software to support the protocol stack. 2.6. Isochrony An important feature of this work is the decision to communicate continuous measurements of gestures to the synthesizing device. This provides the great flexibility, lacking with MIDI, of experimentation with different gestural interpretation computations (Lee 1991; Wessel 1991; Lee, Freed et al. 1992; Lee and Wessel 1992). Gestural signals can operate directly on synthesis parameters or be analyzed and parsed into events. This sampled signal view of gestures requires a stable local clock at the source and an accurate clock recovery scheme at the destination. These requirements are easily satisfied by AES/EBU links because they use biphase signaling and are built to satisfy stringent clock jitter specifications. 2.7. Cost Although parts cost is a factor, the exploratory nature of our work leads us to minimize development cost. Evaluation boards from Crystal ( and AKM ( contain a clock, A/D converter and AES/EBU transmitter with coaxial and optical outputs. Many gestural transduction applications require significant signal pre-processing for transducer calibration, linearization, smoothing and noise reduction. This is easily provided using development boards for DSP chips. Recent DSP chips integrate an AES/EBU interface, e.g., Motorola DSP56011. 3. Gesture Formatting AES/EBU and S/PDIF transmit 2 channels of 24-bit samples at frame rates between 32kHz and 48kHz. Two additional bits, the user and channel status bits, are also sent with each sample. Since few drivers give access to the user and channel status information, we encode gestures within the 24-bit sample data. Although not necessarily optimal for any given application, this mapping covers many musical applications: Left: 0 23 Original Left Channel Audio Sample 0 1 8 15 16 23 F I A I BLO BHI Right:

Page  00000003 The "F" bit is set every 64 sample frames. A is a 7-bit field for low-resolution devices. "BLO" and "BHI" are 8-bit each, and can be combined into a single 16-bit high-resolution field. The original left channel is reserved for audio values to support the common situation that the performer's gestures are combined with an instrument or vocal source. 4. AES/EBU Gesture Acquisition System Sensor Array B Sensor Array A 15_ 7, oCoax Out Clk Auo aSClock AES/EBU Coax n L/R CLock Transmitter Coniverter Our prototype gesture acquisition is built around an audio A/D converter evaluation board. Serial data from the converter to the AES/EBU transmitter is interrupted by a multiplexer. Left channel bits from the converter are passed through. Right channel bits are derived from a latch and shift register that stores and serializes results from the gesture A/D converters. A 6-bit counter provides the frame count bit and the multiplex control for the A/D converters for the gestures. 5. Example Applications 5.1. Expressive Keyboard Most electronic keyboards sense key position at only the top and bottom of each key's travel. This is sufficient to establish key up and down velocity to be encoded and transmitted with MIDI. Unfortunately this cheap sensing strategy does not adequately capture the nuance available on acoustic instruments. Musicians use fine control of key position to control the timbre of sounds from mechanical tracker organs, harpsichords and pianos. We are experimenting with different sensor technologies to measure continuous key position including ones based on an interrupted light beam, a reflected light beam, and a bending resistive strip. The pedal and stops, lower, and upper manual data is mapped to A, BLO, and BHI respectively-yielding continuous position estimates approximately every millisecond. 5.2. Reactive Glove In an initial experiment carried out in collaboration with Butch Rovan (Rovan, Wanderley et al. 1997), we mounted a force-sensing resistor (FSR) on the tip of a finger. With a simple conditioning circuit we obtained a 0 to 5 volt signal that we sampled at an audio rate with a converter that did not eliminate, as most audio conversion systems do, the DC and very low frequency components. We then used this audio signal interpretation of the gesture data to control various synthesis parameters like the envelope and modulation index of an FM patch. These initial experiments were carried out on an ISPW card running FTS (Lindemann, Dechelle et al. 1991) from IRCAM which has a very low and stable latency (< 4 msec). Striking gestures like those of a hand drummer were effective in producing expressive synthetic sounds. The results showed great potential for musical expressivity and led to the construction of a series of lightweight, flexible, and custom fitted gloves with FSR' s mounted on the tips of the fingers and thumbs. This FSR glove technology was combined with an additional three dimensions of spatial location technology to accurately locate the positions of the index finger of each hand. Three dimensions of index finger tip location and five FSR's per hand produced 16 analog signals sampled at 3 kHz multiplexed into the single

Page  00000004 48 kHz channel. This combination of FSR and spatial location sensing provides a flexible poly-point continuous controller capable of considerable musical expressiveness but requiring, as with most instruments, a lot of practice. 6. Future Work We are exploring ways to add bidirectionality, higher bandwidth and power delivery to the recognized strengths of the AES/EBU digital audio standard. Bidirectional V V IEEE1284 Power V V V V Audio+Gesture V V V V V 7. Acknowledgement We gratefully acknowledge the support of Gibson Guitar and the Edmund O'Neill foundation. 8. References AES (1985). "AES recommended practice for digital audio engineering-serial transmission format for linearly represented digital audio data." Journal of the Audio Engineering Society 33(12): 975-84. Hand, C. (1997). "A survey of 3D interaction techniques." Computer Graphics Forum 16(5): 269-81. Kim, J. H. and A. A. Chien (1996). Rotating Combined Queueing (RCQ): bandwidth and latency guarantees in lowcost, high-performance networks. ISCA '96: The 23rd Annual International Conference on Computer Architecture, Philadelphia, PA, USA. Lee, M., Freed, A., Wessel, D. (1991). Real-Time Neural Network Processing of Gestural and Acoustic Signals. Proceedings of the 17th International Computer Music Conference, Montreal, Computer Music Association. Lee, M., A. Freed, et al. (1992). Neural networks for simultaneous classification and parameter estimation in musical instrument control. Adaptive and Learning Systems, Orlando, FL, USA. Lee, M. A. and D. Wessel (1992). Connectionist models for real-time control of synthesis and compositional algorithms. Proceedings of the International Computer Music Conference, Computer Music Association. Lindemann, E., F. Dechelle, et al. (1991). "The architecture of the IRCAM musical workstation." Computer Music Journal 15(3): 41-9. Roads, C. (1996). Musical Input Devices. The Computer Music Tutorial. Cambridge, MIT Press: 619-658. Rodrigues, S. (1997). High-Performance Local-Area Communication With Fast Sockets. Usenix 1997. Rovan, J. B., M. M. Wanderley, et al. (1997). Instrumental Gestural Mapping Strategies as Expressivity Determinants in Computer Music Performance. KANSEI - The Technology of Emotion. Wessel, D. (1991). Improvisation with highly interactive real-time performance systems. Proceedings of the International Computer Music Conference, Montreal, Computer Music Association. Wright, M. (1998). Implementation and Performance Issues with OpenSound Control. International Computer Music Conference, Ann Arbor, Michigan, CMA. Wright, M. and A. Freed (1997). Open Sound Control: A New Protocol for Communicating with Sound Synthesizers. International Computer Music Conference, Thessaloniki, Greece, ICMA.