Page  00000266 Compositional Imperatives for Implementing an Audio Alignment Program in MAX/MSP Ozgur Izmirli Rob Seward Noel Zahler Center for Arts and Technology, Connecticut College email. Abstract Believing, as we do, that music technology should be driven by compositional need, we have created a program in the MAX/MSP environment that deals with compositional practices that have only recently begun to be explored: tracking live audio signals, repetitive pitches, trills and fast sequences. We have implemented such a program for the purposes of spatializing a solo clarinet part in Noel Zahler's Concerto for clarinet, chamber orchestra and interactive computer. was necessary because it was clear a human operator would have great difficulty in carrying out such a task if, in some instances, that might be possible at all. o t........... T0 i", 1 f 'l-iT i Figure l.a. Excerpt 1. The rectangular symbol above the staff indicates where the sound will emanate in the concert hall. 1 Introduction ---,,-----~ - '-,- - -~--t A14 Work in performance alignment and score following dates back to the ground breaking work of Dannenberg (Dannenberg 1988, 1991, 1998) and Vercoe (Vercoe 1984, 1985). Following on their heals were significant additions by Puckette et. al. (Puckette et. al. 1992, 1998), Baird et al. (Baird et. al. 1989a, 1989b, 1991, 1993), and most recently Orio (Orio 2001a, 2001b). We try to provide here what we believe will be some significant new ways of dealing with the problem of human/machine performance based on a compositional model. Zahler's composition posed several problems in alignment (Orio 2001a) (score following) not adequately dealt with in previous research. The goal was to automate the triggering of spatialization for a solo clarinet part, as well as to carry out the DSP for that spatialization in a single environment that is portable and easily implemented in a variety of performance venues. The system had to deal with numerous successive repetitions of a single pitch, even if that pitch was, at times, "fuzzy" because of the use of "color fingerings." Other problems posed included the need for spatialization to be carried out during rapid passages and trills, sometimes extracting a single event from these complex environments (see excerpts in figures la and lb below). Automation P -3 --f^ J Figure 1.b. Excerpt 2 2 The Environment: MAX/MSP For all of the reasons stated above, MAX/MSP presented itself as the best programming environment for our purpose. Using the Max Software Development Kit we wrote our new objects in C and drew on other objects provided by those in the Max community. The problem was divided into tasks that, in turn, provided an architecture for the entire program. Roughly, those tasks fell into the categories of signal recognition and processing, score following and spatialization (Figure 2). Signal Score Spatialization Recognition w Following Figure 2. Overview of the program. 266

Page  00000267 Signal recognition is carried out with the assistance of Miller Puckett's "Fiddle (Puckette 1998)." We found that, with a small degree of effort, "fiddle" can be tuned to carry out the task of fairly reliable signal recognition for our purposes. We had to modify the translation of midi data to achieve the "best" format for our purpose but this was a relatively simple adjustment. Score following was quite another matter. 3. Score Following The score follower takes its input from a front-end that converts a monophonic acoustical input to fundamental frequency and amplitude information. We are currently using the MSP fiddle object for this module. The performance of the soloist is captured by a microphone close to the instrument to minimize crosstalk. The output of the front-end is converted to a sequence of musical events represented by midi note, velocity and time. This representation supports a monophonic melody with possible silences but without overlapping notes. The history of musical events obtained from the performance are kept in a context dependent window called the performance window. The size of the window is limited by both the maximum number of musical events, Nmax, or the maximum duration, Dmax. This ensures that the system has an adaptive locality depending on musical context. For example, at a certain time, the window may have less than Nnax musical events when the duration is limited at Dmax. This case corresponds to a context in which there are many relatively long notes and in order not to compromise the agility of the score follower the window being limited to the maximum allowable duration. On the other hand, in another scenario, it is also possible for the window to be limited by the number of notes and not the duration, which corresponds to a context in which there are many short notes. The score follower relies on two mechanisms: the first is pattern matching. In the light of incoming information from the performance a match is sought with a scaled version of the score. This process aims at finding the locally optimal alignment of the recent musical events to the score. The second is an inertial mechanism. This is used to pace the score follower when the incoming performance data is unreliable or insufficient, as is encountered when wrong notes have been played, notes have been missed or when there are very long sustained notes. The score follower continually searches for the best match between the sequence of events from the performance and the sequence of events in the score. The search is performed periodically (typically every 20 msec) without waiting for new musical events. As a result of this search a new candidate position in the score is determined. If the reliability of the match is above a certain threshold then the follower will correct itself by skipping to that position. This capability prevents the follower from entering a deadlock in the following situations: waiting for a note that has been missed by the performer, a note misdetected by the front-end and the follower pointing to an incorrect position in the score due to discrepancies in estimated tempi during sustaining notes. Figure 3. Block diagram of the score follower. The pattern matcher employs a two-dimensional search to find the best possible match between the performance data and the score data. A scaling factor is determined as a result of a local search by translating and scaling in time the performance window. This search is performed around the vicinity of the currently estimated position in the score (score pointer). The result of this search gives the estimate of the position in the score which constitutes the output of the score follower. The scaling factor is also used to perform the tempo prediction as described below. In order to determine the scaling factor, the pattern matcher calculates a match for each time scaling and translation pair. The calculation for each pair is carried out by first finding the best association of musical events between the two windows (score and performance). Then the match is calculated as the normalized sum of differences in the onset times of the associated notes. The pattern matcher acts as a mapper between the score time, which is given in score-time units, and the real-time input. One of the inherent problems in 267

Page  00000268 score following is the alignment of the real-time events to the score events, which have a different and more generic type of representation, taking into account possible errors and aberrations in performance. The reason for implementing a proprietary score-time unit instead of the more conventional beat unit is to incorporate relative accellerandi and ritardandi into the score representation. The score in the score follower contains a relative representation of an ideal performance reflecting the information in the original score. We define a score time tempo which is the tempo in terms of score time units. This specifies the pace at which the score pointer moves relative to the score. The score time tracker makes use of an exponentially weighted adaptation rule to keep a running estimate of the score time tempo. A reliability figure is computed based on the degree of match for each individual note for the time scaling and translation pair that yields the best match. Reliability is given by the proximity of the onsets and duration compatibility of corresponding musical events in the scaled performance window and the score. A weighted average of onset times and duration matches are used for notes that have the same midi note value. If the overall reliability is above a certain threshold then the current score time tempo is calculated from the reliable musical events in the performance window and the score time tempo is updated based on an Initialize SCT to reflect the estimated starting tempo in the performance; While events left in the score perform local search by translating and scaling the performance window contents to find SF and the position of SPcurr; Calculate R using SF and SPcurr; if( R>T){ calculate SCTurr from reliable musical events; update SCI SCT = x SCT + (1-x) SCTcurr; if ( SPcur >= SP) SP =SP,, else pace SP according to SCT; Figure 4. The sequence of computations for the score follower. The abbreviations are as follows: SCT: score time tempo, SP: score pointer, SF: scaling factor, R: reliability, T: reliability threshold, x: weight reflecting dependence on history. The curr subscript denotes the value pertaining to that specific iteration. exponentially weighted moving average update rule. Otherwise, the score time tempo is left unaltered. The sequence of computations is given in Figure 4. 4. Spatialization The spatialization demands made by the Concerto call for "on the fly" localization of sound to one or more of six individually addressed electroacoustic speakers. The configuration of the performance space is demonstrated in Figure 5 below. The spatialization information is synchronized with the score by coding it as a separate midi message. At present there are some 63 different locations specified. This information is directly used by the spatialization module. The timing information is provided by the output of the score follower which is also fed into the spatialization module. Individual events or groups of events may be extracted from fast moving passages and spatialized at different locations. The spatialization module performs interpolation for gradual changes in the spatialization and provides the amplification levels for all six output channels. II~hibeiw&Ihst a Figure 5. Configuration of the performance space. 5. Evaluation of Results The evaluation of the system has been carried out using two different methods of configuration. The first employs various audio excerpts taken from rehearsals that have been previously recorded. The use of recorded examples helped to economize on the development time of the system and they have also proven to be useful in the assessment of system performance. In this case, the microphone was replaced with the computer's internal audio playback 268

Page  00000269 leaving all remaining components intact. To ascertain the level of coupling between the performer and the score follower, a midi stream is generated and played back using a perceptually distant timbre. By simultaneously listening to the input and the output of the score follower errors in performance can be effortlessly identified. It should also be noted that timing errors in spatialization are more forgiving than onset errors in, for example, an accompaniment system. In as much as the listening tests are concerned, the system is robust and performed well. The second method of evaluation has been to generate the audio input using a synthesizer. The midi sequence was formed by manually transcribing the timings of a real performance. Random timing and duration errors were introduced to determine how robust the system really is. We do acknowledge however, that this method can not be used as the only testbed, due to the fact that the errors generated by this kind of a system are not necessarily human like. Nevertheless, this method proved useful in identifying some unforeseen limits of the system. 6. Conclusions We maintain that realistic compositional demand fosters the relationship between technology and the compositional process. In this work, we have attempted to exemplify this by implementing a spatialization agent in which the motivation lies solely in compositional need. Some of the challenges introduced by Zahler's composition have lead to a design of a score follower which caters to a different class of inputs. Above all, the demands that composers make in their works drive us to confront new challenges that have, in the past, been thought too difficult to attempt. Our program, robust and hopefully useful to others, is but one small step in the ever broadening relationship between human performer and machine. References Baird, B., D. Blevins, and N. Zahler. 1989. "The Artificially intelligent Computer Performer on the Mac II and a Pattern Matching Algorithm for Real-Time Interactive Performance." In Proceedings of the ICMC. San Francisco: International Computer Music Association, pp. 13-16. Baird, B., D. Blevins, and N. Zahler, 1989. "The Artificially Intelligent Computer Performer: The Second Generation." Proceedings, The Connecticut College Arts and Technology Symposium II; New London, Connecticut, pp. Baird, B., D. Blevins, and N. Zahler.. 1991. "Artificially Intelligent Computer Performer and Parallel Processing." In Proceedings of the ICMC. San Francisco: International Computer Music Association, 340-43. Baird, B., D. Blevins, and N. Zahler. 1993. "The Artificial Intelligence and Music: Implementing an Interactive Computer Performer." Computer Music Journal 17.2: pp. 73-79. Dannenberg, R. and H. Mukaino. 1988. "New Techniques for Enhanced Quality of Computer Accompaniment." Proceedings of the 1988 ICMC, San Francisco: International Computer Music Association, pp. 243 -249. Dannenberg, R.B. "Recent Work In Real-Time Music Understanding By Computer." In Sundberg, Nord, and Carls n (Eds.), Music, Language, Speech, and Brain Macmillan Publishers, London, UK (1991)194 -202. Dannenberg, R. B. and I. Derenyi. 1998. "Combining Instrument and Performance Models for High-Quality Music Synthesis." Journal of New Music Research 27(3), pp. 211-238. Dechelle, Francois et al. 2000. "The jMax environment: an overview of new features." Proceedings of the 2000 ICMC, San Francisco: International Computer Music Association, pp. 252 -257. Orio, Nicola. 2001. "An Automatic Accompanist Based on Hidden Markov Models." AI*IA 2001, LNAI 2175, pp. 64-69. Orio, Nicola and D. Schwarz. 2001. "Alignment of Monophonic and Polyphonic Music to a Score." Proceedings of the 2001 ICMC, San Francisco: International Computer Music Association. Puckette, M., and C. Lippe. 1992." Score Following in Practice." Proceedings of the 1992 ICMC. San Francisco: International Computer Music Association, pp. 182-185. Puckette, M. and Apel, T. 1998. "Real-time audio analysis tools for Pd and MSP". Proceedings, ICMC,. San Francisco: International Computer Music Association, pp. 109-112. Vantomme, Jason D. 1995. "Score Following by Temporal Pattern." Computer Music Journal 19:3: pp. 50-59. Vercoe, B. 1984. "The Synthetic Performer in the Context of Live Performance." Proceedings of the 1984 ICMC. San Francisco: International Computer Music Association, pp. 199-200. Vercoe, B. and M. Puckette. "Synthetic Rehearsal: Training the Synthetic Performer." Proceedings of the 1985 ICMC, San Francisco: International Computer Music Association, pp. 275 -278 Zahler, N. 1991. "Questions About Interactive Computer Performance as a Resource for Music Composition and Performance of Musicality." Proceedings of the 1991 ICMC. San Francisco: International Computer Music Association, pp. 336-39. 269