Page  00000192 A REAL-TIME GENETIC ALGORITHM IN HUMANROBOT MUSICAL IMPROVISATION Gil Weinberg Georgia Tech Music Department Mark Godfrey Georgia Tech Music Department Alex Rae Georgia Tech Music Department John Rhoads Georgia Tech Music Department ABSTRACT The paper describes an interactive musical system that utilizes a genetic algorithm in an effort to create inspiring collaborations between human musicians and an improvisatory robotic xylophone player. The robot is designed to respond to human input in an acoustic and visual manner, evolving a human-generated phrase population based on a similarity driven fitness function in real time. The robot listens to MIDI and audio input from human players and generates melodic responses that are informed by the analyzed input as well as by internalized knowledge of contextually relevant material. The paper describes the motivation for the project, the hardware and software design, two performances that were conducted with the system, and a number of directions for future work. 1. INTRODUCTION AND RELATED WORK Real-time collaboration between human and robotic musicians can capitalize on the combination of their unique strengths to produce new and compelling music. In order to create intuitive and inspiring human-robot collaborations, we developed a robot that can analyze music based on computational models of human percepts and use genetic algorithms to create musical responses that are not likely to be generated by humans. The twoarmed xylophone playing robot is designed to "listen like a human and improvise like a machine", bringing together aspects of machine musicianship with the capacity to produce acoustic musical responses in a physical and visual manner. Current research directions in musical robotics focus on sound production and rarely address perceptual aspects of musicianship, such as listening, analysis, improvisation, or group interaction. Such automated musical devices include both Robotic Musical Instruments - mechanical constructions that can be played by live musicians or triggered by pre-recorded sequences - and Anthropomorphic Musical Robots - humanoid robots that attempt to imitate the action of human musicians (see a historical review of the field in (Kapur 2005). Only a few attempts have been made to develop perceptual robots that are controlled by neural networks or other autonomous methods (Baginsky 2004). The goal of our project is to embed machine musicianship and interactive algorithmic composition techniques in such mechanical devices. Some successful examples for such interactive musical systems are Cypher (Rowe 1992), Voyager (Lewis 2000), and the Continuator (Pachet 2002). These systems analyze musical input and generate algorithmic responses by 192 generating and controlling a variety of parameters such as melody, harmony, rhythm, timbre, and orchestration. These interactive systems, however, remain in the software domain and are not designed to generate acoustic sound. As part of our effort to develop a musically discerning robot, we have explored models of melodic similarity using dynamic time warping. Notable related work in this field is the work by Smith at al. (1998), which utilized a dynamic-programming approach to retrieve similar tunes from a folk song database. Utilizing a similar method, our robot is designed to respond with inspiring and surprising musical outcomes using a novel approach for improvisatory genetic algorithms. Related work in this area includes GenJam (Biles 1994), an interactive computer system that improvises over a set of jazz tunes using genetic algorithms. GenJam's initial phrase population is generated stochastically, with some musical constraints. Its fitness function is based on human aesthetics, where in every generation the user determines which phrases remain in the population. Other musical systems that utilize human-based fitness functions have been developed by Maroni (Maroni et al 2000), who uses a real-time fitness criteria interface, and Tokui (Tokui et. al 2000) who uses human feedback to train a neural network-based fitness function. The Talking Drum project (Brown 1999), on the other hand, uses a computational fitness function based on the difference between a given member of the population and a target pattern. In an effort to create more musically relevant responses, our system is based on a humangenerated phrase population and a similarity-based fitness function, as described in detail below. 2. THE ROBOTIC PERCUSSIONIST In previous work, we have developed an interactive robotic percussionist named Haile (Weinberg, Driscoll 2006). The robot was designed to respond to human drummers by recognizing low-level musical features such as note onset, pitch, and amplitude as well as higher-level percepts such as rhythmic stability and similarity. Mechanically, Haile controls two robotic arms; the right arm is designed to play fast notes, while the left arm is designed to produce larger and more visible motions, which can create louder sounds in comparison to the right arm. Unlike robotic drumming systems that allow hits at only a few discrete locations, Haile's arms can move continuously across the striking surface, which allows for pitch generation using a mallet instrument instead of a drum. For this project, Haile was adapted to play a one-octave xylophone. The different

Page  00000193 mechanisms driving each arm led to a unique timbral outcome, as notes played by the solenoid-based arm sound different than those played by the linear-motor driven arm. Since the range of the arms covers only one octave, Haile's responses are filtered by pitch class. Figure 1. Haile's two robotic arms cover a range of one octave (middle G to treble G.) The left arm plays 5 notes while the right arm plays 7 notes. 3. THE GENETIC ALGORITHM Our goal in designing the interactive genetic algorithm was to allow the robot to respond to human input in a manner that is both relevant and novel. The algorithmic response is based on the observed input as well as on internalized knowledge of contextually relevant material. The algorithm fragments MIDI and audio input into short phrases. It then attempts to find a "fit" response by evolving a pre-stored, human-generated population of phrases using a variety of mutation and crossover functions over a variable number of generations. At each generation, the evolved phrases are evaluated by a fitness function that measures similarity to the input phrase, and the least fit phrases in the database are replaced by members of the next generation. A unique aspect in this design is the reliance on a pre-recorded human-generated phrase set that evolves over a limited number of generations. This allows musical elements from the original phrases to mix with elements of the real-time input to create unique, hybrid, and at times unpredictable, responses for each given input melody. By running the algorithm in real-time, the responses are generated in a musically appropriate timeframe. 3.1. The Base population Approximately forty melodic excerpts of variable lengths and styles were used as an initial population for the genetic algorithm (GA). They were recorded by a jazz pianist improvising in a similar musical context to that in which the robot was intended to perform. Having a distinctly "human" flavor, these phrases provided the GA with a rich initial pool of rhythmic and melodic "genes" from which to build its own melodies. This is notably different from most standard approaches, in which the starting population is generated stochastically. 3.2. The fitness function A similarity measure between the observed input and the melodic content of each generation of the GA was used as a fitness function. The goal was not to converge to an "ideal" response by maximizing the fitness metric (which could have led to an exact imitation of the input melody), but rather to use it as a guide for the algorithmically generated melodies. By varying the number of generations and the type and frequency of mutations, certain characteristics of both the observed melody and some subset of the base population could be preserved in the output. Dynamic Time Warping (DTW) was used to calculate the similarity measure between the observed and generated melodies. A well-known technique originally used in speech recognition applications, DTW provides a method for analyzing the similarity, either through time shifting or stretching, of two given segments whose internal timing may vary. While its use in pattern recognition and classification has largely been supplanted by newer techniques such as Hidden Markov Models, DTW was particularly well suited to the needs of this project, specifically the task of comparing two given melodies of potentially unequal lengths without referencing an underlying model. We used a method similar to the one proposed by Smith (1998), deviating from the time-frame-based model to represent melodies as a sequence of feature vectors, each corresponding to a note. Our dissimilarity measure, much like Smith's "edit distance", assigns a cost to deletion and insertion of notes, as well as to the local distance between the features of corresponding pairs. The smallest distance over all possible temporal alignments is then chosen, and the inverse (the "similarity" of the melodies) is used as the fitness value. The local distances are computed using a weighted sum of four differences: absolute pitch, pitch class, log-duration, and melodic attraction. The individual weights were configurable, each with a distinctive effect upon the musical quality of the output. For example, higher weights on the log-duration difference led to more precise rhythmic matching, while the pitch-based differences led to outputs that more closely mirrored the melodic contour of the input. Melodic attraction between pitches was calculated based on the Generative Theory of Tonal Music model (Lerdahl, Jackendoff, 1996). The relative balance between the local distances and the temporal deviation cost had a pronounced effect upon the output - a lower cost for note insertion/deletion led to a highly variant output. Through substantial experimentation, we arrived at a handful of effective configurations. The computational demands of a real-time context required significant optimization of the DTW, despite the relatively small length of the melodies (typically between two and thirty notes). For the search through possible time alignments, we implemented a standard path constraint in which consecutive insertions or deletions are not allowed. This cut computation time by approximately one half but prohibited comparison of melodies whose lengths differ by more than a factor of two. These situations were treated as special cases and were assigned an appropriately low fitness value. Additionally, since the computation time was proportional to the length of the melody squared, a decision was made to break longer input melodies into 193

Page  00000194 smaller segments to increase the efficiency and remove the possibility of an audible time lag. 3.3. Mutation and Crossover With each generation, a configurable percentage of the phrase population was chosen for mating. This "parent" selection was made stochastically according to a probability distribution calculated from each phrase's fitness value, so that more fit phrases were more likely to breed. The mating functions ranged from simple mathematical operations to more sophisticated musical functions. For instance, a single crossover function was implemented by randomly defining a common dividing point on two parent phrases and concatenating the first section from one parent with the second section from the other to create the child phrase. This mating function, while common in genetic algorithms, does not use structural information of the data and often leads to nonmusical intermediate populations of phrases. We also implemented musical mating functions that were designed to lead to musically relevant outcomes without requiring that the population converge to a maximized fitness value. An example of such a function is the pitch-rhythm crossover, in which the pitches of one parent are imposed on the rhythm of the other parent. Because the parent phrases are often of different lengths, the new melody follows the pitch contour of the first parent, and its pitches are linearly interpolated to fit the rhythm of the second parent. mutation functions and two crossover functions available for use with the algorithm, any combination of which was allowed through a configurable interface. 4. INTERACTION DESIGN In order for Haile to improvise in a live setting, we developed a number of human-machine interaction schemes driven by capturing, analyzing, transforming, and generating musical material in real-time. Much like a human musician, Haile decides when to play, for how long, when to stop, and what notes to play in any given musical context. Our goal was to expand on the simple call-and-response format, creating autonomous behaviour in which the robot interacts with humans by responding, interrupting, ignoring, or introducing new material that aims to be surprising and inspiring. 4.1. Input The system receives and analyzes both MIDI and audio information. Input from a digital piano is collected using MIDI while the MSP object pitch( is used for pitch detection of melodic audio from acoustic instruments. 4.2. Simple Interactions In an effort to establish Haile's listening abilities in live performance settings, simple interaction schemes were developed that do not use the genetic algorithm. One such scheme is direct repetition of human input, in which Haile duplicates any note that is received from MIDI input. In another interaction scheme, the robot records and plays back complete phrases of musical material. A simple chord sequence causes Haile to start listening to the human performer, and a repetition of that chord causes it to play back the recorded melody. Rather than repeating the melody exactly as played, Haile utilizes a mechanism that stochastically adds notes to the melody, similarly to the density mutation function described above. 4.3. Genetic algorithm driven improvisation The main interaction scheme used with the genetic algorithm was an adaptive call-and-response mechanism. The mean and variance of the inter-onset times in the input is used to calculate an appropriate delay time; then if no input is detected over this period, Haile will generate and play a response phrase. In other interaction schemes, developed in an effort to enrich the simple calland-response interaction, Haile is programmed to introduce musical material from a database of previous genetically modified phrases, interrupt human musicians with responses while they are playing, ignore human input, and imitate melodies to create canons. In the initial phase of the project, a human operator was responsible for some real-time playback decisions such as determining the interaction scheme used by Haile. In addition, the human operator of the system triggered events choosing among the available playback modes, decided between MIDI and audio input at any given time, and selected the different types of mutation functions for the genetic algorithm. In order to facilitate Parent A Parent B i ---- --------------........ Child 1 Child 2 Figure 2. Mating of two prototypical phrases using the pitch-rhythm crossover function. Child 1 has the pitch contour of parent A and rhythm pattern of parent B while Child 2 has the rhythm of parent A and the pitch contour of parent B. Additionally, an adjustable percentage of each generation is mutated according to a set of functions that range in musical complexity. For instance, the simple random mutation function adds or subtracts random numbers of semitones to the pitches within a phrase and random lengths of time to the durations of the notes. While this mutation seems to add a necessary amount of randomness that allows a population to converge toward the reference melody over many generations, it degrades the musicality of the intermediate populations. Other functions were implemented that would stochastically mutate a melodic phrase in a musical fashion, so that the outcome is recognizably derivative of the original. The density mutation function, for example, alters the density of a phrase by adding or removing notes, so that the resulting phrase follows the original pitch contour with a different number of notes. Other simple musical mutations include inversion, retrograde, and transposition operations. In the end, we had seven 194

Page  00000195 a more autonomous interaction, we decided to develop an algorithm that would choose between these higherlevel playback decisions based on the evolving context of the music, thus allowing Haile to react to musicians in a performance setting without the need for human control. Haile's autonomous module involves switching between four different playback modes: call-andresponse (described above), independent playback, canon mode, and solo mode. During independent playback mode, Haile introduces a previously generated melody from the genetic algorithm after waiting a certain period of time. Canon mode employs a similar delay, but instead Haile repeats the input from a human musician. If no input is detected for a certain length of time, Haile enters solo mode, where it continues to play genetically generated melodies until a human player interrupts the robotic solo. Independently of its playback mode, Haile decides between inputs (MIDI or audio) and changes the various parameters of the genetic algorithm (mutation and crossover types, number of generations, amount of mutation, etc.) over time. In the end, the human performers do not know which of them is driving Haile's improvisation or exactly how Haile will respond. We feel this represents a workable model of the structure of interactions that can be seen in human-to-human musical improvisation. 5. PERFORMANCES Two compositions were written for the system and performed in two separate concerts. In the first piece, titled "Svobod," a piano and a saxophone player freely improvised with a semi-autonomous robot (see video excerpts - The second piece, titled "iltur for Haile," involved a more defined and tonal musical structure utilizing genetically driven as well as non-genetically driven interaction schemes, as the robot performed autonomously with a jazz quartet (see video excerpts elements in the implementation of the project include using a human-generated phrase population, running the genetic algorithm in real-time, and utilizing a limited number of evolutionary generations in an effort to create hybrid musical results, all realized by a musical robot that responds in an acoustic and visual manner. Informed by these performances, we are currently exploring a number of future development directions such as extending the musical register and acoustic richness of the robot, experimenting with different genetic algorithm designs to improve the quality of musical responses, and conducting user studies to evaluate humans' response to the algorithmic output and the interaction schemes. 6. REFERENCES [1] Baginsky, N.A. "The Three Sirens: A Self-Learning Robotic Rock Band", Accessed May 2007. [2] Biles, J. A. "GenJam: a genetic algorithm for generation of jazz solos", Proceedings of the International Computer Music Conference, Aarhus, Denmark, 1994. [3] Brown, Chris, "Talking Drum: A Local Area Network Music Installation." Leonardo Music Journal, vol. 9, pp. 23-28. San Francisco, USA, 1999. [4] Kapur, A. "A History of Robotic Musical Instruments", Proceedings of International Computer Music Conference, Barcelona, Spain,21 -28, 2005. [5] Lerdahl, F. and Jackendoff, R. A Generative Theory of Tonal Music. MIT Press, Cambridge, MA, 1983 [6] Lewis, G. "Too Many Notes: Computers, Complexity and Culture in Voyager", Leonardo Music Journal, 2000, Vol. 10, 33-39, [7] Moroni A., J Manzolli, F Zuben, R Gudwin, "An Interactive Evolutionary System for Algorithmic Music Composition." Leonardo Music Journal, 2000. 10, p.49-55, [8] Pachet, F. "The Continuator: Musical Interaction With Style" Journal of New Music Research, 2003. 32(3): p. 333-341. [9] Rowe, R. 2004. Machine Musicianship. Cambridge, MA MIT Press. [10] Smith, L., R. McNab, and I. Witten. "Sequencebased melodic comparison: A dynamicprogramming approach". In Melodic Comparison: Concepts, Procedures, and Applications, Computing in Musicology 11, 101-128, 1998. [11] Tokui, N. and Iba, H., "Music Composition with Interactive Evolutionary Computation." Proceedings of the 3rd International Conference on Generative Art, Milan, Italy, 2000. [12] Weinberg, G. Driscoll D. "Toward Robotic Musicianship", Computer Music Journal 30:4,28 -45, 2007 Figure 3. Human players interact with Haile as it improvises based on input from saxophone and piano in "iltur for Haile". 1. SUMMARY AND FUTURE WORK We developed an interactive musical system that utilizes a genetic algorithm in an effort to create unique musical collaborations between humans and machines. Novel 195