Page  227 ï~~Intelligent rhythm tracking David Rosenthal MIT Media Lab, Music and Cognition Group, Cambridge, Massachusetts, 02139 ABSTRACT This paper describes a computer program, called Machine Rhythm, which models the processes by which humans perceive rhythm in music. This problem, in various guises, has been investigated by (Allen, 1990), (Bamberger, 1980), (Schloss, 1985), (Desain, 1989), (Lee, 1985), (Longuet-Higgins, 1984), (Mont-Reynaud, 1985), (Rosenthal, 1990), and (Vercoe, 1985), and others. Input to the program is a polyphonic MIDI stream. The output of the program is a complete description of the rhythmic structure of the piece: the meter of the piece, the length of the upbeat, and the rhythmic role played by each note Machine Rhythm embodies a cognitive model of rhythm perception, and as such represents a tool for investigating intriguing questions about how rhythm perception works in humans. How long does it take a listener to arrive at a rhythmic interpretation? How committed is the listener to an interpretation once it has been constructed? How different are our answers to questions like these, when we consider music in other cultures? Machine Rhythm represents a framework in which such questions can be more precisely formulated, and answers can be hypothesized and tested. The program also has applications in the world of music technology, such as automatic transcription, automatic synchronization of audio and/or video tracks, and intelligent participation by computers in live performances. 2. BASIC CONCEPTS When people listen to a rhythmic piece of music, such as a march or a dance, they can indicate their perception of the piece's rhythm by tapping their hand or foot. So, for example, when listening to "La Marseillaise," one taps one's finger as indicated by the arches in the diagram shown in figure 1. Musicians would refer to this as "tapping the quarter-note level." One could also tap the half-note level (figure 2), or the whole-note, or measure level (figure 3). In other words, the rhythmic structure that listeners build in their minds consists of a hierarchy of levels, which we can represent with the kinds of pictures shown in figure 4. In the remainder of this paper, we will call these structures hypotheses. The individual arches that compose levels are called beats; the width of the arch, corresponding to the length in time of the beat is called the beat's interval. Certain formal constraints apply to hypotheses; for example, the arches cannot cross, as they do in figure 5. It is usually the case that adjacent arches are subdivided by the same number of smaller arches; when this property holds, we call the hypothesis isomeric. The most common non-isomerism in music is a change from subdivision by two to subdivision by three; the musical term for this situation is triplet, illustrated in figure 6. 3. PREPROCESSING Input to the program is a sequence of MIDI bytes representing a musical performance. Before the program can begin to try to find the rhythm there are some preprocessing operations which reorganize the MIDI data into a form more suitable for rhythm finding. First, the MIDI bytes are organized into events, which correspond to musical notes. Second, the notes are organized into summary events, these being events which are considered perceptually simulta neo us. 4. OVERVIEW OF THE PROGRAM We are now ready to begin the rhythm-finding proper. Three main modules comprise the program: startup, where we form a set of rhythmic hypotheses at the beginning of the piece, extension, where the remainder of the piece is interpreted according to a given hypothesis (which may divide, forming new hypotheses), and ranking, where hypotheses are ranked according likelihood that they represent the interpretation humans would make:. Ranking serves two purpse; first, the top-ranked hypothesis represents the program's best guess as to what the rhythm of the piece is; second, it allows the program to prune away the less likely hypotheses. This is important because the number of hypotheses increases exponentially with the number of notes processed. 227

Page  228 ï~~time y4y...Finger trace II II I I 1I I: 1 I I I 4- Note events Fig. I Fig. 2 Fig. 3 Fig. 4 t 1 t t I t 1 I 1 t 1 1 333 13 {3 { Fig. 6 Fig. 5 5. STARTUP The aim at this stage is to construct every plausible hypothesis for the first few seconds of the piece, and then to rank the hypotheses according to likelihood that they are "correct," i.e. preferable to human listeners. The first step in this process is to find the tactus, which typically corresponds to an eighth-note or a quarter-note. This is accomplished by a process that forms histograms of note inter-onset-intervals; the actual algorithm employed resembles that developed by Gold and Rabiner for pitch-tracking (Gold 1969). Having found the tactus, we sequentially search for a set of notes which comprises the tactus level. We then build every plausible hypothesis with the tactus level as its lowest level. Three different hypotheses built on the same level are shown in figure 4. At this stage, if all has gone well, we havea collection of hypotheses about the rhythm of the first few seconds of the piece, one of which represents "the" rhythm of the piece, that is, the interpretation that a human listener would construct. 228

Page  229 ï~~6. HYPOTHESIS EXTENSION This module extends a hypothesis to include yet-unprocessed notes of a performance (see figure 7). Each beat examines a time-window surrounding the point in time at which it expects the next beat (see figure 8). Inside this window there is some number (possibly zero) of new events. If there are no events, we construct a ghost-event. This represents a point in the performance where one would tap ones finger but no note sounds. There are two such points in the segment from "La Marseillaise" shown in figure 1, indicated by dotted vertical lines in the event stream. In the general case that there is more than one event, we have to choose the one which best continFig. 7 current event 4 window surrounding next event Fig. 8 ues the beat. This is the task of a module called choose, which is described (briefly) in the next section. It may happen that choose cannot determine a single event which best continues the level, in which case the hypothesis containing the level bifurcates to produce two new hypotheses. In general, this process is taking place simultaneously at all of the levels that compose a hypothesis; most of the work of the extension module is simply to keep these processes coordinated. 7. CHOOSING AND RANKING Choose and rank are modules whose responsibility it is to decide which of a number of objects - either individual notes, or complete hypotheses--is preferred. They are discussed together here, because they both use a similar strategies to make their decisions. Those strategies might best be termed management of multiple sources of evidence (Minsky, 1986), and may be summarized as follows: We note that human rhythm perception depends on a number of different cues. In certain pieces we are mainly cued by the timing of the notes; the relations between longer and shorter notes tell us what the rhythm is. In other pieces the notes may all be of the same length, so that the timing tells us nothing, but a repeating pattern in the melody indicates the rhythm. The density of chords may provide a cue, as may the locations of changes in the harmony. We also note that for any one of these rhythm finding methods, there are musical situations in which it will not work - that is, it will fail to indicate the rhythmic hypothesis that a human listener would in fact choose. To model this situation, we construct a number of programmatic experts, each of which specializes in one of these rhythm-choosing methods. There is for example a timing expert, which prefers hypotheses in which adjacent beats are very close to the same length. The timing expert is usually right, but it can also fail, especially when there is an expressive deviation in the performance from metronomic time. Other experts look for repeating motivic patterns in the melody or accompaniment, or prefer interpretations in which there are fewer syncopations. In the current version of the program, six such experts are used inthe rank module and three in the choose module. Each of these experts ranks the objects under consideration according to its specialty. it is then necessary to.. integrate their results. ln the rare situation that the experts all agree on which interpretation is best, the choice is obvious. In general, the experts disagree, and we use various strategies for deciding which are correct: 229

Page  230 ï~~1. Some experts are good at making some distinctions but not others. The timing expert, for example, is good at telling us which of several (4 4) interpretations is correct, but useless for choosing between (3 4) and (4 4). 2. Some experts work well in conjunction with other experts. When the timing and anti-syncopation experts agree, we can be fairly sure that they are both right. 3. Experts differ not only in the overall reliability but also in their ability to avoid false positives. The melodicpattern expert often produces no result at all, but when it does have a preference, it is relatively reliable. 4. When we can't do anything else, we simply vote. Each expert has been assigned a weight - essentially, the number of votes it can cast for its preference. The overall design of the program includes an established protocol for adding new experts and improving the reliability of existing ones. 8. CONCLUSIONS The program described in this paper is currently working and is being tested on a variety of performances and musical styles. I have also used the program to perform tasks that constitute technical motivation for building an automatic rhythm finder, such as automatically synchronizing the primo and secundo parts of four-hands piano pieces. It appears that the basic strategies of the program - in particular the management of a variety of programmatic specialists - arc well chosen. These strategies enable the program to deal with surprisingly daunting situations - sloppily played passages, syncopations, expressive timing variations, triplets, and so on. The program constitutes a valuable tool for understanding an important aspect of musical cognition. I also believe that the work described here can inform general theories of perception, perhaps even including nonauditory perception. There are many situations where our senses interpret and use evidence from a variety of sources. Machine rhythm provides an example of how that situation can be successfully modeled. 9. REFERENCES P. Allen and R. Dannenberg, "Tracking Musical Beats in Real Time," Proceedings of the 1990 International Computer Music Conference, pp. 140-143, Computer Music Association, Glasgow, 1990. J. Bamberger, "Cognitive Structuring in the Apprehension of Simple Rhythms," Archives de Psychologie, vol. 48, pp. 171-199, 1980. P. Desain and H. Honing, "Quantization of Musical Time: A Connectionist Approach," Computer Music Journal, vol. 13, pp. 56-66, 1989. B. Gold and L. Rabiner, "Parallel Processing Techniques for Estimating Pitch Periods of Speech in the Time Domain," J. Acoust. Soc Am., vol. 46, p. 442, 1969. C. Lee, "The Rhythmic Interpretation of Simple Musical Sequences: Towards a Perceptual Model," Musical Structure and Cognition, Howell, Cross & West (eds.), pp. 53-69, Academic Press, London, 1985. H. Longuet-Higgins and C. Lee, "The Rhythmic Interpretation of Monophonic Music," Music Perception vol. 1, pp. 424-441, 1984. M. Minsky, The Society of Mind, Simon and Schuster, New York, 1986. B. Mont-Reynaud and M. Goldstein, "On Finding Rhythmic Patterns in Musical Lines," Proceedings of the 7985 International Computer Music Conference, pp. 391-397, Computer Music Association, Burnaby, 1985. D. Rosenthal, "Computer Emulation of Human Rhythm Perception," MIT Media Lab Tech. Report, 1990. A. Schloss, "On the Automatic Transcription of Percussive Music-- from Acoustic Signal to High-Level Analysis." CCRMA, Stanford University Ph.D. thesis, 1985. B. Verooe, 'The Synthetic Performer in the'Context of Live Performance," Proceedings of the 1985 International Computer Music Conference, pp. 25-31, Computer Music Association, Burnaby, 1985. 230