Page  232 ï~~Real-time Recognition of Melodic Fragments Using the Dynamic Timewarp Algorithm Dale R. Stammen Bruce Pennycook Faculty of Music, McGill University Faculty of Music, McGill University stammen@music.mcgill.ca brp@music.mcgill.ca Abstract In this paper, we describe an adaptation of the dynamic timewarp algorithm (DTW) used to recognize melodic fragments in real-time. The DTW is a widely used technique for time alignment and comparison of speech and image patterns. We have modified the DTW to allow the comparison of the pitch and rhythmic contours of an unknown melodic fragment with a collection of previously recognized fragments. The DTW can compare melodic fragments of different sizes. The timewarping characteristic of the DTW easily accommodates the expressive timing inherent in human performance, thereby eliminating any need for rhythmic quantization. The algorithm is currently implemented in a Max external object on a Macintosh computer and as a T-Max object [Lea and Pennycook, 1991] for the INMOS T-805 transputers. Good results have been obtained in the recognition of melodic fragments from music ranging from Bach fugues to bebop jazz. 1. Introduction Recent research into melodic perception suggests that melodic contour is one of the most salient features used to remember and recognize melodic sequences. Melodic contour may be defined as the set of directional relationships between successive notes of a melody. Melodic contour is an abstraction from melody that can be remembered independently of pitches [Bartlett and Dowling, 1980]. Experiments have shown that the contours of brief, novel atonal melodies can be retrieved from short-term memory even when the sequence of exact intervals cannot. In addition to being preserved in short-term memory, melodic contours also seem to be retrievable from long term memory independent of interval sizes [Dowling, 1978]. Melodic contour also contributes strongly to the recognition of transposed melodies [Dowling and Fujitani, 1971]. However, contour alone can be used to identify a melody only if listeners have some knowledge of the set of tunes from which to select their answers [Massaro, 1988]. In view of this evidence, a computer program designed for melodic recognition should be able to analyze melodic contours. Further consideration of melodic recognition theories suggests that melodic recognition may be viewed as a pattern recognition task. The contour of an unknown melodic fragment represents a pattern of rising and falling intervals which in turn characterizes the general shape of the fragment. Melodic contour recognition therefore involves the comparison of an unidentified melodic fragment with previously recognized fragments. Our measurement of similarity between any two melodic fragments considers the differences between their melodic contours as well as their rhythmic content. However, this comparison becomes more complicated when one considers the variability of melodic fragments, especially the differences in their lengths. To overcome this difficulty, we have chosen the dynamic time warping algorithm as a method to compare melodic fragments of different lengths. 2. Applications of Dynamic Programming to Music The technique of dynamic time warping is based on the dynamic programming path finding algorithm. The theory of dynamic programming (DP) was introduced by Bellman[1957] to solve mathematical problems arising from multidecision processes. Elastic pattern matching involving the comparison of sequences of different lengths can also be modeled using DP. This technique is commonly referred to as dynamic time warping (DTW). Dannenberg[1985] describes one of the first applications of DP to music. His score-follower algorithm utilized DP for real-time comparison of performed pitches with a score stored on a computer. The algorithm handled performance deviations from the score by allowing for insertions, deletions, and replacements. Insertions occurred when the performer added one or more notes to the score. Deletions occurred when notes were left out of the performance. Replacements resulted when a performer substituted another note for one in the score. Through the use of DP, the algorithm was able to match inaccurate performances with a fixed score. 18.2 232 ICMC Proceedings 1993

Page  233 ï~~Mongeau and Sankoff[1990] used DP to compare the overall similarity between two musical scores. Their algorithm not only accommodated musical insertions, deletions, and replacements but also consolidation and fragmentation of the musical material. Consolidation involves the replacement of several elements by a single one, while fragmentation is the replacement of one element by several. Their implementation of DP searched for similarities in melodic lines despite gross differences in key, mode, or tempo. However, their use of interval weights and tonic relationships limited their analyses to Western tonal music. Their system was intended to compare entire musical works or large sections of works and is therefore unsuitable for a real-time implementation. MIDI SegmenterEvaluation - Templates - Figure 1: System Implementation of the DTW 3. Implementation of the Dynamic Timewarp Algorithm We will briefly describe our implementation of the DTW algorithm. For a more complete introduction to the DTW, the reader is referred to [Silverman and Morgan,1990] and [Sankoff and Kruskal,1983]. Our melodic recognition system is modeled after a discrete word recognition system (DWR) first proposed by Itakura[1975] and is shown in Figure 1. In this system, a monophonic MIDI stream is segmented into candidate musical fragments by a Segmenter process [Pennycook et al., 1993]. Each fragment produced by the Segmenter consists of 4-16 MIDI notes. The notes in a fragment are converted to an interval set to remove pitch differences caused by transposition. For example, fragment 1 shown in Figure 2 is represented by the interval set (0, - I, +1, -5, +11}. The rhythm of this fragment is converted into a set of duration ratios, which we define as the ratio of a note's duration divided by the previous note's duration. The rhythm of the melodic fragment in Figure 2 is represented by the duration ratio set (0, 1, 2, 1, 1). However, in a live performance of this fragment, the duration ratio set would not consist of exact integers and as an example may contain the values (0,.9, 2.1,.97, 1.1). The DTW is well suited to accommodate these local rhythmic variations, thereby removing the need for rhythmic quantization. The duration ratio representation of rhythm also allows for easy comparison of rhythmic contours that are related by augmentation or diminution. Each melodic fragment is therefore transformed into a two-dimensional feature vector that combines the pitch interval set and duration ratio set. Other feature sets such as velocity can be added to the vector to create an n-dimensional representation of the musical fragment. This set of feature vectors becomes the candidate template representing the unknown fragment. Fragment 1 Fragment 2 Fragment 3 J.S. Bach Pitch Interval: 0-1+1-5 +1 0-1 +1 +2-7 0-1 +1+2-7+2 +1 0-2 -2 Duration Ratio: 0 1 2 1 1 0 1 2 1 1 0 1 2 1.5 1 4 0 1 2 Figure 2: Feature Set Representation of Musical Fragments In order to recognize the unknown fragment, the candidate's feature template is compared to a database of reference templates. The DTW compares the unknown candidate with each template in the database and assigns a distance value that indicates the degree of similarity between the candidate and reference templates. Our recognition system matches the candidate with the reference template that results in the lowest distance measure. When several close matches occur, an Evaluator is used to select the best match. If the best match distance is higher than a pre-defined threshold, the candidate's template is considered to be unique and can be added to the database of reference templates. The DTW template matching process is illustrated in Figure 3. The horizontal axis represents the [1..m] elements of the unknown candidate C, where m is the number of feature vectors contained in the candidate. The vertical axis represents the feature vectors [1..n] of a reference template R, n being the number of feature vectors in R. Each grid ICMC Proceedings 1993 233 1B.2

Page  234 ï~~intersection point (ij) represents a possible match between elements Ci and Rj. A local distance measure d(ij) is computed at each grid intersection. This distance measure is a function of the two feature vectors Ci andRj and describes the dissimilarity of these two vectors. We have chosen to use the simple Euclidean distance measure. Column i-1 Column i nj1(ij) 1 m 2)(i-1-) I Candidate C 1 Candidate C m Figure 3:The DTW Array Figure 4: Local Path Constraints Figure 5: Global Path Constraints After all of the local distances have been calculated, the DTW attempts to trace a path from endpoint(1,1) to endpoint(m,n) that results in the lowest accumulated distance. As shown in Figure 3, any monotonic path from endpoint (1,1) to endpoint (m,n) represents a possible mapping or warp of the candidate template onto the reference template. The accuracy of such a mapping can be measured by summing all of the distances d(ij) along the path. Fora given candidate/ reference pair, the goal is to find the monotonic path through the array that minimizes the accumulated distance between the endpoints. Instead of tracing every possible path through the DTW array, we add local and global path restraints to limit the total number of possible paths. A local restraint used in computing the optimal path through the DTW array is shown in Figure 4. This restraint limits the number of predecessors at a given grid intersection. As indicated in Figure 4, the only possible predecessor to the point (ij) is one of (i-1j), (i - 1j-1), or (i - 1j-2). The accumulated distance at any point (ij) on the grid becomes: d(i,j) = d(i,j) + min(d(i-l,j),d(i -1,j-1),d(i -1,j-2)} This local path restraint is known as the Itakura local restraint. The Itakura restraint limits the possible path through the array to between a slope of 0.5 and 2.0. Other local path restraints are possible and several examples are listed in [Silverman and Morgan, 1990]. The DTW algorithm has a complexity of inn, where m and n are the number of feature vectors in the candidate and reference templates. It is therefore desirable to further reduce the amount of distance and path calculations required to determine the degree of similarity between a candidate and reference template. Global path restraints such as the one shown in Figure 5, require that the optimal path be within a certain region of the DTW array. Paths that run outside of this region are rejected. The use of a parallelogram search space [Rabiner et al., 1978] reduces the complexity of the DTW to approximately nm/3. The combination of the local and global path restraints allow the DTW to compare an unknown candidate with templates that vary from 0.5 to 2.0 times its length. To compare a candidate with K templates, the complexity becomes K(nm/3). Real-time performance is therefore dependent on the size of K. The DTW compares a candidate template with every reference template contained in the template database. These reference templates may be loaded into the database in a number of ways. Templates may be pre-loaded by playing the desired musical fragments into the Segmenter. The Segmenter converts each fragment into a feature vector template and stores them into the template database. Templates may be loaded from a data file containing templates recognized during previous sessions. Templates may also be created by the DTW Evaluator. With this method, melodic fragments that are not recognized by the DTW are entered into the template database. It is therefore possible to start with.an empty database and have the system construct a template database that contains templates representing the unique musical fragments contained in a single musical work. A graphical editor allows for the display and editing of the recognized fragments and also the correction of incorrectly segmented or recognized fragments. The flexibility of the template database allows our system to be used in a variety of melodic recognition tasks. The system can be configured to respond to only one feature such as pitch or rhythm. Likewise, new features may be defined and added to the DTW recognition process. 6. Conclusions We have found the DTW to be an effective algorithm for the recognition of melodic fragments. The contour-matching capabilities of the DTW allow for recognition of similar melodic fragments by accommodating the effects of insertion, 16.2 234 ICMC Proceedings 1993

Page  235 ï~~deletion,replacement, consolidation and fragmentation of elements. We have currently implemented the DTW melodic recognition system as a Max external object, a T-Max object for implementation on INMOS T-805 transputers, and as a Macintosh program called Time Warp. We believe that the DTW may also be used as a melodic recognition process for music analysis of printed scores. For example, the output of an optical music recognition system [Fujinaga, 1992] may be used as input to a DTW recognition system. This would enable users to search large musical databases for similarities of certain musical features. This would also allow DTW reference template databases to be constructed directly from printed musical scores. Our future research will examine level-building DTW algorithms [Silverman and Morgan, 1990] used in continuous speech recognition to create higher level hierarchies from recognized fragments. As the performance of our recognition system is dependent upon the accuracy of the Segmenter, continuous speech recognition techniques may allow for continuous recognition of fragments and remove the need for a segmentation process. Acknowledgments We are grateful to the Social Sciences and Humanities Research Council of Canada for their support of this research. References [Bartlett and Dowling, 1980] James C. Bartlett and Jay W. Dowling. Recognition of Transposed Melodies: A KeyDistance Effect in Developmental Perspective. Journal of Experimental Psychology: Human Perception and Performance, 6(3), pp.501-515, 1980. [Bellman, 1957] R.E. Bellman. Dynamic Programming, Princeton University Press, Princeton, 1957. [Dannenberg, 1984] Roger B. Dannenberg. An On-Line Algorithm for Real-Time Accompaniment. Proceedings of the 1984 International Computer Music Conference, pp.193-198, 1985. [Dowling, 1978] W. Jay Dowling. Scale and Contour: Two Components of a Theory of Memory for Melodies. Psychological Review, (85)4, pp.341-354, 1978. [Dowling and Fujitani, 1971] WJ. Dowling and Diane Fujitani. Contour, Interval, and Pitch Recognition in Memory for Melodies. The Journal of the Acoustical Society of America, 49(2), pp.524-531, 1971. [Fujinaga et al., 199211. Fujinaga, B. Alphonce, and Bruce Pennycook. Interactive Optical Music Recognition. Proceedings of the 1992 International Computer Music Conference, pp.117-120, 1991. [Itakura, 1975] Fumitada Itakura. Minimum Prediction Residual Principle Applied to Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-24, pp.67-72, 1975. [Massaro et al., 1980] Dominic W. Massaro, Howard J. Kallman, and Janet L. Kelly. The Role of Tone Height, MelodicContour, and Tone Chroma in Melody Recognition. Journal of Experimental Psychology: Human Learning and Memory, 6(1), pp.77-90, 1980. [Mongeau and Sankoff, 1990] Marcel Mongeau and David Sankoff. Comparison of Musical Sequences. Computers and the Humanities, 24, pp.161-175, 1990. [Pennycook et al., 1993] Bruce Pennycook, Dale Stammen, and Debbie Reynolds. Toward a Computer Model of a Jazz Improviser. Proceedings of the 1993 International Computer Music Conference, 1993. [Pennycook and Lea, 1991] Bruce Pennycook and Chris Lea. T-MAX A Parallel Processing Development System for MAX. Proceedings of the 1991 International Computer Music Conference, pp.229-233, 1991. [Rabiner et al., 1978] L. R. Rabiner, A. E. Rosenberg, and S. E. Levinson. Considerations in Dynamic Time Warping Algorithms for Discrete Word Recognition. IEEE Transactions on Acoustics, Speech, Signal Processing, ASSP-25, pp.575-582, 1978. [Sankoff and Kruskal, 1983] David Sankoff and Joseph B. Kruskal (Ed). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison Wesley, Reading, Massachusetts, 1983. [Silverman and Morgan, 1990] Harvey F. Silverman and David P. Morgan. The Application of Dynamic Programming to Connected Speech Recognition. IEEE ASSP Magazine, pp.7-24, 1990. ICMC Proceedings 1993 235 lB.2