Page  00000001 Structural Recognition of Music by Pattern Matching Yuzuru Hiraga University of Library and Information Science 1-2 Kasuga, Tsukuba, 305 JAPAN hiraga@ulis.ac.jp Abstract This paper presents a pattern-directed model of melodic structure recognition and its pilot implementation. The intention is to show the significance of pattern information in structural recognition. The basic framework is described, and some example results are discussed. 1 Introduction This paper describes a pattern-directed model of melodic structure recognition and its pilot implementation, following previous work [1][2]. The program generates structured pattern descriptions of a given melody based on pattern induction and matching. Patterns in our context are syntactic entities that reflect intrinsic features of individual pieces. They are syntactic in that their integrity and meaning reside in the relationship among the components, and that base components are treated as formal symbols with no semantic connotation. The point of emphasis is that pattern information provides significant, grounds for analyzing music structure, without explicit reliance on reference frames of key and meter which associate semantic function to the notes. This allows for robust processing, and extends the applicability of the model to non-Western style music. Pattern in music has received considerable attention in the past. Narmour presents, in effect, an extensive theory of syntactic patterns in his (bottomup, parametric system of) Implication-Realization model [3]. Within the computational approach, recent works include those by Pennycook and colleagues [4] [5], and Rowe[6]. The present scheme differs from them in that it is oriented towards structural matching and structure derivation, and that pattern matching is intermixed with other processing. 2 Overview Input to the program is a symbolically encoded melody sequence. Primary attributes are pitch (by semitone units) and note-length (in quantized form; though extensible to actual durations). The processing involves the following operations: * initial segmentation * pattern matching * pattern induction & grouping * reduction & hierarchical structuring The program works incrementally on the input data, with the above operations working in parallel. Initial segmentation looks for primitive sequences, which are fragments that construct patterns and provide group boundary candidates. Pattern matching matches incoming sequences to existing fragments. The matching scheme is structural in that exact match is attempted over attribute sequences of the input. Pattern induction gathers fragments to form a group closure, and generates its structural description. Reduction looks for structural notes, which can be used as components of higher level sequences. The output of the program is the segmentation of the piece into groups, their structural description based on primitive sequences, and a description of relationship among matched groups or fragments. 3 Primitive Sequences A primitive sequence is a. sequence that is defined over a single primitive relation. In the pitch domain, relations are defined in terms of intervals. Intervals (= difference between adjacent pitches) are taken relative to "scales", so the values will vary according to the underlying scale. Instances of scales include the chromatic scale, diatonic scales, and diatonic triads. For example, the se

Page  00000002 quence "CDEFG" has the interval profile {*, +2, +2, +1, +2} over the chromatic scale, and {*, +1, +1, +1, +1} over the C-major diatonic scale ('*' indicates the reference position). Likewise, the sequence "EGCT"(tf indicate octave higher/lower pitches) is, {+2, +3, *} over the C-major scale and {+1, +1, *} over the C-major triad (with reference pitch CI). The intervals can be further decomposed into direction (up(+), down(-), same(O)) and magnitude (absolute value of interval); and the latter further into magnitude type (same(O), adjacent (= 1: a), remote (> 1: r)). This allows for more abstract (and coarse) profiles. For example, the interval profile {*, +1, -2} can be abstracted as {*, +a, -r}, and further as {*, +, -} or {*, a, r). Note that the intervals for diatonic scales and triads can be obtained without the knowledge of the key. For example, diatonic tones are either 1 or 2 semitones apart, and a semitone interval is not succeeded by another (different criteria apply for, e.g., pentatonic scales). A pitch sequence is primitive if the elements of its profile are identical. Thus, primitivity depends on the profile level. For the current purpose, the direction/magnitude level profile over diatonic scales and triads are sufficient as base profiles. As an example, consider the opening sequence of nursery rhyme Are You Sleeping?: "CDECC...". This contains three primitive sequences: CDE(+a), EC(-r), and CC(O). Note that boundary notes necessarily overlap, because they serve both as end of the preceding sequence and beginning of the following sequence. In the time domain, the base profile is the notevalue sequence itself. This is abstracted as a duration ratio sequence (in log2). For example, a note-value sequence: [1/4, 1/8, 1/2] has a log-ratio profile {*, -1, -+11, with the initial quarter note as the reference length. The signs of the log-ratio provide a further abstraction into longer(+), shorter(-) and same(O) lengths (a tolerance range can be set to broaden the range of same length notes). The counterpart of primitive sequences is simply periodic repetition of the same length. This, however, is given low precedence while segmentation, and is utilized more in pattern induction. 4 Pattern Matching Matching of two sequences is performed over their profiles (for both pitch and time sequences). A match is established when two profiles (at some level) are identical. The match is strongest when two sequences are identical, and gets weaker as the matched profiles become more abstract. The weakest case is when only the rhythm patterns match. Match strength also depends on the length of the match, "complexity" of the sequence, and proximity of the sequences. A match is stronger when it is longer, involves larger number of primitive sequences. and/or occurs consecutively. Repetition of same pitch and/or note-value is treated as a weaker match. Because pattern matching directs grouping and other processes, matching is invoked whenever conceivable (currently, at the beginning of each primitive sequence). In order to facilitate the matching process, the current implementation employs an array indexed by (up to) three consecutive elements for each profile type. If the current three notes match an entry in this array, the main matching procedure is invoked. Matching may be invoked for sequences whose boundaries are yet uncertain, so matching between the source and the target continues until: * match fails, * the tail of the source reaches the head of the target (non-overlapping condition), or * the end of the source is reached, in case it is a pre-established pattern (=group). Groups must fulfill the non-overlapping condition (as stated in [7]). When competing matches overlap and cause a conflict, the stronger match is chosen as more preferable. Matching can be partial in that while the whole patterns do not match, their sub-patterns match to a substantial degree. A typical case is when two phrases start identically, with one ending in the dominant and the other in tonic. This enables limited treatment of inexact matches. Matching can also be made over reduced sequences, where only the structural notes are picked out (and extended in length). This allows for another form of inexact match. The current program makes only a limited attempt for such matching (e.g. join repeated notes of same pitch into a single note). 5 Induction and Reduction Pattern induction gathers fragments to form groups, which correspond to the ordinary notion of motives, phrases, etc. Group segmentation is firmly established when two consecutive sequences are matched. In this case, the group boundary is set between the two, and the

Page  00000003 latter is taken to be the repetition of the former. Likewise, if a match is made with a previously known group, then the sequence is recognized as an recapitulation. Once a group is established, its features (e.g. component sequences, rhythm pattern, and total length at the bottom end) is used to facilitate and preempt further processing. For example, if a group is known to be 4 beats long, then grouping with 4 beat units gets to be preferred. This amounts, in effect, to the recognition of meter. But unlike an established meter, it is not taken as normative and can be overridden by other conditions. When there are no clear matches (especially at the beginning stage), the situation gets more difficult. The program is set to provide some segmentation within a given range (corresponding to approx. two measures length), and tries to find the best breaking point within that range, based on primitive sequences. An example of such processing is when a longer note is surrounded by shorter notes. In this case, either (or both) of the preceding/succeeding sequences are seen to be subordinate to the long note, thus belonging to the same group. A group consists of a number of primitive sequences, and its structural account is given in terms of them. Roughly speaking, this is done by selecting structural notes from primitive sequences, and forming higher-level sequences from them. Structural notes are selected from either the first, last, or longest note, and in a way that the higher-level sequence is simple (preferably, primitive). Particular patterns are treated individually. For example, a return motion (mordent) is reduced to a single note, so "GAGF" becomes "G-F". Another example is when a pattern is repeated 3 times, followed by a different one. In this case, patterns are binary-grouped, so that aaab is treated as aa+ab, which amounts to recognition of a AA' pattern. 6 Examples and Discussion As a simple example, let us consider Are You Sleeping? (Frbre Jacques?). After the first five notes of the opening "CDECCDEC", the program has three primitive sequences "CDE", "EC" and "CC", and is at an indecision of whether to cut the sequence at "EC", "C-C", or less preferably, "D-E". This is resolved once the match for "CDEC" is found. The program proceeds likewise with the rest, and finds the piece to have the following sequences, each repeated. A. CDEC C. {GAGF}EC B. EFG- D. CGIC(- indicates half-notes, {} eighth notes.) The program also generates the description that: * B and D have weak resemblance (having the same rhythm pattern). * A and B have weak resemblance (in the ascending triplet). * A and D have weak resemblance (being a return motion, one inverted). * A and C are related (sharing "EC" at the end, and because the return motion of eighth notes in C reduces to "GFE", so ascending/descending triplets provide a weak match). In general, pieces that are well-structured with repeated use of motives are suitable for the program to come up with a detailed description (e.g. Long Long Ago). Style forms like AA'BA' can be induced in effect from the descriptions. On the other hand, if the piece contains less structural resemblance, then the description becomes less detailed, and the derived results less determinate. However, in the domain of nursery rhymes and folk songs, most of the songs are seen to pursue pattern resemblance to at least some extent, which can be detected by the program. An interesting example is the Bohemian folk song in Figure 1. The program can match the two phrases in measures 1-4 and 5-8, and in particular measures 1 and 5. This is because "FACt" and "GABb" are matched as ascending adjacent sequences (over triad and scale). Another point is that in measures 3,4 (and 7,8), "{ABV}CT" is repeated, and this binary grouping is preferred over the ternary grouping for measures 1,2 (but see below). This conflict, in effect, amounts to noticing structural syncopation. Since the basic program relies only on pitch and note length, it can be easily "deluded" to draw wrong conclusions. A typical example is Haydn: Symphony 92 (Figure 2). The correct notation in the score is (a) (the above) in 6/8 meter. The program correctly matches the two sequences, but comes up with a description that amounts to (b) (although the possibility of (a) is potentially retained). This, however, is understandable, because the human ear hears it the same way when played flat by a computer. That is, the sequence is essentially ambiguous. In practice, such ambiguity is overridden by articulation and accentuation (disregarding effect of accompaniment parts in an orchestrated performance). The program has additional features of including such performance information. But since this has to be hand-encoded in the input (as opposed to being automatically de

Page  00000004 1 2 3 4 5 1[ k Js3r i 'L - p -r _____--_I__i 'r_ _ -- 6 7 8 Figure 1: Beginning of a Bohemian folk song (English title unknown). Figure 1: Beginning of a Bohemian folk song (English title unknown). aI r Irr I 4-- It i r-i-Al (a) 40 I -W w Figure 2: Transcriptions of Haydn: Symphony 92 (1st mvmi.): (a) correct (b) incorrect. rived from acoustic data), its use seems artificial. In the case of Figure 1, there is an alternative interpretation of a 2/4 meter with a quarter-note upbeat (the "{FA}" at the beginning). Although this interpretation seems musically ridiculous, the situation is subtle. The program prefers the intended interpretation because the descending sequences continues to A at the end of measure 2, and because of the repetition of "{ABb}CT" mentioned above. However, the decision gets reversed if the A in measure 2 is changed to CT, because now the repetition of "CT{ABb}" becomes prominent. Similar ambiguity of meter exist in many of the works by Bach (e.g. Brandenburg Concerto 2). The current program makes no extensive use of tonality (although a simple key-finding feature is included). This obviously limits the capability of the program. For example, same results would be produced for a mistuned or atonal sequence, as long as the interval profiles are retained. This may lead to absurdity, but also may not be a disadvantage, as human listeners may exhibit similar behavior (see, introductory comments in [5]). On the other hand, human listeners seem to rapidly establish the tonal context and make extensive use of it, drawing conclusions which the computer cannot at that point or at all. Exploiting such discrepancies will help us understand the limitation of the program's capabilities, which are indicative of what features are missing, and how they interact. 7 Conclusion This paper presented a program that analyzes melody sequences based on syntactic patterns and their matching. Syntactic patterns are seen to be effective for the recognition of simple melodies. Although the program's capabilities are limited in many ways, it is believed that the framework will provide a good testbed for more extensive studies of music cognition and analysis. The current program is implemented in Common Lisp on a SUN workstation and works on a symbolic interface. It is planned to be ported to C on PC's with visual and audio interfaces. References [1] Hiraga, Y. 1993. "A Computational Model of Music Cognition Based on Interacting Primitive Agents", Proceedings of ICMC 1993, pp.292-295. [2] Hiraga, Y. 1996. "A Cognitive Model of Pattern Matching in Music", Proceedings of ICMC 1996, pp.248-250. [3] Narmour, E., 1990. The Analysis and Cognition of Basic Melodic Structures, Chicago: Univ. of Chicago Press. [4] Pennycook, B., Stammen, D.R., and Reynolds, D. 1993. "Toward a Computer Model of a Jazz Improviser", Proceedings of ICMC 1993, pp.228 -231. [5] Stammen, D.R. and Pennycook, B. 1993. "Realtime Recognition of Melodic Fragments Using the Dynamic Timewarp Algorithm", Proceedings of ICMC 1993, pp.232-235. [6] Rowe, R. 1993. Interactive Music Systems, Cambridge MA: MIT Press. [7] Lerdahl, F. and Jackendoff, R. 1983. A Generative Theory of Tonal Music, Cambridge MA: MIT Press.