Page  297 ï~~A Fuzzy Syntactic Approach to Recognising HandWritten Music Greg Watkins Computer Science Department Rhodes University, P.O. Box 94, Grahamstown, 6140 E-Mail: Abstract This paper introduces the problem of Optical Music Recognition and presents our approach to it Since this approach relies on fuzzy graph grammars, the underlying concepts of standard graph grammars, fuzzy sets and fuzzy graphs are also briefly explained. 1 Introduction With the advent of cheap computing power, computers have found widespread uses in the field of music. These range from music performance (normally using the MIDI protocol) to music publishing and form analysis. The result is that many tedious tasks which previously had to be performed by hand are now automated. An example is the transposition of a music score. However, current methods for entering music scores into a computer system are still cumbersome. Often the pitch and duration of each note are typed into a text file. Newer graphical user interfaces do not alleviate the problem much: although they are typically easier to use, they are still very slow. The aim of optical music recognition is to overcome this problem by automatically converting sheet music into a form which allows the music information to be manipulated by a computer. This will speed up the input process and drastically reduce the tedium associated with it Optical music recognition is by no means a new research field - the first attempts were made by D Prerau at MIT some twenty-five years ago [Prerau, 1975; Carter, 1988] - and it is perhaps surprising that the first commercial OMR products are only beginning to appear now. One of the reasons for the difficulty of OMR is the fact that each publisher of musical scores uses his own font and printing style. There are also a vast number of different music symbols, compared to the fairly limited number of symbols making up a textual alphabet. This is largely due to the existence of complex symbols such as beamed groups of quavers or semi-quavers. Another complication is the presence of stave lines which connect most of the symbols to each other. This contrasts to printed text, where the symbols (letters) are normally separated from each other by white space. 2 Outline of our approach The initial input is obtained by scanning the music sheet at a resolution of 300 dots per inch. The first processing phase removes the stave lines from the scanned image. This step is complicated by the fact that the symbols which are superimposed on the stave lines must be altered as little as possible, while the rest of the stave lines must be removed completely. Each stave line is removed by tracking along it from left to right and examining it column by column. If it appears that there is a symbol overlapping with the line at a particular point (i.e. the pixels above and below the line are set), the pixels making up that column of the line are left, otherwise they are erased (see Figure 1). x lx Top of line n NIX x x x x x x x il X x 'Bottom All blocks except those marked will be deleted of line Figure 1 Stave-line removal ICMC Proceedings 1994 297 Music Notation

Page  298 ï~~After the stave lines have been removed, the image consists of a series of disconnected music symbols. The OMR program must recognise these symbols as accurately as possible. This is, of course, the core of the problem and has been tackled in a variety of ways by the various OMR systems that have been developed. Methods used include nearest-neighbour classification [Fujinaga et al., 1992], neural networks [Martin and Bellisant, 1991], template matching and grammars [Fahmy and Blostein, 1991]. The systems developed up until now were designed to recognise fairly well-printed music, and none is able to cope with hand-written music. In contrast, one of our major aims is to design a system which can recognise hand-written as well as printed music. The difficulty in recognising hand-written music stems from the fact that it does.not, in general, follow many of the strict rules observed in printed music (e.g. note stems might not be vertical and note heads might not actually touch the stems). This implies that any system developed to deal with hand-written music must be based on more general principles than systems concerned only with printed music. We have taken a linguistic (syntactical) approach to the problem of recognising the music symbols. In this approach primitive elements (tokens) are extracted from the image and the list of tokens is then parsed according to a suitable grammar which reconstructs the musically meaningful symbols (see Figure 2 for a description of the primitives used and their attributes). This enables global knowledge to be used to solve ambiguities at a local level. This example also demonstrates the fact that crucial information is contained in the two-dimensional spacial relationships between music symbols. This means that music cannot be represented by a onedimensional (string) grammar as this would result in a loss of information. This problem occurs in a number of different pattern recognition applications and has lead to the development of various twodimensional grammars. We propose to modify the Attributed Programmed Graph Grammar model introduced by [Bunke, 1982]. The idea of using attributed programmed graph grammars for music recognition is due to [Fahmy and Blostein, 1991]. A major difference between our approaches is the fact that some of the primitives chosen by them are fairly complex (e.g. accidentals). We feel that the simpler primitives we have chosen should lead to a more robust system as they place much less burden on the low-level primitive extraction layer. 3 Attributed Programmed Graph Grammars Attributed programmed graph grammars are a twofold extension of conventional graph grammars. Firstly, a facility for controlling (programming) the application order of productions is provided, and secondly, the underlying graphs are augmented by attributes. A survey of these grammars is presented in [Fahmy, 1992]. The productions are applied to transform an initial input graph into a final output graph. In our case, the nodes in the input graph will, represent the primitives that have been extracted from the image, and the nodes in the output graph will represent the music symbols that have been recognised through the application of the productions. An attributed graph is a generalization of a standard graph, where each node and edge can have attributes associated with it. A subgraphs of an attributed graph is defined as follows: Definit"ion: An unattrlbuted (but labelled) graph (u-graph) g' is a subgraph of a u-graph g (g' g) if all nodes and all edges in g' also occur in g. Additionally, corresponding nodes and edges in g and g' must have identical labels. Given two graphs g and g', with g' g, let g-g' denote the graph which remains after removing g' from g. The edges between the subgraph g' and the Figure 2 The primitive objects and their attributes For example, the meaning of a blob in the image is dependent not only on its own attributes, but also on the relative position, orientation and length of any nearby lines. Music Notation 298 1CMC Proceedings 1994

Page  299 ï~~host graph g-g' are called the embedding of g' in g.. Each production in an attributed graph grammar maps a subgraph into another subgraph. Productions consist of four parts: " a description of the unattnibuted graphs g, and, where g, is the subgraph that will be replaced by the subgraph g,; " an embedding transformation which maps the embedding of g, in g onto the embedding of & in g' (g and g' represent the whole graph before and after the production is applied respectively); " an attribute transfer function, which maps the attributes of nodes and edges in the old subgraph onto attributes in the new subgraph; " an applicability predicate, which specifies conditions that have to be satisfied before the production may be applied. 1 1' 2 2' line line Embedding Transformation - ((1,1'), (2,2')) Attribute Transfer Function - (All (1') - All (1); All (2')- All (2)) Applicability Predicate - (Connected (1,Top(2)) OR Connected (1,Bottom(2))) AND NOT Edge (1,2) Figure 3 Sample production The production in Figure 3 creates an edge between a node representing a note-head and the node representing its stem. The diagram is a pictoral description of g1 and g (circles and lines represent nodes and edges respectively). This production creates an edge between two nodes representing a blob and a line (and labelled, accordingly, blob and tue). The embedding transformation specifies that all edges that were connected to node 1 will now be connected to node 1', and all edges that were connected to node 2 will now connect to node 2'. The attribute transfer function will give nodes 1' and 2' the attributes that were associated with nodes 1 and 2 respectively. Lastly, the applicability predicate given specifies that the blob must be connected to either end of the line and that there must not already be an edge between the two nodes (the functions Connected Top and Bottom are assumed to already exist). Unfortunately, any scanned image will contain a certain amount of noise: even in the best printed scores, noise is unavoidably introduced by the stave-line removal process. For example, it is not always clear whether a blob is hollow or solid. To solve this problem we associate certainty functions with token identification and with the productions in the grammar. For example, instead of being either hollow or solid, a blob will have a "continuous" solidity factor ranging from 0 (hollow) to I (solid). These considerations lead to the notion of fuzzin the graph grammar just described. The main advantages to be gained from fuzzifying the grammar can be illustrated by considering Figure 4. 3 Figure 4 A difficult situation It is obvious that a decision has to be made at some point as to whether the first note-head is hollow or solid. If this is done at the primitive extraction level only local information can be utilised, limiting the level of intelligence which can be applied. If, on the other hand, the choice is delayed until global information is available, a more intelligent decision can be made. Assume that the number of beats in a bar of the music contained in Figure 4 is known (say 3, i.e. ICMC Proceedings 1994 299 Music Notation

Page  300 ï~~3/4 time). Once it is known that the only other symbol in the bar is a crotchet, it becomes obvious that the first note must be a minim, Le. the blob is hollow. The use of a fuzzy grammar achieves exactly this goaL Instead of making final decisions about objects at an early stage, when a symbol is ambiguous we will merely calculate the certainty with which it belongs to different classes. The certainty of the different (and competing) interpretations will propagate through the parsing until a stage is reached where it is possible to take an. informed decision. 4 A Fuzzy Attributed Programmed Graph Grammar In order to explain the fuzzy attributed programmed graph grammar that we are developing, it is necessary to introduce a few basic concepts of fuzzy set theory [Kandel, 1982]. Definition Let X - (x} denote a space of objects. A fuzzy set A in X is a set of ordered palrs A =((x,LA(x))}, X E X where l.tA(X) is the grade of membership of x in A. l.A(x) is a number in the interval [0,1], with the grades 0 and 1 representing, respectively, nonmembership and full membership. Note that if ILA(X) only takes on the values of either 0 or 1, the fuzzy set collapses to a normal crisp set. Consider the problem of blob solidity again. We wish to partition the set of all blobs into two subsets: one for hollow blobs and one for solid blobs. Since there is no crisp dividing line between hollow and solid blobs, there is no natural way of partitioning the blobs into crisp subsets. However, they can quite naturally be partitioned into the fuzzy sets Solid and Hoilow by defining Its(x) and Its(x) to be the degree to which any blob x belongs to each of these subsets. The crisp string grammr has been extended to a fuzz string grammar by associating a membership (certainty) value with each production. A fuzzy language over an alphabet A is a fuzzy subset of A*, whereas a crisp language over A is a crisp subset of A* [Pal, 1987]. In other words, instead of each string being either in a fuzzy language or not, it belongs to the language to a degree specified by a real value in [0,1]. The membership of a string derived through a given chain of productions is determined by the certainty values of the productions in that chain, for example by taking their minimum value. If a string is generated by more than one production chain, its membership in the language is determined from the certainty values of all the chains, typically by taking the maximum. The graph grammar formalism described above can be modified to cater for fuzziness. If the applicability predicate associated with each production can return a value between 0 and 1, this value can be interpreted as the certainty (membership) value of the production. The grammar thus produced is the natural extension of fuzzy string grammars to two dimensions. More formally, an attributed programmed graph grammar is defined as follows: Definition: An attributed programmed graph grammar is a 7-tuple G=(V,WA,B,P,S,C) where " V and W are alphabets for labelling the nodes and edges respectively; " A and B are finite sets of attributes for nodes and edges respectively; " P is a finite set of productions; " S is a set of initial graphs; " C is a control diagram over P. A production is a 5-tuple P=(G,G,,T,7i,F) where ' 01 and G, are unattributed graphs, the leftand right-hand side, respectively; " T is the embedding transformation; " it: P ->(truefalse) is the applicability predicate (i' is the set of all attributed graphs); " F is the attribute transfer function. The fuzzy version of the grammar is obtained by replacing IT in the above with S.t:->[0,1], where [0,1] is the range of real values from 0 to 1, extremes included Figure 5 contains a few sample productions in this gra~mmar. If a bar in the music score being examined contains ambiguity, more than one output graph will be produced by the fuzzy graph grammar. Each different possibility for that bar of music will be represented by a separate graph. In general, the Music Notation 300 ICMC Proceedings 1994

Page  301 ï~~final graph with the highest certainty factor is the one most likely to be correct. It is immediately apparent that there is a danger of the parse tree expanding very rapidly, especially for a noisy input image. This makes it practically impossible to traverse the entire parse tree, because of both the time and space this would require. Fortunately the need for a full traversal can be obviated by a suitable heuristic that allows a bestfirst traversal For example, if MINIMUM is used for the fuzzy AND operator, the paths found at any one time can be kept ordered on their current certainty values. The tree can then be searched by simply expanding the tip of the path with the highest certainty value. 5 Current Progress A prototype implementing the system described in this paper has been developed, and initial testing has begun. Productions have been written to describe a simple class of music scores containing one part per stave. The implemented system can be viewed as a series of units, where each unit takes its input from the previous one, and outputs the data to the next one. The three main stages are, of course, stave-line removal, primitive extraction and parsing. Primitive extraction is broken down into three further stages: first the image is thinned to simplify subsequent line detection, then the lines are detected and removed, and lastly all remaining connected regions in the image are classified as blobs. This produces a raw description of the primitives (e.g. a line is stored as a list of the coordinates of all its constituent pixels). This raw data is transformed into the list of attributes associated with each primitive object. The resulting list of primitives and attributes can then be parsed to yield the graph representing the music coded in the original image. Because it is important that the productions in the graph grammar be modified easily (to experiment and to cope with a larger subset of music symbols), we decided to define a language that allows one to express them as naturally as possible. (The only slight deviation from this 'naturality' is the request of defining the subgraphs represented graphicaly in Figure 3 as a list of text statements, which obviously simplifies the job of parsing the production definition language.) The language is implicitly defined through a list of YACC productions. Production 1 # map a blob to a notehead GL: Nodes - {(1,blob)} Edges = {} GR: Nodes - {(1,notehead)} Edges -={} Embed= {(1,1)} Attrib: {_I.centre =_.centre; _1.solid - _1.solid;} Apply = {match (headsize, 1.size)} Production 2 # map a line to a notehead GL: Nodes - {(1,line)} Edges = {} GR: Nodes = {(1,notehead)} Edges - {} Embed - {(1,1)} Attrib: {_1.centre = midpoint(_1); 1.solid 1;1 Apply - {(match (headlength, length(1)) + match (headthickness,1.thickness) + match (headslope, slope(1))/3} Production 3 # map 2 lines to a note with a broken head GL: Nodes - {(1,line), (2,line)} Edges - {} GR: Nodes - {(1,note)} Edges -{} Embed- {(1,1), (2,1)} Attrib: {_1.pitch - Pitch(midpoint(_1)); if (length (_2)>2) 1. duration -32; else 1. duration -64; } Apply - { (very close (findsl(top(1) ), top(1)) + very close (findsl(bot(1)), bot(1) ) + MAX (very close (findsl(bot(2)), bot(2)) + close (bot(1),bot(2))), (very close (findsl(top(2) ), top(2)) + close (top(1),top(2))))/4} Figure 5 Sample productions ICMC Proceedings 1994 301 Music Notation

Page  302 ï~~A few productions written in this text form are shown in Figure 5. As it can be seen from this example, the definition of the subgraphs and the embedding transformation are relatively easy to process. Since the attribute transfer function and the applicability predicate are pseudo-C functions, they are directly translated into C functions and compiled with a standard C compiler. References [Bunke, 1982] H. Bunke. Attributed Programmed Graph Grammars and Their Application to Schematic Diagram Interpretation. IEEE Transactions on Patterns Analysis and Machine Intelligence, V.4, N.6, pp. 574-582, 1982. [Carter et al., 1988] N.P. Carter, R.A. Bacon, and T. Messenger. The Acquisition, Representation and Reconstruction of Printed Music by Computer. A Review. Computers and the Humanities, V.22, N.2, pp. 117-136, 1988 [Fahmy and Blostein, 1991] H. Fahmy and D. Blostein. A Graph Grammar for High-level Recognition of Music Notation. Proceedings of First International Conference on Document Analysis, V.1, Saint-Malo, France, pp. 70-79, 1991 [Fahmy and Blostein, 1992] H. Fahmy and D. Blostein. A Survey of Graph Grammars: Theory and Applications. Proceedings of 11th International Conference on Pattern Recognition, V.2, pp. 294-298, 1992. [Fujinaga etal. 1992] L Pujinaga, B. Alphonce, and B. Pennycook. Interactive Optical Music Recognition. Proceedings of the 1992 International Computer Music Conference, pp. 117-120, 1992. [Kandel, 1982] A. Kandel. Fuzzy Techniques in Pattern Recognition, Wiley Interscience, New York, 1982. [Martin and Beilisant, 1991] P. Martin and C. Bellisant. Low-Level Analysis of Music Drawing Images. Proceedings of the First International Conference on Document Analysis, V.1, pp. 417-425, Saint-Malo, France, 1991. [Pal and Majumder, 1987] S. Pal and D. Majumde. Fuzzy mathematical approach to pattern recognition, Wiley Eastern Limited, Calcutta, India, 1987. [Prerau, 1975] D.S. Prerau. DO-RE-MI:A Program that Recognises Music Notation. Computers and the Humanities, V.9, pp. 25-29, 1975. Music Notation 302 ICMC Proceedings 1994