Page  00000511 A Music Information Retrieval system for structural queries Alberto Pinto Dipartimento di Informatica e Comunicazione Universith degli Studi di Milano, 20135 Milano, Italy pinto @ Abstract Existing solutions for searching for appropriate music content rely on methods like string edit distances or functional approaches (as for the symbolic level) or simple features extraction methods (as for the audio level) that cannot provide structural similarity informations. Our recent theoretic results in similarity measures for music information retrieval rely on graph theory and provide new tools for extract characterizing structural quantities from musical data (both symbolic and audio) to be implemented within an integrated query environment. The improvements regard the use of the graph model of musical data at different levels, the XMLformatfor the representation of musical work and the new user interfaces for the navigation of musical source material. Such a system takes advantage of organizing, managing and utilizing information of a heterogeneous set of music source material through the XML multilayered structure. 1 Musical metrics Everyday experience tells us that the most outstanding search engines deal with extraneous metadata like author's names, dates of publication, etc. Music Information Retrieval (MIR) research produced numerous methods and tools, but the design of complete, usable working systems is still in its infancy. The urgent need for such systems is strongly felt, in particular by cultural heritage institutions that possess large music holdings. Up to date we adopted different musical similarity metrics that could be applied to music contents. We implemented this kind of approach in a software module which has been providing an effective environment for querying music scores in the Musical Archive Information System (MAIS) of the italian theatre "Teatro Alla Scala", providing an integrated support for both standard and content-based queries. 1.1 The functional approach In this context musical data (such as melodies, rhythms, spectra, etc.) are finite sequences and therefore distance concepts are often inherited by metrics defined for functions (functional metrics). Given two continuous real valued functions f (t) eg (t) defined on [m, n], there are different kind of possible metrics: d(f,g) = sup{ f(t) - g(t) }I (sup) (1) and d(f, g) jm f(t) g(t) dt (amplitude) (2) Figure 1: Comparison of melodic fragments using the functional model Generalizing the concept, it is possible to make an average 511

Page  00000512 of the higher order derivatives, like in the Sobolev metric: d(f, g) = (f(t)i - g(t)i)2 i=o (3) 1. build a representative graph for every musical data 2. work with graphs instead of musical data In this way, it is possible to recognize a greater number of relevant structural similarities and we can reduce the number of objects wich should be compared. Moreover, musical objects with the same representative graph can be identified. where i is the order of derivative. An analogous metric is the L1-version of the previous metric, normalized respect to the number of derivatives: S, fV(ti - g(ti] d(f, g)=n, (4) then, weighting difference functions, we obtain the metric: Z a(i) L--i i=0 (5) where i is the order of difference function on M and N. Starting from i = 1 implicates the exclusion of the elements of M and N, so for example, in melodic sequences tranposition have zero distance. a(i) represents a weight function indexed by the order of difference function n < L - 1, where L is the length of M and N. The weight function establishes the importance of each derivative order. This point of view may represents a problem for musicologists interested to structural similarity in music because functional metrics induce a lot of musically disappointing results due to the lack of an intrinsic structural coding. In fact, how may be possible to find similarities among the melodic sequences of Figure 2? I I I I I I I Figure 3: Graph representation of a melodic sequence. The graph-construction can be applied to melodic sequences and also to audio files. This section illustrates the structural analysis method in two of the possible layers: symbolic (melodic incipits) and audio (overall structural analysis), because of the similar process of segmentation they undergo. 1.2.1 Symbolic level * Segment a music fragments into notes a note sequence * build up a representative graph for each music fragment, whose nodes are the pitch values and whose arrows are the intervals; * Compute the graph similarity function 6. 1.2.2 Audio level * Segment an audio file into a note sequence through the pitch representation, which uses autocorrelation to estimate the main frequency component of each frame; * build up a representative graph for each music fragment, whose nodes are the pitch values and whose arrows are the intervals; * Compute the graph similarity function 6. Figure 2: Identically structured themes 1.2 The graph-theoretic approach In order to manage the structure of musical data we took advantage of the typical mathematical structure which describes relations among objects: graphs. In figure 3 the graphtheoretic approach is sketched for a simple 6-notes melodic sequence. Its essence is the following: 512

Page  00000513 1.3 Similarity function This is the notion of similarity function between graphs which provides an estimation of the "distance" between two musical objects. Let M and M' be two music objects with representative graphsG = G(M) and H = H(M). Given r e N, the r-order similarity function is: Music Scores Audio Files: I:!PCM audio codes Fragment... PCM audio codes | ^ 1 M Music Scores o(M, M') = max Sa, i==1 G\Hi\Hi-l G (6) where Ho = 0, ao are positive coefficients which depend upon the weights assigned to the different trail lengths in the power graphs H2 and (f varies among all the possible isometries of the vertex sets. So to summarize, the first step is the classification of themes by their representative graphs. The next step is mining graphs in a database isomorphic to the given one, and this can be done through graph invariants in order to reduce the computational cost of this operation. The most useful invariants are: Query report Figure 4: General architecture 1. Graph order and size 2. Graph complexity 3. Graph sequences 2 The overall architecture The most innovative features of our architecture is that it provides tools for efficient storage of structural musical data and for performing content-based queries on such data. The overall architecture of the Musical Data Management module is illustrated in Figure 4. The module consists of two main environments: the Musical Storage Environment and the Musical Query Environment. The musical storage environment has the purpose of representing musical information in the database, to make query by content efficient. The musical query environment provides methods to perform query by content on music scores, starting from a score or an audio fragment given as input. The matching between the input and the scores stored into DB is performed in several steps, graphically illustrated in Figure 4. The input can be either an audio file or a score fragment, played by the user on a keyboard or sung or whistled into a microphone connected to the computer. * From the audio files, note-like attributes are extracted by converting the input into a sequence of note-numbers, i.e. the concatenation of pitch and duration of each input note. Such step is performed by the Symbolic Music Code Extractor module (Figure 4). The conversion uses a pitch-tracking algorithm based on conventional frequency-domain analysis. If the input is entered from a keyboard or it is a score fragment, conversion is not necessary and the sequence can be directly built. * The Structural Feature Extractor converts acoustic input first into an audio feature sequence and then into its related graph representation. * The graph similarity function 6 is computed. Hence, comparing the input with a given score has been re-conducted to graph matching. 3 An XML multilayer environment for music MX is an XML environment where the structural metadata can be stored and whose multilayered structure allows the integration of audio and symbolic features of music. Music information is represented according to a multi-layered structure and to the concept of space-time construct. In fact, music information can be structured by a layer subdivision model, as shown in Fig. 5. 513

Page  00000514 The main advantage of MX is the richness of its descriptive features, which are based on other commonly accepted encodings aimed at more specific descriptions. The multi-layered music information structure is kept together by the concept of spine. Spine is a structure that relates time and spatial information. Through such a mapping, it is possible to fix a point in a layer instance (e.g. Notational) and investigate the corresponding point in another one (e.g. Performance or Audio). General I-usl LoEa Symb olic Music Information LgicrlStuu Notational Perfornranc e j \t~ Audio MIDI net. The structural level of information was developed to contain the explicit description of musical objects and their causal relationships, both from musicological and compositional points of view. That is to say musical objects can be described as transformations of previously described musical objects. Those objects are usually the results of segmentation processes made by different musicologists together with their own different musical points of view, or also by an automatic score segmenters. In this framework it is evident that the introduction in this layer of MIR-oriented metadata, for example attributes of objects as Theme for music indexes, could be extremely useful. Here we have shown the introduction of melodic invariant quantities. The same could be done for similar invariants related to the graphs derived from audio. 4 Conclusions and future work The system we described allows to organize, manage and utilize information of a heterogeneous set of music source material. This work is a development of the previous information system developed in the context of project for "Teatro Alla Scala" of Milan. The improvements regard the use of the graph model of musical data at different levels, the XML format for the representation of musical work and the new user interfaces for the navigation of musical source material. Such a system takes advantage of organizing, managing and utilizing information of a heterogeneous set of music source material through the XML multilayered structure. Further work must be done on the preprocessing phase such as symbolic feature extraction from complex audio. XML music format representation is still under development in the context of standardization process of MAXWG. The approach to structural music information retrieval by graphs can be developed towards the enrichment of the invariant class. Figure 5: (a) Music information layers and (b) relations among them The Structural layer contains explicit descriptions of music objects together with their causal relationships, from both the compositional and musicological point of view, i.e. how music objects can be described as transformation of previously described music objects. An example of a particular structural object is Theme which represent exactly the concept of musical theme of a the musical piece (or part of it) under consideration. Theme objects may be whether the output of an automatic segmentation process or the result of a musicological analysis. A content-based information retrieval method must take advantage of the presence of new metadata specifically oriented to retrieval in order to improve processes and algorithms. The goal is the integration of audio and symbolic contents into the same framework through a search engine for music files that can handle standard XML music metadata (such as author, title, album, etc.) as well as new metadata provided by MX such as structural informations, analysis, harmony, theme, melody, rhythm, etc. and structural audio features extracted from audio formats like PCM, MP3 and AAC. As for the audio format, we're going to support both queries on PCM material and queries in the compressed domain of MP3, which is the most common audio format on the Inter 514