Page  00000073 On SOM based time-series analysis and motivation of generic methods of musical pattern recognition Kalev Tiits Centre for Music and Technology Sibelius Academy, Box 86, 00251 Helsinki, Finland email: kalev.tiits@siba.fi Abstract During a period of several years, my aim has been to make effective use of the learning system known as the self-organizing map (Kohonen 1989, 2001) often called SOM, for local boundary detection of timeseries data in musical context. Interest to the matter has produced a treatise on the subject (Tiits 2002a), as well as an experimental piece of software for UNIX environment, which searches for local boundaries from a given time-ordered data set. This paper describes the outline of this endeavour, and is intended to give context to another paper, which is oriented towards more specific detail (Tiits 2002b). 1 Generic methods for music segmentation - why? The motivation of using plain mathematical or computational formalisms for segmentation of a musical melody is perhaps not quite obvious - after all musical form is a rather specific topic, considered as one requiring musical expertise. Therefore, would it not make much more sense to use knowledge of music theory as a basis for music analysis, instead of doing away with it and using general techniques for data processing instead? The choice of generic methodology as the subject matter has been motivated by a search for methods applicable to many different musical idioms. Building an analysis model upon music theory oriented information has the following problem. It tends to bind the procedure to favor certain kind of structures, but may not be effective in determining whether there are arbitrary kind of structures to be found at all in the data. This is particularly true with idiomatic or genre specific knowledge. Moreover, most concepts of music theory seem to be idiomatic or genre specific. Musical idiom oriented knowledge may assist a recognition system to perform well in a particular situation, but it will also reduce the generality of the solution. In search of generality, some scholars have built models of analysis based on more general level Gestalt principles or other measures grounded in music psychology. A recent example of such a project is the Local Boundary Detection Model LBDM of Cambouropoulos (2001). Such research projects, though related and interesting on their own right, are not targeted quite to the same objective as the one at hand. The present objective is to outline an alternative way of local boundary detection using some more purely formal means for it. These means connect with unsupervised learning machine architectures. The project at hand is foremost an experiment on a formalism, which is a derivative of one rather simple operating principle. It describes a general time series processing method, or a signal processing method, which can be applied to music, but might as well be applied to any extra-musical information flow containing time structures with redundancy. Generic pattern recognition methods have the obvious downside of not making use of all the available information in the case music-originating data. On the other hand, exclusion of music theory oriented presuppositions minimizes the role of genre to the results of analysis. This is where the role and the relevance of abstract methodology is. It approaches a mode of listening familiar to a large majority of music audiences - one of a 'generic unprofessional listener'. This is an added bonus of formal time series methods as compared to more specific and music-oriented considerations. Indeed, though paradoxical it seems, it would be conceivable to have an abstract system used for melodic analysis, which were applicable even to non-musical data sets, the only requirements being a time-order of data elements and existence of more or less consistent time patterns in the data. 2 Distinguishing features The project discussed in this paper aims to bring about a computing method suitable for local 73

Page  00000074 boundary detection in musical melody, with as few assumptions concerning the style, genre or grammar of the particular piece as possible. In this case the goal was pursued using - instead of a Gestalt rule set - an unsupervised learning system, which seeks to adapt to a time-ordered data set retrieved from the melody, which is being analyzed. The data set will be subjected to simple crosstable comparison. For carrying out the comparison, the data set is cut to short feature vectors. In crosstabling, a feature vector in a particular time location of the data set is compared against each of the other feature vectors, corresponding all other possible time locations. The learning mechanism is embedded in the similarity metric of feature vectors. Briefly put, the similarity measure is derived from the reaction of a selforganizing map (SOM) to a specific feature, in the situation where the teaching data for the map has been the time series being analysed. Thus, the grounds of the segmentation have been derived from the knowledge retrieved from the very data being segmented, through the behavior of the selforganizing map. The self-organizing map used here has required slight changes to the original architecture as conceived by Kohonen. It seemed likely from the start that this would be the case, since obviously SOM architecture was not planned for analyzing free time sequences at the first place. The reaction of the map in the present application is computed using afterimage phenomenon, which was discovered while studying the behavior of the map. It is called afterimage because of its reminiscence to the wellknown retinal afterimage, which results from staring a pattern for a lengthy period. Prolonged staring results in a neural phenomenon where a visual image remains for a while after the stimulus, which caused it, already has disappeared. The behaviour, which I have called SOM afterimage, is the extent of excitation caused by a pattern not on the excitation focus tuned for this particular pattern, but on some other node of SOM, which has tuned to reacting to some other feature. An interested reader can find a more thorough explanation of it in the thesis of the present author (Tiits 2002a). Through crosstable comparison of pattern distances computed using afterimage strengths, the process produces a quantity describing the tendency of each feature vector to differ from the preceding material. I have called this quantity transience. Some other authors have used the term segregation for a rather similar concept (Tenney and Polansky 1987). 3 Empirical part: the test runs The afterimage phenomenon in SOM was an idea, which would only be tested with software developed specifically for this purpose. Consequently, a significant part of the project was to develop an application of SOM, which would incorporate it. The software was written in C language for UNIX environment, and was named SOMAT. There are several parameters controlling the process in SOMAT, such as the size of the network (number of nodes). Other important parameters are number of teaching cycles used in the adaptation part of the process, and the number of components in the feature vectors, corresponding the number of dimensions in the vector space used for distance measurement. Several pieces for solo flute were chosen as test material. They included parts of Paul Hindemith's Acht Stiickefiir Flote Allein, Debussy's Syrinx, and a part of Kaija Saariaho's three-part piece Canvas. This type of music was used because of simplicity of monodic lines. The material was fed into SOMAT as preprocessed MIDI data. This naturally is a reduction in itself, since a portion of information gets disregarded. Still it was felt that sufficient amount of original information would remain to make the tests worthwhile and interesting. 4 Results and discussion The general impression about the performance of SOMAT system is that it performs remarkably well considering them narrow slot of data given to it. For instance, an increase in dimensionality of the vector space enhanced the performance, just as it should have, supports the feeling that the thinking that went into building the system was sound. In settings such as the one described in this paper, SOMAT probably can not outperform rule-based systems in every way. But it is its capability of adaptation, that the examples of Acht Stiicke, Syrinx and Canvas can be said to give strong, though not completely unambiguous positive support. The example of Canvas also brought up the inevitable limitations of the system. Though not strictly a 12-tone piece, it is still based on serialist style of construction. Thus it was characteristically most symbolic, in semiotic sense, of the chosen melodies, and definetely least thematic, and as such remains partly out of the scope for SOMAT. To what extent the system is able to handle different variation techniques, is not determined by these experiments. The best way to get to any judgement of such matters, would probably be construction of artificial sequence examples instead of real musical material. Then it would be possible to isolate the effect of multiple influences to recognition and concentrate to one variation technique at a time. On the other hand, due to the holistic nature of the system, the performance in real musical situation might in any case differ from such artificially constructed tasks. Also, it was felt that the point, which makes this study belong to musicology instead of computer science, is the very use of real musicoriginating data, and the choice of material, which may be expected to challenge the performance of the method on musical grounds. 74

Page  00000075 A glance at some of the results of computer runs will follow, using one of the mentioned works. Initially, map size of 18 x 18 was used, equaling 324 nodes total. Both three- and six-dimensional vector spaces were experimented on. These tests were run originally in the NeXT workstations. Thus the computing power was limited, no practical possibility existing for examining the behavior of larger networks or higher-dimensional spaces. In the model the feature vector components each represent 1/32-note or a semidemiquaver. The top curve in figure 1 represents the original melodic contour, and the ones below that the output of SOMAT; a transience value at the corresponding part of the time series. Rather surprisingly for someone familiar with SOM implementations, somewhat intellingent responses were achieved with as few as 30 to 50 teaching cycles. This is only due to the use of afterimage mechanism, since a standard SOM is reported to require orders of magnitude more teaching. is in accord with the common sense about the matter, for it is easy to suppose that a larger number of processing elements is apt to produce more noise and require more teaching cycles to achieve the state of organization equivalent to a smaller network. a) Input signal b) Output signal of a network with 2500 nodes Figure 2: SOMAT input and output signals for Acht Stiicke nr. 6 with larger SOM. Paul Hindemnith: Acht Stuicke nr. 6 a) original signal (melodic contour) b) Test run with 3-component feature vectors a) Input signal b) Output signal of a network with 2500 nodes and 300 teachin2 cycles c)Test run with 6-component feature vectors J JL tJL Jk k Figure 1: SOMAT input and output in graphical form for Acht Stijeke nr. 6 with SOM size 324 nodes. Later some additional runs were made using a significantly more powerful computer system, allowing the mechanism to be studied with more teaching cycles and larger 5GM networks. The input signal for these test runs was taken from Hindemith's Acht Stiicke nr. 6. Figure 2 presents the resulting signal for a network of 2500 nodes. Yet more test runs were made with 2500 nodes network with larger number of teaching cycles and space dimensions. The outputs of these runs are displayed in figure 3. A general observation of the larger network behavior is that it seems to produce more peaks with the small amount of training, which was given. This c) Output signal of a network with 2500 nodes, 300 teaching cycles and 10- dimensional vector space. Figure 3: SOMAT input and output signals for Acht Stiicke nr. 6 with larger 5GM, more teaching and higher dimensionality for the feature vector space. Evaluation of these last examples confirms the notion that putting more teaching cycles to the larger network will produce a more organized state. But it also denotes that having a larger network does actually not do any good, all other parameters remaining the same including the input signal, unless more teaching cycles are used in training the system. The output signal depicted in figure 3 c also suggests that using longer feature vectors gives more precision to the process, since the patterns represented are more specialized. 75

Page  00000076 It would be informative to study these examples with the score. The test run of figure 3 c locates the major division of form to the beginning of bar 7. If this result is taken as an approximation as to include the two last semiquivers of the previous bar, the solution would be musically credible, though it is not located as nicely to golden section as the return back to motif a at the beginning of bar 9. The latter would be the point where I would personally be inclined to place a major division, given the task of segmentation. From the musical point of view, the division at bar 7 (or actually at the end of bar 6), is somewhat plausible in any case. One impression resulting from the test runs is that the performance of the system can not be fully estimated except in the light of actual musical examples. Tests on artificial data can be instructive in making testing some special cases of the behavior of the system, whereas more general observations tend to be more valid when the data used is taken from real world cases. Making such observations requires musical competence. The study on SOM afterimage based pattern recognition is not yet exhaustive and will carry on. A more thorough testing program would be in need, also incorporating other kinds of distance measuring engines than SOM afterimage, to provide an appropriate benchmark. References Kohonen, Teuvo (1989). Self-Organization and Associative Memory. Third edition. Berlin, BRD: Springer-Verlag. Kohonen, Teuvo (2001). Self-Organizing Maps. Berlin, BRD: Springer-Verlag. Tenney, James and Larry Polansky (1987). Temporal Gestalt Perception in Music. Journal of Music Theory 24:2 (205-241). Tiits, Kalev (2002a). On Quantitative Aspects of Musical Meaning. A model of emergent signification in timeordered data sequence (Academic dissertation). Helsinki, Finland: University of Helsinki. Forthcoming. Tiits, Kalev (2002b). A proposal for vector similarity metric in pattern distance measurement with musical time series. Preprint. 76