Page  412 ï~~Rhythmic Pattern Processing using a Self-Organising Neural Network Simon C. Roberts and Michael Greenhough Department of Physics and Astronomy, University of Wales College of Cardiff PO Box 913, Cardiff, CF2 3YB, Wales, UK Telephone: (+44) 1222 874458 Fax: (+44) 1222 874056 e-mail: S. Roberts@astro.cf.ac.uk ABSTRACT: The real-time processing of a continuous sequence of event onsets by a selforganising artificial neural network is discussed. The network encodes inter-onset interval patterns which are consistent with many principles of human rhythm perception. Important rhythmic patterns are identified, making the network beneficial for the determination of a rhythm's grouping structure. The network fundamentals are described followed by network simulation results. I Introduction Human listeners spontaneously organise a continuous sequence of events [Han89]. This paper discusses an attempt to model this behaviour using a self-organising artificial neural network known as SONNET 1, which was developed by Nigrin [Nig93]. The network's task was to encode patterns of inter-onset intervals (IOIs) from a continuous sequence of event onsets, in a manner which reflected aspects of human rhythm perception. In this paper, an IOI is defined as the time-interval between consecutive onsets. SONNET l's capacity for discovering patterns of time-intervals has undergone little investigation. The network has previously been implemented for learning melodies [Pag93, Pag94], but these only consisted of very simple rhythmic structures. SONNET 1 has also been applied to the detection of rhythmic repetition [RG94], but this implementation gave little consideration to the modelling of human rhythm perception. It is hoped that the present implementation of SONNET 1 will advance the computational modelling of human rhythm perception, by assisting in the determination of a rhythm's grouping structure. 2 SONNET 1 overview SONNET 1 is a self-organising artificial neural network which uses unsupervised learning. Its architecture consists of two highly interconnected fields of cells: J1 and 22 (shown in Figure 1). network output Input items are applied to Y1 which acts as a short-term memory (STM), storing the most re-....... __....... -_............. i!!i.................__. network input cent items. Each JT cell has an activity which is N2 fed forward to each 12 cell via bottom-up links. The F2 cells compete to represent the.F1 patterns. excitatory Excitatory weights on the bottom-up links to each bottom-up F2 cell store the long-term memory (LTM) reprelinks sentation for the cell's associated pattern. Learning is the process of adapting these weights so that they accurately encode the appropriate pattern. 1 The bottom-up input to an F2 cell increases as this encoding improves, causing the cell to gain a considerable activation. This is turn enables more substantial LTM weight changes. Figure 1: SONNET 1 architecture. An.F2 cell is said to be committed to a pattern after it has been fully encoded by the corresponding LTM weights. A committed.F2 cell can chunk out its pattern from the STM by resetting the activ 412 I C M C PROCEEDINGS 1995

Page  413 ï~~ities of the relevant F1 cells [Pag93]. The F2 field self-organises to form a masking field, which allows an T2 cell that represents a long pattern to mask out a cell which represents a shorter pattern. SONNET 1 was implemented with each F1 cell representing an event onset, i. e. a note attack, drum beat, etc. Any F1 cell could represent a particular onset, so multiple interacting links connected each F1- 2 cell pair to enable the network to be insensitive to the spatial dimension of the -F1 pattern. These link interactions were modified to allow embedded patterns to be recognised. Further details of SONNET 1 technicalities are given in [Nig93, Pag93, Pag94]. 3 Time-driven SONNET 1 versus event-driven SONNET 1 Previous work investigated the detection of rhythmic repetition using an event-driven SONNET 1 network [RG94]. Each F1 cell represented a particular 101, and the network state was only updated during a fixed time period after the presentation of each IOI. The current paper discusses the application of a time-driven SONNET 1 network for the processing of rhythmic patterns. In this implementation, each F1 cell represented an event onset so that IOIs were stored by the relative pattern of-F1 activities (as explained in Section 4). In addition, all of the cell activities and connection weights were modified continuously, hence the term "time-driven". The event-driven network (EDN) had a number of disadvantages when compared to the timedriven network (TDN) for processing 101 sequences. Many of these were due to the fact that the EDN had a discrete time-representation whereas the TDN had a continuous time-representation. The EDN required more pre-processing as an input 101 needed to be quantised to an interval that was represented at the -F1 field, and then converted to a place in the field to cause the appropriate -F1 cell to activate. This representation also demanded that the input IOIs were known prior to processing. The most important drawback for the EDN was that the quantisation prevented it from being exposed to expressive timing. Music psychologists state that expressive timing communicates an intended musical structure, so it is essential that SONNET 1 can process expressive timing. The TDN has this ability. Further, for the EDN, the occurrence of a relatively large timing deviation could cause the activation of a different F1 cell from the one which was intended. This would prevent the EDN from recognising a familiar pattern. The TDN's ability to recognise familiar patterns which include timing deviations depends on the vigilance parameter, which controls the coarseness of SONNET l's classifications. Also, to allow a variety of sequences to be processed, the EDN required more F1 cells than were necessary for the TDN. A larger F1 field hinders real-time operation because computation time increases with network size. For the EDN, the encoding of 101 patterns depended only on the frequency of pattern occurrence and context. The TDN was able to exploit timing information more thoroughly to bias its encodings. It was more likely to encode patterns which followed relatively long IOIs and patterns which were composed of relatively short IOIs. Both of these properties are consistent with human rhythm perception [Fra82]. 4 STM storage of 101 patterns The time-driven SONNET 1 network was implemented with each F1 cell representing an event onset, such that any one of the inactive F1 cells fired when an onset was presented. After firing, an F1 cell's activity continually increased so that earlier onsets were associated with greater activities. Each F1 cell tripled its activity during a time period equal to a tactus-span, i. e. the time between beats on an intermediate metrical level which is perceived as being the most prominent level [LJ83]. (The tactus-span for each input sequence was determined from experiments where listeners tapped out the underlying beat to each sequence.) Ois were thus represented by the relative activity levels which corresponded to consecutive onsets, so that an 101 with a duration of a tactus-span was represented by activity levels which differed by a factor of 3. Smaller factors represented shorter LOIs and larger factors represented longer lOIs. The activity of an F1l cell was reset to zero after it had exceeded a pre-determined threshold. IC M C PRO C E E D I N G S 199541 413

Page  414 ï~~Therefore, this threshold determined the STM depth. In the SONNET 1 simulations (discussed in Section 6), the threshold was set so that an.1 cell remained active for 6 tactus-spans after firing. This produced an STM depth of about 3s when the tactus-span corresponded to the preferred tempo (approximately 0.5s), which agrees with the "practical limit" of human short-term storage [Fra82]. SONNET l's STM storage was equivalent to a window which continually slid along the input sequence. The Y1 representation for a waltz sequence (alternating half-note and quarter-note IOIs) is illustrated in Figure 2. In this example, the tactus-span is assumed to equal the duration of a quarter-note, and for simplicity, the reset threshold gives an STM depth of 3 tactus-spans. Note that although the figure shows the.F1 state at discrete instances in time, the.F1 activities continually change. Table 1 shows the onset pattern and corresponding [01 pattern represented by the "17 activity at each presentation stage displayed in Figure 2. reset threshold 2 7......................................... -.......-............... 0 9 1 Â~Â~Â~ Â~Â~Â~ GoÂ~ GÂ~ stage 1 stage 2 stage 3 stage 4 Figure 2: Sliding window.1 activity representation for a waltz sequence, shown at 4 stages during event presentation. Each stage advancement corresponds to an elapse of time equal to a tactusspan (quarter-note). The ith 'F1 cell is labelled si. After firing, an.T1 cell triples its activation during a tactus-span. A cell is reset after it has been active for 3 tactus-spans. [J Stage f OnsetsTheld Corresponding ___in STMj 101 pattern _I1 3 I 4!! Table 1: Sliding window STM storage of a waltz sequence. Successive rows show the input pattern held in the STM after successive tactus-span (quarternote) time-shifts. Each input pattern is displayed in the form of onsets and IOis. For the onset patterns, each character position represents a tactus-span and '1" represents an onset occurrence whereas "-" signifies the absence of an onset. Referring to the.T1 state at stage 1, the activities of s1 and s2 differ by a factor of 9 thus representing a half-note, and the activity of s2 is 3 times greater than that of s3 thus representing a quarter-note. The oldest 101 is represented by the difference between the two greatest activities, so the half-note precedes the quarter-note as shown in Table 1. The cell si is reset immediately after stage 1 and the activities of the other two cells continue to grow; so only the quarter-note is stored at Y1. A new onset is presented at stage 3 causing s1 to fire. The 51 field involves the same relative activities at stage 4 as at stage 1, so the 101 patterns at these times are identical. Humans perceive a relatively long 101 as a pause which separates contiguous rhythmic patterns [Fra82]. Consequently, the first onset of a "perceptual group" is often immediately preceded by a long IOI. This is also true for SONNET l's encodings. The reasons for this are due to the 414 4ICMC PROCEEDINGS 1995

Page  415 ï~~sliding-window type STM and the fact that F2 cells were reset when F1 reset occurred (which prevented incorrect learning). So for the waltz sequence, the F'2 cells would have more time to accumulate substantial activations after the onset starting the half-note was reset, than after the reset of the onset starting the quarter-note. As a result, more substantial LTM weight changes would occur in response to a pattern commencing with a quarter-note, causing the encoding of such a pattern. Therefore, the network would be more likely to encode the 101 pattern shown at stage 3 in Table 1, than that shown at stages 1 and 4. This is consistent with human rhythm perception. 5 Processing rhythmic patterns with SONNET 1 This section briefly describes network design considerations and modifications which are specific to the processing of rhythmic patterns. (i) Tempo sensitivity The cell activities and connection weights were continuously modified so that the network's operation was dependent on absolute time. This was beneficial because human rhythm perception is dependent on tempo. Nevertheless, the F1 dynamics were dependent on the tactus-span for each input sequence, and thus needed to be controlled by a beat-tracking signal. This was necessary to ensure that a familiar pattern would be recognised at different tempi (within limits). However, no beattracking system was utilised for the network simulations (described in Section 6); instead, the 3F1 dynamics were contrived so that an 31 activity increased by the required factor of 3 during a time period equal to a tactus-span. (A neural "foot-tapping" system has been devised to supply a beattracking signal, but has yet to be implemented with SONNET 1.) Note that the overall network behaviour remained sensitive to tempo because the F2 dynamics were dependent on absolute time. (ii) Maximum encoded 101 pattern length The maximum encoded pattern length was restricted to be in accordance with human rhythm perception. Fraisse [Fra82] states that listeners perceive groups constructed from 2 to 6 "sounds", and Handel [Han89] gives the maximum group length as 4 "elements". Therefore, the maximum pattern length was set to 5 onsets, so that an F2 cell could represent a pattern of up to 4 IOIs. This limitation was enforced by restricting the number of bottom-up links that connected each F1-.F2 cell pair. (iii) Grouping by temporal proximity The excitatory weights on the bottom-up links were initialised to small random values. The range over which the weights were randomised affected the initial network performance. A narrow range gave patterns involving relatively short IOIs an initial advantage, thus biasing the network to perform grouping by temporal proximity. Therefore, a narrow range was used in the network simulations (discussed in Section 6). (iv) Limiting the influence of long IOIs Section 4 described how SONNET l's operation caused relatively long IOIs to act as pauses, separating 101 patterns. Generally, SONNET 1 learned more easily from sequences which included pauses, and longer pauses allowed greater LTM weight changes. Although humans use pauses to form rhythmic groupings, the influence of a pause does not continuously increase with its duration, because IOIs greater than 1.5-2s destroy the sense of cohesion between successive onsets [Fra82, Han89]. To comply with this, a modification was introduced to prevent excessive network state adaptations when very long IOIs were presented. In addition, this modification prevented excessive initial learning in response to the start of a sequence, and facilitated the determination of a reasonable learning rate for processing a variety of sequences. (v) Chunking-out constraint SONNET 1 encoded patterns of IOIs by responding to event onsets. When a pattern was classi I C MC P ROC EE D I N G S 199541 415

Page  416 ï~~fled the appropriate onsets were chunked out of the STM. However, the last onset of an encoded pattern was not chunked out, to avoid loss of information about the following IOI. This constraint was required because an onset identified the end of the previous 101 and the start of the next. 6 SONNET 1 Simulations SONNET 1 was implemented to process 10 sequences which were derived from a variety of rhythms. These rhythms differed in length, complexity, tempo, time signature and 101 content (i. e. some rhythms only included relatively short IOIs whereas others included long IOIs). A network consisting of 37 F1 cells and 20 F2 cells was used for each sequence, with the same network parameters. This number of F1 cells was chosen so that an isochronous sequence with IOIs equal to a sixth of a tactus-span could be stored in the STM. This was considered to be the most rapid onset presentation which would need to be processed. A low learning rate was used to allow recurring patterns to be identified. The simulations were run in pseudo real-time. Ten cycles of each sequence were presented in a continuous manner, such that the first 101 of a cycle followed directly on from the last 101 of the preceding cycle. The F2 cells could respond to the initial IOIs of the first cycle for 6 tactus-spans before the F2 activations were reset. This was because it took this time for the F1 cell which represented the first onset of a sequence to be reset, causing the F2 cells to be reset (as explained in Section 4). In addition, F2 cells with small bottom-up excitatory weights could only respond to the oldest onsets stored in the STM. As these weights were initialised to small values, all of the F2 cells initially responded to the oldest onsets. So the initial 101 pattern for each sequence enjoyed substantial learning and the network thus exhibited a primacy effect, which is consistent with human perception [Fra82]. In addition, the network favoured patterns which followed relatively long IOis and those which incorporated relatively short IOis (as explained in Sections 4 and 5 respectively.) Frequently occurring patterns were also more likely to be encoded, which agrees with the fact that pattern repetition influences human rhythm perception [Han89]. The network's response to a sequence based on a rhythm from Mancini's "Pink Panther" demonstrates the network's general learning behaviour. One cycle of the 101 sequence is shown in Figure 3. The tempo of the sequence was such that a dotted quarter-note equalled 0.7 seconds. This duration was judged to be the tactus-span from the tapping-experiment results (mentioned in Section 4). 2 3 1.5 12. 7 I l,"-....Ml i E. I l ll, I /Il I,,,1, j.........._ g. jI4 IIJ JJ!I.:, i~.iJ IJ _ Figure 3: One cycle of the Pink Panther 101 sequence. Initially, the F2 cells had extremely low bottom-up inputs until the first 5 onsets were presented. Note that this was the maximum number of onsets that an F2 cell could respond to (as stated in Section 5). The cells then began to learn the pattern of the first 4 IOIs, until the first onset was reset. As the first 3 IOIs were relatively short (less than a tactus-span), the cells gained a fair bottom-up input and began to represent a pattern starting with these IOIs, although the representation was very poor at this stage. The F2 cell with the initial bottom-up weights which were the most parallel to this pattern achieved the greatest activation. The F'2 cells then had low bottom-up inputs and activities whilst they responded to the relatively long IOIs present in the second bar of the sequence. The eighth-note at the end of the second bar began the same 3-101 pattern which started the sequence. As the second occurrence of this pattern was preceded by a long 101, the Fv2 cells had a substantial time to respond to it after 416 6ICMC PROCEEDINGS 1995

Page  417 ï~~[Label IOIpattern Cycle Duration (s) A J)J)J2 1.40 B 2 0.93 C 2 0.93 D " ) 4 1.40 S7 1.63 Table 2: Patterns encoded from the Pink Panther sequence shown in the order in which they were first chunked out. The sequence cycle from which each pattern was first chunked out and the pattern durations are also given. the onset which started the long 101 was reset. This occurrence was followed by a quarter-note, so the cells responded to the 101 pattern labelled A in Table 2. As all of the IOIs in pattern A were relatively short, the cells received considerable bottom-up inputs. Some cells had obtained a fair representation for this pattern by the time the starting onset for the eighth-note at the end bar 2 was reset. After the cells had responded to the IOIs in bar 3, two further occurrences of pattern A had passed, so the network had formed a good representation for it. When bar 3 was repeated (in the first sequence cycle) an F12 cell had almost fully encoded this pattern. A run of very short IOIs was generally rapidly encoded. Such a run occurred in bar 4 of the Pink Panther sequence (4 eighth-notes), and in this instance it was preceded by a long 101. Consequently, an F2 cell formed a stable representation for this eighth-note run from a single occurrence. The eighth-note run is referred to as pattern B in Table 2. The 3-I0I pattern which occurred at the beginning of the sequence was repeated at the end of bar 5. This pattern was followed by various IOIs during the presentation of a cycle, so the network responded to it in different contexts. Also, this pattern was often preceded by relatively long IOIs. Consequently, the 3-I0I pattern became partially encoded by the network from the first cycle. This pattern is given the label C in Table 2. Generally, patterns composed of fewer than 4 IOIs were only encoded if they occurred in multiple contexts or if they included relatively long IOIs. Human listeners tend to perceive patterns of equal duration [Han89], so patterns including long IOIs will include fewer IOIs in total. When an F2 cell had partially encoded a pattern, it possessed substantial bottom-up weights and was allowed to respond to patterns embedded within the STM (up until now it could only respond to the oldest onsets). Consequently, a cell with substantial weights had an extended response time to its pattern, causing the pattern to rapidly become fully encoded. As F2 cells had partially encoded the patterns A, B and C from the first cycle, they became committed to these patterns on the next cycle. Subsequent presentation of the pattern A, B or C caused the appropriate committed F2 cell to chunk it out of the STM. Table 2 shows the encoded patterns for the Pink Panther sequence in the order in which they were first chunked out. Chunking-out caused F1 cells which represented successive onsets to be simultaneously reset. This resulted in extended learning periods for IOIs which followed chunked-out patterns, as it took longer for the activation of the F1 cell associated with the oldest onset to reach the reset threshold. So generally, the earliest encoded patterns influence subsequent processing, which is in agreement with human rhythm perception as listeners perceive unfamiliar events in relation to previously perceived patterns [Fra82]. The patterns D and E for the Pink Panther sequence (shown in Table 2) were encoded because of the influence of the earliest encoded patterns. Pattern V followed C twice per cycle and S followed A towards the end of each cycle. Note that D was encoded earlier than E because it was more common. Chunking patterns out of the STM caused the input sequence to be segmented. Figure 4 displays the segmentation of the Pink Panther sequence on the 10ta cycle. The IOIs shown in brackets were not encoded by the network and thus could not be chunked out. Most of these IOIs ICMC PROCEEDINGS 199541 417

Page  418 ï~~were relatively long, ranging from 1.17s to 2.1s. It was undesirable for these long IOIs to be encoded because human listeners fail to group IOIs greater than about 1.5-2.Os [Fra82, Han89]. Figure 4 shows that the network identified the repetition of bars 1, 2 and 3 within the cycle, and also the repetition within bar 3. Figure 4: Segmentation of the Pink Panther sequence on the 10th cycle. The lOis shown in brackets were not chunked out of the STM. Examining the Pink Panther sequence in Figure 3 reveals that a quarter-note 101 was always preceded by an eighth-note. Had SONNET 1 grouped these 2 IOIs together, more lower-level structure would have been discovered, and pattern D would have been encoded with its 2 IOIs in reverse order (i. e. starting with the eighth-note). This would have been beneficial as humans tend to perceive patterns terminating in long IOIs [Fra82]. However, the 2-101 pattern could not be encoded because it only occurred in a single context, as a quarter-note was also always followed by an eighth-note. This caused the 3 IOIs to be grouped together forming pattern C. Two other sequences which were presented to the network were based on rhythms from Ravel's "Bolero" and Holst's "Mars", which had tactus-spans corresponding to an eighth-note (0.5s) and a quarter-note (0.6s) respectively. Both of these sequences only consisted of short IOIs, resulting in short time periods between.1 resets. Consequently, the.T2 cells could not easily accumulate large activations and thus many cycle presentations were required before patterns became encoded. This agrees with human perception as listeners require more time to discover patterns from fast sequences [Fra82]. For these two sequences, 30 cycles were presented instead of 10. The segmentations of each sequence on the 30th cycle are displayed in Figure 5. 3 3 (a) (b) Figure 5: Segmentation on the 30th cycle of the: (a) Bolero (b) Mars sequence. lOis shown in brackets were not chunked out of the STM. The Bolero sequence began from the first eighth-note shown in Figure 5a, so strictly the first sixteenth-note triplet belongs to the preceding cycle. The figure depicts the sequence in this way because the triplet was encoded together with the following eighth-note. This was the only pattern which was encoded, and was first chunked out from the 8th cycle. This encoding was desirable because it represents a common pattern, commencing with proximate onsets and terminating with a relatively long 101, thus respectively agreeing with the run and gap principles of human perception [Fra82]. Note that the primacy effect supported the encoding of the triplet with the preceding eighth-note (as the sequence started with an eighth-note), and that such a pattern occurred as frequently as the one which was encoded. However, the encoding of such a pattern would be undesirable because it conflicts with the run and gap principles. The actual encoded pattern would always be favoured by SONNET 1 due to its inherent bias for encoding successive short lOIs and patterns following relatively long LOIs. The first pattern to be encoded from the Mars sequence was the grouping of the eighth-note triplet with the following quarter-note (shown in Figure 5b), which was first chunked-out from the 7th cycle. The sequence started with this pattern so its encoding was supported by the primacy effect, as well as temporal proximity and the fact that it followed a relatively long 101. After this pat 418 8IC MC P ROC EE DIN G S 1995

Page  419 ï~~tern was chunked out, the network had an extended response time to the following IOIs, which resulted in the encoding of the second pattern. This pattern was first chunked out from the 11th cycle. Lerdahl and Jackendoff [LJ83] have specified rules which define a well-formed and preferred grouping structure. Many of these rules relate to a grouping hierarchy and to musical surface events other than onsets (e. g. slurs, pitch-jumps and changes in dynamics and articulation) making an analysis of SONNET l's results with regard to the rules inappropriate. Nevertheless, SONNET l's general performance does agree with various aspects of the rules: only contiguous IOIs constituted groups, smaller groups were less preferred (due to masking), greater IOIs identified group boundaries, repeated 101 patterns received identical grouping (parallelism) and groups with similar lengths were formed. The last point is supported by the Pink Panther results (Table 2) as the patterns have similar durations, and also by the fact that the Mars sequence was fairly evenly subdivided (as shown in Figure 5b). 7 Concluding Remarks The ability of a self-organising artificial neural network, known as SONNET 1, to process a continuous sequence of event onsets has been shown. The network's operation is consistent with many aspects of human rhythm perception. In general, SONNET 1 favours the encoding of patterns which occur at the start of a sequence (primacy effect), follow relatively long IOIs, occur frequently, commence with relatively short LOIs (run principle) and/or follow a previously encoded pattern. In addition, short patterns (i. e. fewer than 4 IOIs) are encoded if they occur in multiple contexts or include relatively long IOIs. Some difficulties need to be overcome. Firstly, the optimisation of the network parameters to obtain a reasonable learning speed for a wide variety of sequences is an arduous task. An attention signal which is dependent on the input sequence's tactus-span could be employed to modulate the learning rate. This would allow faster learning for rapid sequences without causing excessive learning for sequences including long IOIs. Secondly, some circumstances lead to the encoding of overlapping patterns, which results in inappropriate sequence segmentation. Further work will investigate the use of expectancies to prevent this behaviour. The processing of expressive timing and the investigation of a hierarchy of SONNET 1 architectures are also subjects for future work. References [Fra82] P. Fraisse. Rhythm and tempo. In D. Deutsch, editor, The Psychology of Music. Academic Press, New York, 1982. [Han89] S. Handel. Listening: An Introduction to the Perception of Auditory Events, chapter 11. MIT Press, Cambridge, Massachusetts, 1989. [LJ83] F. Lerdahl and R. Jackendoff. A Generative Theory of Tonal Music. MIT Press, Cambridge, Massachusetts, 1983. [Nig93] A. Nigrin. Neural Networks for Pattern Recognition. MIT Press, Cambridge, Massachusetts, 1993. [Pag93] M. P. A. Page. Modelling Aspects of Music Perception Using Self-Organizing Neural Networks. PhD thesis, University of Wales, 1993. [Pag94] M. P. A. Page. Modelling the perception of musical sequences with self-organizing neural networks. Connection Science, 6(2-3):223-246, 1994. [RG94] S. Roberts and M. Greenhough. The detection of rhythmic repetition using a selforganising neural network. In ICMC Proceedings, pages 125-128. IGMA, 1994. IC MC P ROCEED I N G S 199541 419