Page  320 ï~~Towards a Universal Algorithmic System for Composition of Music and Audio-Visual Works Jeremy L Leach, School of Mathematical Sciences, Bath University, Bath, BA2 7AY, England. E-mail: j ll@maths.bath. ac. uk Abstract This paper presents a universal composing system capable not only of automatically generating likeable music, but also of generating animated visual sequences synchronised with the music. The system is based on general theories of human perception with respect to the temporal domain. 1 Introduction This century has seen many attempts to analyse and formalise aspects of musical structure (Bent & Drabkin 1987). In some cases this work has led to systems capable of automatically assigning structural analyses to pieces of music (Jackendoff & Lerdahl 1983). More ambitiously though, others have attempted to produce formal systems which can automatically generate music (Todd 1991). Most of the latter have had only limited success. The reason for this, I would suggest, is that they appeal to no fundamental understanding of the function of music. Although by no means a trivial task, I believe that the development of such an understanding - even if incomplete - is of the utmost importance. Without this, we cannot even define objectively what the goal of algorithmic composition should be. In this paper, therefore, I attempt to explain the function of music which, I believe, is related to the way the brain has evolved in order to process sensory information. It has often been said that music mimics the way natural processes change with respect to time (Pietgen & Saupe 1988, Leach & Fitch 1995). However, even if this is true, it must still be asked why music brings us so much pleasure. We know that the human brain has evolved to process sensory information derived from the kind of changes which occur in nature. By making relationships between events and then using them as templates to predict the future we give ourselves a survival advantage. This has been commented on by other authors (Baddeley 1993). I would suggest that the reinforcement of these relationship templates could be linked to the sensation of pleasure. This would explain why repetition is so prevalent in music. Secondly, it seems plausible that continually repeating a relationship would result in a diminishing reinforcement effect. This in turn would be linked with the relationship being per / erarch icall y S uperio r Edges /.. I111 11111 11111 11111!1111 Edges /dHierarchically Transitions Inferior dges Hierarchically Superior Edges Figure 1: Edges / transitions perceived in groups of regions. ceived as less and less pleasurable. In general this means that for a piece of music to be pleasurable it must make a compromise between presenting many relationships and ensuring that these relationships are both perceivable and reinforced a number of times. 2 Visual & Auditory Perception In order to make relationships between entities, it is necessary to perceive each entity as being distinct in some way from the others. Now, in the physical world, matter is more often than not grouped together in some form. Our visual perception has evolved to make use of this, and in general, objects are identified by detecting edges or discontinuities around otherwise homogeneous regions in the visual field. This fact was demonstrated by Hubel & Wiesel (1962) who discovered neurons in the cat's brain which are activated at edges or discontinuities. Furthermore, it is also known that edges can be perceived at a higher "hierarchical" level and serve to separate groups of discrete regions (see figure 1). This type of separation was observed by the Gestalt Leach 320 ICMC Proceedings 1996

Page  321 ï~~First Beginning End Time IP Second Beginning Time End Figure 2: Joseph Haydn, Sonata In G Major Hob. XVI:G, first movement, bars 1-4 and 8-11. theorists. Both of the above types of perception can be seen to be evident in music. For example, the musical term "note" is useful precisely because it refers to an auditory stimulus, more often than not with a reasonably constant amplitude, and almost invariably with a constant pitch. Its edges with respect to amplitude are the attack and decay. If, however, there is little change in amplitude between notes then a rapid change in frequency serves as an edge / transition. It is indeed this fact that allows us to represent music as a discrete sequence of symbols (i.e. the musical score). This is significant when you consider that performed music is in actuality a continuous phenomenon with respect to time, and there is no physical reason why it should fall so readily into groups. Once again, but this time on a higher level, we find that auditory perception mimics the visual. In a study into pitch groupings (Deutsch 1982) it was found that memory for pitches of notes was strongest if they occurred "close" to the beginning or end of a melodic sequence. Hence the "edges" of a typical melody are perceptually more important than the notes in the middle. In general, the relationship template that is memorised for a particular group of regions consists of the relative differences between those regions. Again, if we continue this hierarchically, then the relative differences between groups of regions form a relationship template for a group of groups. In this general hierarchical case though, we have the opportunity to reinforce the part of the templates that are the same in each group while creating a new hierarchically superior template for all the parts of the templates that differ from group to group. A musical example of this is given in figure 2 where we recognise the beginning of the second sequence so we expect the ending to be the same. However, it is different and so this difference is attributed to the temporal context of "being the second sequence". Figure 3 gives another example of the same phenomenon. These are both essentially "A B A C" forms. This behaviour is best described with the phrase "hierarchical context sensitivity", since the behaviour of a small element is determined by the context of the hierarchically superior group to which it belongs. Elements which are not sensitive in this way behave identically regardless of the context of any larger group and are therefore reinforced. Beginning Beginning L End End i le[d Figure 3: C.P.E.Bach, Sinfonie in D Major, H.663, first movement, bars 1-4, Second Violin. 3 Edges in Nature and Humanbuilt Forms Edges in general form a transition from one region to another in the visual field. In the case where these regions physically interact, the transition between the two is rarely instantaneous. It is usually characterised by a complex organisation of relatively smaller regions. For example, between land and sea, the transition often consists of dunes, beach and surf. Similarly, in biology the cell wall provides a complex interface between the inside and outside of the cell. This seems a hallmark of organisation in nature and our perceptual systems have evolved to expect it. In most cases the transition provides a testimony to the incompatibility of the regions. The large and complex edge is needed to allow a stable transition from one state to another. Human art and design mimics this in many ways. Ornate picture frames provide a transition from the world of the painting to completely different rules and behaviour of the surrounding environment (most probably a painted wall). Door-frames act similarly. In music too, transitions from any stable aspect of rhythm, melody or theme can usually themselves be broken down into much smaller regions of limited but stable behaviour. 4 Homogeneous Regions and Transitions between them: Their Relative Complexity Since we have established that transitions are more complex than regions, it makes sense to classify "musical" forms in terms of relative complexity. For example, a piece of music is more complex than the silence that precedes and follows it. Therefore a musical work should preferably be defined as "a transition from one silent region to another", rather than the other way round. Another example is the musical note, which should also be defined as a "a transition from one silent region to another", since the timbral information in a tone is clearly more complex than silence. However, at the same time, the attack of a tone is more complex than the relatively steady state of the successive tone and thus in this case the attack should be defined as "a transition from a region of silence to a region of non-zero amplitude rela tively stable tone". Thus paradoxically, the tone plays ICMC Proceedings 1996 321 Leach

Page  322 ï~~transition region region Complexity silence tone silence (a) transition (piece of music) h-region (silence) h-region (silence) Complexity transition transition region region region silence steady toni silence a tack decoy (b) Figure 4: In (a) the main tone is the transition, yet seen from a smaller time scale in (b) the main tone becomes a stable region when compared to the attack and decay transitions which flank it. the role of a transition in one situation, and a region in another. These two can co-exist though, and are hierarchically related. The transition that is the tone between two regions of silence, becomes at a smaller temporal scale a region of relative simplicity between transitions of attack and decay. Figure 4 illustrates this. Thus any model of music should allow the decomposition of a transition into many smaller regions, themselves separated by transitions. 5 The Essentials of the Algorithm The model I propose here combines the characteristics of all the previously discussed phenomena to result in a system capable of accounting for the majority of organisation in music. The generation of the music takes place within a single hierarchical data structure. At any point in the generation, this structure takes the from of several hierarchically arranged layers of elements. Each element is one of two possible types; that of a transition or that of a region. Each layer consists of a sequence of alternating element types. Thus, ideally, any musical sequence can be represented by an instance of such a layered structure. All compositions are generated from the same initial data structure. This consists of three elements; region, transition and region. These correspond to: the silence before the piece starts, the piece itself and the silence after the piece ends respectively. The algorithm then proceeds through successive iterations until the generation is complete. At each iteration a new layer is created underneath the bottommost layer in the generation. This layer is created by expanding each transition from the previous layer into a sequence of regions, each separated Layer 1 1st mov. 2nd mov. 3rd mov. Layer 2 Layer 3 r!!r!jkhi h///khh Layer 4 fliitf I Hl hh h h h h h h hh h h h hhhh Layer 5 II-If1-IfnHI-Ij Ii4njIlhIji! Figure 5: A schematic of a typical subdivision process. Horizontal regions represent homogeneous regions, vertical bars represent transitions. Time runs from left to right. The initial generation is that of layer 1. The width of any region is indicative of its perceptual complexity. by a transition. Hence at any stage in the process, the bottommost layer represents the most detailed temporal description of the generation. In addition to this, at any iteration, each region may subdivide into a number of smaller identical regions, separated from each other by a transition. A typical example might be that of figure 5, where after the first iteration the initial layer has a second layer added to it. This second layer represents the subdivision of the piece of music into a three "movement" piece. After the last iteration we see that in layer 5 there are many regions of short duration. We might expect the most complex in this example to represent notes (i.e. those regions in the figure with the thickest horizontal lines). If the iteration continued, then the most complex regions might be the attacks and decays of notes. However, to allow sensitivity of hierarchical contexts to affect the instance of the generation produced and indeed, to allow context sensitivity in general, we need to provide the model with more information. In fact, one of the most important aspects of the perception of a piece of music is that the listener usually has a intuitive sense of where the piece is terms of beginning, middle or end. The way we handle this is by representing each transition by a complete cycle (through an angle of 2ir). In this way, when the transition is subdivided in the subsequent layer, the resulting n transitions each represent a transition through a different portion of the cycle (figure 6). Thus, to create an "A B A C" structure as in figures 2 and 3, all that is needed is a three layer structure, where the top transition is subdivided into two transi Leach 322 ICMC Proceedings 1996

Page  323 ï~~IA First note /N may be tonic Last three in home key/notee since it has notes may be come from in home key coe from since they the first:. come from transition transition of the first "' which ends transition. the piece. Figure 6: In this diagram we have a similar example scenario, except that now each region has an angle associated with it and each transition has a magnitude associated with it, determined by the portion of the complete cycle that it traverses. Solid lines represent angles 0-7r; dashed, angles 7r - 27r. tions (second layer), and each of these are themselves subdivided into two (third layer). Then second transition of each pair needs to be sensitive to the context of its parent transition, (i.e. either beginning the cycle or ending it), whereas the first transition need not (hence the first transition of each pair develops identically). Thus this system handles both general context sensitivity and hierarchical context sensitivity. 6 Extension to Audio-Visual Generation The algorithm was then extended to handle abstract visual information much as pitch, dynamics and timbre are are organised in music. Thus the resulting algorithm produces a temporally indexed hierarchical structure which contains all the surface information for both audio and visual streams. This provides a more immersive sensory experience, and can be considered a complete audiovisual composition system. The rationale behind this extended implementation rests with the knowledge that music can create pleasure purely by virtue of its internal structure. The fact that this structure is often abstract suggested that it would be worth investigating the effect of analogous structures on the other senses. The visual field offered a good opportunity since many perceptual phenomena are shared by the two senses. The basic surface properties used to exhibit visual information include shape, colour, size and location of three dimensional forms. These are handled in an anal ogous way to auditory surface properties in that the way to express relationships is achieved through variation of their values with respect to time. The two streams are only considered separable at the time of rendering where the audio is handled by custom software and the rendering of the visual structure was achieved through the use of OpenGL's Inventor on the SGI platform. 7 Conclusion The model discussed here provides the basis for a general algorithmic composition system which can generate typical musical forms as a result of appealing to basic paradigms of visual and auditory perception. This allows the generations to display arbitrarily complex organisations of abstract information which embody the kinds of change that our perceptual and memory systems are receptive to by virtue of the brains evolution. The systems overall function is to create a hierarchical structure which when observed (in its most general sense) temporally presents a maximum number of novel relationship templates whilst ensuring that each of them are reinforced. The musical forms that have been generated by the system to date have been subjectively judged by many to be pleasing and the accompanying visual system seems to convey well the abstract qualities which are traditionally associated with music alone. References [Baddeley, 1993] Baddeley, A., Working Memory and Conscious Awareness, in Theories of Memory, Edited by A.Collins et al., Lawrence Erlbaum Associates Ltd., 1993 [Bent and Drabkin, 1987] Bent, I. and Drabkin, W., Analysis, Macmillan Press, 1987. [Deutsch, 1982] Deutsch, D., The Psychology of Music, Academic Press, Inc., 1982. [Hubel and Wiesel, 1962] Hubel, D.H. and Wiesel, T.N., Receptive Fields, Binocular Interaction and Functional Architecture in the Cat's Visual Cortex, Journal of Physiology, 1962, Vol 160, pp. 106-154. [Jackendoff and Lerdahl, 1983] Jackendoff, R. and Lerdahl, F., A generative Theory of Tonal Music, Cambridge, Massachusetts, USA, MIT Press, 1983. [Leach and Fitch, 1995] Leach, J.L. and Fitch, J.P., Nature and Music. Computer Music Journal, MIT, 19(2) 1995, pp. 23-33. [Peitgen and Saupe, 1988] Peitgen, H. and Saupe, D., The Science of Fractal Images, Springer-Verlag, 1988 [Todd, 1991 ] Todd, P., Music and Connectionism, MIT Press, 1991. ICMC Proceedings 1996 323 Leach