Page  268 ï~~Understanding and Learning Musical Expression Gerhard Widmer Department of Medical Cybernetics and Artificial Intelligence, University of Vienna, and Austrian Research Institute for Artificial Intelligence, Vienna, Austria gerhard~ai univie ac. at Abstract The paper presents an implemented computer program that learns, from examples of actual performances, to play new pieces in an expressive way, i.e., to apply musically sensible dynamics and tempo variations. Preliminary experimental results indicate that such a complex musical skill can indeed be learned by a machine to a certain extent. However, the purpose of the project goes beyond the immediate goal of producing a musically competent program: the learning program is used as a vehicle for testing some general assumptions concerning the musical and cognitive basis of expressive performance. Starting from the hypothesis that expression is intimately tied to the structure in music as it is perceived by performers and listeners, we construct an explicit model of structural music listening and show that this model strongly influences the learnability of sensible expression rules. Insofar as parts of the model are based on two well-known theories of tonal music, the results also provide empirical support for the relevance of these theories. 1 Introduction and Motivation This paper describes a computer program that learns expressive performance of written music. Expressive performance means variations in tempo, dynamics, etc. that are not given in the notated score. While there is a considerable number of dimensions to musical expression (rubato, dynamics, articulation, vibrato, various timbral effects, etc.), the project at its current stage limits itself to the dimensions of rubato and dynamics. However, the objective is not just to write a computer program that produces the desired output (i.e., exhibits reasonable competence). Rather, we wish to understand what the musical basis is that enables people to produce sensible interpretations, and allows students to learn this skill. In other words, we want to understand what knowledge underlies this form of competence. Thus, the central concern in this study is with the musical and psychological plausibility of the knowledge that we are going to encode in the program. The work described here is in the tradition of a whole series of research projects that all investigated the importance of basic knowledge for learning and performing various musical tasks. For instance, in (Widmer, 1992a), we described a computer program that learned to harmonize given melodies. That program was based on a plausible model of what may be considered common musical knowledge (or listening habits, rather, as most of the knowledge ordinary listeners have about music is passive and unconscious, learned from exposure). In all these projects, we consistently adhered to the following three-step research methodology: (1) hypothesize what common musical knowledge might be relevant to the task and might, at least in part, 'explain' the phenomenon; (2) construct an explicit model of this knowledge, at an appropriate level of abstraction; and (3) test the adequacy of this model empirically by incorporating it into a learning program that uses the knowledge to learn to perform the task being studied. The impact of a given body of knowledge on the learnability of a task is an important indicator of the relevance of the knowledge and the adequacy of the chosen representation. Empirical investigations such as the one presented here are thus a viable research methodology for what Otto Laske has called Cognitive Musicology (Laske, 1988). In the case of expressive performance, the motivation and starting point for the project was the following Hypothesis: One role of expressive performance is the communication of musical structure, from the performer to the listener. Expression serves to emphasize structure. Consequently, we believe that expression becomes partly explainable once we know the relevant structural dimensions, how they are perceived by listeners, and how they may be made explicit by expressive variation along various dimensions. 6B.2 268 ICMC Proceedings 1993

Page  269 ï~~In order to test this hypothesis, we are going to present a formal model that explicitly encodes the knowledge necessary to perceive certain types of structure in music, and that also embodies plausible knowledge about the relationship between structure and expression. We will then test whether a computer program, if equipped with this knowledge, is able to learn some important principles of expressive interpretation from example performances (and to learn these more effectively than if it did not have the' knowledge model). As the model is based on two well-known theories of tonal music - Lerdahl&Jackendoff's (1983) Generative Theory of Tonal Music and Narmour's (1977) Implication-Realization Model - the experimental results will also provide some insight into the relative adequacy and importance of various aspects of these theories for expression. 2 A Qualitative Model of Relevant Musical Knowledge This section presents the structured musical model that we constructed to test the hypothesis. It is meant to capture some of the knowledge that is relevant to understanding and 'rationalizing' observed expression patterns in actual performances. The model consists of two major parts (see Fig. 1). One part models the structural understanding of a piece of music by a listener (the "musical intuitions of the experienced listener", as Lerdahl & Jackendoff would call it). The second part of the model then embodies our specific hypotheses concerning the relations between perceived structure and expression. The intended purpose of the model is to make explicit possible relations between the musical surface of the notated piece and plausible ways of interpreting and playing it. For the moment, the model (and the learning program) is limited to melodies only; it cannot yet handle free polyphony. 2.1 The Model of Structural Hearing The lower part of the model (Fig. 1) attempts to describe the structure of pieces of music as perceived by human listeners or performers. Essentially, this part of the model is a set of programs that take a 'raw', notated piece (a melody) and compute a structural interpretation for it. Each note is described in terms of the roles it plays in the structural interpretation. The current model identifies four dimensions of structural understanding: metrical and grouping structure, timespan reduction, and a level denoted 'process structure'. The metrical and grouping structure components are based on Lerdahl & Jackendoff's (1983) Generative Theoy of Tonal Music. The metrical structure component establishes the metrical interpretation of the piece at all levels. The metrical strength of an individual note is then defined as the number of metrical levels in which the note is classified as a strong beat. The grouping structure component partitions the musical surface into contiguous segments (phrases etc.) that may be heard as units. Individual notes are then described with respect to the roles they play in such phrases. The time-span reduction component, again based on Lerdahl & Jackendoff's theory, establishes the relative structural importance of events (notes) in the piece in a recursive importance hierarchy. From the time-span reduction tree, the system can then compute rough qualitative measures of the relative importance of individual notes. The structural level denoted "process structure" in Fig.1 essentially identifies certain goaldirected and perceptually salient surface patterns. It is based loosely on some notions from Narmour's (1977) Implication-Realization Model. This part of the model detects and marks common melodic and rhythmic surface patterns like ascending or descending lines ("linear continuations"), arpeggio-like figures ("triadic continuations"), melodic gap fills, harmonic departures and returns, and common rhythmic figures (e.g., "rhythmic gap fills"). Shown in Fig.1 are a melodic gap fill (the leap D-G consequently filled by the ascending line G-A-B-C-D) and the ascending line as a unit in its own right (a "linear melodic continuation"). The intuition behind this is that such surface patterns are very typical in tonal music, that they are easily perceived by listeners, and that they have a direct influence on local stress or emphasis patterns applied by performers. 2.2 A Qualitative Model of the Relation between Structure and Expression The upper part of the model in Fig.1 expresses some intuitions as to how structural and surface features may relate to expression dimensions. The arrows denote influences; they express how certain features of a musical situation may give rise to more abstract effects, which in turn influence a performer's decisions concerning how to play a piece. For instance (follow the leftmost branch in the upper half of Fig.1), the relative importance of a note certainly has an influence on how it will be played; we may plausibly expect performers to emphasize important notes in some way. In assessing the general importance of a note, we can distinguish its salience at the surface level, which again depends, among other things, on the metrical strength (derived from metrical structure), the duration (derived from the surface description) ICMC Proceedings 1993 269 6B.2

Page  270 ï~~"EXPRESSION" applied to a note r E { rubato, dynamics, (articulation,...) } ) sreiigth I,-rmportance role i. ends h..............: e i:.:I: i rn p o ".............ends..h. role in "process structure" (implicativity / closure) --71, in process:w...,rase I! I, type ofrprocess prase process pase duration of process ase i I. direction of process starts process - ends process a U'rd. I& ace & ff e a yN, R rmt..r a,4fisls apl "Raw" notated piece: 40 40 w Eno Fig. 1: Structure of the qualitative model and other factors, and the structural importance of the note, which depends on its prominence in the time-span reduction and possibly on whether it plays some role in a cadence, etc. These features are derived from the time-span reduction component. Similarly, the roles that a note plays in the phrase structure and in the process structure are relevant. These are reduced to lower-level features in a similar way. The crucial question is how these arrows are represented and interpreted in the model. Theoretical notions like 'importance' cannot be precisely quantified; also, our intuitions concerning the relationship between such structural notions and actual expression patterns are very vague. Accordingly, the arrows (influences) in the model are represented only as statements of qualitative dependency: " A statement q+( A, B) denotes a qualitative proportionality between parameters A and B. Basically, it says that there is a directed dependency between the values of A and B: the higher the value of A, the higher will be the value of B. In other words, A contributes positively to B. An example from our model is the relation 6B.2 270 ICMC Proceedings 1993

Page  271 ï~~q+ (metricaistrength(Note,MS), surface salience(Note,SS)). "The metrical strength MS of a Note contributes positively to the surface salience SS of the Note - the stronger metrically, the more salient, all other things being equal. " " A statement q- ( A, B) denotes an inverse qualitative proportionality between parameters A and B. The interpretation is analogous to q+: A contributes negatively to B. An example: q- (degreeof.closure( Note, C), accelerando( Note, A)). "The higher the degree of closure C of a Note (in the sense of Narmour), the lower can we expect the degree of accelerando A to be that is applied by a performer." " A statement dep (A, B) denotes an unspecific, undirected dependency between A and B. Basically, it says that A somehow depends on B, but we do not know exactly how. An example: dep( degreeofimplicativity( Note, I), directionof.process( P, Dir)): - inprocess( Note, P). 1) "The degree of implicativiy I of a Note (in the sense of Narmour) that appears in a 'process' P may depend, among other things, on the direction Dir of that process (if it does have a specified direction). " All the influences indicated by arrows in the model are formalized as such dependencies. This goes also for the influences on the target category 'expression'. For instance, the intuition that important events will usually be emphasized by a performer is expressed in the dependency q+ (importance( Note, I), crescendo( Note,X)). "The higher the importance I of a Note, the more crescendo (X) is likely to be applied to the note." In summary, what the entire model does is to relate the surface of a piece of music, through several abstraction steps (structural interpretation and abstraction in terms of effects) to our target phenomenon - musical expression decisions. As 1) The dependency statements, like all other parts of the model, are represented in the programming language Prolog. Capitalized names denote variables. Predicates after the ':-' operator represent conditions that must be true for the predicate before the it stands, the model is too abstract and imprecise to support expression decisions; it cannot be used by the system to decide how to play a new piece. However, it can be used to explain, a posteriori, some aspects of a particular expressive performance, with the help of methods of plausible reasoning (Collins & Michalski, 1989; Widmer, 1993a). In this way, it can also be used for effective learning. 3 Using the Model for Effective Learning In principle, learning general rules from specific cases is simple: the system could simply compare all the situations where a crescendo, say, occurred, compare these with all situations where no crescendo occurred, and formulate general rules on the basis of the observed similarities and differences. However, the potential number of such empirical generalizations is huge, and without any additional knowledge, the system has no way of distinguishing between plausible and implausible generalizations. For instance, if a sizeable number of crescendo situations happened to involve the note G, and if no decrescendo situation happened to involve a G, the system might legitimately create the generalization "Apply a crescendo whenever you are playing a G", which, of course, is absolute nonsense. It is only general knowledge about the problem that can make the learning process more intelligent. In our case, the available general domain knowledge is the musical model. However, as this model is incomplete and highly imprecise (due to the qualitative dependencies), a new learning method had to be developed that could effectively use abstract background knowledge to guide the learning process. The details of this algorithm are not important here. It has been described in the machine learning literature (Widmer, 1993b). Basically, given a set of training examples of the concept to be learned, the algorithm searches for symbolic generalizations that consistently discriminate between positive and negative examples of the concept (in our case, e.g., between crescendo and decrescendo situations). The result is a set of general symbolic rules that characterize various sub-types of the target concept (e.g., various types of situations that call for a crescendo). For each of these symbolic rules, it also learns a numeric interpolation table, based on the training examples that are covered by the rule. When classifying a new example (deciding how to play a new musical passage), the system applies all its learned rules and uses the interpolation tables associated with the matching rules to derive a specific value for ICMC Proceedings 1993 271 68.2

Page  272 ï~~Itr.w Fig. 2: Beginnings of three little minuets by J.S.Bach for the situation (e.g., the precise amount of crescendo to be applied). What is important is that the learning algorithm uses the underlying model to guide its generalization process. It will prefer generalizations that are consistent with its knowledge over generalizations that contradict it. It is in this way that the system's learning behavior will allow us to judge the adequacy of the underlying musical model. To be more precise about the learning scenario: input to the learning program are the notated scores of melodies along with actual recordings (via MIDI) of these melodies as played by some musician. Every note in such a training piece is regarded as a training example; notes played louder than average are examples of crescendo, notes below the average loudness are examples of diminuendo. Similarly, in the rubato dimension, the system tracks the average local tempo, and notes played longer than notated (with respect to the current tempo) are considered examples of ritardando, etc. The output of the learning system is a set of explicit symbolic rules (plus numeric interpolation tables) that allow the system to decide which variations to apply to new melodies. 4 Experimental Results This section briefly presents the results of some experiments with the system. In the first experiment, training and test pieces were three simple minuets from Bach's Notenbachlein fir Anna Magdalena Bach (see Fig.2). All three pieces consist of two parts. The second part of each piece was used for training: they were played on an electronic piano by the author, and recorded through MIDI. After learning, the system was made to play the first parts of the three pieces (which it had never seen before). Each note in the training pieces was one example. That gives a total of 212 examples; they were used twice, once for learning rules for dynamics, once for rubato. From this, the system learned a total of 58 rules, of which 14 deal with crescendo, 15 with diminuendo, 13 with ritardando, and 16 with accelerando. There are a variety of ways in which the system and the experimental results can and should be evaluated. Let us briefly discuss some of these: The listening test: The most obvious test is to listen to the performances that the system produces after learning. In this paper, we can only simulate this graphically, so let us take a look at the Bach experiment: Fig.3a plots the dynamics of one of the training pieces (the second part of the first minuet in G major) as played by the author, and Fig.3.b shows the dynamics that the system applied to one of the test pieces (the first part of the same minuet) after learning (note again that the system had not seen this piece before). The figures plot the relative loudness with which the individual notes were played. A level of 1.0 would be neutral (average loudness), values above 1.0 represent crescendo (increased loudness), values below 1 diminuendo. The reader will appreciate that the system's performance (Fig.3b) makes musical sense. For example, there is a very clear pattern of stresses on the strong beats (beginnings of measures), a pronounced crescendo tendency in ascending melodic lines, and a typical decrescendo pattern in bars with three quarter notes. Also, a comparison of Figs.3a and 3b exhibits strong similarities be 6B.2 272 ICMC Proceedings 1993

Page  273 ï~~1.8 1.6 1.4 1.2 1 0.8 HV- 1- -op pF of a I i i i i ( i ) 12 score time (beats) 15 18 21 24 Fig. 3a: Part of a training piece as played by teacher (dynamics) 1.8 1.6 1.4 1.2 1, Â~: -... " " "- " "- " "".......... " ".. " Â~". "....................... r ". ".. " " ". "... ". 0. 0' P 4et 0 12 score time (beats) 15 18 21 24 Fig. 3b: Part of a test piece as played after learning (dynamics) tween the teacher's style and the way the system played after learning. The system indeed seems to have extracted the relevant regularities from the teacher's performances. Equally positive results were obtained in the tempo (rubato) dimension. For lack of space, we cannot show the corresponding plots here, but generally the system again learned to vary the tempo quite effectively. More details appear in (Widmer, 1993c). Relative relevance of structural dimensions: It is also instructive to look at the rules learned by the system. This may give us some insight into the adequacy of the underlying music theories. For instance, an analysis of the rules learned in the Bach experiment reveals that the most important indicators or determining factors for this level of expression seem to be rhythmic and metrical features like note duration, metrical strength, etc. Also, the Narmour-type surface patterns (especially gap fills of various sorts) seem to be highly relevant, as evidenced by the high number of learned rules which refer to them. On the other hand, few rules refer to the underlying grouping structure. This is somewhat surprising, as many authors have stressed group boundaries and phrase structure as the determining factors for rubato and dynamics (e.g., Todd, 1989). In summary, looking at the learned rules seems to confirm that both the Lerdahl & Jackendoff and the Narmour theories are relevant to the issue of expression and do provide an adequate structural vocabulary (see also Widmer, 1992b). A more detailed discussion of the learned rules appears in (Widmer, 1993c). ICMC Proceedings 1993 273 6B.2

Page  274 ï~~1.8 1.............................. 12 score time (beats) 15 18 21 24 Fig. 4: Same test piece as played after learning without model (dynamics) Importance of the musical knowledge: In order to appreciate the importance of the musical knowledge, it is instructive to run the learning program on the same training data, but without the underlying model. In this situation, the system does not know anything about musical structure and can only reason about surface features of the pieces (intervals, durations, etc.), and generalize empirically. When this experiment was performed, the results were considerably worse. Fig. 4 shows the system's performance on the same test piece after learning without the model. Comparing this result to Fig. 3b, we notice a clear deterioration. Some of the system's variations do make sense, but some run completely counter to our musical intuition (e.g., the stresses on the last notes in measures 1, 3, and 6). Similar results were obtained with the other test pieces. In this experiment exactly the same training data were used as in the previous case; the only difference was the absence of the musical model. From this we conclude that the knowledge contained in our model is not only sensible and relevant (it leads to useful and partly comprehensible learning results), but actually necessary, which again confirms our initial hypothesis. Generality of the approach: In another set of experiments, the system was tested with jazz standards from the swing and bebop eras. The goal was to see if the computer could develop a notion of 'swing' (not as a compositional or improvisational style, but as a way to play jazz melodies). Swing is a very elusive notion, as any jazz, player will readily confirm. However, two characteristics (which are difficult to learn for classically trained musicians) are a strong tendency to stress the metrically weak events (dynamics) and an extreme distortion of the relative duration of notes, especially sequences of eighth notes. Preliminary results are surprisingly good: given only very few pieces to learn from (standards like Thelonious Monks Straight, no Chaser or Ellington's Satin Doll), the system exhibited a very distinct swing 'feel'. Given the vast difference between swing/bebop and playing Bach minuets, these results testify to the generality and robustness of our approach. We are also planning to experiment with music from the romantic period, where expressive gestures tend to be 'larger' and more extended in time. Our expectation is that in this type of music grouping and phrase structure will play a more important role for sensible expression. An analysis of the learned rules will enable us to verify this. 5 Discussion and Related Work In recent years, a number of researchers have attempted to build models of expressive performance. They were guided by different goals, started from different assumptions, and used widely different modelling and description techniques. Sundberg and colleagues (1983; 1991) manually crafted a set of rules that they hypothesized would produce acceptable (or at least better) computer performances. These rules are of an extremely local nature. They do not make reference to higher-level structural aspects of the music. Interestingly, however, our system re-discovered some variations of their rules in the process of 6B.2 274 ICMC Proceedings 1993

Page  275 ï~~learning (see Widmer, 1993c). This is certainly more than a nice coincidence. A system that actually learns interpretation rules from performances has been described by Katayose et al. (1990). The system uses statistical techniques (autocorrelation) to extract surface patterns from the music; these patterns are then stored, along with the way they were played. This system seems to learn quite effectively. However, it is geared directly towards performance, with little regard for the underlying knowledge aspects; thus it is of little use for cognitive musicology. Other authors have developed quantitative models of various aspects of expressive performance like expressive timing (Todd, 1989) that conform quite well with measurements in actual performances by musicians. Again, however, the knowledge dimension has been largely ignored. Of course, as it stands, our model has more limitations than can be discussed here. Apart from being restricted to single melodic lines, it (and the learning method) is also limited to one level of expression. It is well known that expressive variation occurs at several levels at once (Sloboda, 1985), and one of our immediate goals for further research is to extend the system to learn both local and larger-scale expressive gestures. To that end, we will have to untie the concept of expression from the level of individual notes. Also, some of the warnings expressed in (Desain & Honing, 1991) and (Desain, 1992) apply directly to the current model: we will need to introduce more global context (e.g., knowledge about the global tempo), and generally the issue of realtime behavior will have to be addressed - at the moment, the 'listening part' of the model is not restricted to processing a piece strictly from left to right. Nonetheless, it seems that the general approach is fruitful, in terms of both theoretical and practical results. If it can be extended towards more complex types of music and towards realtime operation, the results may well be of practical use to real computer music systems. Acknowledgments This research was sponsored in part by the Austrian Fonds zur Forderung der Wissenschaftlichen Forschung (FWF) under grant P8756-TEC. Financial support for the Austrian Research Institute for Artificial Intelligence is provided by the Austrian Federal Ministry for Science and Research. References Collins, A. and Michalski, R.S. (1989). The Logic of Plausible Reasoning: A Core Theory. Cognitive Science 13(1), 1-49. Desain, P. and Honing, H. (1991). Tempo Curves Considered Harmful. In Proceedings of ICMC-92, Montreal. Desain, P. (1992). Can Computer Music Benefit from Cognitive Models of Rhythm Perception? In Proceedings of ICMC-92, San Jose, CA. Friberg, A., Fryd6n, L., Bodin, L., and Sundberg, J. (1991). Performance Rules for ComputerControlled Contemporary Keyboard Music. Computer Music Journal 15(2), 49-55. Katayose, H., Fukuoka, T., Takami, K., and Inokuchi, S. (1990). Expression Extraction in Virtuoso Music Performances. In Proceedings of the 10th International Conference on Pattern Recognition, Atlantic City, N.J., 780-784. Laske, 0. (1988). Introduction to Cognitive Musicology. Computer Music Journal 12(1), 43- 57. Lerdahl, F. and Jackendoff, R. (1983). A Generative Theory of Tonal Music. Cambridge, MA: The MIT Press. Narmour, E. (1977). Beyond Schenkerism. Chicago, Ill.: University of Chicago Press. Sloboda, J. (1985). The Musical Mind: The Cognitive Psychology of Music. Oxford: Clarendon Press. Sundberg, J., Askenfelt, A. and Fryden, L. (1983). Musical Performance: A Synthesis-by-Rule Approach. Computer Music Journal 7(1), 37-43. Todd, N. (1989). Towards a Cognitive Theory of Expression: The Performance and Expression of Rubato. In S. McAdams and I. Deliege (eds.), Music and the Cognitive Sciences. London: Harwood Academic Publishers. Widmer, G. (1992a). Qualitative Perception Modeling and Intelligent Musical Learning. Computer Music Journal 16(2), 51-68. Widmer, G. (1992b). The Importance of Musicologically Meaninful Vocabularies for Learning. In Proceedings ofICMC-92, San Jose, CA. Widmer, G. (1993a). Learning with a Qualitative Domain Theory by Means of Plausible Explanations. In R.S.Michalski and G.Tecuci, eds., Machine Learning: An Artificial Intelligence Approach, vol. IV. Los Altos, CA: Morgan Kaufmann. Widmer, G. (1993b). Plausible Explanations and Instance-Based Learning in Mixed Symbolic! Numeric Domains. In Proceedings of the Second International Workshop on Multistrategy Learning (MSL-93), Harper's Ferry, WVA. Widmer, G. (1993c). Modeling the Rational Basis of Musical Perception. Submitted. Available as Report TR-93-20, Austrian Research Institute for Artificial Intelligence, Vienna. ICMC Proceedings 1993 275 6iB.2