Page  00000459 Notation and 3D Animation of Dance Movement Royce Neagle, Kia Ng, Roy A. Ruddle Informatics Research Institute, School of Computing and Interdisciplinary Centre for Scientific Research in Music (ICSRiM), University of Leeds, Leeds LS2 9JT, United Kingdom web. email. Abstract Notational systems are used in almost all fields of study, and especially for the communication and expression of ideas. This paper presents a machinereadable format for the Benesh Movement Notation, and illustrates a prototype for 3D dance animation for the proposed representation. The prototype creates an animated figure that is built on an articulated structure, with inferred domain rules. 1 Introduction Dance notations are used to archive choreographies, and allow their subsequent resurrection by dancers and choreologists. The symbolic representation of a dance notation is similar to music and with an underlying mathematical content that lends itself well to machine representation. Many methods of notation systems have been attempted, including Labanotation (Brown and Parker 1984) and Benesh notation (Benesh and Benesh 1993; Wilks 1980). Labanotation is a complete movement analysis. Benesh Movement Notation was designed to represent classical ballet with its inferred rules such as turnout, rounded arms and straight legs. In this paper, we focus on the machine representation and simulation of the Benesh Movement Notation defining body shape and position in relation to classical dance. Like any other symbolic notation system (such as common music notation) there are abstract differences between the representation and the artistic interpretation (Lansdown 2000; Herbison-Evans 2001) of the notation as a performance. Both the Labanotation and the Benesh notation give a rich vocabulary for describing human movement. They model structural and positional aspects of the performer at specific times and provide information on the timing and speed of movement, and qualitative aspects (i.e. how one moves). This paper explores the utilisation of computer graphics to aid visualisation and animation of a movement notation. It focuses on the shape aspects and shows how basic notational data can be visualised by animating a virtual figure in three dimensions so that it "performs" a dance sequence. Throughout this paper we concentrate on the Benesh notation. The following sections describe the mapping of the notation to a file format. Andantino \ -4 - - -4 \ + Figure 1. Example of Benesh notation annotating tendu, glissade, releve and pirouette. 2 Machine Representation Benesh notation is written from left to right on a five-line stave. The limb and body movements are plotted in-stave. Instructions for direction, location and translation are written below the stave. Rhythm and phrasing information are notated above the stave. Figure 1 illustrates a simple example with a time signature (2) and tempo (Andantino) at the beginning of the stave, and 4 rhythmical symbols above the stave. Direction to which the dancer is facing and movement pathway, are symbolised by the directional symbols ( \ ) below the stave. All other glyphs in the stave represent the position the dancer is to achieve on specific time counts. This paper will discuss the in-stave information for recording various positions used in classical ballet. Stave 5 ---r--- Top of head Stave 4 - Top of Shoulders Stave 3 --- --- Waist Stave 2 Knees Stave 1 Floor Figure 2. Five-line stave, forms a matrix for the human position. For the Benesh notation, the human body is divided into the feet, knees, waist, shoulders and top of the head and this, conveniently, maps onto a fiveline stave (see Figure 2). As the human's arm span is approximately equal to their height, a figure with arms extended occupies a square on the stave. To record a position or pose, the Benesh notation notes the exact locations occupied by the four extremities, the hand and feet. In addition, the position of a bend, 459

Page  00000460 such as the knee or elbow, may also be required. Figure 3 demonstrates three different positional notations (see figure 3(a)) and the corresponding human body locations (see figure 3(b)). I.. if ~11LiL 0i 1; -Ir (a) (b) Figure 3. (a) Sample notation (b) Mapping a figure to the sample notation. 2.1 3D Benesh Notation Each in-stave extremity notation is defined in relation to the cardinal planes (see Figure 4). Different glyphs represent positions either in front, level, or behind the cardinal coronal plane (see Table 1 I Front Level Behind Basic Glyph I Opposite Side / 7' I -+ x Table 1. Glyphs to define extremity position in relation to cardinal coronal plane. Surrior Posteri or Cardinall STra~nsverse Anteror ~ Plane Carinr /let.j ~s~,Cardinal Sagittal Coronal PlaEne Pla~ne Inferi'or Figure 4. Human body in anatomical position with cardinal planes. It is not necessary to state how far in front or behind because that will be dictated by the biomechanical characteristics of the performer. For example, adding a diagonal line to the glyph can specifies that the extremity is on the opposite side of the sagittal plane. The distance that the glyph is away from the centre of the frame specifies how far the extremities are, from the centre of the body. Other glyphs indicate bends of the knee or elbow. Examples of classical positions that require the glyphs, as described earlier, are shown in Figure 5. Iv ~X J~ (a) (b) (c) Figure 5. (a) and (b): Use of head and body on the cardinal coronal plane; (c) Position of body with left leg behind the cardinal coronal plane. 2.2 M4achine Readable Format This section details the development of a machine-readable f~ile format for Benesh notation. For convenience and readability, the file format is in ASCII text file that can be edited by hand. Each frame has a starting and ending token, FSTART and FEND, that allows the complete dance to be stored on a single line if necessary. The end token is also used to trigger the validation of the data. The four extremities are tokenised with two alpha characters. For example, LH represents left hand and RF represents right foot. This is followed by a set of four numbers to define the colunm position, the associated stave line number, the proximity in relation to the stave line (e.g. above, below) and the glyph type. Appendage Token: To specify which extremity to be parsed. This includes hands, feet, knees and elbows. Column Position: Distance with respect to sagittal plane. Numerical specification starts from arm length, away from the centre (maximum distance without moving the body), along the coronal plane. Stave Position: This is numbered bottom up where the lowest number represents the bottom stave, i.e. the floor. The highest number refers to the top stave, on top of the head (see Figure 2). Stave Relationship: The proximity of the glyph with its associated stave line. For example, above shoulder height, or below knee height. It also extends the glyph position, for example, floor positional glyphs can specify whether the foot is on flat, demi- or fullpointd. Glyphs: This represents the position in relation to the coronal plane, specify/ing whether the extremity position is in front, level, or behind the coronal plane. As an example, Figure 2 is encoded as: FSTART LH1412 RH1412 LF3112 RF3112 FEND For readability, the format uses optional white spaces. The prototype can parse the data file with or without white spaces with no significant performance impact. Lines drawn in the centre of the frame represent body and head positions. The head is notated between the forth and fifth stave, body is drawn between the third and forth stave and the pelvis is drawn between the second and third. The glyphs represent isolated rotation and bends of the spine. Translations include inclinations, turns and bends of the waist, ribcage and head. Each translation can be simplified to three states for each movement. For the rotation and side bends, the states include left, straight or right. For movement along the sagittal plane the states are forward, straight, or back. It is also possible to have a combination of twists and bends. The tokens are specified with an alpha field, for example head is tokenised as HD. This is followed by three numerical fields representing movement along the sagittal plane, the coronal plane, and rotation around the longitudinal (vertical) axis. Figure 5(a) shows an 460

Page  00000461 example of rotations and bends without the extremities which is encoded as: FSTART HD002 BD001 FEND 3 3D Animation With this limited set of glyphs, it is feasible to create a figure moving on the spot (without translation). The current prototype uses the classical first position and obeys classical ballet rules such as implied straight legs unless specified and 900 turnout of the hip socket. The Benesh notation provides snapshots between motions and the choreologists interpret the movement between each set position. A keyframe in an animation is a frame that explicitly defines one or more parameter values. Additional frames between two keyframes are computed by interpolating the values of the keyframes. The notation snapshots are used to provide the parameters to set the keyframe positions for animating a virtual dancer. This section describes how the encoded data file is interpreted by the animation prototype to create each keyframe position. Screen snapshots of the prototype in action are illustrated in Figure 6. contains built-in default values for all joints in an anatomical position (see Figure 4) with two exceptions: the legs are turned out, and the arms are rounded to follow the traditional form of classical ballet. 3.1 Virtual Dancer Design The prototype virtual figure is built on an articulated structure of rigid objects, connected to each other by joints. Ratner's (1998) folder metaphor (see Figure 7) is adapted to enable an effective approach where data are inherited from one level to the other following a hierarchical structure. Any translation in a folder will affect the position of its sub-folders. No variables need to keep track of the directions and positioning of the joint parameters as each parameter is linked to a joint in a parent folder. The pelvis is the root node set in the body folder and movement from here affects all other joint nodes. For example, the upper body folder contains a subfolder called chest, within the chest folder are the hierarchies for the head and arms. By rotating or bending the waist parameters translate the chest, head and arms. End effector positions, hands and feet, are an accumulation of joint parameters. Mapping the end positions to the Benesh notation has to use both forward and inverse kinematic methods under different circumstances. Forward kinematics are applied when all joints are known and inverse kinematics are used when bend-joint parameters are inferred. (a) (b) T- T U- 44-. -.. *L f^ - ^ - ~ -- - _ 4 2' 4^ S *Body / */Upper /: V Body/ [..................................I,. 7 ' A" Lo/wer /,, S/ *Lft / / *Right I/ Leg / Leg /.T/*Chest /' I *Neck| *Left /*Right/ V/& Head/ / Arm / Arm Figure 7. Body hierarchy folder metaphor. I............................................................ Figure 6. Snapshots of prototype with stick figure: (a) feet first, arms a la seconde (b) leg held in a la seconde with directional normals. The prototype displays the notation below a 3D window which renders a stick figure in 3D space using forward and inverse kinematics (see Figure 6). Each notation frame sets the changes of the positions and the prototype updates the respective joint parameters with the new values. The software Design for the virtual figure is based on Credo Lifeforms (Credo Interactive Inc. 2000) with adaptations made for the classical ballet assumptions. For our prototype, the body is set with only a lower abdomen and chest region while Credo models have three body sections above the pelvis. The design of the notation makes this acceptable with only one stave space set aside for upper body movement above the pelvis and below the shoulders. A bow; a forward bend with a rounded back, is represented with notation of a forward movement of the thoracic vertebrae (upper back) and the cervical vertebrae (neck). For movement where the back needs to be straight, the notation specifies movement 461

Page  00000462 in the space between the second and third stave lines to define the parameters for the amount of rotation in the pelvis. 3.2 Applying Kinematics Forward kinematics is an effective means for finding the position of the body for each frame where all joint angles are known. This can be demonstrated by extending the folder metaphor in Figure 7. To raise the foot to be at the same height as the knee requires a rotation of the hip. This rotation accumulates down the folder to include the knee and ankle joints. As ballet assumes straight legs, the hip node is the only joint that needs its parameter setting. However, if the knee bend is also notated, by descending down the folder metaphor we set the angle of the hip to put the knee in the correct position. After that, we set the angle of the knee joint to put the foot opposite the hip. This is an accumulation effect where the ankle position is the hip rotation plus the knee rotation. Inverse kinematics is used when bends are inferred. For example, the foot placed on the knee (retire). The inverse kinematic algorithm (Watt 2000) calculates generically both the hip and knee joint rotational values for the figure to be rendered correctly with the feet placed in the defined position. Same procedure applies for shoulders, elbows and hands. This is working in the reverse direction of the folder metaphor. 3.3 Local Coordinate Geometry The simplicity of the notation means that while rendering in 3D, calculations are made on a 2D plane. Positions of the limbs are either in front, level, or behind the coronal plane. Reverting back to the square around the dancer (as shown in Figure 2), the placement of the glyph within the square provides the distance of the extremity from the centre of the body. The glyph type (i.e. front, level, behind) provides the direction the extremity. Together, the two parameters provide the vector for the extremities. The default value for lifting a limb is in the forward direction, therefore to lift to the side requires a rotation of 90'. The height of lift is relative to the stave line, for example, if the glyph for the hand is notated on the forth stave line, the hand is rotated until shoulder height. The final degree of freedom is defined by the rotation within the joint socket, resulting in rotation of the palms. This method of movement allows us to position the limbs within the defined rules of classical ballet with the minimum of rotations. For example, rotating the limb around its own axis to give the classical turnout positions. The hinge joints, for example the knee, use two degrees of freedom and do not require the first rotation to set the direction of the plane. The order of rotations for achieving the extremity positions is important. Accumulative rotations move the local coordinate system while unordered calculations create very interesting but impossible positions. Using a true anatomy representation of the joints where the ankle is defined as a gliding joint is not necessary for this prototype. Simplification of the mechanical structure for the joints was applied. Gliding joints were considered as hinge joints as the rendering not being detailed enough to show every sinew in the body movement, and it is a good approximation for rendering dance movement. 4 Conclusion This prototype is effective in showing technically correct dance positions. Figure 6 demonstrated leg movements with grands battements (classical dance movement). Further enhancement on rendering includes interpolation between keyframes, which could simulate movement (of a virtual figure) to resemble dancing as opposed to movement from one dance position to another. Work in hand includes the study of motor motion to enhance the system to simulate syntactic expression and style of movement. This paper described a machine-readable format for dance notation with keyframe rendering. The prototype parsed the data using built-in domain knowledge and conventions. This includes default position settings and the use of classical ballet rules such as turnout and straight legs. The full development of a software animation system to represent and simulate dance would be beneficial to historians, choreographers and choreologists, and for educational purposes. A complete dance visualisation tool would allow choreographies from the past, present and future to be analysed, recreated or interpreted for the advancement and preservation of the art form. References Benesh, R. and J. Benesh. 1983. Reading Dance: The Birth of Choreology (First ed.). McGraw-Hill Book Company Ltd. Brown, A. K. And M. Parker. 1984. Dance Notation for Beginners. Dance Books Ltd. Credo Interactive Inc. 2000. Credo Interactive. Herbison-Evans, D. 2001. Dance and the Computer: A Potential for Graphic Synergy. White paper (Updated). Lansdown, J. 2000. Computer-generated choreography revisited. White paper. Ratner, P. 1988. 3D Human Modeling and Animation. John Wiley and Sons, Inc. Watt, A. 2000. 3D Computer Graphics (Third ed.). Addison-Wesley Publishers Ltd. Wilks, J. 1980. Benesh Movement Notation Beginners' Manual: 1. Still Life (Second ed.). The Benesh Institute of Choreology, Ltd. 462

Page  00000463 ENP2.0 A Music Notation Program Implemented in Common Lisp and OpenGL Mika Kuuskankare and Mikael Laurson Center for Music and Technology, Sibelius Academy, P.O.Box 86, 00251 Helsinki, Finland email: mkuuskan @, Abstract In this paper, we present an open and portable music notation program, ENP2.0. It is based on the PatchWork user-library ENP. ENP2.0 is now rewritten using Common Lisp and OpenGL APL The improvements in the user interface and overall design of ENP2.0 are discussed. The most important new features are covered with musical examples. We also point out the differences between ENP (user-library in PatchWork) and ENP2.0 (music notation program). 1 Background ENP (Kuuskankare and Laurson 2001) has been used in several projects for the last three years. ENP is a userlibrary in PatchWork (Laurson 1996). Its aim was to enhance the existing notational capabilities of the PatchWork music editors. ENP was written in Lisp using QuickDraw - the graphics layer of Macintosh - for rendering. Thus, ENP was intensively attached to Macintosh. The recent developments in the operating system markets (the introduction of OS X, Linux, etc.) made it obvious that this was not the most fruitful approach. ENP2.0 is a music notation program written in Common Lisp and OpenGL (Woo et al. 1999). ENP2.0 can be used both as a stand-alone application and as a score-editor within a new visual language called PWGL. PWGL, in turn, is based on PatchWork (PWGL will be documented elsewhere). ENP2.0 benefits greatly from the concepts and experience that originate from the development of ENP. It is, however, completely rewritten. It is potentially possible to run ENP2.0 in any platform supporting Common Lisp and OpenGL. 2 The Programming Environment ENP2.0 is currently running in a Macintosh using Macintosh Common Lisp (MCL) and OpenGL API. There are native MCL implementations in both MacOS 9 and MacOS X and in Linux. OpenGL is a fast and portable graphical library with implementations in all of the aforementioned operating systems. In the following subsections, we outline some of the more important technologies behind the new and improved system. 2.1 OpenGL Some of the advantages of OpenGL are for example double-buffering, anti-aliasing and hardware acceleration. These features provide an inexpensive way for creating smooth and professional looking graphics. One of the most important features of OpenGL, though, is its ability to interact with arbitrary complex objects. This feature, called 'picking' in OpenGL, is provided with almost no additional coding. Yet, it is a very powerful tool when programming graphical user interfaces. In our case this means that any musical object of any complexity can be edited with the mouse. The 'picking' mechanism in ENP2.0 is implemented so that every graphical object itself acts as a direct pointer to the musical data it represents. 2.2 Persistent Object Storage In ENP the save and restore methods had to be implement for each object. This made the system difficult to extend and maintain. It also made it more susceptible for human errors. The automatic and modular object source generation feature in ENP2.0, in turn, enables fast and dynamic 463

Page  00000464 development process. The object source generator is recursive so that embedded structures are handled correctly. Source generators for the most common data types are provided but the user can modify and extend their behavior. 2.3 Other New Features ENP has been used in various projects for several years. During this time, a considerable amount of user data has been generated. A special converter was designed so that any score created in PatchWork can be imported to ENP2.0. The user can extend the capabilities of ENP2.0 with special plug-ins. Unlike the traditional scripting languages this approach allows full access to the musical structures. In ENP2.0 we implement a new way to edit preferences. The preferences are categorized in several groups like stems, beams, ties, etc. Each of the groups is displayed in a separate page accompanied by an appropriate musical excerpt. When the user changes any of the given parameters, the effect is immediately shown in the notation. Thus, the user can concentrate on adjusting the musical output instead of adjusting some abstract values. 3 ENP2.0 Overview ENP2.0 supports both mensural and non-mensural notation. Mensural notation is used when working with traditionally notated metric music. Non-mensural notation can be used when writing instrumental music that is improvisatory or when the durations of events are relative rather than absolute. The two notation styles can be freely mixed. However, exact time synchronization between parts written in different styles is not yet provided by ENP2.0. Next we give two examples of the notation created with ENP2.0. Figure 1 shows a passage written for piano using mensural notation. It internally uses four independent voices that are drawn in a typical keyboard staff. Figure 1. Claude Debussy: Le Cathedrale engloutie. Figure 1. Claude Debussy: Le Cathidrale engloutie. The figure 2, in turn, gives an example of the nonmensural notation. This example includes some special notational attributes like Score-BPS's and special noteheads. -.-.-------- -- Figure 2. Example of non-mensural notation. 4 New Graphical User Interface In ENP2.0 the user interface is intended to be as straightforward as possible. There are two edit-modes: a general edit mode and a rhythm edit mode. The user operates mostly in general edit mode. In this mode the user can for example enter and edit pitch information, articulations, note-heads etc. The rhythm-input mode is provided for inputting and editing the rhythmic structures. The user can easily switch between these two modes. Next we investigate the editing features of ENP2.0 and discuss some of the problems that arise when dealing with multiple selection. 4.1 Selecting There are two primary methods of selecting objects in ENP2.0. First, the user can click on any object in order to select it. Second, the user can use sweep selection to select a large number of objects at once. When considering any consequent operations after selecting the appropriate objects, the first method is unambiguous. The operation affects only the selected object. However, the latter case, is not as straightforward. When sweep selecting, the user may select objects that are of different kind. To be able to direct the desired action to the appropriate type of objects there has to be a way to filter a selection. 4.2 Filtering In ENP2.0, there are two different ways to tell which objects are active when performing operations to a selection: (1) The user can apply a global filter. Currently, the user can filter notes, chords, beats and measures from a group of objects. The filtering does not collapse the actual selection. At any time, the user 464

Page  00000465 may apply another filter or revert to the initial selection. This method is primarily used in conjunction with keyboard shortcuts. (2) By clicking at any of the already selected objects, the user can apply a local filter. The consequent operation affects only the objects that are of the same kind. This method is used with contextual menus and with mouse driven operations. 4.3 Editing with the mouse Every musical object in the score reacts to a set of operations, among others dragging, zooming, and panning. The outcome of the performed operation depends on the type of the object. For example, notes and chords can be dragged to be transposed, the page can be panned to navigate to a different location on the page, and so on. Let us investigate the case where the user must edit a group of objects with the mouse. In figure 3 the user has sweep selected a number of different kinds of objects (all the objects shown in the example). To transpose the notes within the selection, the user must drag any of the selected note objects. To reposition the selected articulations the user must drag any of the articulations, and so on. Figure 3. Group of selected musical objects. The mechanism described above is discriminating enough to distinguish, for example, between the staccatos and the accents in figure 3. Dragging the accents does not affect the positioning of the staccatos and vice versa. 4.4 Contextual IMenus Most of the operations in ENP2.0 are handled with contextual menus. Both the score and every musical object in the score owns a special context sensitive menu. This allows the user to perform object specific operations quickly. It also serves as a reference to the possible operations and parameters for a given musical object. 4.5 Rhythm Editing The rhythmic representation of ENP2.0 is based on the hierarchical PW-beat (Laurson 1996; Kuuskankare and Laurson 2001). However, in ENP2.0, the beat structures are modified and extended to add new functionalities. These include, for example, special grace-beats, and integration of the beat structures more closely to the musical representation (for example beaming). ENP2.0 also allows a new way to input rhythms. There are two basic operations to modify the beat structures: (1) the user can divide an existing beat into arbitrary number of sub-beats. (2) The user can change the proportional duration of any beat. Typically, the numeric keys are used to indicate the corresponding numeric values. When the user types a number, the proportional durations of the selected beats are changed. When the user holds down the Shift key and types a number, the selected beats are divided into corresponding number of sub-beats. In the rhythm-input mode, it is necessary to draw some additional information along with the standard notation. As can be seen in figure 4, the beat hierarchy is revealed by drawing a thick line with a number in front of it to identify each of the levels. The line serves as a visual indication of the extent of the beat. The number, in turn, indicates the proportional duration of each of the beats. Figure 4. Beat hierarchies in the rhythm edit mode. In figure 5 we give a brief example on how the rhythm input works in ENP2.0. Let us consider a situation where the user wants to input a quarter-note triplet with the last two note-values divided in two equal portions. We assume that the user starts with a quarter note. First, the user selects the appropriate beat (the line drawn above the quarter note) and presses Shift-3 (a). This results in a quarter-note triplet shown in (b). Next she/he selects the two latter eighth-notes (c) and presses Shift-2. The resulting rhythm is shown in (d). 465

Page  00000466 a) -.. I4wt _ i b) ' a C) -d) Figure 5. The input of a complex rhythm. Besides the operations discussed above there are a number other operations that, due to the space limitations, cannot be addressed in this article. 5 Improved Expressions Like ENP, ENP2.0 also provides a comprehensive collection of expressions such as articulations and dynamics (Kuuskankare and Laurson 2001; Laurson et al. 2001). Furthermore, it includes a set of expressions with improved functionality. For example, expressions can have a local editable state. In this state, the expressions are drawn differently from their normal appearance. For instance, Score-BPF's can be edited directly in the score without the need to open an external editor window. In ENP2.0, expressions are now also instrument sensitive. Next we investigate the Score-BPF and look at the instrument sensitive expressions. 5.1 Score BPF's Score BPF's are specialized groups. They are multipurpose graphical objects that can represent breakpoint functions as a part of a musical texture. Score BPF's can be edited directly in the score (figure 6). S. ~0.720, 0.79] | ~ ~ ~ ~ ~ ~ L-- LA^^^^"---,_Z^ '-___ Figure 6. A score BPF displayed in an editable state in the score. The user can select, move, add and delete points. The view can be resized vertically (the horizontal size is affected by the positioning of the musical objects and the general spacing of the score). Furthermore, the score-BPF has a specialized contextual menu when drawn in editable state. From there the user can, for example, add multiple break-point functions, adjust the colors or the grid and perform various transformations. 5.2 Instrument Sensitive Expressions Under some circumstances, a particular expression may have a different graphical representation depending on the instrument it is written for. For example when attaching a string number expression for a string part it is usually written differently when, say, attached to a guitar part (figure 7). This is handled automatically in ENP2.0. ~ ~~~~-~-~~-~~-----~ Violn Figure 7. Instrument sensitive expression. Instrument sensitive expressions are also dynamic. When the user changes the instrument of a part, all the instrument sensitive expressions in it adjust their appearance accordingly. 6 Acknowledgments This work has been supported by the Academy of Finland in project "Sounding Score - Modeling of Musical Instruments, Virtual Musical Instruments and their Control". References Laurson M. 1996. "PATCHWORK: A Visual Programming Language and some Musical Applications." Doctoral dissertation, Studia Musica No.6, Sibelius Academy. Laurson M., C. Erkut, V. Valimiki, and M. Kuuskankare. 2001. "Methods for Modeling Realistic Playing in Acoustic Guitar Synthesis." Computer Music Journal 25(3). M. Kuuskankare and M. Laurson. 2001. ENP, Musical Notation Library based on Common Lisp and CLOS. In Proc. of ICMC'01, Havana, Cuba. Woo M., J. Neider, T. Davis, and D. Shreiner. 1999. "OpenGL Programming Guide". Addison Wesley, 3rd edition, Massachusetts, USA. 466

Page  00000467 Neural Networks for Note Onset Detection in Piano Music Matija Marolt, Alenka Kavcic, Marko Privosnik Faculty of Computer and Information Science, University of Ljubljana email. matij Abstract This paper presents a brief overview of our researches in the use of connectionist systems for transcription of polyphonic piano music and concentrates on the issue of onset detection in musical signals. We propose a technique for detecting onsets in a piano performance, based on a combination of a bank of auditory filters, a network of integrate-and-fire neurons and a multilayer perceptron. Such structure has certain advantages over the more commonly used peak-picking methods and we present its performance on several synthesized and real piano recordings. Results show that our approach represents a viable alternative to existing onset detection algorithms. 1 Introduction Transcription of polyphonic music (polyphonic pitch recognition) is a process of converting an acoustical waveform into a parametric representation, where notes, their pitches, starting times and durations are extracted from the waveform. Transcription is a difficult cognitive task and is not inherent in human perception of music. It is also a very difficult problem for current computer systems. Separating notes from a mixture of other sounds, which may include notes played by the same or different instruments or simply background noise, requires robust algorithms with performance that should degrade graceftilly when noise increases. In recent years, several transcription systems have been developed. Some of them are targeted to transcription of music played on specific instruments (Rossi 1998; Sterian 1999; Dixon 2000), while others are more general transcription systems (Klapuri 1997). Onset detection is an integral part of these transcription systems, as it helps to determine exact onset times of notes in the transcribed piece. Some authors use implicit onset detection schemes (Rossi 1998; Sterian 1999), while others, including us, chose to implement a separate onset detection algorithm to improve the accuracy of onset times. We present our approach to onset detection in this paper. 2 Piano Music Transcription Our transcription system, called SONIC, is a system for transcription of polyphonic piano music. It takes an acoustical waveform of a piano recording (44. 1 kHz sampling rate, 16 bit resolution) as its input and produces a MIDI file containing the transcription as its result. Notes, their starting times, durations and loudness' are extracted from the signal in this process. Besides the piano being the only instrument in the transcribed signal, the system imposes no other limitations on the type of signal being transcribed, such as minimal note length, maximal polyphony, style of transcribed music, etc. The structure of SONIC is similar to most other transcription systems. The main distinction to existing approaches is that we use neural networks to perform tasks such as partial tracking and note formation. These parts of the system have already been presented elsewhere (Marolt 2000; Marolt 2001) and will not be discussed in this paper. We dedicate the next two sections to the onset detection subsystem implemented within SONIC, and present its performance on several synthesized and real piano recordings. 3 Onset Detection 3.1 Overview Note onsets play an important role in the perception of music. Studies showed that onsets play a pivotal role in the perception of timbre, as it is much more difficult to recognize the timbre of tones with removed onsets (Martin 1999). Onsets also make it easier to detect new information in music; we can detect tones with pronounced onsets well before we can determine their pitch. In a music transcription system, an onset detection algorithm is needed to correctly determine the starting times of notes in the transcribed signal. Several authors use implicit onset detection schemes in their systems and make the onset time of a note equal to the time of its finding. At first, we used a similar solution, but later abandoned it as it did not produce accurate results, especially for notes in lower octaves, where delays of several 1 Oths of milliseconds were very common. Such timing deviations led to unpleasant effects when listening to re synthe sized transcriptions, and also made performance evaluation of the entire system very difficult, as one had to take such deviations into 467

Page  00000468 consideration. We have therefore decided to add a separate onset detection algorithm to our system. Detection of onsets in a monophonic signal is not a difficult problem, especially if onsets are prominent, as is the case with piano tones. Onsets in a monophonic piano signal could be calculated with high accuracy by simply locating peaks in the amplitude envelope of the input signal. In polyphonic music, such an approach fails, because the amplitude envelope of an entire signal reveals little of what is going on in individual frequency regions of the signal, where note onsets and offsets may occur. Many researches in onset detection have been made in the field of beat and rhythm tracking. Unfortunately, these algorithms are usually not accurate enough to be used in transcription systems, as they only discover very prominent onsets in a signal. Better approaches were used in some transcription systems (Klapuri 1999; Scheirer 1995). There, the signal is first split into several frequency bands. A relative difference function is then calculated on the amplitude envelope of each frequency band and peaks above a certain threshold are taken as onset candidates. Peaks across all bands are then merged together, their new amplitudes calculated and all new peaks that fall below a certain new threshold removed. The remaining peaks are considered to represent onsets in the signal. Such approaches tend to be very sensitive to the choice of threshold values; if they are set too low, many spurious onsets are detected and vice versa; high threshold values produce many missing onsets. We have therefore chosen to take a somewhat different approach to onset detection. 3.2 Onset Detection in SONIC Our onset detector is based on a model for segmentation of speech signals, as proposed by Smith (Smith 1996). The model is founded on psychoacoustic findings and is based on a network of integrate-and-fire neurons that detects possible onsets in the input signal. We extended Smith's model with a multilayer perceptron neural network to improve the reliability of onset detection. The first phase of the model splits the signal into several frequency bands with a bank of auditory filters. These are bandpass IIR filters, their parameters were calculated from psychoacoustic findings (Patterson and Holdsworth 1990). We use these filters to split the signal into 22 overlapping frequency bands, each covering half an octave. The signal in each of the 22 resulting frequency bands is full-wave rectified and processed with a difference filter that calculates the difference between two amplitude envelopes; one calculated with a first order IIR smoothing filter with a time constant between 6 and 20 ms (depending on the center frequency of the band), and the other by smoothing the signal with a longer time constant (20-40 ms). Output of the difference filter has positive values when the signal rises and negative otherwise. Figure 1 shows the output of the difference filter on an excerpt taken from Glenn Gould's interpretation of Bach's Two-part Invention No. 8 (Sony 6622). The upper left part of the figure shows the acoustical waveform of the entire signal, vertical lines show note onsets. The right part of the figure shows two amplitude envelopes, calculated in the frequency band that covers frequencies between pitches of notes Gb4 and B4. Envelopes were calculated with a different amount of smoothing (time constants were 6 ms and 20 ms) and the difference is clearly visible. Output of the difference filter is shown in the lower left part of the figure. One can see that peaks of the filter output correspond to onsets of notes that fall within the Gb4-B4 frequency band. The last note (D4) falls outside of this frequency range, so its peak is not very prominent. The main task of the onset detector is to determine which peaks in outputs of difference filters correspond to note onsets and which are the result of various noises or beating in the signal. Our onset detector performs this task with a combination of a network of integrate-and-fire neurons and a multilayer perceptron neural network. Outputs of all 22 difference filters are first fed into a fully connected network of integrate-and-fire neurons. Each integrate-and-fire neuron i in the network.Bb4 A4F3.G4:F4 -A4F4.D4 -20 - I ' --3 -40 S 0.2 0.4: 0.6 i0.8 i 0 0 0.2 04 0.6 0.8 1 0 time (s) 10 i ", ".i -20 ~ " i!,, o5:. L, - 0.2 0.4 0.6 0.8 time(s) t..=6 ms 0.2 0.4 0.6 time(s) t=20 ms 0.2 0.4 0.6 time(s) 0.8 1 0.8 1 Figure 1: Acoustical waveform, two amplitude envelopes and the output of the difference filter O(t) 468

Page  00000469 changes its activity Ai (initially set to 0) according to: dAi =oi(t)-Y., dt where Oi(t) represents the output of the i-th difference filter, and.7 the leakiness of integration. When Ai reaches a threshold, the neuron fires (emits an output pulse), and its activity Ai is reset to 0. After firing, there is a period of insensitivity to input, called the refractory period (50 ms in our model). Firings of neurons provide indications of amplitude growths in frequency channels. Neurons are connected to all other neurons in the network with excitatory connections and the firing of a neuron raises activities of all other neurons in the network and thereby accelerates their firing, if imminent. Onset discovery with a network of integrate-andfire neurons provides two main advantages over classical peak-picking algorithms. Network connections cluster neuron firings, which may otherwise be dispersed in time, while at the same time the refractory period prevents neurons from generating a series of impulses at each onset. Connections also improve the detection of weak onsets, as they encourage firings of neurons that are close to the firing threshold, but would not fire without additional help. A network of integrate-and-fire neurons outputs a series of impulses indicating the presence of onsets in the signal. Not all impulses represent onsets, since various noises and beating can cause amplitude oscillations in the signal. We use a multilayer perceptron (MLP) neural network to decide which impulses represent onsets. Inputs of the MLP consist of activities Ai of integrate-and-fire neurons and several other parameters, such as amplitudes of individual frequency bands. The MLP only has one output, which indicates whether an onset has occurred in the input signal. The MLP was trained to recognize note onsets on a set of synthesized piano pieces and tested on a mixture of synthesized and real piano recordings. The performance of the entire onset detection system is presented in the next section. 4 Performance Evaluation We tested our algorithm on a set of synthesized and real piano pieces. On average, the system correctly found around 98% of all onsets, together with 2% of spurious onsets (onsets not present in the input signal). We present results on three synthesized and three real piano recordings in table 1. piano no. of missed spurious piece onsets onsets onsets 1 4793 51=1.1% 3=0.1% 2 1305 37 = 2.8% 3 = 0.2% 3 963 10= 1.0% 2 = 0.2% 4 786 25 = 3.1% 13 = 1.6% 5 206 13 = 6.3% 6=2.9% 6 556 0 8=1.4% Table 1: Performance statistics on three synthesized and three real piano recordings The synthesized pieces are: (1) J.S. Bach, Partita no. 4, BWV828 (Fazioli piano), (2) P.I. Tchaikovsky: Miniature Overture from The Nutcracker, (B6sendorfer piano), (3) S. Joplin in S. Hayden: Kismet Rag (Steinway D40 piano). Real recordings are: (4) J.S. Bach: English suite no. 5 (BWV810), 1. movement, performer Murray Perahia (Sony Classical SK 60277), (5) F. Chopin, Nocturne Op. 9/2, performer Artur Rubinstein (RCA: 60822), (6) S. Joplin, The Entertainer, performer unknown (MCA 11836). Results on synthesized recordings are generally better than those on real recordings. A large number of missed notes are notes played in very fast passages or in ornamentations such as thrills and fast arpeggios (most missed notes in Bach's Partita (1)). The main cause of such misses is the refractory period of integrate-and-fire neurons, which prevents them from firing and thus detecting onsets in very fast pace. The system also often misses quietly played notes, which are masked by other louder notes or chords occurring shortly before or after the missed onset. OK SP OK -30 - -35 1-5o -55" OK OK OK OK 523 Hz I V \l-^ 370 Hz S740 Hz I o 5z. HZ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 time (s) Figure 2: Detection of a spurious onset 469

Page  00000470 Poorer onset detection accuracy on real recordings is a consequence of several factors. Recordings contain reverberation and more noise, while the sound of real pianos includes beating and sympathetic resonance. Furthermore, performances of piano pieces are much more expressive, they contain increased dynamics, more arpeggios and pedaling. All of these factors make onset detection more difficult. Still, the overall performance is satisfying. The causes of missed notes are similar to the ones we mentioned when looking at synthesized recordings; increased dynamics of performances is the main factor that contributes to a larger percentage of missed notes in real recordings. A good example of this is Chopin's Nocturne (5th row in table 1), where a distinctive melody is played over very quiet, sometimes barely audible left hand chords, which are often missed. The larger percentage of spurious notes in real recordings is a result of more noise and piano imperfections, such as beating and unpredictable partial behavior. An example of a spurious note detection is given in figure 2. The figure represents amplitude envelopes of four frequency bands calculated on a one second excerpt of Bach's English suite (4). Vertical lines represent onsets found by the system. Six onsets were correctly found (OK), together with one spurious onset (SP). The spurious onset occurred because of a large increase in the amplitude envelope of the 740 Hz frequency band (for which there is no obvious explanation). 5 Conclusion In this paper, we presented our approach to detection of note onsets in a polyphonic piano performance. The approach is based on a connectionist paradigm and employs a bank of auditory filters and a network of integrate-and-fire neurons, coupled with a multilayer perceptron neural network. By using a connectionist approach to onset detection, we tried to avoid threshold problems that occur with standard "peak picking" algorithms. We presented performance statistics of our system on several synthesized and real piano recordings. Overall, we are satisfied with the results obtained; the presented onset detector brought a substantial improvement in the overall performance of our transcription system. The algorithm shows that connectionist approaches represent a good alternative in building onset detection systems and should be further studied. References Klapuri, A. 1997. Automatic Transcription of Music. M.Sc. Thesis, Tampere University of Technology, Finland. Rossi, L. 1998. Identification de Sons Polyphoniques de Piano. Ph.D. Thesis, L'Universite de Corse, France. Sterian, A.D. 1999. Model-based Segmentation of TimeFrequency Images for Musical Transcription. Ph.D. Thesis, Univesity of Michigan. Dixon, S. 2000. "On the computer recognition of solo piano music." Proceedings ofAustralasian Computer Music Conference. Brisbane, Australia. Marolt, M. 2000. "Adaptive oscillator networks for partial tracking and piano music transcription", Proceedings of the 2000 International Computer Music Conference, Berlin, Germany. Marolt, M., 2001. "SONIC: transcription of polyphonic piano music with neural networks." Workshop on Current Research Directions in Computer Music, Barcelona, Spain. Martin, K.D. 1999. Sound-Source Recognition: A Theory and Computational Model. Ph.D. Thesis, MIT, USA. Klapuri, A. 1999. "Sound Onset Detection by Applying Psychoacoustic Knowledge." Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing. Scheirer, E.D. 1995. Extracting expressive performance information from recorded music. M.Sc. Thesis, MIT, USA. Smith, L.S. 1996. "Onset-based Sound Segmentation," Advances in Neural Information Processing Systems 8. Touretzky, Mozer and Haselmo (eds.). Cambridge, MA: MIT Press. R. D. Patterson, J. Holdsworth. 1990. "A functional model of neural activity patterns and auditory images," Advances in speech, hearing and auditory images. W.A. Ainsworth (ed.). London: JAI Press. 470