Page  00000001 INTRA-NOTE FEATURES PREDICTION MODEL FOR JAZZ SAXOPHONE PERFORMANCE Rafael Ramirez Amaur Pompeu Fa Music Techno Ocata 1, 08003 {rafael, ahazan, e ABSTRACT Expressive performance is an important issue in music which has been studied from different perspectives. In this paper we describe an approach to investigate musical expressive performance based on inductive machine learning. In particular, we focus on the study of variations on intra-note features (e.g. attack) that a saxophone interpreter introduces in order to expressively perform a Jazz standard. The study of these features is intended to build on our current system which predicts expressive deviations on note duration, note onset and note energy. 1. INTRODUCTION y Hazan Esteban Maestre ibra University logy Group - IUA 3 Barcelona, Spain maestre} - Fundil Frequency S - Ene.rgy - Spectral Centroid - Note onset - Dura-tion - lIDI Note - Mean Energy Modeling expressive music performance is one of the most challenging aspects of computer music. The focus of this paper is the study of how skilled musicians (saxophone Jazz players in particular) express and communicate their view of the musical and emotional content of musical pieces by introducing deviations and changes of various parameters. In the past, we have studied expressive deviations on note duration, note onset and note energy [11, 10]. We have used this study as the basis of an inductive content-based transformation system for performing expressive transformation on musical phrases. In this paper we focus on the study of the intra-note features (i.e. attack, sustain, release) that the interpreter introduces in order to expressively perform a piece. In particular, we apply machine learning techniques to induce a predictive model for the type of attack, sustain and release of a note according to the context in which the note appears. The rest of the paper is organized as follows: Section 2 briefly describes how we extract a melodic description from audio recordings. Section 3 describes the approach we have followed for the induction of the predictive model of the mentioned intra-note features. Section 4 reports on some related work, and finally Section 5 presents some conclusions and indicates some areas of future research. Figure 1. Overview of the description system 2. AUDIO DESCRIPTION In this section, we summarize how the audio description is extracted from a set of monophonic recordings. We compute descriptors related to three different temporal scopes: some of them related to an analysis frame, some to a note segment, and others related to an intra-note segment. All the descriptors are stored into a XML document. Figure 1 represents a schematic view of our description scheme, including different levels of abstraction. Roughly, the procedure for audio description is as follows: First, the audio signal is divided into analysis frames, and a set of lowlevel descriptors are computed for each analysis frame. Then, we perform a note segmentation using low-level descriptor values and fundamental frequency. Using note boundaries and low-level descriptors, we carry out energybased intra-note segmentation, and a posterior intra-note segment amplitude envelope characterization. 2.1. Low-level descriptors computation The main low-level descriptors used to characterize expressive performance are instantaneous energy and funda

Page  00000002 mental frequency. Energy is computed on the spectral domain, using the values of the amplitude spectrum. For the estimation of the instantaneous fundamental frequency we use a harmonic matching model, the Two-Way Mismatch procedure (TWM) [8]. After a first test of this implementation, some improvements to the original algorithm where implemented and reported in [6]. 2.2. Note segmentation Note segmentation is performed using a set of frame descriptors, which are energy computation in different frequency bands and fundamental frequency. Energy onsets are first detected following a band-wise algorithm that uses some psycho-acoustical knowledge [7]. In a second step, fundamental frequency transitions are also detected. Finally, both results are merged to find the note boundaries. 2.3. Note descriptor computation We compute note descriptors using the note boundaries and the low-level descriptors values. The low-level descriptors associated to a note segment are computed by averaging the frame values within this note segment. Pitch histograms have been used to compute the pitch note and the fundamental frequency that represents each note segment, as found in [9]. 2.4. Intra-note description We perform intra-note segmentation based on energy envelope contour. Once note boundaries are found, energy envelopes of notes extracted from the recordings and divided into three segments, namely attack, sustain, release. Figure 2 shows the representation of these segments for a melody fragment. The procedure for intra-note energy-based description is outlined as follows: We consider the audio note segments as differentiable function over time. We compute the zero-crossings of third derivative [Jenssen] in order to select the segment characteristic points, i.e. maximum curvature points. Then, we join these points by straight lines, conforming a set of consecutive linear segments. We compute the slopes and durations (i.e. the projection on the time axis) of the linear fragments, and finally define an attack, a sustain and a release section. We characterize a the complete envelope of a note by the following (self explanatory) six descriptors: Attack Relative Duration (ARD), Attack Regression Slope (ARS), Attack Relative End Value (AREV), Sustain Regression Slope (SS), release Relative Duration (RRD), and Release Regression Slope (RS). We obtained a correlation coefficient of 0.83 for our data set of 4360 notes. 3. INTRA-NOTE PREDICTIVE MODEL In this section, we describe our inductive approach for learning the intra-note predictive model from performances Figure 2. Example of linear envelope approximation of a sequence of saxophone notes of Jazz standards by a skilled saxophone player. Our aim is to find an intra-note-level model which predict how a particular note in a particular context should be played (i.e. predict its envelope). This, we believe, will improve our current model which predicts note-level deviations on note duration, note onset and note energy. As we have pointed out in the past, We are aware of the fact that not all the expressive transformations performed by a musician can be predicted at a local note level. Musicians perform music considering a number of abstract structures (e.g. musical phrases) which makes of expressive performance a multi-level phenomenon. In this context, our ultimate aim is to obtain an integrated model of expressive performance which combines note-level knowledge with structure-level knowledge. Thus, the work presented in this paper may be seen as part of a starting step towards this ultimate aim. Training data. The training data used in our experimental investigations are monophonic recordings of four Jazz standards (Body and Soul, Once I loved, Like Someone in Love and Up Jumped Spring) performed by a professional musician at 11 different tempos around the nominal tempo (apart from the tempo requirements, the musician was not given any particular instructions on how to perform the pieces). A set of melodic features is extracted from the recordings which is stored in a structured format. The performances are then compared with their corresponding scores in order to automatically compute the transformations performed. Descriptors. In this paper, we are concerned with intranote (in particular the note's envelope) expressive transformations. Given a note's musical context in the score, we are interested in inducing a model for predicting aspects of the note's envelope. The musical context of each note is defined in a structured way using first order logic predicates. The predicate me 1o/10 specifies information both about the note itself and the local context in which it appears. Information about intrinsic properties of the note includes the note duration and the note's metrical position, while information about its context includes the duration of previous and following notes, and extension and direction of the intervals between the note and both the previous and the subsequent note. contextnarmour/4

Page  00000003 specifies information about the Narmour groups a particular note belongs to. Temporal aspect of music is encoded via predicate succ/4. For instance, succ (A, B, C, D) indicates that note in position D in the excerpt indexed by the tuple (A,B) follows note C. The use of first order logic for specifying the musical context of each note is much more convenient than using traditional attribute-value (propositional) representations. Encoding both the notion of successor notes and Narmour group membership would be very difficult using a propositional representation. In order to mine the structured data we used Tilde's top-down decision tree induction algorithm ([2]). Tilde can be considered as a first order logic extension of the C4.5 decision tree algorithm: instead of testing attribute values at the nodes of the tree, Tilde tests logical predicates. This provides the advantages of both propositional decision trees (i.e. efficiency and pruning techniques) and the use of first order logic (i.e. increased expressiveness). The increased expressiveness of first order logic not only provides a more elegant and efficient specification of the musical context of a note, but it provides a more accurate predictive model (more on this later). Tilde can also be used to build multivariate regression trees, i.e. trees able to predict vectors. In our case the predicted vectors are the amplitude envelope linear approximation descriptors, which can be seen as a first step towards a more complete (e.g. pitch and centroid envelope), and more refined (e.g. using spline regression) intra-note description. Table 1 compares the correlation coefficients obtained by 10-fold cross validation of a propositional regression tree model and our first order logic regression tree model for each descriptor in the amplitude envelope prediction task. On the one hand, the first order logic model takes into account a wider note's musical context by considering an arbitrary temporal window around the note (via succ/4 predicate). On the other hand, the propositional model only consider the note's local context, i.e. the temporal information is restricted to the duration and pitch of the previous and following notes. In Table 1 ARD refers to Attack Relative Duration, ARS to Attack Regression Slope, AREV to Attack Relative End Value, SS to Sustain Regression Slope, RRD to release Relative Duration, and RS to Release Regression Slope. The pruned tree models have an average size of 69 nodes for the propositional model, and 123 nodes for the first order logic model. We also obtained correlation coefficients of 0.71 and 0.58 for the complete amplitude envelope prediction task by performing a 10-fold cross-validation for the first order logic model and the propositional model, respectively. Table 2 shows the average absolute error (AAE) and the root mean squared error (RMSE) for each descriptor in the amplitude envelope prediction task. 4. RELATED WORK Previous research in building performance models in a somehow structured musical context has included a broad [Morphological attribute C.C (Prop) C.C(FOL) ARD 0.19 0.27 ARS 0.39 0.51 AREV 0.51 0.65 SS 0.22 0.24 RRD 0.24 0.31 RRS 0.31 0.40 Table 1. Comparision of Pearson Correlation coefficients obtained by 10-fold cross-validation for propositional and first order logic models. Morphological attribute AAE RMSE ARD 0.11 0.19 ARS 0.05 0.08 AREV 0.65 0.91 SS 0.01 0.01 RRD 0.20 0.26 RRS 0.05 0.11 Table 2. Errors for each descriptor in the amplitude envelope prediction task by the first order logic model obtained by 10-fold cross validation spectrum of music domains. Widmer et al [12] have focused on the task of discovering general rules of expressive classical piano performance as well as generating them from real performance data via inductive machine learning. The performance data used for the study are MIDI and audio recordings of piano sonatas by W.A. Mozart performed by a skilled pianist. In addition to these data, the music score along with hierarchical phrase structure description done by hand was also coded. The resulting substantial data consists of information about the nominal note onsets, duration, metrical information and annotations. However, given that they are interested in classical piano performances, they do not study intra-note expressive variations. Here, we are interested in saxophone recordings of Jazz standards, thus we have studied deviations on local duration, onset and energy, as well as ornamentations [11]. In [10], we generate saxophone expressive performances by predicting local note deviations. In [4], Dannenberg et al study trumpet envelopes by computing amplitude envelopes descriptors in a total of 125 contours (i.e. 3 notes sequences varying in interval size, direction, and articulation). Statistical analysis techniques led to find significant groupings of the envelopes by interval or direction types, or more specific intentions (e.g. staccato, legato). This work is extended to a system that combines instrument and performance models ([3]). The authors do not take into account duration, onset deviations or ornamentations. Dubnov et al [5] have followed a similar line. Analysis of sound behavior as it occurs in course of an actual performances in several solo works is performed, in order to build a model able to reproduce aspects of sound texture

Page  00000004 originated by expressive inflexions of the performer. Correlations between pitch, energy, and spectral envelopes variations are studied, phase coherence between pitch and energy, and decomposition between periodic component and noise component are investigated. To our knowledge, these models are devised after a preliminary statistical analysis rather than being induced from the training data. A possible reason may be the difficulties to parameterize continuos data form real world acoustic recordings and feed a machine learning component with the parameterized data. A first step towards continuous signal parameterization is done in [1], where voice pitch-continuous signals, namely indian gamakas, are parameterized with Bezier splines. An approximation of amplitude, pitch, and centroid curves is obtained, and the reduced data can be used to render similar signals. Nevertheless, the proposed representation lacks of a higher level context (e.g. attack, sustain) that can be used to analyze and synthesize in a more accurate way specific parts of the audio signal. 5. CONCLUSION This paper describes an inductive logic programming approach for learning expressive performance transformations at the intra-note level. In particular, we have focused on the study of variations on intra-note features that a saxophone interpreter introduces in order to expressively perform Jazz standards. We have compared the induced first order logic model and a propositional model and concluded that the increased expressiveness of first order logic not only provides a more elegant and efficient specification of the musical context of a note, but it provides a more accurate predictive model than the one obtained with propositional machine learning techniques. Future work: This paper presents work in progress so there is future work in different directions. We plan to explore diflferent envelope approximations, e.g. pitch and centroid envelope or splines, for characterizing the notes' envelopes. Another short term goal is to incorporate the intra-note model described in this paper into our current system which predicts expressive deviations on note duration, note onset and note energy. This will give us a better validation of the intra-note model. We also plan to increase the amount of training data as well as experiment with different information encoded in it. Increasing the training data, extending the information in it and combining it with background musical knowledge will certainly generate a more complete model. Acknowledgments: This work is supported by the Spanish TIC project ProMusic (TIC 2003-07776-C02-0 1). We would like to thank Emilia Gomez and M/aarten Grachten for pre-processing and providing the data. 6. REFiERENCES [1] B. Battey. Bezier spline modeling of pitchcontinuous melodic expression and ornamentation. Compzuter Music Journal, 28:4, 2004. [2] H. Blockeel, L. De Raedt, and J. Ramon. Top-down induction of clustering trees. In ed. J. Shavlik, editor, Proceedings of the 1 5th International Conference on Ma~chine Learning, pages 53-63, Madison, Wisconsin, USA, 1998. Morgan Kaufmann. [3] R.B. Danneberg and I. Derenyi. Combining instrument and performance models for high quality music synthesis. Journal of New 1Music Research, 27:3, 1998. [4] R.B. Danneberg, H. Pellerin, and!I. Derenyi. A study of trumpet envelopes. In Proceedings of the International Computer M2usic Conference., San Fransisco: International Computer Music Association., 1998. [5] 5. Dubnov and X. Rodet. Study of spectro-temporal parameters in musical performance, with applications for expressive instrument synthesis. In 1998 IEEE International Conference on Systems Man and Cybernetics, San Diego, USA, November 1998. [6] B. G~mez. Melodic Description of Audio Signals for Mu~2sic Content Processing. PhD thesis, Pom-peu Fabra University, 2002. [7] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing,ICASSP, 1999. [8] R.C. Maher and J.W. Beauchamp. Fundamental frequency estimation of musical signals using a twoway mismatch procedure. Journal of the Acoustic Society ofAmerica, 95, 1994. [9] R.J. McNab, Smith Ll. A., and Witten I.H. Signal processing for melody transcription. In SIC working pp~er, volume 95-22. 1996. [10] R. Ramirez and A. Hazan. Modeling expressive music performance in jazz. In Proceedings of the 18th Florida Arti/icial Intelligence Research Society Conference (FLAIRS 2005), Clearwater Beach, FL, 2005. [11] R. Ramirez, A. Hazan, B. Gmez, and B. Maestre. Understanding expressive transformations in saxophone jazz performances using inductive machine learning. In Proceedings of Sound and M2usic Computing '04, Paris, France, 2004. [12] G. Widmer and A. Tobudic. Playing mozart by analogy: Learning multi-level timing and dynamics strategies. Journal of New 1Music Research, 2003.