Page  00000021 EXPRESSIVE MODIFICATIONS OF MUSICAL AUDIO RECORDINGS: PRELIMINARY RESULTS Marco Fabiani & Anders Friberg Royal Institute of Technology (KTH), Stockholm (Sweden) Speech, Music and Hearing (TMH), Music Acoustics Group ABSTRACT A system is described that aims to modify the performance of a musical recording (classical music) by changing the basic performance parameters tempo, sound level and tone duration. The input audio file is aligned with the corresponding score, which also contains extra information defining rule-based modifications of these parameters. The signal is decomposed using analysis-synthesis techniques to separate and modify each tone independently. The user can control the performance by changing the quantity of performance rules or by directly modifying the parameters values. A prototype Matlab implementation of the system performs expressive tempo and articulation modifications of monophonic and simple polyphonic audio recordings. 1. INTRODUCTION A music performance represents the interpretation that a musician (or a computer in our case) gives to a score. To obtain different performances, the musician often follows some principles related to structural features of the score (e.g. musical phrases). The KTH rules system for musical performance [5] models such principles in a quantitative way in order to play back a nominal score expressively using a sequencer and a synthesizer. This approach often produces rather unnatural results if compared to human performances, mostly because of the quality of the synthesizer. Therefore we propose here an alternative approach: directly modify a recorded human performance. This modification can be rule-based or controlled in other ways. Our aim is to obtain a system that can be used both for the analysis of music performance, as well as a tool to modify a performance in a controlled way, in order to be used for example in live interactive systems similar for example to [4]. We will concentrate on the musical parameters tempo, dynamics (sound level) and articulation (tone duration) which have been found to be crucial for performance expression [8]. To modify these parameters a few problems need to be solved. First of all, the rules system requires the division of the musical piece into phrases. This process is usually done by hand since automatic methods are not yet reliable enough. A second problem, since we aim at modifying each note independently, is the separation of each tone, or at least chord, from the mixture (specially in a polyphonic case). Finally, the modifications should Audio sign-a det----n +--- [b te.......[a).. Figure 1. Schematic representation of the system..li...i l..........::........................... Enr chedScorems, we propose to combne the use of sre "wsa l, 1 - - - - - - -ot ion ^ proached using analysis-synthesis techniques. The system can be divided into three main parts, as shown Figure 1. Schematic representation of the system. be accurate and possibly avoid artifacts. To solve the first two problems, we propose to combine the use of score files aligned with the audio file. The third problem is approached using analysis-synthesis techniques. 2. SYSTEM OVERVIEW The system can be divided into three main parts, as shown in Figure 1. In the first part (a), the audio file is analyzed and aligned with the score. This is done prior to the performance and the information stored for later use. The audio analysis separates the harmonic part of the signal from the noise and transients. Ing fixed points in the second part (b, modification and synthesis), tempo, tone duration and sound level are modified according to the output values of the controlrs block (c). These values are defined for example by the rules system or in other ways (movements detected by cameras, sensors or other controllers). 3. SCORE ALIGNMENT In order to use the information provided in it, the score needs to be aligned with the audio file. This is done by defining a number of corresponding fixed points in the two files, which in our system are tone onsets. Tone onsets are automatically extracted, with the possibility to correct errors by hand, since an accurate alignment is crucial for analysis. Accurate automatic onset detection is still an open problem. Many algorithms have been proposed 21

Page  00000022 (a summary in [1]), and presented at MIREX/ISMIR'. While they perform relatively well with impulsive attacks, they have problems to detect slow attacks. Currently a simple algorithm based on an edge detection filter is used2. First, the absolute value of the signal is computed. The new signal is then convolved with a filter which is the first derivative of a Gaussian pulse, and which outputs positive spikes for positive edges and negative spikes for negative edges. Every relative minimum above a certain threshold is counted as an onset. A possible problem with alignment based on tone onsets is the occurrence of non-simultaneous onsets in the audio signal which are simultaneous in the score. This could possibly be solved using beat positions instead of tone onsets as alignment points, although beat extraction is more difficult than onset detection. Other approaches to score alignment which do not rely on onsets are possible, for example algorithms based on dynamic time warping [2]. 4. AUDIO SIGNAL ANALYSIS In order to be modified a tone needs to be separated from the rest of the signal. In a polyphonic audio recording, every tone is mixed with other tones, possibly overlapping. A tone produced by an acoustic instrument is usually harmonic and contains a large number of partials. Even if two tones do not have the same pitch, some of their partials may overlap. In order to apply tone-based modifications, each partial should be associated with its corresponding note in the score. We are currently using the Analysis-Synthesis technique proposed by [3] to detect, modify and synthesize single tones in the audio signal. The audio signal is transformed into the time-frequency domain, and partials tracks are extracted. The objective is to associate each peak (partial) in the spectrogram with its corresponding note in the score. This means using the spectrogram as the piano-roll representation similar to MIDI sequencers, but extended also to the partials. Usually, the approaches to automatic partial tracking are based on heuristic rules and do not use any a priori information (e.g. [12]). However, our system knows already the tone onsets and the notes from the score. For each time window in the spectrogram, we compute the frequency of the tone that is expected to be present and of its partials, and look for the corresponding peaks. Also integrated is a simple model to take into account the inharmonicity of piano in the computation of partial frequencies. A problem to be solved is when two simultaneous tones have overlapping partials. A possible solution is to assign part of the energy of the peak to one tone and part to the other one, as proposed for example by Klapuri in [9]. 1 http://www.ismir.net/ 2 http://cnx.org/content/ml417O/1.1/ 5. MODIFICATIONS AND SYNTHESIS Previous attempts of expressive modifications of tempo can be found in, for example, [6, 7]. Our aim is to go beyond tempo and modify each tone independently in order to change also the articulation (note length relative to the Inter Onset Interval) and the sound level. Sound level and tone duration modifications are performed on the spectrogram by modifying, adding and removing peaks according to the mapping obtained during the analysis. To obtain a realistic sound level modification both the amplitude of the tone and its timbre need to be changed (usually louder tones have a brighter sound, [11]). This can be simulated by adding or removing entire partials tracks (see [3]) as well as changing the spectral shape of the tone in the time-frequency domain. Another solution is the use of an appropriate filter (e.g. shelf filter with variable slope) in the time domain. In case we want to make the timbre brighter, higher partials might not be present in the original recording. The filter solution will then only raise the noise level, while the spectrogram approach, on the other hand, allows synthesizing higher harmonics which are not present. It is also important to notice that in order to change the sound level according to the rule values, we need to know the original sound level of each note in the recording. Modifications of tone duration (articulation) are performed by shortening or extending the partials tracks associated with that tone in the same way that the timbre is modified (e.g. by removing or adding peaks in the spectrogram). For time scale modifications we implemented a simple approach in our prototype system: to make tempo faster frames are dropped from the spectrogram, while to make tempo slower some frames are used twice. This is done only in the sustain part of the tone in order to preserve tone attacks and avoid attack duplication. Once the modification of the spectrogram is complete, the new performance is synthesized by inverting the analysis operation. Particular attention has to be paid to the phase response when modifying the signal in the timefrequency domain. Most of the artifacts (as for example "phasiness", also known as "loss of presence") are caused by a non-consistent phase response, meaning that no real signal exactly corresponds to that particular combination of magnitude and phase frequency response. To solve the problem we try to compute the inverse transform using only the spectrogram by iteratively minimizing an error function (see [13]). An alternative method would be to correct the original phase response. 6. PERFORMANCE CONTROL The control part is where the user interacts with the system in order to obtain the desired performance. The control over the parameters can be done directly by the user, or can be partly automated by using the KTH rule system for music performance [5]. 22

Page  00000023 Figure 2. Inter Beat Interval for three versions of Chopin's Etude op.10 no.3, bars 1-11: expressive (Average), original deadpan (Deadpan) and deadpan output by the system (Flattened). pDM [4] is a system for both rule based and directly controlled performance modifications of symbolic music files based on the KTH rule system, which consists of 19 rules: 14 rules influence tempo, 11 influence sound level and 5 influence articulation. Each rule produces a default parameter change for each note based on contextdependent information such as rhythm, phrasing, pitch, and expressive signs in the score. pDM uses a score file in which these default values are stored together with the notes. The final values for tempo, sound level and articulation are obtained by computing a weighted sum of the default values given by the rules. The weights are controlled by the user using for example sliders or previously defined in default sets for standard performances (e.g. happy, angry, sad,...). Our system works in a similar fashion, but is using audio recordings instead of symbolic data and a synthesizer. The audio file is aligned with the pDM score in order to use the rule values. The same functionalities present in pDM are implemented so that a pDM score file can be used for applying each rule individually to the each tone in the audio recording. 7. PRELIMINARY RESULTS In order to test the timing accuracy of the system, an expressive performance was transformed into a dead-pan performance and the resulting onset accuracy was measured. The audio input was a piano rendering of the first 11 bars of Chopin's Etude op.10 no.3. It was a recording of a grand piano (Bisendorfer CEUS), controlled via MIDI. The performance contained large tempo variations and originated from an average of 22 real performances used in the RENCON contest 2006. A recorded deadpan version was also available (denoted Deadpan), which was used in 3http://www.ofai.at/simon.dixon/... rencon/results.html an informal listening test in a comparison with the generated dead-pan (denoted Flattened). In Figure 2, the resulting Inter Beat Interval for the original performance version, the recorded dead-pan version and the generated dead-pan version are shown. Note that the IBI (Inter Beat Interval) values are approximate because of the relative uncertainty of onset detection: this may explain the small mean error (std = 9.0 ms) for the Deadpan performance. It is indeed evident that the Flattened version is very close to the Deadpan version, regarding both overall tempo (mean IBI m = 299.9 ms) and error (std = 9.7 ms), as opposed to the original performance (m = 472.9 ms and std = 92.6 ms). The informal listening test corroborated the measured results. Some preliminary tests were done for the articulation modification. The length of the tones were shortened (e.g. to obtain a more staccato performance) by canceling the magnitude peaks of the tone's partials in the time-frequency representation. The length of the single note was computed using rule values from the pDM score. The final result of this modification strongly depends on the quality of the analysis/separation operation. Artifacts were audible, caused by the fact that tones overlap both in frequency and in time: when canceling a part of a tone we might cancel parts of other tones, as well as modify the residual part (e.g. noise and transients). Furthermore, as mentioned in Section 5, artifacts were also caused by the non-consistent phase response. The last parameter to modify is the sound level. The measurement of the sound level of each note is not trivial for a polyphonic recording. Many factors have to be taken into consideration, like for example masking effect and note overlapping. We tried to compute the sound level of each single tone by summing the energy of the partials we extracted during the analysis, but the results were not very satisfactory, especially for piano recordings. Another problem that we tried to solve is the phase incoherence. Since modifications of the spectrogram are done at different levels and different times, we decided to discard the original phase response and use only the magnitude response. We thus implemented the algorithm proposed in [13]. We are currently also trying to use the original phase response for the unmodified parts and a generated one for the modified parts. Finally, all parts were integrated into a prototype system controlled through a simple GUI (Figure 3). The user can: load audio and score files; automatically extract tone onsets and correct errors by hand; transform and analyze the signal using different window sizes and overlapping ratios; define rule values using sliders or choose performance presets; change the overall tempo; playback the original recording and the generated performance. 8. DISCUSSION AND FUTURE WORK A system prototype was presented that can be used to modify the expression of a recorded musical performance. Possible applications of this system could be to gener 23

Page  00000024 Figure 3. GUI of the prototype system developed in Matlab. ate realistic and precisely controllable performances of the same piece to be used for listening experiments (e.g. psychology, neuroscience). We can also think about an orchestra conducting system or other interactive applications. Given that some of the current problems can be solved, the system should sound more natural when compared to other performance systems based on MIDI sequencers and synthesizers [4]. It should also allow the user to use any kind of recording, which is not possible using other conducting systems [10] based on audio recordings, as well as allowing modifications of articulation and sound level on a note basis. Many open problems need to be solved to obtain audio performances which are free from artifacts. The analysissynthesis process needs to be improved in order to obtain better tone separation. An open issue is the choice of the most suitable time-stretching approach. We also need to find a solution to remove the phase artifacts introduced in the synthesis stage. Finally, after testing the various parts of the system using Matlab, we would like to implement a real-time version. This is not infeasible given that the analysis part of the computations can be done in advance and only the modification/synthesis part would run online. 9. REFERENCES [1] J. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. Sandler. A tutorial on onset detection in music signals. IEEE Transactions on Speech and Audio Processing, 13(5):1035-1047, 2005. [2] S. Dixon and G. Widmer. Match: a music alignment tool chest. In Proceedings of the 6th International Symposium on Music Information Retrieval (ISMIR05), London (UK), 2005. [3] A. J. Ferreira and D. Sinha. Accurate spectral replacement. In Proceedings of the 118th Convention of the Audio Engineering Society, Barcelona (Spain), May 2005. [4] A. Friberg. Home conducting: Control the overall musical expression with gestures. In Proceedings of the 2005 International Computer Music Conference (ICMC05), pages 479-482, Barcelona (Spain), September 2005. [5] A. Friberg, R. Bresin, and J. Sundberg. Overview of the KTH rule system for musical performance. Advances in Cognitive Psychology, Special Issue on Music Performance, 2(2-3):145-161, 2006. [6] F Gouyon, L. Fabig, and J. Bonada. Rhytmic expressiveness transformations of audio recordings: Swing modifications. In Proc. of the International Conference on Digital Audio Effects (DAFX03), London (UK), 2003. [7] J. Janer, J. Bonada, and S. Jord. Groovator - an implementation of real-time rhythm transformations. In Proceedings of 121st Convention of the Audio Engineering Society, San Francisco, CA (USA), 2006. [8] P. N. Juslin. Cue utilization in communication of emotion in music performance: Relating performance to perception. Journal of Experimental Psychology: Human Perception and Performance, 26:1797-1813, 2000. [9] A. P. Klapuri. Multipitch estimation and sound separation by the spectral smoothness principle. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2001), pages 3381-3384, Salt Lake City, UT (USA), 2001. [10] E. Lee, H. Kiel, S. Dedenbach, I. Gruell, T. Karrer, M. Wolf, and J. Borchers. iSymphony: An adaptive interactive orchestral conducting system for conducting digital audio and video streams. In Extended Abstracts of CHI 2006 Conference on Human Factors in Computing Systems, Montreal (Canada), 2006. [11] D. A. Luce. Dynamic spectrum changes of orchestral instruments. Journal of the Audio Engineering Society, 23(7):565-568, 1975. [12] R. J. McAulay and T. F. Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(4):744-754, August 1986. [13] X. Zhu, G. T. Beauregard, and L. Wyse. Real-time iterative spectrum inversion with look-ahead. In Proceedings of the 2006 IEEE Internationl Conference on Multimedia and Expo (ICME 2006), Toronto, Canada, July 2006. 24