Page  00000001 TOWARD AN INTE GRATED S OUND ANALYSIS AND PROCE SSING FiRAMIEWORKI FiOR EXPRESSIVENESS RENDERING Carlo Drioli Riccardo Di Federico Centro di Sonologia Computazionale Dipartimento di Elettronica e Informatica - UniversitBi di Padova via S. Francesco 11 - 35121 Padova -Italy, ABSTRACT A framework for automatic expressive modification of audio musical performances is presented. Different levels of sound and performance analyses are integrated, which allows to characterize the same performance from different points of view (acoustic parameters and musical structures). Based on these analyses it is then possible to provide a score description of a recorded performance, which can be used as input to an expressiveness model. The final audio processing and resynthesis stages provide the tools for applying the sound modifications as computed by the expressiveness model. 1. INTRODUCTION The work presented in this paper is an attempt to put together existent and new sound analysis and processing techniques in a single system for expressive modification of recorded musical performances. Recently, the field of expressive processing gained some attention, and the effectiveness of the audio-processing techniques based on the sinusoidal modeling of sound proved to be suitable for this kind of application (Arcos et Al., 1997)(Serra, 1997). However, the main issue is that audio effects, such as time-stretching and pitch-shifting, must be controlled accordingly to sound articulation (vibrato, portamento, attacks, voiced unvoiced discrimination for singing etc.). Aim of the system is to integrate recognition, expressive manipulations and high quality re-synthesis of sound in a single tool in which automation can be introduced when desired. The research focuses on a wide class of musical signals, namely monophonic and quasi-harmonic sounds such as wind instruments, solo string instruments and singing. The paper is organized as follow. After a brief presentation of the general structure of the system, analysis and processing stages are described. The analysis step involve a signal modeling (sinusoidal analysis) and a higher level analysis, related to musical structures and performance; here, a multi-level segmentation of the signal is proposed, which is a complete representation of musical events, from note level to high-level musical attributes lke staccatolegato or vibrato. This allows the definition of a joined and description of sound and performance. The analysis results are then passed to the expressiveness model (Canazza et Al., 1998) which computes modifications on a symbolic representation of music. In the last part of the paper some post processing algorithms will be introduced. 2. ANALYSIS - RE SYNTHE SIS FiRAMIEWORK The general structure of the system is outlined in Fig. 1. The input signal is first processed by asinusoidal analysis. Besides a parametric signal representation, this analysis provides the determination of other features, such as pitch and energy, used by the next high level perform ance analysis block. At this stage a complete description of the signal, both in term of sound model and performance, is available for expressive transformation. The expressiveness model computes the required deviations on performance parameters and controls the rendering block which performs the necessary sound modifications. Score Expressive Analysis representation transformation Rendering Input sound Sinusoidal High7U levelL analysis performance Expressiveness Resynthesis Otu on analysis model Figure]~. Outline of the system

Page  00000002 3. SINUSOIDAL ANALYSIS In the last decade the sinusoidal modeling of audio signals showed to be one of the most flexible tools for sound analysis and resynthesis. Introduced by McAulay and Quatieri for speech coding (McAulay, Quatieri, 1986), sinusoidal modeling was then improved for musical purposes by Smith and Serra (Serra, 1997), by extending the model to inharmonic sounds and with the introduction of the residual, that is the random part of the signal. In the sinusoidal plus residual model, the input signal is approximated as a sum of sinusoids (partials) which retain most of the signal energy, plus a an error term, the residual, which models the noisy part of the sound: N t x(t) = A(t) cos[,(t)] + r(t), 0(t) = o i(r) d i=1 0 The estimation ofA1, 0, co, usually involves a Short Time Fourier Transform (STFT), a peak picking and a continuation (or matching) algorithm, which identifies the time evolution of the partials from frame to frame. In this scheme, partials can appear or 'die', depending on their amplitude and frequency variations. By limiting our target to quasi harmonic signals, we simplified the partial tracking step. In our scheme a partial is assumed to be the spectral maximum which is the nearest to the position predicted by the pitch (Di Federico, Borin, 1998). It's clear that in this case pitch must be computed before locating the partials, then great care was put in the formulation of a efficient method which reliably decides on the position of the fundamental frequency. The method measures the mean frequency distance between peaks in the spectrum. The pitch is then taken as the position of the peak whose frequency is nearest to the estimate. In spite of its simplicity this procedure turns out to be very effective in modeling quasi harmonic sounds. 4. HIGH LEVEL PERFORMANCE ANALYSIS The performance modification section (expressiveness model) requires a symbolic description of the signal. The extraction of this musical information can be considered at different levels. We will refer here to a basic structure composed by at least two layers, organized as follows: - Layer I: Note-level segmentation. For this level a musical score-like representation of the signal is constructed by means of a score matching procedure, which identifies the position, duration and pitch of the notes in the performance. - Layer II: High-level musical attributes segmentation. For high level musical attributes we mean here pitch-related events (such as vibrato or portamento), timing-related events (such as rallentando or syncopation), and amplituderelated events (such as tremolo or accents). Each event in the second layer can be represented by means of a parametrical model. The procedures for the identification and parameterization of these events are based on the sinusoidal model representation and acoustic features. Timbre attributes should be considered in the second layer as well; however, due to the complexity of the topic, timbre characterization will be deferred to future research. The importance of a high-level description of the performance has been recently pointed out in (Serra et Al., 1997), where a detailed spectral description format is proposed for encoding all possible musical events. In order to automatically produce an high-level description analysis, part of the present work focuses on automatic recognition of events in layer I and layer II. The two layer segmentation proved to be satisfactory for an accurate description of the performance of a generic solo instrument. Singing voice has been considered as well but, due to the variety of its peculiar features (voiced/unvoiced sounds, timbre variability, articulation) an extended description of the signal, accounting for phonetics, is under study. Score-performance matching and onset detection. We based the note-level segmentation (layer I) on a score-performance matching procedure. This operation links note-events in a musical performance to the corresponding events in the score (Desain, 1997). Most score matching algorithms are primarily based on pitch information. We assume that the performer recorded his performance while listening to a musical accompaniment (or with the aid of a metronome) and by reading a score. With these constraints the score matching can be reasonably performed by cross-correlating the real pitch with the nominal pitch given in the score. After a first global alignment, the cross correlation window is progressively reduced, thus refining the local tempo alignment. A sensible improvement for this approach comes from 'cleaning' the pitch profile from the influence of higher level attributes such as vibrato or non written rests (micropause). Fig. 2a shows a comparison between the real and the modified pitch. detection of vibrato and micropauses works as follow:

Page  00000003 Vibrato recognition. Vibrato frequency usually lies in a very narrow range, from 4.5 to 7 Hz, and its waveform is roughly sinusoidal. Since our task is basically to detect and track a sinusoid, a natural way to do it is to analyze the pitch profile through a STFT. The mean value of the pitch is usually much greater than the vibrato extent; in order to reduce the influence of these very low frequency components on the spectral estimation in the range of vibrato, the pitch profile is pre processed by a DC-killer. A further improvement can be obtained by averaging the STFT over the last few frames. Given the above conditions on frequency range and waveform, we expect to detect vibrato whenever a peak is found between 4 and 7 Hz and its energy is high enough compared with the global energy (the corresponding threshold is determined empirically). Micropauses. Performed notes are often separated even though there is no rest between them in the score. The procedure for micropause recognition exploits the time masking effect (also used in MPEG/Audio coding), by which a soft sound preceded by a very loud sound is not heard. This effect can be seen as a limit to the decay rate of the signal energy: if the decay is too fast the ear is unable to detect it and the corresponding sound section is perceived as silence. Fig. 2.b, depicts a comparison between an example of real RMS profile and the corresponding time-masking profile. The cross-correlation between the modified pitch and the nominal pitch gives a first estimate of the deviation of the real note onsets and their nominal position. An observation window with gaussian shape is then used as a weighting function for the evaluation of pitch deviations (which are mainly due to note transitions or articulations) in the surrounding of the onset estimate (fig. 3). The center and the width of each window are computed in order to cover the region where the pitch is outside a 60-80 cent tolerance threshold around the nominal pitch, which is likely to be in the region of note transition. The determination of the onset time is performed by finding the maximum or minimum of the weighted derivative of pitch, depending on which kind of pitch transition we expect (high- low, low -> high, stable, micropause-note, etc.). This procedure proved to be robust for well performed musical phrases, and in recognizing most of the events when the performer introduces little tempo and pitch deviations, as well as note duration changes. At this point of the analysis, the first layer description is complete and the second layer contains the high-level musical attributes which have been detected before the time-alignment procedure. Other highlevel attributes, such as time-related ones (rallentando, accelerando), can then be detected. In most cases, a parametric description of the attributes is adopted, for example in terms of geometric or trigonometric functions, in order to allow easier modifications. Modified real pitch Measured itch a Note onset detection 550 - a) - 6000 500 450 550-- 1 - M400 oed p Vibrato indication 500 -- 30 Modified pitch 300 ptI h I - -450 - - - - - - 250 -_ _I_ _| __ onset selection T 40 frame number 80 100 120 400 nominal pitch pitch tolerance S' treshold 0 4 - Micropause * 350 - 95- 300- I I S90 b 250) b s8 - Masking RMS profile II 0-200 I ^ pitch derivative Real RMS profile 150 pitch derivative 0 20 40 60 80 100 120 1I frame number 100i 0 20 40 60 80 100 120 frame number Figure 2: Pitch a) and energy b) analysis. A fragment of a Figure 3: Note onset detection after local nominal pitch and modified violin performance is shown. The sequence of two legato pitch alignment). The algorithms for positioning of gaussian notes followed by a vibrato is apparent. observation windows and for determination of onset times are based on out-of-threshold pitch events, and on previously detected offset times. 5. PROCESSING AND RESYNTHESIS Once the performance (symbolic) modifications have been calculated by the expressiveness model, a post processing block, based on the initial sinusoidal analysis, provides the necessary sound modifications. Timing, pitch and dynamics modifications are the essential signal features for expressive performance. Traditional time stretching implementation for the sinusoidal model is basically a resampling of the analysis points in time so as to change the speed of the sound articulation by keeping its frequency and amplitude structure. The approach to pitch shifting

Page  00000004 usually involves a frequency scaling of the partial followed by an amplitude adjustment in order to keep the original formant structure. Although these approaches lead to simple modification algorithms, the sound quality is often unsatisfactory, due to the waveform coherence loss related to phase distortion. This problem is addressed in (Di Federico, 1998), where a waveform preserving approach is presented. Besides these basic effects, higher level features were considered. An important case is the vibrato processing. Vibrato can be represented with good approximation by a time varying sinusoid v~t) V0 () + ~t) os(2 (r~r + 'o) t E[0, T,] where V0O(t) is the underlying slow varying pitch; V(t) is the vibrato extent, that is the peak to peak pitch deviation, ~f/t) is the vibrato rate, y, is the initial phase, Tv is the vibrato duration. M/odification of vibrato can be decomposed in pitch flattening (vibrato elimination) and vibrato synthesis, obtained by modifying its parameters. The flattening is performed on all partials; for this purpose a stopband filter with cutoff frequencies of 4.5 and 10O Hz has been designed. To avoid the effect of transients, the selected sections are extracted from the pitch profile and filtered separately. The pitch smoothing procedure can be effectively used also for amplitude modulations (tremolo) which always occur with vibrato. 6. CONCLUSIONS In this work we have presented a framework for automatic analysis and processing of monophonic musical signals. Aim of this research is to develop an environment in which audio recorded performances can be processed like symbolic scores by an expressiveness model. To approach this problem we tried to combine different kinds of performance analyses (acoustic analysis of the signal and a higher level analysis on musical events, such as notes and high level musical attributes) with a flexible and effective post processing stage, capable of implementing high level sound modifications, such as vibrato alteration, as well as the more traditional time varying time stretching and pitch shifting. Some original approaches for musical event recognition and for sound processing have been proposed. Besides expressive processing, the presented framework lends itself to be employed for other 'intelligent' score related applications, such as automatic timing and pitch correction. Most of our future research will focus on singing. This is by far the most complicated musical signal to deal with, thus further extensions are needed, especially for phonetic characterization and specific audio effects. ACKN\OLEDGEMIENTS This work has been supported by Telecom Italia: S.pA. under the research contract "Cantieri M~ultimedia/i" REiE RE NCE S J. L. Arcos ct Al., 1997, "SaxEx: a case-based reasoning system for generating expressive musical performances", Proceedings of JCM~C97, Thessaloniki, Greece, pp.329-336. S. Canazza et Al. 1998, "A model to add expressiveness to automatic musical performance," to be published in Proceedings JCMlC98, Ann Arbour, Michigan. P. Desain et Al, 1997, "Robust Score-Performance Matching: Taking Advantage of Structural Information," Proceedings of JCMlC97, Thessaloniki, Greece, pp.337-340. R. Di Federico, 1998, "Waveform Preserving Time Stretching and Pitch Shifting for Sinusoidal Models of Sound," Proceedings of COST-G6 Workshop on Digital Audio Effects (DAFX98), Barcelona (in press). R. Di Federico,, G. Bornn, 1998, "An improved pitch synchronous sinusoidal - analysis - synthesis method for voice and quasi harmonic sounds" Proceedings of XII Colloquium on Musical Informatics, Gorizia (Italy), (in press). R. J. McAulay, T. F. Quatieri, 1986, "Speech Analysis / Synthesis based on a sinusoidal representation," IEEE Trans. ASSP vol. 34 No. 4, August, pp~.744-754. X. Serra, 1997, "Musical Sound M~odeling with sinusoid plus noise", in Muzsical Signal Processing, Ed. by C. Roads, S. T. Pope, A. Piccialli and G. De Poli, Swets and Zeitlinger Publ., pp. 91-122. X. Serra et Al., 1997, "Integrating complementary spectral models in the design of a musical synthesizer", Proceedings of JCMlC97, Thessaloniki, Greece, pp. 152-159.