Page  264 ï~~WAVELET ANALYSIS OF RHYTHM IN EXPRESSIVE MUSICAL PERFORMANCE Neil P. McAngus Todd Department of Music City University Northampton Square LONDON EC1V OHB UK In this paper a new analysis method is described in which a compact structural description can be recovered from the musical signal. Rather than carry out the multiscale decomposition on the sound signal itself this method decomposes the sound energy flux. This enables the analysis of rhythmic phenomena which have frequencies several orders of magnitude below pitch and are not represented in the sound signal. The structures resemble the grouping/metrical structures of the theory of Lerdahl and Jackendoff in the sense that they are composed of two complementary components. However, the structures which result from this analysis are of a more primitive low-level nature more akin to the primal sketch in the theory of vision as proposed by Marr. Research supported by a grant from the MRC, No. G9018013 1 Introduction Theories of rhythm in both speech and music agree that the rhythmic structure of an utterance or musical passage can be represented by two complementary components - a phrasal type structure and a metrical type structure. In the case of speech, according to Selkirk (1984), these are the prosodic constituent structure and the metrical grid. In the case of music, according to Lerdahl and Jackendoff (1983), these are grouping structure and metrical structure. Whilst there are some differences in the way these two components are defined in the speech and music domains they are broadly similar, particularly the hierarchical nature of the structures which can be represented as a tree. Attempts to model the perceptual process by which listeners acquire such structures in music have focussed largely on the metrical component. Such models are mostly cast in the form of a symbolic algorithm (see Lee (1991) for extensive review). Other models have taken a more bottom-up approach (see Vercoe (1984) for example). However, whilst these models have successfully captured some aspects of rhythmic perception, almost without exception they run into difficulties when confronted with an expressive performance whose principal component is tempo variation. Thus for these models musical expression is a form of "noise". This is of course at odds with the reality of the listening experience which is that not only is expression NOT a form of noise but [S (a) essential to a valid performance and (b) rich in structural information. The sort of information communicated is consistent with theories of rhythm, i.e. phrasal type expression, such as marking a boundary (Todd, 1992a), and metrical type expression, such as making strong beats longer, louder or more legato (Sloboda, 1983). In this paper a parsing mechansim is proposed whose function is the interpretation of musical expression (Todd, 1992b, 1993, manuscript). Rather than start from a linguistic analogy, an analogy with the theory of low-level visual processing forms the basis for the design of this mechanism. The basic idea is that one first carries out a multi-scale decomposition of the sound energy flux (analog of image intensity) then various structural features of the rhythm can be recovered by looking for the coincidence of the zero-crossings of derivatives of the smoothed energy flux. 6B.1 264 ICMC Proceedings 1993

Page  265 ï~~2 A Causal Analog of Visual Processing 2.1 Theory of Edge Detection In the theory of vision as proposed by Marr (1982) the first stage in the detection of edges involves blurring the image I by convolution with a Gaussian low-pass filter G. A 2-D Gaussian function is given by 1 G(r, ) - _exp[-r2/2 r2] (1) 2Vr where r is radius and a is a space-constant. Since it is not possible to capture all information about intensity changes in a single convolution operation this blurring or smoothing process is carried out over a range of scales '7k. k refers to channel number. The next stage in edge detection involves differentiating the smoothed versions of the image and looking for either peaks in the first derivative or zero-crossings of the second derivative which in the 2-D case is given by the Laplacian operator V2. Thus effectively the image is convolved with V2G(r, 'k) = 1 ( - r exp[-r2/22)t ] (2) which have a characteristic "mexican hat" shape. The "mexican hat" filter is an example of a wavelet (Kronland-Martinet, Morlet, & Grossman, 1987). 2.2 Rhythm Perception In order to apply the above ideas to rhythm perception we might suppose that the substitution of time t for radius and time-constant r for space-constant, yielding a temporal Gaussian G It,r) = exp[-t2/2r2], (3) would provide the basis for the decomposition of the signal energy Ix(t)12. That is, we might recover information by first convolving the signal energy with a range of temporal Gaussians, to obtain smoothed versions of the energy SG(t, Tk) = G(t, Tk) * jx(t)J2, then look for the coincidence of zero-crossings of the derivatives. However, since it is not possible to physically realize an ideal Gaussian response as a causal filter we use a polynomial approximation (Dishal, 1959). The analog design is then discretized to a cascaded IIR structure using a bilinear transformation (Proakis and Manolakis, 1992). As well as the usual advantages over the FIR approach this also avoids the need for down/up sampling (KronlandMartinet, Morlet, and Grossman, 1987). 2.3 Three kinds of Zero-Crossings Structures We are interested in three kinds of low-level structures which are required to form a rhythmic interpretation: (a) onsets, the analog of visual edges, which are obtained by detecting negative going zero-crossings of the second derivative, i.e. d2 d d-SG(t rk)= 0 and -SG(t,rk)>0; (4) (b) group boundaries, which are obtained by detecting positive going zero-crossings of the first derivative, i.e. d d2 (c) stress structure, which describes the hierarchy of importance of events within the segmentation boundaries and is obtained by detection of negative going zero-crossings of the first derivative, i.e. d d2 - SG(t, rk)=O0 and --S(t,Tk)<O0. (6) In each case the loci of zero-crossings form 3-D structures in an energy-frequency-time space. The projection in the frequency-time plane of the stress structure gives a fractal-like tree representation of the rhythm which can be shown to have a strong relationship to perceived hierarchy (Todd and Lee, manuscript). ICMC Proceedings 1993 265 6iB.1

Page  266 ï~~3 An Analysis In order to demonstrate the power of this approach an analysis of a performance of the Chopin Prelude Op.28, No.7 is shown. We focus in particular on the recovery of phrasal information. The Prelude Op.28, No.7 consists of two 8-bar sections A1 and A2 each of which contain four 2-bar phrases pI, P2, P3, P4. 3.1 Energy Flux The input to the system is an analog signal x(t) which is full-wave rectified Ix(t)I then integrated with a user-selectable cut-off frequency. Since rhythmic phenomena typically have a range of frequencies from 10-0.01Hz this cut-off is also usually low, about 50Hz. The output of the analog front-end is then sampled, forming the input to a digital filter-bank. Since the frequency content of the input signal is low we can sample at a low rate, typically about 100Hz, thus saving massively on computation time and storage space. Figure 1 shows such a sample for a performance of the Chopin Prelude. 1.4 1.2 4 >.8 0 0 5 10 15 20 25 30 35 40 45 50 time Figure 1. A performance of the Chopin Prelude Op.28. No.7. The signal is full-wave rectified, low-pass filtered then sampled at 100Hz. The filter was a 2nd-order Gaussian approximation with a time-constant of about 100ms. 3.2 Stress structure In the software component of the system the user can select the type of Gaussian approximation, e.g. Taylor, Bessel, etc., the order of the filters and their range and spacing. Three analyses are carried out in parallel, i.e. onsets, peaks and group boundaries, and can be displayed as zero-crossing structures together or separately. Alternatively, the energy density spectrum can be shown either as a static display or a dynamic pattern in real-time. Figure 2 shows the stress structure obtained from the sample shown in figure 1. We can see that this structure accurately reflects the hierarchical organization of the phrases in the Prelude and gives a measure of the relative importance of the phrases within the structure. Of course another performer might have chosen a different interpretation in which case this would be reflected in a different stress structure. Thus the analysis is also sensitive to expressive interpretation. 4 Summary Models of rhythm perception have generally treated expression as noise. In this paper we have shown how it is possible to construct a mechanism, which involves the multi-scale decomposition of the sound energy, for the recovery structural information from an expressive performance. In particular we have demonstrated the recovery of phrasal information from a performance of the Chopin Prelude Op.28. No.7. As well as being a useful tool for the analysis of expressive musical performance the system also acts as a model of rhythm perception. 6B.1 266 ICMC Proceedings 1993

Page  267 ï~~40 35 30 UU 25, 1 A2 820 p31 15p 10 pl p2 p3, P, p3* p4* 5 pl* 0 5 10 15 20 25 30 35 40 45 50 time Figure 2. The stress structure of the performance of the Chopin Prelude Op.28. No.7. The filters selected were fourth-order Taylor series Gaussian approximations logarithmically spaced a 12th octave apart with time-constants ranged from 2-40s. References Dishal, M. (1959) Gaussian response filter design. Electrical Communications. 36(1), 3-26. Lee, C.S. (1991) The perception of metrical structure: Experimental evidence and a model. In P. Howell, 1. Cross & R. West (Eds), Representing Musical Structure. Academic Press, London. Lerdahl, F. & Jackendoff, R. (1983) A generative theory of tonal music. MIT Press, Camb. MA. Kronland-Martinet, R., Morlet, J. & Grossman, A. (1987) Analysis of sound patterns through wavelet transform. Int. J. Pattern Recogn. Artificial Intell. 1(2), 273-302. Marr, D. (1982) Vision. Freeman, NY. Proakis, J.G. & Manolakis, D.G. (1992) Digital signal processing. Principles, algorithms and applications. Macmillan, NY. Selkirk, E.O. (1984) Phonology and syntax. The relationship between sound and structure. MIT Press, Camb. MA. Sloboda, J. (1983) The communication of musical meter in piano performance. Q. J. Exp. Psychol. 35, 377-396. Todd, N.P. McAngus (1992a) The dynamics of dynamics: a model of musical expression. J. Acoust. Soc. Am. 91(6), 3540-3550. Todd, N.P. McAngus (1992b) Hierarchical event detection. J. Acoust. Soc. Am. 92(4)B, 2380. Todd, N.P. McAngus (1993) Segmentation and stress in the rhythmic structure of music and speech. J. Acoust. Soc. Am. 93(4)B, 2363. Todd, N.P. Mc:Angus (manuscript submitted for publication) The auditory primal sketch: A multi-scale model of rhythm perception. Todd, N.P. McAngus & Lee, C.S. (manuscript submitted for publication) A multi-scale account of interval produced accents. Vercoe, B. (1984) The synthetic performer in the context of live performers. Proc. Int. Comput. Mus. Conf. 199-200. ICMC Proceedings 1993 267 6B.1