Page  00000624 Score Following of Orchestral Music Using Acoustic Pressure Peak-Tracking and Linear Stretch Matching Takefumi Miura', Ayumu Akabane', Makoto Sato', Takao Tsuda2 and Seiki Inoue2 'Precision and Intelligence Laboratory, Tokyo Institute of Technology 2Science & Technical Research Laboratories, Japan Broadcasting Corporation Abstract In this research, we present a new real-time score following system for orchestral music. A score of orchestral music includes portions where we can follow the score by adopting the existing method using the short-time spectral analysis and Dynamic Programming (DP) matching. Also, it includes portions where the fundamental tones and overtones of many instruments are complicatedly overlapped, hence, the existing method can not be used. But we can observe distinct acoustic pressure peaks in the latter portions. So we propose a new score following system using these pressure peaks, which should be used in combining with a system using the existing method. In this research, first, we originate a new method with which we can presume pressure peaks from an orchestral score. And then, in order to follow the score, we match these presumed peaks with the corresponding peaks extracted from the performance sound. When we use DP in the process of the matching, it is often occurred that the time values of notes are deviated from the correct values by the nonlinear characteristic ofDP Finally, we propose a new linear stretch matching method to avoid these deviations of time values. We implement a system using this new method and compare its results with those obtain by using DR performance program, in which score following techniques are to be adopted. However, the existing score following systems using shorttime spectral analysis and DP matching (or Hidden Markov Model) can follow instruments with low polyphony such as a violin, but cannot track highly polyphonic instruments until now (Orio et al. 2003). These methods analyse the frequency spectrums of the performance sound, and use these spectrums to detect pitches of notes in the corresponding score. In orchestral performances, many instruments often make sounds of different pitches at a time and it is very difficult to detect pitches at these moments. But we can observe distinct pressure peaks (corresponding to onset of notes) in these portions. So we propose a score following system using these peaks instead of pitch-detection, which is to be used in combination with the existing system of pitch-detection. The pre-matching processes Introduction In recent years, several algorithms and techniques have been developed in the field of the score following (Grubb and Dannenberg 1997), (Orio et al. 2003) and in its related fields, such as the score alignment (MUller, Kurth, and Roder 2003) and the beat tracking (Goto 2001). As a result, it has become possible to follow a score automatically in monophonic music cases and step by step in polyphonic music cases. On the other hand, an automatic score-based camerawork planning has begun to be proposed for orchestra shooting (Shiba et al. 2003). So we think that it is the time to take a first step toward the full-automatic TV-shooting of orchestral In our research, we use standard transcriptions (CDs of Orchestral performance) as our sound sources and extract acoustic pressure peaks pa (t) from them. In this process, first, we rectify the performance sound, and then low-pass filter it. By using this process, we can get the required pressure peaks. (Fig. 1) 2.1 Presumed pressure peaks from a score To compare the performing sound with the corresponding score, we originate a method to presume pressure peaks from the score. Fig. 2 represents the method with which we get the presumed pressure peaks. First, we change an orchestral score to a standard MIDI file. In the process, we make each part of the orchestral score corresponds to each channel of MIDI, and the dynamics of each note on the orchestral score corresponds to the Velocity byte of Note-On status. Next, we define a relative weight that is proportional to the acoustic pressure from each instrument. We obtain the Total Velocity 624

Page  00000625 * iIi. Orche stral peiformance sound Rectified sound Acoustic pressure peaks Iwr ) ' T I hi ~h ~ 11L ~i 1 Isec] Low..pass filtering 1 [sec] by multiplying the Velocity bytes with the relative weights and then adding the results over all channels. We specify the positions on the score-time T (i.e. the time axis of status' time in the standard MIDI file) when large peaks are expected by using the peak-finding techniques with a threshold on this Total Velocity. We then put a unit of pressure peak on each of these positions. After these processes, we can get the desired presumed pressure peaks Ps (T). 2.2 Definition of the matching segments To define the matching segments on the score-time T, we set two parameters for rt-th segment which are the length of the segment ( L Tend (1) - Tbeg~in (1~) ) and the advance of the segment ( LA Tbegin(rt) - Tbegin (1 - 1) ), as is shown in Fig 3. The parameters ( L, LA ) keep constant throughout all segments. I'*I J*I I -I, j ILpI ~kI Figure 1: Process of extracting the acoustic pressure peaks a'-tt~~C~~ V 1II Flute (Relative Weight=2) Trumpet (Relative Weight=6) Velocity 64 --------------------------------- Velocity 128------------------- x Relative Weig1it of Flute Velocity 64 64 64 64 64 64 r5.LI I I Tbegin( 1) LA Tbegin(2) Iemn I I segment 1 segment 2 Tend(i) U4-~ 611 1 Added Velocity 384 384 384 384 384 x Relative Weight ofTnimpet Total Velocity 512 512128512 512 512128512128--- Velocity x Weight of other instrunients added as well and the graph of Total Velocity becomes as the below. Total Velocity * ___ m I I mI~lll imu the value of the threshold Figure 3: Definition of the matching segments We divide the segment into N intervals by introducing a resolution AT of the score-time; hence, the length of segment L becomes L NAT. By dividing T into a width of AT, we can get the discrete time Ti iAT + Tbeg~in(12) (i 0,..., N) on the 12-th segment and the p5(Tj) as the value on the presumed pressure peaks. This Ps (Ti) constitutes a (N + 1)-dimensional vector. 3 Linear stretch matching method The score alignment of the performing sound is totally non-linear. For realizing the non-linearity of the alignment, DP matching is frequently used. However, in our case of the score following, we match the score with the performing sound on the relatively short-time segments. And we could regard the matching on the short-time segments as linear. So we could use linear-time matching method instead of DP (Mfi'ller, Kurth, and Rdider 2003), because, in DP matching, a partial stretch (i.e. performance tempo changes drastically) occurs frequently even on short-time segments. This Figure 2: Process of getting the presumed pressure peaks 625

Page  00000626 partial stretch occurs under the influence of the neighboring large peaks. (Fig. 4) We think that this partial stretch could be avoided if we use a linear stretch matching. So we originate a linear stretch matching which is based on the template matching. We regard the presumed pressure peaks vector ps (Ti) as the standard pattern (the template). We specify a segment as I = tend - tbegin + Al on the performance current time, where tend, tbegin are the estimated current time corresponding to Tend, Tbegin respectively before the matching, and Al is a margin on the current time. Half of this margin is put on both sides of I in order that tend, tbegin are surely included in I for any matching result. As input patterns of the template matching, we transform pa (t) on the segment I to 2M + 1 patterns using different performance tempo V(m) ( diff. of score-time / diff. of performance time ), as: Acoustic pressure peaks 0.25 0.2 0.15 0.1 0.05 0 9 9.5 Presumed pressure peaks matched to performance I The partial stretch occurs under the influence of the next peak. 0.5 0 V(m) = Vx (maximum stretch rate)ml/M (1) -0.5 9 where V is the accumulated mean tempo, and m is a stretch parameter which is an integer from -M to M. The correspondence between the performance time ti and the scoretime Ti is given by: 10.5 11 performance time[sec] LT^T-"71 The time value of 2: 1: 1 is not kept. (a) Example of mismatching by using DP ti = i base tbase(m, k) V(m) (2) where Tbase is defined as the begin time Tbegin(n) of the nth segment and tbase(m, k) is the current time corresponding to Tbase. The optimal value of this parameter tbase(m, k) is given as the matching result. The matching distance is defined as: Acoustic pressure peaks 0.25 0.2 0.15 0.1 0.05 n ___ 9.5 10.5 11 performance time[sec] D(m, k) = {pa(tz) -,(T,)}2 i=0O (3) Presumed pressure peaks matched to performance I! where pa (ti) is the acoustic pressure peak value over the segment 1. The dimension N + 1 of the pressure vector ps (Ti) is identified with the dimension of the characteristic space in the template matching and a is a normalization factor between pa (ti) and ps (Ti) on n-th segment. Substituting Equation 2 into Equation 3, we get Equation 4. D(m, k) = { Pa (Ti + tbase(m, k)) - aps (Ti) i=0 (m) (4) The template matching is executed by shifting the template s (Ti) (length L) over I x V(m). The shifting step k of the matching is an integer from 0 to kmax (m) where kmax (m) is the largest integer that is not beyond {I x V(m) - L}/AT. We choose the length Al of the margin long enough to make kmax (m) considerably large for any stretch parameter m. 0.5 `i -0.5L 9 9.5 10.5 11 performance time[sec] The time value of 2: 1: 1 is kept. (b) Matching correctly by using the linear stretch method Figure 4: Example of mismatching by partial stretch of nonlinear stretch matching 626

Page  00000627 The matching result is decided as follows; First, for each stretch parameter m, the optimal value kopt(m) of the shifting step k is decided by kopt(m) = argminkD(m, k). Next, the optimal value mopt of the stretch parameter m is decided by mopt = argminmD (Mi, kopt(m)). Then we get the optimal values ( V(mopt) and tbase(mopt, koptmTopt)) ) as the matching result. 4 Real-time score following system In our system, we need to recognize instantaneously the performing position on the score at any instance. However, matching process needs to use a performed segment on current time. So we cannot instantaneously recognize the performing position from the matching results, directly. Therefore, we propose a system in which the recognition is instantaneously executed by using the subsystem as given in Section 4.1. The inner-states are always renewed with a delay of finite time needed in a matching process. The time needed in the matching process is too short to influence the real-time character of our system. 4.1 Subsystem determining the performing position on the score If any current time tcur of the performance is inputted into this subsystem, the corresponding position Tcur on the score is outputted instantaneously from it. This subsystem uses a linear estimation as: Section 3 and we get the values ( V(mopt), Tbase and tbase (mopt, kopt (mopt)) ) as the matching results. We then update the inner-states of the subsystem described in the previous Section 4.1, by using these three values. The two subsystems, which are described in this and previous section, are combined to form a real-time score following system. Conclusion Tcur = VX (tur - tbase) +Tbase (5) In this research, we proposed a real-time score following system for orchestral musics. In order to make an orchestral score comparable with the corresponding performance sound, we first extracted acoustic pressure peaks from the performance sound and then devised a new method to presume pressure peaks from the score. We showed that our linear stretch matching method is suitable as the matching method of the above two pressure peaks. Based on these methods, we implemented a real-time score following system for orchestral musics. Our system is to be used only in those portions of orchestral musics as a system using the existing pitch-detecting method cannot be applied. We are evaluating this system now. But as long as we attained (e.g. Fig. 4), this system is robust enough to replace DP. Acknowledgment This research is sponsored by the assistance grant of HosoBunka Foundation, Inc. (2004/2005 Grant Cycle) References Goto, M. (2001). An audio-based real-time beat tracking system for music with or without drum-sounds. Journal ofNew Music Research 30(2), 159-171. Grubb, L. and R. B. Dannenberg (1997). A stochastic method of tracking a vocal performer. In Proceedings of the International Computer Music Conference (ICMC), pp. 301-308. Milller, M., F. Kurth, and T. Roder (2003). Towards an efficient algorithm for automatic score-to-audio synchronization. In Proceedings of the International Conferences on Music Information Retrieval and Related Activities (ISMIR). Orio, N., S. Lemouton, D. Schwarz, and N. Schnell (2003). Score following: State of the art and new developments. In Proceedings of the international conference on new interfaces for musical expression (nime). Montreal, Canada. Shiba, S., A. Inoue, J. Hiraishi, H. Takaku, H.Shigeno, and K. Okada (2003). Score-based camerawork planning for orchestra shooting (in japanese). In Proceedings of the Symposium on Multimedia, Distributed, Cooperative and Mobile Systems (DICOMO), Volume 2003, pp. 821-824. where tcur is a current time (at the performance) inputted into this subsystem, Tcur is the outputted-corresponding position on the score, and V, tbase, Tbase are the inner-states of this subsystem. V, tbase and Tbase are given in Section 3. These innerstates are obtained as the results of the matching which is finished just before the running matching. 4.2 Subsystem of the matching process The following three data are inputted into this subsystem; * The acoustic pressure peaks pa (t) described in Section 2 (Fig. 1) * The presumed pressure peaks ps (T) described in Section 2.1 (Fig. 2) * The matching segments defined in Section 2.2 (Fig. 3) Then the next inner-states are outputted from it. This subsystem uses the linear stretch matching method described in 627