Page  00000383 Spectral Modification and MIDI Control for Improved Quality of Violin Sound Synthesis P. Zolfaghari, H. Banno, H. Ban, K. Igarashi, K. Takeda F. Itakura Itakura Laboratory, Graduate School of Engineering, Nagoya University email: ac. jp Abstract In this paper, some of the expressive and quality aspects of the violin is studied. Using fundamental frequency and a distortion measure, simple MIDI control is implemented to improve the expressive components in synthesising the violin sound. The player's rendition of a piece on a Stradivarius and a synthesised version of the same piece is used as input to the system. In a second study, the silent violin and the same Stradivarius are spectrally compared and based on an inverse filtering scheme, the sound of the silent violin is improved to be perceptually closer to that of the Stradivarius. Subjective evaluations are reported on with encouraging results. 1 Introduction In the violin, the source of vibrational energy consists of one or more stretched strings, which are set into selfsustained oscillation with a bow which is drawn across the string. In this paper, a player refers to an individual who controls the vibration of the violin strings, using the bow and fingers. Compared to the other instruments with regard to their sound, violins have a wide dynamic and playability range. The quality of a violin is often chosen by the player based on the pure quality of its sound and the ease of playability. Researchers have long been working on determining such qualities of violins based on discrete methods (Caldersmith 1985). Our aim in this study is not to indicate which violins are preferred by players but to emulate the player's expressiveness over a piece as well as spectral properties of certain violins. Electronic music is often perceived to be lacking the expressiveness that a player can induce by performing the same piece. There are a number of factors which contribute to this including the gestures of the performer, and the spectral properties of the instrument that is being played. Acoustic instruments transduce movements of a performer into the sound. In a natural sound we observe a great amount of spectro-temporal variations, which are difficult to capture and model. There are various temporal scales such as the short term correlation related to the timbral properties (such as formants), correlation due to pitch period, slower modulations such as vibrato, expressivity inflections, and transitions between different notes (Dubnov and Rodet 1998). In this ongoing study, the spectral characteristics of synthesised violin and the silent violin (YAMAHA SV100K) are compared to a Stradivarius violin played by a professional violinist. Based on these observations, player's expressive style is studied and using an inverse filtering scheme, the silent violin sound is enhanced to be perceptually closer to that of the Stradivarius. 2 Recording & Technical Aspects A professional violinist was recorded in an anechoic room. In total, eight microphones and a binaural dummy head were used in this recording session. Two violins were used in the recording: A Stradivarius (1726) and a silent violin (YAMAHA SV-100K). The sampling frequency used was 44.1 kHz. In this study, only the recorded sound from the microphone at a distance of approximately 1 m from the player was used. 3 Timing & Expressive Aspects Players express their rendition of a piece by timing and technical variations such as tremelo and pitch bending. In this section, the timing aspect as well as some technical aspects of the player are compared to that of the same piece of music synthesised using a synthesiser (YAMAHA MOTIF6). A block diagram of the method used to do this modification is shown in Figure 1. The [Playe SMusical Score -.^) - ------- * r Synthesiser FO )I_ SDTW New Signal SItakura-Saito Measure IVMIDI Manipulation - Figure 1: Block diagram for Expressive Control System. player and the synthesiser are presented with the same musical score. The signal is passed through a fundamental Frequency extraction scheme briefly described in section 3.1. After dynamic time warping and the application of Itakura-Saito distance measure (Itakura and Saito 1970) to the played signal and the synthesised signal, MIDI data is modified and a new signal is synthesised which has closer expressive properties to that of the player's signal. 383

Page  00000384 3.1 Fundamental Frequency - FiO The instantaneous frequency based FO extraction method used in this paper was designed to produce continuous and high-resolution FO trajectories suitable for high quality speech modifications (Kawahara et al. 1999). The proposed method assumes that the signal has the following nearly harmonic structure: (1) where ak, (t) represents a slowly changing instantaneous amplitude and &Jk (7) represents a slowly changing perturbation of the k--th harmonic component. In this representation, FO is the instantaneous frequency of the ftrndamental component where k 1. The proposed method also uses instantaneous frequencies of other harmonic components to refine FO estimates. For further details of this algorithm, please refer to (Kawahara et al. 1999). Figure 2 shows the extracted FO for a section of Schubert's violin Sonata Gmin 4th. Stochastic regions, where there is no fundamental, are indicated by the absence of a pitch contour. between the player's signal (f) and the syntehsised signal (g) as a(9R)&f)R(f) Z A~r) where 1 (2) Pi= i (~,~ - (j azol= (3) and ai are the linear prediction (LP) coefficients, u(. is the spectral power, n(' is the normalised LP residual, and i 1,..., pand j -p,..., p,and the LP order is p. In these experiments, p was chosen to be 40. The minimisation of the Itakura-Saito distortion, over the predictor coefficients a, is equivalent to the minimisation of the residual energy. Figure 3(c) shows the the Itakura-Saito measure between the played (Figure 3(a)) and the synthesised (Figure 3(b)) section of Schubert's violin Sonata containing transient and stocatto regions with differing intensities due to the smooth note synthesis in the synthesiser. Compared to other distortion measures, the Itakura-Saito measure gave the best results in the violin case. 800 750 700 i1',I',iiI' 1C,I' I 650;3` I 600 o U_ 550 500 450 400 ~LL- -- 0 1000 Player MIDI with MOTIF Time - (msec) 2000 2500 3000 3500 2400 2600 2800 3000 3200 3400 3600 3800 (a) (b) (C) ~EnT 10000 5000.. Figure 2: Fundamental Frequency of Stradivarius (player) and synthesised signals with a frame shift of lOins. 3.2 Dynamic Time Warping Since the synthesised signal and the player's signal may not be time aligned due to expressive modification of the player, dynamic time warping (DTW) is used to find a matching frame between the two signals. Analysis is performed with a Hamming window of length 20 ms and a frame shift of 10 ins. Warping is performed using the fundamental frequency and cepstral distortion with a cepstral order of 10. Optimum path is found using the cepstral distortion with a cepstral order of 10. Mean error of FO in DTW is computed after matching frames based on the optimum path. Over 546 frames the mean error of FO (normalised by frame Number) was 1.28 Hz. 3.3 Itakura-Saito IMeasure The log spectral difference is the basis of a number of distortion measures. The distortion measure originally proposed by Itakura and Saito in their formulation of linear prediction as an approximate maximum likelihood estimation is given in (Itakura and Saito 1970). After obtaining the correspondence between frames, ItakuraSaito distance is used to obtain a measure of difference 2000 2500 3000 351 -1 2000 2500 3000 351 Time - (msec) 10 (d) 10 Figure 3: A section of Schubert's violin Sonata played on a Stradivarius (a), played through a synthesiser (b), the Itakura-Saito Measure (c) and played through the synthesiser after modification (d). 3.4 MIDI Data Editing MIDI signal was modified based on DTW and Itakura-Saito distance results. The resynthsised sound has better alignment to the player's expressiveness over the piece. Using the Itakura-Saito measure over a certain threshold, the volume (velocity) of the notes were controlled based on simple functions giving a better representation of some expressive aspects of the player such as in transient and stocatto regions. Figure 3(d) shows the resulting resynthesised waveform. 3.5 Subjective Evaluation Test Since an evaluation task over a long piece of music is difficult, two sections of Schubert's violin Sonata were 384

Page  00000385 used as stimuli. Subjective evaluation was carried out based on the forced choice ABX scheme using players rendition of the piece (X), the original synthesised signal (A), and the edited synthesised signal (B). Number of subjects was nine. The question asked was, "is X perceptually closer to A or B in terms of note shape and timing?". For the first section which contained long notes with little stocatto regions, 67% preference was obtained for the edited signal (B). For the second chosen section, containing more transient and stocatto regions, subjects gave 89% preference over the edited signal. 4 Spectral Aspects In this section, we develop an inverse filter based method for generating a Stradivarius sound from a silent violin. The same player plays both instruments and so there are little expressive aspects to consider. The section initially describes the long-term spectral difference between the instruments. This is then used to enhance the silent violin sound to be percpetually closer to that of the Stradivarius. 4.1 Melody Case We compared the long-term spectra of the silent violin with that of the Stradivarius. The analysis conditions in computing the long-term spectra are shown in Table 1. Figure 4 illustrates the long-term spectra of the silent violin and the Stradivarius for the melody case. The long-term spectra are smoothed by cepstral truncation up to thirtieth order. These spectra are computed from the signals of a melody passage with a duration of around thirty seconds. It is evident that the spectral power of the Stradivarius is greater than that of the silent violin in the frequency region from 5 kHz to 15 kHz. This means that the harmonic intensity in this region is higher in the Stradivarius. Table 1: Analysis Conditions Sampling Frequency 44.1kHz Analysis Window Hamming Frame Length 23 ms Frame Shift 11 ms FFT Points 4096 points Cepstrum Order 30 4.2 Single Tone Case To further investigate the differences between longterm spectra of the violins in detail, we also observed the long-term spectra of single tones for various fundamental frequencies. Several short segments with relatively constant FOs were extracted from the recorded sound. The range in duration of the signals was approximately between one and three seconds. The long-term spectrum was then computed for each extracted signal. The analysis conditions are the same as in Table 1. Each line in the plots of Figure 5 show the long-term spectra of the Stradivarius and the silent violin for different FO. As can be seen, there is a large difference in the spectra of Stradivarius in producing different notes. The long-term spectrum of silent violin barely changes in the lower frequency region (lower than five kHz) and simply decreases in the mid-frequency region while the Stradivarius's one is more complicated and has stronger power than that of the silent violin. This and the formant structure is one of the main factors for the rich sound obtained from this Stradivarius. In these figures, it is also observed that the power of both spectra increases with FO increase in the higher frequency region of the spectra (above 5 kHz). -25 /7. /-\ -- 196Hz / 220Hz.\ x - - 44380Hz 44* *+3-x 4 586HZ ~ 660HZ *x - 784Hz Sx 880Hz S. x xx - '^ x- *--x Xx S. ' <D -35 CýE -40 -45 -50 -30 -35 D-40 CL -45 -50 -55 -60 1 2 3 4 Frequency [kHz] 5 6 7 - " -^ - V* - 380Hz F- - 440Hz --<^-d "......A 586Hz X* +7 Y* 660HZ -Ix 784Hz S^x 880Hz..^t--xx xx *-+~ x,,xx~xX:.-, " ", ~...>.. S 1 2 3 4 5 6 7 Frequency [kHz] Figure 5: The long-term spectra of the Stradivarius (top) and the silent violin (bottom) for different FOs. 0 -210 --20 c -30 -40 -50 ]\ 7 4.3 Enhancement Method Based on the above observations of the long-term spectra, we introduce a method to enhance the quality of the silent violin based on an inverse-filtering scheme. The following is the procedure emplyed: 1. Frame-by-frame analysis is performed over the signal x(n). Windowed signal xi(n) of i-th frame is obtained by windowing using a rectangular window with a duration L. 2. Apply DFT to the windowed signal xi(n). 5 10 15 Frequency [kHz] 20 Figure 4: The long-term spectra of the Stradivarius (solid line) and the silent violin (dotted line) in the melody case. 385

Page  00000386 3. Apply inverse filters to the obtained spectrum Xi (k) given by Seven listeners with normal hearing participated in this experiment. Each pair was judged four times in total by a listener. The results obtained are shown in Table 2. It Yi(k) X(k) Ss(k) (4) Melody Silent Violin 2.3 Generated Sound 3.5 Single Tone Silent Violin 2.3 Generated Sound Generated Sound (FO Depend) 3.7 where, Yi is the output spectrum, ISs| and |Se| are amplitude spectra of the Stradivarius and the silent violin, respectively. The spectrum Yj generated by this process is spectrally closer to that of the Stradivarius. 4. Apply inverse DFT to the generated spectrum Yi. 5. Overlap-adding of the generated signals. Based on the observation of the long-term spectra, the fidelity of the generated sound will be increased when the inverse filter is computed according to the FO of the current frame. However, in order to realize this in the melody case, FO extraction for the silent violin and interpolation of the inverse filters are required. This is as yet not implemented. Figure 6 shows spectra of the silent violin (input), the Stradivarius (target) and the generated signal which is created from the silent violin using the long-term spectrum of the Stradivarius. It can be found that the spectrum of the generated signal is similar to that of the Stradivarius. Table 2: Results of the subjective test. is found that the similarity scores of the generated sound as compared to that of the silent violin are improved by more than one. In the single tone case, the similarity score of the generated sound depending on FOs is not very much higher than that of the generated sound independent from FOs, because the durations of stimuli are quite short (from one second to three seconds approximately), and the decision in this case is much more difficult than that of the melody case for many listeners. 5 Conclusions In this ongoing study, using simple MIDI control we demonstrated that a violin sound, synthesised by a synthesiser can be improved perceptually in representing the expressive aspects of the player. There are various MIDI control devices on the market but our study is to delve into why players make such expressive tones and structures to create their own style. Automation of the MIDI control system devised here is an essential part of this study. Furthermore, we have shown that with simple inverse filtering, it is possible to improve the silent violin sound to that of a Stradivarius. A combination of these two strategies as well as further studies for expressive violin synthesis and FO based synthesis are left for further work. Acknowledgments We would like to express our gratitude to Ms. Hideko Udagawa for performing the pieces used in this research. References Caldersmith, G. (1985). The violin quality debate: Subjective and objective parameters. The Catgut Acoustical Society 43, 6-12. Dubnov, S. and X. Rodet (1998). Study of spectro-temporal parameters in musical performance, with applications for expressive instrument synthesis. In Proceedings IEEE International Conference on Systems, Man, and Cybernetics. Itakura, F. and S. Saito (1970). A statistical method for estimation of speech spectral density and formant frequencies. Electronics and Communications in Japan 53A, 3643. Kawahara, H., H. Katayose, A. D. Cheveigne, and R. Patterson (1999). Fixed points analysis of frequency to instantaneous frequency mapping for accurate estimation of f0 and periodicity. In Proceedings of EUROSPEECH'99, Volume 6, pp. 2781-2784. Figure 6: Spectra of the silent violin (input), the Stradivarius (target), and the generated signal by the proposed method, respectively. 4.4 Subjective Evaluation Pairs of the Stradivarius and the generated signal were presented to listeners in random order. The stimuli consisted of two types of melody and eight types of single tone with different FOs. Listeners judged the degree of similarity between the two signals in each pair using five categories ranging from "The stimuli are exactly the same" (5) to "The stimuli are completely different" (1). 386