Page  307 ï~~DSP BASED RELIABLE PITCH-TO-MIDI CONVERTER BY HARMONIC MATCHING. Pablo Ferndndez-Cid and Francisco J. Casajts-Quirts ETSI Telecomunicaci6n-UPM, Ciudad Universitaria, 28040 Madrid, SPAIN E-mail: ABSTRACT A device - known as a pitch-to-MIDI converter - capable of reliably estimating the fundamental frequency of sound signals is described. The information is transformed into musical data and sent via standard MIDI (Musical Instruments Digital Interface) to any other MIDI-controlled device. A harmonic matching algorithm is used for pitch estimation and accuracy is better than 1/1200 octave. It has been implemented in real time on a single AT&Ts DSP16C processor with practically no additional hardware. 1. INTRODUCTION The problem of estimating the fundamental frequency of acoustic signals like speech and music is an old one. It turns out to be very important in speech applications like low bit-rate coding and speech synthesis. This paper will however focus on the application of pitch estimation to music processing. An automatic pitch estimator, capable of extracting the fundamental frequency from singing voices or acoustic musical instruments would have many applications.Among these: systems for computerassisted ear-training or singing teaching, analysis of microtonal non-Western music, automatic sound-tomusical score transcription, etc. Another application is a pitch-to MIDI converter, which translates the melody being played on an acoustical instrument into MIDI messages that can in turn be directed to any MIDIcontrolled device. This application interests mainly performers of non-keyboard instruments, who could gain access to the possibilities of MIDI sound sources. Also a new approach to on stage computer-player interaction is open. For example, the computer can compare a stored score of the piece with the real time performing of the musician, playing along with him. Pitch-to-MIDI converters are not new [CMJ, 1990], but pitch estimation is a difficult operation, and previously commercial available models tend to show a slow, not reliable response, keeping musicians reluctant for them. In this paper we present a pitch-to MIDI converter the main features of which are to be described in what follows. Pitch estimation is carried out by means of a new harmonic matching algorithm (some related work can be found in [Doval, 1993] and [FernAndez-Cid, 1992]). This unaided algorithm is by itself capable of correctly estimating the fundamental frequency of practically any musical sound. There is also an user-selectable postprocessing which attempts to map the fundamental frequency (a physical property of the waveform) into pitch (a perceptual variable). The post-processor is a fuzzy neural network. It deals with such events as time-overlapping, transients and very brief (impulselike) notes, where fundamental frequency is not well defined. 2. FUNCTIONAL DESCRIPTION A simplified block-diagram of the whole system appears in figure 1. A sound signal is sampled at a sampling rate of 8000 samples/sec. That signal must have been previously low-pass filtered so as to ensure that its bandwidth is less than 4000 Hz. The sound samples are then divided into frames (about 30 ms. long) for further processing. Each of these frames is the input for the parameter estimation block, which estimates both the fundamental frequency and the rootmean-square (rms) value of the signal segment under analysis. ICMC Proceedings 1994 307 Audio Signal Processing

Page  308 ï~~An additional block operates on pitch and volume information in order to determine note boundaries. processor with 80 MHz clock, frame overlap is about 50%. I Signal fundamental Figure 1: Block diagran of the overall The stream of data about pitch, volum boundaries constitutes a decomposition of waveform into musical notes. A final MI block carries out the encoding of this infor standard digital form which can be interp MIDI-compatible device. 3. PARAMETER ESTIMATIi According to the block diagram shown the system must estimate three parameter sound waveform: fundamental frequency, rm note boundaries. In what follows only the main featu estimation process will be desc comprenhensive study of the methods we ha be found in [Casajiis-Quirns, 1994]. After sampling, the sound signal is di frames for further analysis. We have chose frames as they provide a good compromi: time and frequency resolution. The frai asynchronous and depends on the performance. A frame is analyzed as s( previous analysis is finished. In pract The energy of the signal inside the frame is used to estimate the current intensity of the sound source. This intensity contour is the main clue on selecting the note boundaries (although the pitch contour must also be taken into account: a sudden jump on pitch at a constant volume marks the beginnig of a new note). The estimation of the fundamental frequency, is the core of the process. The sound of a musical instrument Note playing a given note, is a (quasi) periodic signal. The Detection attack noises and evolving timbre that characterizes the onset of notes on acoustic instruments, make it very difficult to discover the periodicity on the waveform until this initial phenomena have disappeared (this is the main reason for the large delay on many pitch-tonote on/off MIDI converters). This is not the case when the spectrum of the sound is used for the pitch extraction. The spectrum of a periodic wave is built of a set of evenly spaced peaks (as a reflection of the fact that the instrument is vibrating in a set of harmonic modes). I system Those peaks are still evident in the presence of noise, and a timbre evolution can change the amplitude of the ae and note peaks but not their positions. The distance between f the sound these harmonic peaks is the fundamental frequency (the DI encoder pitch). However, a simple peak distance measure is not mation in a enough: on acoustic sounds, is often the case that not eted by any all the harmonics are playing (sometimes even the first -the fundamental- is missing), and the spectral shape (the relative amplitudes of the different modes) shows a ON great variety (not only according to the instrument being considered, but from frame to frame on a given in figure 1, instrument). We have often found the 90% of the rs from the energy concentrated over a single harmonic, while the Is yalue and remaining 10% is spread over various other harmonics. Our algorithm (which falls inside the frame of what ores of the are called 'harmonic matching' techniques), searches the:ribed. A frequency that can lead to a better representation of the yve used can original spectrum. Generating spectra with peaks positioned fO Hz appart (0, f0, 2-f0, 3f0,...) and ivided into playing with the height of each peak, as to obtain the si 30 msec. best posible imitation of the original spectrum, se between different candidate frequencies can be compared. The me rae is value of fO that better 'matchs' the original spectrum, processor is the estimated fundamental frequency for the frame. )on a the By construction of these synthetic spectra, the ice, for a matching of every single pea is warranted, and the absence of some peaks (even the one that corresponds Audio Signal Processing 308 ICMC Proceedings 1994

Page  309 ï~~to the fundamental), does not cause any problem. The procedure is robust, source-independent, and can be made as accurate as desired. Those interested in a in depth description of the algorithm can find it in [Fernandez-Cid, 1993], [Casajis-Quir6s, 1994]. Remember that the sound signal was assumed to be periodic. Often this is not the case due to several reasons. One of these is that the frame under analysis may encompass a transition between two notes or the onset of one of them. In both cases the frame is not truly periodic either because it contains two unrelated periodic signals (thus implying in some sense a g.c.d. periodicity) or because it contains also background noise. Moreover, even if the analysis frame has been taken on the stationary part of a note, it will contain a considerable amount of noise for many common instruments that generate noise along with periodic waveforms. Also and specially for human voices and string instruments, the fundamental frequency can vary across the duration of the frame. In all those cases we are far from the ideal one. They can be treated by including improved procedures that take into account the possibility of simulaneous periodicities and extract the frequency modulation component from the waveform under analysis. Intelligent postprocessing and a pitch tracker can also improve performance at the cost of increased processing delay (that is why post-processing should be optional). Details can be found in [Casajis-Quirts, 1994]. 4. MIDI ENCODER Once the pitch and volume information has been extracted from the analysis of the input audio signal, it must be encoded into messages built according to the MIDI specification. As soon as the note detection block senses the beginning of a new note, the midi encoder sends a NOTE ON message to turn on the closest note available in the equal tempered scale, inmediately followed by a PITCH BEND to finally match the estimated frequency. As long as the note lasts, dynamic changes on the pitch (for example as a result of a vibrato or a glissando) are translated to more PITCH BENDs. The intensity fluctuations are also dinamically followed and both VOLUME CONTROLLERs and AFTERTOUCHs (in order to be able to dynamically change the timbre characteristics together with the intensity) are sent. The pitch and loudness estimates are translated to MIDI messages about 40 times in a second, allowing an smooth recreation of fast vibratos and tremolos. 5. PERFORMANCE Both signal acquisition and its analysis need time to be performed, so an inherent delay is present in any pitch extractor. Literature on pitch extractors (moreover for musical ones) is usually not very realistic when it comes to performance measurement. Most of the times selected short steady tone portions of sound are employed for statistics of errors, forgetting the effects of articulations, transitions between notes, pitch ornaments, etc. Also, delay measures are given for the pitch extractor alone, while error percentages are calculated after some kind of pitch tracking or postprocessing, which always introduce additional delays. We will try to give more significant figures. The algorithm we employ keeps the final delay below 30-35 milliseconds (with no post-processing). The error rate is low enough without tracking for the pitch-to-MIDI converter to be useful, as musicians who have tested the prototype have confirmed. Also, errors tend to concentrate over note transitions,making it easier for the musician him/herself or the postprocessing to act against them. Close miking is a must, as reverberation breaks up the one-note-at- atime condition during transitions. The delay figure together with the asyncronously overlapping frames (no processor rest), yield a gracious transcription of pitch modulation musical ornaments: the estimates are refreshed about 40 times each second. Interestingly enough, delay is consistent frame to frame, so that it can be corrected in non-real-time applications. In previous pitch-to-MIDI converters, a much longer delay was found for low notes (even as large as one second), as many of them had to wait for a stationary time domain wave pattern. Intense rapid vibratos or trills as well as glissandos are correctly translated to MIDI. These had been difficult tasks for previously available pitch -to-MIDI converters, as many of them needed to promediate the caracteristics of the sound over a large time portion. We cover the range from 70 to 2600 Hz, were most soloist instruments fall (remember we are considering ICMC Proceedings 1994 309 Audio Signal Processing

Page  310 ï~~only the fundamental frequency, not the spectrum width), and found no problems with pitch displacements as fast as one octave in 300 milliseconds. The pitch of each frame is found with a better than 1 Hz. precision. By no means, do we favour the tempered scale on the pitch extraction, so microtonality and alternate tuning scales as well as ornamental pitch artifacts are as well addressed as the tempered scale. The following table presents the number of errors detected in the analysis of entire musical phrases lasting various seconds (not just sustained portions of single notes). Sources were recordings (with hall time reverb) so better results are to be expected from more real situations. notes. This allows the player more freedom to play with a variety of techniques. Other pitch-to-MIDI converters impose serious restrictions on the playing style, to avoid the production of a strongly resonant sound, or lack of energy over the fundamental and first harmonics. As a final illustration figure 4 shows the parameter output of the system for a sung melody. No postprocessing is included. After removing the third harmonic spikes (no delay involved), which can be used as transition marks, the MIDI output has been connected to a commercial score editor. Instrument time (sec) flute 5.5 oboe 8.5 clarinet clarinet violin viola viola cello soprano alto tenor 6.4 6.2 11.3 11.0 9.2 13.8 10.1 11.4 11.1 no. of notes 7 7 7 12 10 7 10 10 13 13 13 total errors 0 7 0 1 14 0 2 4 4 1 6 gross errors 0 3 0 1 10 0 0 3 4 1 6 final errors 0 4 0 0 4 0 2 1 0 0 0 t (ms.) (musical phrases taken from the EBU SQAM CD) With other instruments, and (mainly) when excited directly from instruments (not selections from recordings), results were significantly improved. Also, many errors were properly corrected when postprocessing techniques were added, as they tend to concentrate on note transitions and use to be octave lowered or raised versions of the real pitch, or g.c.d. of the two notes present in the transition. Spectra with 90% of energy over one harmonic, are well resolved, and the inside-frame (no memory) pitch detection makes the algorithm suitable for rapidly changing spectral envelopes, as those that appear on many musical instruments, particularly on the onset of Figure 2: Estimated pitch and volume for singing voice (no post-processing) 6. REFERENCES [CMJ, 1990] Several Authors and Papers on Pitch Detection, Computer Music Journal, vol. 14, no. 3, Fall 1990. [Doval, 1993] Doval, B. et al., Maximum likelihood harmonic matching fundamental frequency estimation and tracking using HMMs, Proc. of ICASSP-93. [Fern4ndez-Cid, 1992] Fernandez-Cid, P., Fundamental frequency to MIDI converter, Telecom. Eng. Thesis, ETSI Telecomunicacion-UPM, 1992. [Fernandez-Cid, 1993] Fernandez-Cid, P. and CasajdsQuir6s, FJ., "Single-Chip, Real-Time, Harmonic Matching Pitch-to-MIDI Converter", Procs. of ICSPAT '93, pp. 284-292, 1993. [Casajis-Quires, 1994] Casajts-Quires, F.J. and Fernndez-Cid, P, "Real-Time Loose Harmonic Matching Fundamental Frequency Estimation For Musical Signals". Procs. of ICASSP'94. 1994. Audio Signal Processing 310 ICMC Proceedings 1994

Page  313 ï~~solve the blind decomposition of concurrent sounds? We propose the necessary conditions for the blind decomposition problem in the next section. 3. Necessary conditions for sound segregation in the blind decomposition problem The information that is common to each sound source is the cue of the blind decomposition of concurrent sounds. What kind of information on the sound is observed in human auditory system? Human auditory system transforms the sound signals to signals which are associated with the frequencies[Ghitza, 1994] and recognizes the power in each frequency and its variations with time. It can be said that the sound source separation is equal to the separation of the power spectrum. To simplify the problem, we will discuss about this problem in the restricted case when the mixed sound consists of two sound sources and contains no background noise. To solve the problem, we propose the necessary conditions of the short-time power spectrum of each sound and its variation with time as follows: Condition (1); The spectrum of the concurrent sounds is represented by the sum of the spectra of the two sound sources. Condition (2); The short time power spectra of two sound sources keep their shapes while the amplitude change with time, in adequately short term. This condition is shown as follows: Sh(t,w)=a(t)-Sf(W)+ b(t).Sg(W ) (Eq.3-2) 3Oo such that {Sf(w0)O0 fl Sg(oo)=Oor Sg(o) O fl Sf(oo)=O) (Eq.3-3) where St(co) and Ss(co) are the power spectra of the sound sources F and G. a(t) and b(t) are the change rates of the power of sound sources at time t. O)o is the frequency that meets the condition (3). It should be noted that we don't need to know the frequency 0o in advance. Si(co), S8(co), a(t), b(t), and con are unknown in advance, and we can only observe Sh(tco ). The task we must complete is to estimate St(co) and Sg(co) by only observing Sh(t,w) under Conditions (1), (2), and (3). Before we construct a method for sound segregation on the basis of these conditions, we should investigate whether these conditions hold true for all sound signals. To test Condition (1), we propose the following model of the mixed sound in which the mixed sound is represented by the sum of two sound signals: h(t)=f(t)+g(t) (Eq.3-4) St(t,(o )=a(t)" St(y) (Eq.3-1) where h(t) represents the amplitude of the mixed sound at time L f(t) and g(t) represent the amplitude of the sound sources F and G. Here, H(co), F(.o), G(ay) stand for the Fourier transformed representation of h(t), f(t), and g(t), respectively. The power spectrum of the mixed sound Sh(() is represented as follows: Sh(W) =IH (M) + H (o)) - H(Â~o)-H'(o) = (F(O)+ G(o)).(F*(c) +G'(o )) (Eq.3-5) The operation" * "takes the complex conjugate. Equation 3-5 will be written as follows: Sh(wO) =Sf (0)+Sg (0)) + F(m)-G*(co)+ G(o))-F* (Co) (Eq.3-6) In Equation 3-6, the spectrum of the mixed sound is where Sf(tco) is the short time power spectrum of the sound source F at time t. a(t) is the rate of variation, which is constant for all frequency. S(co) stands for the form of the power spectrum characteristic to sound source F. Condition (3); There exists a frequency where one of the power spectra is zero and the other is not zero. Under these necessary conditions, the observed short time spectrum Sh(t,cD) can be written as follows: ICMC Proceedings 1994 313 Audio Signal Processing