Page  130 ï~~A MIDI Control and Performance System for Brass Instruments Perry Cook (prc@ccrma.stanford.edu) Dexter Morrill (dcxter@cs.colgate.edu) Julius 0. Smith (jos@ccrma.stanford.edu) Stanford CCRMA Colgate University Stanford CCRMA Abstract Pitch and frequency detection schemes which do not take into account the unique spectral and acoustical properties of a particular instrument family usually generate errors of three types: 1) octave and harmonic detection errors, 2) half step errors which jitter rapidly about the true estimate, and 3).latency of detection. A pitch detection and live MIDI control system has been constructed for the trumpet, which significantly reduces detection errors and latency. By limiting the search range to the natural playing range of the trumpet, sampling rate and computation can be optimized, reducing latency in the pitch estimates. By measuring and utilizing valve position in the pitch detection algorithm, the half-step jitter problem is completely eliminated, and latency can be further reduced. Schemes for reducing harmonic detection errors will be presented. The trumpet system is quantitatively compared to two popularly available pitch-to-MIDI systems. Performance features are integrated in the trumpet MIDI control system, such as MIDI file and sound file playback controlled by triggers mounted on the instrument itself. 1 Introduction: Pitch/Period Detection Pitch detection is of interest whenever a single quasiperiodic sound source is to be studied or modeled, specifically in speech and music [Hess, 1983]. Pitch detection algorithms can be divided into methods which operate in the time-domain, frequency-domain, or both. One group of pitch detection methods relies on the detection of some set of features in the time-domain. Other time-domain methods use autocorrelation functions or difference norms to detect similarity between the waveform and a time lagged version of itself. Such methods are essentially period detectors, and not truly pitch (a perceptual measure) detectors. However, many signals such as those encountered in music are highly periodic, and thus systems which detect frequency or periodicity are often called pitch detectors. Another family of methods operates in the frequency-domain, locating sinusoidal peaks in the frequency transform of the input signal. Other methods use combinations of time and frequency-domain techniques to detect pitch. Frequency-domain methods call for the signal to be frequency-transformed, then the frequency-domain representation is inspected for the first harmonic, the greatest common divisor of all harmonics, or other such indications of the period. Windowing of the signal is recommended to smooth the effects at frame edges, and a minimum number of periods of the signal must be analyzed to enable accurate location of harmonic peaks. Various linear pre-processing steps can be be usedto make the process of locating frequency-domain features easier, such as performing linear prediction on the signal and using the residual signal for pitch detection. Performing non-linear operations such as peak limiting also simplifies the location of harmonics. In a time-domain feature detection method the signal is usually pre-processed to accentuate some time-domain feature, then the time between occurrences of that feature is calculated as the period of the signal. A typical time-domain feature detector might lowpass filter the signal, then detect peaks or zero crossings. Since the time between occurrences of a particular feature is used as the period estimate, feature detection schemes usually do not use all of the data available. Selection of a different feature yields a different set of pitch estimates [Deem et. aL, 1989]. Since estimates of the period are often defined at the instant when features are detected, the frequency samples yielded are non-uniform in time. To avoid the problem of non-uniform time sampling, a window of fixed size is moved through the signal, and a number of detected periods within each window are averaged to obtain the period estimate. For reliable and smooth estimation, the window must be at least a few periods long. Often the signal must be interpolated between samples in order to locate the feature occurrence time as accurately as needed. 2 A New Period Estimation Algorithm A method of pitch detection which uses the phase delay of a periodic predictor to form the pitch estimate will be briefly presented in this section. For more detail on the algorithm, implementation, and applications, see [Cook, 1991]. This pitch detector accurately tracks a quasiperiodic signal, and will be called the Periodic Predictor Pitch Tracker (PPPT). The PPPT provides a method of automatically and adaptively determining the optimum continuous-time period, and also provides an estimate of the reliability of the period estimate. The PPPT system as initially described is not a complete pitch detector, in that it relies on some other scheme for an initial estimate of the period. Once the detector locks onto the correct period, the method provides accurate estimates of the instantaneous period using all samples of the input signal, provides an estimate of the periodicity of the signal, and provides controls which affect the dynamics and accuracy of the pitch detector. Given a quasi-periodic signal x(n), and an integer estimate P of the initial period, periodic prediction is implemented by: M xA(k) _ x(k-P+i) c(i) [11 I "-M where M is some appropriately chosen small number 1A.3 130 ICMC Proceedings 1993

Page  131 ï~~and c(i) are the predictor coefficients. Backward prediction is implemented by replacing P with -P in Equation 1. Figure 1 shows a block diagram of an FIR periodic predictor. estimate the gradient of the error surface. It is preferred by the authors because of its performance and efficiency of implementation. Each predictor coefficient is adjusted at each time sample by an amount proportional to the instantaneous error and the signal value which is associated with the coefficient being adjusted. The LMS update equation is: Ck+I=Ck+2pXkek [91 Figure 1: Linear FIR period predictor where Ck is the vector of predictor coefficients, and Xk is the corresponding delayed input sample vector. The adaptation constant 2 t comes from the Newton's method derivation (the 2 comes from taking the multidimensional derivative of the error function), and controls the dynamics (and stability) of adaptation. Stability is ensured if: The phase (relative to the Pth delayed sample) of the FIR filter implemented by the predictor coefficients is computed by: M 1:(c(i) - c(-i)) * sin(w i) 0=tan M [21 c(0) + I (c(i) + c(-i)) * cos(o i) i "-M The frequency to is the frequency at which the phase delay of the filter is calculated. The relation between the pitch estimate and o is: w = 2/TO - 2x/Ts 3] where Ts is the sampling period in seconds and TO is the period estimate. For further computational savings, sine, cosine, and tan1 values can be calculated by interpolated table lookup. The phase delay of the filter is computed by: Phase Delay= 06/ Co [4] By adding the computed time delay to the time delay of the P length delay line, the net time delay of the predictor is computed. This total delay is then used to compute a period and frequency estimate: Period = TO = Phase Delay + (P / Sampling Rate) [5] Frequency = F0= l/TO [6] 2.1 FIR Periodic Prediction Algorithms There are many known methods of implementing the adaptive FIR predictor used in the PPPT, among them the Covariance Least Squares (CLS), Recursive Least Squares Adaptive (RLS), and Least Mean Squares Adaptive (LMS) algorithms. All of these methods minimize the Mean Square Error (MSE): M MSE=1/N I i2[71 i--M where the instantaneous error ek is defined as the difference between the signal sample and the predicted sample at time k: Â~k = x(k) - x^(k) [8] The Least Mean Squares (LMS) adaptive [Widrow and Stearns, 1985] algorithm is a gradient steepest descent algorithm using the instantaneous error to ~ < (2M+I) x2)"1 [101 where x is the signal power. The adaptation parameter can be adapted dynamically, yielding the Normalized LMS algorithm: Ck+1 =Ck +a Xkrk/(2 (M+1) x) [Ill where the signal power is computed over at least the last 2M+1 samples. ao is any positive number < 1. 2.2 Adapting the Delay Parameter P The integer period estimate P is variable, and there are new issues of filter dynamics in the LMS and RLS systems caused by on-line adaptation of the delay-line length. Ideally, the filter should experience no transients because of the adaptive modification of P. Various methods have been developed for adapting P, and are described in detail in [Cook, 1991]. 2.3 The AMDF The PPPT can be viewed as a refinement of the Average Magnitude Difference Function (AMDF) detectors. Methods of this type have also been called comb-filter methods. The AMDF measures the difference between the waveform and a lagged version of itself. The generalized AMDF is: q+N-1 AMDF(P) = I x(i) - x(i + P) Im i-q [12] The quantity m is set to 1 for average magnitude difference, and other values for other related methods. The zero lag P=O position of the AMDF is identically zero, and the next significant null is a likely estimate of the period. Other nulls will occur at integer multiples of the period. If the error signal from the optimally adjusted PPPT were rectified and integrated, the output would be close to that of the optimum lag bin of the AMDF. The difference is that the PPPT can adjust to fractional sample periods, can even adjust to signals which are not purely harmonic (periodic), and does not require the block processing that the AMDF imposes. 3 A System for Brass Instrument MIDI Control A MIDI (Musical Instrument Digital Interface) ICMC Proceedings 1993 131 IA.3

Page  132 ï~~performance control system has been constructed which integrates a valve-guided AMDF fundamental pitch estimator for coarse estimates, one PPPT adaptive period predictor for fine estimates, and other controls for musical performance. Figure 2 shows the system control screen, which runs on a NeXT computer. Figure 2 NeXT control window for Trumpet MIDI controller program The trumpet audio is sensed using a mouthpiecemounted microphone, routed to the CODEC port on the NEXT, and sampled at 8012 Hz. The valve states are sensed using two optical switches per valve, resulting in four position accuracy for each valve. Two additional switches mounted on the trumpet can be actuated by the player's finger and thumb, providing programmable control of performance functions such as MIDI and sound file playback, and synthesizer controls such as program change and sustain. The eight valve and control switches are encoded into a single serial byte and routed to the NeXT machine via one of the serial ports running~ at 19,200 baud. There is one internal synthesizer voice provided (an FM trumpet) synthesized on the NeXT DSP 56001 digital signal processing chip, primarily for tuning the control instrument and pitch detectors. A standard MIDI interface connects to the other NeXT serial port, and can be used to connect the system to any MIDI compatible synthesizer. Modes are provided to incorporate valve information into the AMDF fundamental pitch estimate, by limiting the AMDF calculations to 'legal' harmonics based on the current valve state. A single PPPT adaptive periodic predictor is set to the integer period yielded by the AMDF detector, and the coefficients of the PPPT are used to calculate fine estimates of fundamental pitch. Pitch bend can be used to continuously update the MIDI synthesizer based on the fine pitch estimate of the PPPT. The system was tested and compared to two popular 'pitch to MIDI' devices available in the music market. The two pitch to MID i detectors will be called OTHER-A (circa 1989, list price about $900) and OTHER-B (circa 1991, list price about $300). Our device will be called OURS (list price about $7000, because of custom-machined hardware for the trumpet, and the host NeXT machine). The three systems were tested for latency, defined as the total time delay from first trumpet sound to first synthesizer sound. All three devices were presented with the same audio and controlled the same synthesizer voice. The test passage consisted of a chromatic scale up and down, separately articulated (tongued). The systems were also tested for accuracy, defined by the number of erroneous MIDI messages sent during the articulated chromatic test passage, and another chromatic passage which was performed slurred (notes connected with no space between and no rearticulation). Two of each passage were presented, and the best performance of each system was used to calculate accuracy and average latency. Figures 3 and 4 show the results of the three detectors on the two chromatic test passages. The graphs of the three device outputs are offset vertically by 10 for plotting clarity. Errors are circled. Tongued Chromatic Scale Resulta 3 wrom 301mg OTHER-B OTHER-A 0Kowmt1143 n ' - w.dela Figure 3: Detection results for tongued chromatic scale. Slurred Chromatic Scale Results 100 M01OI w 95 Gt 90 OURS Figure 4: Detection results for slurred chromatic scale. OTHER-A exhibited an average latency of 114.3 ins. with a standard deviation of 66 ins., zero errors in the articulated chromatic passage, and five errors in the non-articulated chromatic passage. OTHER-B exhibited an average latency of 60.4 ins. with a standard deviation of 21.7 ins., seven errors in the articulated chromatic passage, and six errors in the non-articulated chromatic passage. OURS exhibited an average latency of 30.1 ins. with a standard deviation of 8.1 is., three errors in the articulated chromatic passage, and five errors in the non-articulated chromatic passage. These results point out a general tradeoff between accuracy and latency, where 'waiting around' pays off in increased confidence in the estimate. The OTHER-A device exhibited the highest accuracy, but at the price of objectionably high latency. Most of the errors in OURS were caused by the valves moving slightly before the pitch changes, or intermediate valve positions between two stable positions (in moving from all valves up to all valves down for example, two valves may arrive earlier 1A.3 132 ICMC Proceedings 1993

Page  133 ï~~than the third), thus causing a spurious message to be sent. It is conjectured that performance can be improved by delaying the valve information slightly and making more use of fractional valve positions to determine valve trajectories. One further test was performed which integrates both aspects of latency and accuracy, based on the system's ability to accurately track a valve trill which increases in speed. As the trill speed increases, at some point each device should fail based on its inability to reliably detect pitch and send an output message. Of course, with valves active, the theoretical limit for a system using valve information to track a valve trill is 1/2 the valve sampling rate (1000 Hz in the OURS case). Figure 5 shows the results of a half step trill (all valves up alternating with valve two down). The OTHER-A device fails at a trill rate of about 7 Hz. (consistent with the 114.4 ms. average latency). The OTHER-B device began to produce errors at around 6 Hz., but did not fail to track changes in pitch at the maximum trill rate of 8 Hz. The OURS system produced no errors and tracked the trill up to the maximum rate of 8 Hz. Accelerating Valve Trill Results MIDI not OURS OTHER-8 OTHER-A 2 4 6 8 10 12 Figure 5: Detection results for accelerating valve trill. 4 Composing Music for Trumpet Performers Using Pitch Detectors As a single voice, sustaining instrument with a full potential for a fast, somewhat percussive attack, the trumpet requires a reasonably fast and accurate pitch detector in computer music systems. The trumpet often plays long notes because of its rich melodic potential, and the amplitude contour may vary considerably. Both the steady state amplitude variation and small variations in pitch, including a periodic vibrato, make it essential for the system to constantly "follow" the sound. Players can feel the response of the computer music system and find it difficult to play naturally if the pitch detection/MIDI delay is greater than.035 seconds. For the performer, the most annoying problems are (1) missed pitch, especially in fast passages, (2) MIDI NOTE ON delays of more than.035 seconds, (3) false MIDI NOTE ON messages caused by valve noise or small breath impulses, and (4) a lack of pitch bend (small pitch variations) and aftertouch (amplitude variations) causing the synthesis to 'untrack' from the brass sound source on long notes. From the standpoint of composition, the matter of pitch detection accuracy is quite simple. If the computer music system is not very accurate (i.e., near 100%), then the composer must find a way to alleviate the problem for the performer. In conventional music, especially most music found in the western music repertoire, pitch accuracy is perhaps of the highest order of importance, and performers can sometimes quickly correct mistakes. Fast scale passages are particularly well known and easily followed by listeners, because of their common usage and high degree of predictability. Composers can compensate for slow pitch detection in computer music systems by not allowing the listener to know what the 'right' pitch is, or have a low expectation for pitches. In situations where the pitch is missed by the player, some tolerance is demanded by the composition. Improvisation is ideal, allowing the player some freedom in the choice and rhythmic placement of notes. It is best to avoid conventional arpeggios or scales, and an obvious melodic doubling of parts played by the music system. To some extent, the performer using these systems may enter the domain of accompaniment, where a rich texture or mass of sound results. Most players want to expand the limits of their own playing techniques, not to reduce the role of their playing, or to have a machine do most of the work for them. For this reason, there has been an odd circumstance where composers and performers sometimes have very different goals; the player reaching for a fantastic new technique and the composer attempting to find a new sound and musical expression, leaving the player with new, but unrewarding tasks. The concept of our design is to leave the natural trumpet alone and offer the player an extension to his or her instrument, reaching beyond to the world of synthesizers and processors with hand and embouchure control that is already learned. It is unrealistic to think that this extension will be without some cost or disadvantages, or that it will not require the player to learn some new techniques. Wind players are very sensitive to tiny changes in mouthpiece size, for example, so it is reasonable to expect that such a formidable expansion in the domain of sound will require adjustments and some time for exploitation. References [Cook, 1991] Perry R. Cook, Identification and Control of Parameters in an Articulatory Vocal Tract Model, With Applications to the Synthesis of Singing, Ph.D. dissertation, Dept of Electrical Engineering, Stanford Univ. 1991 (Deem et al, 1989] J. F. Deem, W. H. Manning, J. V. Knack and J. S. Matesich, The Automatic Extraction of Pitch Perturbation Using Microcomputers: Some Methodological Considerations, Journal of Speech and Hearing Research, vol. 32, pp. 689-697, 1989. (Hess, 1983] Wolfgang Hess, Pitch Determination of Speech Signals, Berlin: Springer Verlag, 1983. [Widrow and Stems, 1985] Bernard Widrow and S. D. Steams, Adaptive Signal Processing, New Jersey: Prentice-Hall, 1985. ICMC Proceedings 1993 133 1A.3