Page  00000001 Real-time audio analysis tools for Pd and MSP Miller S. Puckette, UCSD (msp@ucsd.edu) Theodore Apel, CRCA, UCSD (tapel@ucsd.edu) David D. Zicarelli, Cycling74 (www.cycling74.com) Abstract Two "objects," which run under Max/MSP or Pd, do different kinds of real-time analysis of musical sounds. Fiddle is a monophonic or polyphonic maximum-likelihood pitch detector similar to Rabiner's, which can also be used to obtain a raw list of a signal's sinusoidal components. Bonk does a bounded-Q analysis of an incoming sound to detect onsets of percussion instruments in a way which outperforms the standard envelope following technique. The outputs of both objects appear as Max-style control messages. 1 Tools for real-time audio analysis The new real-time patchable software synthesizers have finally brought audio signal processing out of the ivory tower and into the homes of working computer musicians. Now audio can be placed at the center of real-time computer music production, and MIDI, which for a decade was the backbone of the electronic music studio, can be relegated to its appropriate role as a low-bandwidth I/O solution for keyboards and other input devices. Many other sources of control "input" can be imagined than are provided by MIDI devices. This paper, for example, explores two possibilities for deriving a control stream from an incoming audio stream. First, the sound might contain quasi-sinusoidal "partials" and we might wish to know their frequencies and amplitudes. In the case that the audio stream comes from a monophonic or polyphonic pitched instrument, we would like to be able to determine the pitch(es) and loudness(es) of the components. It's clear that we'll never have a perfect pitch detector, but the fiddle object described here does fairly well in some cases. For the many sounds which don't lend themselves to sinusoidal decomposition, we can still get useful information from the overall spectral envelope. For instance, rapid changes in the spectral envelope turn out to be a much more reliable indicator of percussive attacks than are changes in the overall power reported by a classical envelope follower. The bonk object does a bounded-Q filterbank of an incoming sound and can either output the raw analysis or detect onsets which can then be compared to a collection of known spectral templates in order to guess which of several possible kinds of attack has occurred. The fiddle and bonk objects are low tech; the algorithms would be easy to re-code in another language or for other environments from the ones considered here. Our main concern is to get predictable and acceptable behavior using easy-to-understand techniques which won't place an unacceptable computational load on a late-model computer. Some effort was taken to make fiddle and bonk available on a variety of platforms. They run under Max/MSP (Macintosh), Pd (Wintel, SGI, Linux) and fiddle also runs under FTS (available on several platforms.) Both are distributed with source code; see http://manl04nfs.ucsd. edu/~mpuckett/ for details. 2 Analysis of discrete spectra Two problems are of interest here: getting the frequencies and amplitudes of the constituent partials of a sound, and then guessing the pitch. Our program follows the ideas of [Noll 69] and [Rabiner 78]. Whereas the earlier pitch" object reported in [Puckette 95] departs substantially from the earlier approaches, the algorithm used here adhere more closely to them. First we wish to get a list of peaks with their frequencies and amplitudes. The incoming signal is

Page  00000002 broken into segments of N samples with N a power of two typically between 256 and 2048. A new analysis is made every N/2 samples. For each analysis the N samples are zero-padded to 2N samples and a rectangular-window DFT is taken. An interesting trick reduces the computation time roughly in half for this setup; see the source code to see how this is done. If we let X[k] denote the zero-padded DFT, we can do a three-point convolution in the frequency domain to get the Hanning-windowed DFT: XH[k] = X[k]/2 - (X[k + 2] + X[k - 2])/4 Any of the usual criteria can be applied to identify peaks in this spectrum. We then go back to the nonwindowed spectrum to find the peak frequency using the phase vocoder with hop 1: kr X[k - 2] - X[k + 2] W z#(k +re[2;t -l1 N ( _2X[k] - X[k - 2] - X[k + 2] _] This is a special case of a more general formula derived in [Puckette 98]. The amplitude estimate is simply the windowed peak strength at the strongest bin, which because of the zero-padding won't differ by more than about 1 dB from the true peak strength. The phase could be obtained in the same way but we won't bother with that here. 2.1 Guessing fundamental frequencies Fundamental frequencies are guessed using a scheme somewhat suggestive of the maximum-likelihood estimator. Our "likelihood function" is a non-negative function C(f) where f is frequency. The presence of peaks at or near multiples of f increases C(f) in a way which depends on the peak's amplitude and frequency as shown: k ~(f) = aitini i=o where k is the number of peaks in the spectrum, ai is a factor depending on the amplitude of the ith peak, ti depends on how closely the ith peak is tuned to a multiple of f, and nr depends on whether the peak is closest to a low or a high multiple of f. The exact choice of how these factors should depend on f and the peak's frequency and amplitude is a subject of constant tinkering. For monophonic pitch estimation, we simply output the value of f whose "likelihood" is highest. For polyphonic pitch estimation, we successively take the values of f of greatest likelihood which are neither multiples nor submultiples of a previous one. In all cases, an additional criterion is used to make a pitched/nonpitched decision since E(f) will always have a maximum, even when no pitch is present. Our criterion is that there either be at least four peaks present or else that the fundamental be present and the total power of the contributing peaks be at least a hundredth of the signal power. 2.2 Object design The fiddle object has a signal input and a varying number of control outputs depending on its creation arguments: fiddle Enpoints] Enpitches] Enpeaks-analyzed] [npeaks-outputl where npoints gives the (power of two) number of points in each analysis window, npitches gives the number of separate pitches to report (one by default), npeaks-analyzed gives the maximum number of peaks to consider in determining pitch (default 20) and npeaks-output gives the number of peaks which are to be output raw. Setting npitches to zero suppresses pitch estimation (and saves computation time). The outlets, from left to right, are: * a floating-point pitch which is output when a new, stable note is found * a bang which is output conditionally on "attacks", whether or not a pitch is found * from 0 to 3 lists, each giving the pitch and loudness of a pitch track * the continuous signal power in dB * a list, which iteratively sends triples giving each peak's index, frequency, and amplitude. The following messages print or set the fudge mix: print print out the parameters controlled by the following messages: amp-range (low) (high) set the (low) and (high) amplitude thresholds in dB. Note-on detection requires that the signal exceed (high); if a pitch track's strength goes below (low) it is dropped. reattack (time) (dB) Pitch tracks whose strength increases by more than (dB) within (time) msec output a new "rnote" message.

Page  00000003 vibrato (time) (half-tones) warn fiddle that the ilistrumeilt is capable of vibrato. New Rotes will Rot be reported ulltil (time) msec have passed with the pitch remaillillg withill (halftolles) of a cellter pitch; the cellter pitch is thell reported. If the illstailtalleous pitch differs by more thall (half-tolles) from the reported pitch, the search begills for a rtew Rote. upartial (n) The jth partial is weighted as n~ J) iR the likelihood formula. The following messages are also defined: uzi (onoff) TurR "uzi" mode OR or off; by default it's OR. TurR it off if you want to poll for pitch tracks or sinusoidal components yourself; otherwise they come out OR every analysis period. bang... poll them. debug turR OR debugging. The computatioR load of fiddle varies depend1Rg OR the input signal. JR aR informal test, rurtrirtg fiddle OR a sawtooth plus white Roise used 21 percent of the available CPU time OR a 300 MHz. PeRtium 2 machine rurtrirtg NT. 3 Bounded-Q Analysis The bank object was writteR for dealing with sound sources for which sinusoidal decompositioR breaks dowR; the first applicatioR has beeR to drums and percussioR. The desigR emphasizes speed; the hardwired analysis window size is 256 samples or 5.8 msec at 44K1; the hop size cart be as low as 64 samples. The first stage of analysis iR bank is a downsampling FIR filterbank of the sort described iR [BrowR 92]. Letting N be the window size, M N/2, and [n n - 0,..., N - 1 the input signal, w a center frequertcy and S a filter bandwidth, we cart compute the estimated signal power at w with bandwidth S as: (6 2w/N,2 2w/N),(12w/N,4w/N), (12 2w/N,4 2w/N),(24w/N,8w/N) and SO OR up to the Nyquist, giving a total of 11 filters for NV 256. 3.1 Detecting attacks The most satisfying applicatioR of this analysis is iR detecting percussive attacks. The most popular way of doing this is to use art envelope follower and look for rapid rises iR follower output; but arty kind of ringing cart set off trains of unwanted attacks, or oppositely, cart mask true attacks. The analysis used by bank cart oftert detect rtew attacks which appear as sharp relative chartges iR the spectrum without arty accOmpanying large chartge iR the overall power; cortversely, ringing instruments doR't oftert give rapidly changing spectra artd hertce dort't attract bank's atterttiort. We defirte a growth furtctiort as follows. For each chartrel we mairttairt a mask in which represertts the currertt power irt the chartrel. To accomplish this, after each artalysis we look at the currertt power p. If p > in we replace in by p; otherwise if in hasrt't beert updated for more thart masktime artalyses (5 by default), the mask decays by multiplicatiort by mczskdecay (0.8). Sirtce the default artalysis irtterval is 3 sc the default mask time is about 15 msec; much shorter thart this artd your kettle drum will set off art attack every half period. The growth irt each chartrel is the dimertsiortless qu arttity, g max(0, p/m - 1) which is 1 if the currertt power is twice the mask, for irtstartce. Next we add up the growth estimates for all elevert chartrels. If this total exceeds hithresh (default 12), we'll report art attack. However, we dort't actually report the attack urttil the spectrum stops growirtg, i~e., the growth must decrease to a value below lothresh (default 6). This is dorte so that the true loudrtess artd spectrum of the rtew evertt cart be reported. T-1 P(w,S)~ >3I expiwm m=-T where The mirtimnum bartdwidth 2w/lN. The particular choic combirtatiorts irt bank was cept where prohibited by t (W,S) (2w/~N, 2w/N),(4T7 1 Wrnc(im~ 12 3.2 Matching spectral templates It is also possible to ask bank to test arty rtew atw/s1. tack agairtst a mertu of pre-recorded attacks irt order we cart achieve is thus to guess which of several possible irtstrumertts was >e of frequertcy/bartdwidth resportsible for the rtew attack. To do this, first we two filters per octave ex- store spectral templates for each of the irtstrumertts. he bartdwidth limit: Thereafter, arty rtew attack is compared with the stored ortes artd the closest match is reported. The /lN, 2w/lN), (6w/~N, 2w/lN), urtderlyirtg assumptiort, that there is actually some

Page  00000004 repeatability irt the spectral ertvelopes of attacks of percussive irtstrumertts certairtly doesrt't hold true ill the real world, but it is irtterestirtg to learrt which sorts of irtstrumertts bank cart identify ill this way artd which it cart't. To describe how template matchirtg works we first add irtdices to the variables for power artd mask; pi, mi are the power artd mask for the ith chartrel for i 1,..., 11. Now suppose si, ti, are the spectra of two pre-recorded attacks rtormalized so that ISI ITI 1 as real 11-dimertsiortal vectors. The simplest test would be to ask which of P. S, P. T is greater. However, some irtformatiort might be missirtg irt P because of maskirtg so a more appropriate measure of agreement betweert P artd 5, for example, is to weight each comportert by its ""clarity" which we measure by the growth gi; so the value of the fit betweert P artd S is thus (>1 gisipm) Emnmi >1 Ywi 2* the template which agrees best with P irt this sertse is the output reported. 3.3 The bonk object design Like fiddle, bank has orte irtlet which takes art audio sigrtal artd messages to alter its settirtgs artd to learrt store artd recall templates: thresh (lothresh) (hithresh) Set the attack thresholds mask (masktime) (maskdecay) Set the mask parameters debounce (debounce-time) Set the mirtimum time irt msec betweert two attacks print (more) Prirtt the values of the parameters artd (if more) the currertt artalysis vector learn (flag) turrt "learrt" mode ort or off write (filename) write learrted templates to a text file read (filename) read templates from a text file bang output currertt spectrum as a list Two outputs are provided; the rightmost reports the erttire elevert-elemertt spectrum ort attacks artd ort bang; the other gives the rtumber of the best matchirtg template for each attack artd the overall loudrtess of the attack. If you just wartt to be able to poll the spectrum, set the threshold to art impossible value so bank wort't volurtteer output ort its owrt. 4 Acknowledgement This work was gerterously supported by the Jrttel Corporatiort. The techrtiques described here have evolved over marty years of collaboratiort with researchers artd artists, especially Mark Dartks, A. Couturier Lippe, Philippe Maurtoury, Joel Settel, Vibeke Sorertsert artd Rartd Steiger. The bank artd fiddle objects described here were developed as part of Puckette, Sorertsert artd Steiger's Lemma 1, showrt at JCMC97, Thessalortiki, Greece, played by George Lewis artd Stevert Schick, artd at UCSD by Vartessa Tomlirtsort artd Michael Dessert. References [Browrt 92] Browrt, J.C., artd Puckette, M.S., (1992). ""Art Efficiertt Algorithm for the Calculatiort of a Cortstartt Q Trartsform", J. Acoust. Soc. Am. 92, 2698-270 1. [Noll 69] Noll, A. M., 1969. "Pitch determirtatiort of humart speech by the harmortic product spectrum, the harmortic sum spectrum, artd a maximum likelihood estimate." Proc. Symp. Computer Proc. irt Comm., pp. 779-798. [Puckette 96] Puckette, M., 1996. "Pure Data: artother irttegrated computer music ertvirortmertt." Proc. the Secortd Jrttercollege Computer Music Cortcerts, Tachikawa, pp. 37 -41. Reprirtted as ftp: //crca-ftp.ucsd. edu: >'pub/msp/pd-kcm. ps [Puckette 98] Puckette, M., artd Browrt, J., 1998. ""Accuracy of frequertcy estimates from the phase vocoder." IEEE Trartsactiorts ort Speech artd Audio Processirtg 6/2, pp. 166-176. [Rabirter 78] Rabirter, L.R., artd Schafer, R.W., 1978. Digital Processing of Speech Signals. Ertglewood Cliffs, N.J.: Prerttice-Hall.