Page  00000001 Real-time Pitched/Unpitched Separation of Monophonic Timbre Components Joseph A. Sarlo Department of Music, CRCA, Cal-IT2, University of California, San Diego Abstract Described here is a new technique for the real-time separation of a monophonic sound source timbre into its pitched and unpitched components (e.g. violin string vibration vs. bow noise.) This technique differs from preexisting noise extraction and reduction techniques in that it does not rely on a priori knowledge of the signal and does not consider the duration or stability of a component in determining whether that component is pitched, but rather analyzes sinusoidal components based on their harmonic relationship to an estimated fundamental frequency. This has several benefits including independence of time in the determination of un-pitched components and the ability to discern pitched components from unpitched components in situations where the unpitched components may be steady in the time-evolving frequency sense. 1 Introduction We often consider the timbre of a sound to be a composite phenomenon, being made up of several distinguishable components. As such, it is often desirable to separate a given sound source into some set of constituent components. The components can be categorized in various ways, often based on perceptual differences. One such perceptual category is that of pitched versus unpitched. We can, for example, consider the sound produced by a violin to be the composite of the tones produced by the string vibration (pitched) and the noise produced by the bow (unpitched). We describe here a method for separating a monophonic timbre into these two distinct components. Many noise reduction techniques for audio applications have been developed. Some rely on the ability to estimate the noise by sampling the signal during "noise-only" portions (Boll 1979). This is clearly an inappropriate technique for extracting the unpitched portion of a monophonic musical instrument timbre since the pitched and unpitched portions are nearly always simultaneous. Others define noise to be the components of a sound that are rapidly changing (Hirsch 1993) and rely on this idea in their determination of which portions of a sound are noise. This, however, requires time to elapse between analyses in order to determine whether a component is steady or not. Also, what is often perceived as unpitched in a timbre is not dependant on steadiness or duration, but on the harmonic relationship that the noise component has with the perceived pitch. For example, in the case where a single piano note is played with the sustain pedal depressed, the nonsympathetic vibrations of the piano strings not struck by the key hammer will not change rapidly over time, yet can be considered unpitched in reference to the perceived pitch. The technique presented here uses a sinusoidal analysis of the sound source and an estimated fundamental frequency to separate the unpitched portion of a monophonic timbre from the pitched portion. Thus, it has the ability to discern unpitched timbre components from pitched timbre components in cases where the unpitched components are steady in the time-evolving frequency sense. The methods of analysis, separation, and re-synthesis will be presented followed by a brief discussion of possible applications. 2 Analysis, Separation, Re-synthesis The technique presented here essentially consists of performing both a sinusoidal deconstruction and a pitch estimation of the source sound. The individual sinusoidal components are then considered to be pitched or unpitched based on their harmonic relationship to the estimated pitch. The amplitudes of the sinusoidal components are then adjusted accordingly and a time-domain signal is resynthesized. 2.1 Analysis The analysis is accomplished via the phase vocoder with single-sample hop proposed by Puckette and Brown (1998) in which the frequencies of the sinusoidal components are estimated by a rectangular windowed discrete Fourier transform (DFT). This technique is used for its computation speed, its frequency estimation accuracy, and its ability to estimate frequency given a single block of audio data. Proceedings ICMC 2004

Page  00000002 Consider the time series input signal x[n] sampled at a rate of s, Hz. Analysis is done on a sample blocks of length N, where N is a radix-2 integer typically between 512 and 4096. We have that N-1 ~ -2~2jkn X,[k]- = x n +- - e N n=O h for z = (0,1,2,..) =(N N N N S2 2 '2 2 ) Here, T is the sample block index, k is the DFT bin index, and h is some radix-2 hop factor between 2 and N, typically 4 or 8. We note that h is only needed for re-synthesis purposes and is not a necessary part of the analysis. We can then estimate the frequency of the kth sinusoidal component of the Ith sample block as,k k im X,[k +1]- X,[k -1] S \N -im N 2X,[k] - X,[k +1] - X,[k -1] We also estimate the fundamental frequency of the th sample block using some pitch estimation technique such as finding fT,k for which I X,[k] I is maximum with respect to k for a fixed T, or using a more sophisticated technique (Puckette, Apel, and Zicarelli 1998). In practice, the method of pitch estimation used is dependant on the sinusoidal structure of the input sound source. In any case, we define F, to be the estimated pitch, or fundamental frequency, for the Ith sample block. 2.2 Separation To determine if a component is pitched or unpitched, we compare each sinusoidal component of frequency fT,k to the estimated fundamental frequency F,. We define the fractional partial index of the kth sinusoidal component of the Ith sample block to be dr,k LP,k +0.5j- p,k When f,k is a perfect harmonic or sub-harmonic (integer multiple or factor) of F, we have that d,,k = 0. Otherwise we have that 0 < d,,k < 0.5. We could, therefore, consider a sinusoidal component of frequencyf,,k to be unpitched when d,,k > 0 and pitched otherwise. In practice, however, inaccuracies in analysis and other physical phenomena make it desirable to define some error bound 8, where 0 < E < 0.5. We decide that the kth sinusoidal component of the Ith sample block is unpitched when d,k > e and pitched when d,,k < E. In practice, useful values of F are dependant on the input sound source. 2.3 Re-synthesis Using the above method to determine if a sinusoidal component is pitched or unpitched, we alter the amplitude of specific components and then apply phase vocoder analysis/re-synthesis. Specifically, let y be some gain factor, and let g,k be the gain applied to the kth sinusoidal component of the zth sample block. We can apply this gain to all of the pitched components by choosing g,k as gr,k y for dr,k < E 1 for d~,k > Finally, to return to the time-domain, we perform a phase vocoder analysis/re-synthesis utilizing g,k as S iN- -2ikjkn x[ki=- Yw[nlxn+- e N n=O h 2; jkn N )~ s,[n] for 0 n< N for n < O, n > N P,k = f,k for f,k F, - for f.,k < Fr fr,k X'[n] = s. in- Y1 Here, w[n] is a suitable window function, usually a Hanning window. By setting y = 0 and choosing some suitable value for F, we obtain x'[n] that is a representation of only the unpitched components of x[n]. We can obtain a representation of the pitched components of x[n] by a similar definition of gk. We can also choose y # 0 to obtain We then calculate the distance of the fractional partial index to the nearest whole partial index as Proceedings ICMC 2004

Page  00000003 a signal that is some combination of pitched and unpitched components or alter y for various values of T to obtain a time-evolving combination signal. 3 Applications Several applications of the technique discussed are evident. For example, it is the case for many timbres that during the attack portion of the amplitude envelope, the unpitched components are much more prominent than during the remainder of the amplitude envelope. One could, therefore, utilize the technique presented here to aid in onset detection. From an electro-acoustic compositional standpoint, applications abound including phasing and spatialization effects, improvements to techniques such as time-stretching and pitch-shifting, and possibilities for convincing timbral morphing. References Boll, S.F. (1979). Suppression of Acoustic Noise in Speech using Spectral Subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing (27), pp. 113-120. Hirsch, H. G. (1993). Estimation of noise spectrum, and its application to SNR-estimation and speech enhancement. ICSI Technical Report TR-93-012, Intl. Comp. Science Institute, Berkeley, CA. Puckette, M. S., T. Apel, D. Zicarelli. (1998). "Real-Time Audio Analysis Tools for Pd and MSP." In Proceedings of the International Computer Music Conference, pp. 109-112. Ann Arbor: International Computer Music Association. Puckette, M. S., and J. Brown. (1998). "Accuracy of Frequency Estimates From the Phase Vocoder." IEEE Transactions on Speech and Audio Processing 6/2, pp. 166-176. Proceedings ICMC 2004