Efficient Pitch Detection Techniques for Interactive Music

de la Cuadra, Patricio; Master, Aaron; Sapp, Craig

« Prev Next »

...A A A AA......-.......- -............. -------- Alý--- A ---- A --- A, - ------- Figure 3: Overview of Maximum Likelihood gives erratic answers when the input signal is halfway between two cases in the reference matrix; therefore, ML works well if the input source is in a fixed tuning. Keyboard and woodwind instruments are more appropriate for ML than strings or voice since the latter instruments can easily produce nondiscrete pitches, particularly in vibrato. Specifying a limited range of possible pitches can improve the efficiency in ML by lowering the number of test signals in the reference matrix. If the true pitch is outside of the test range by an octave, then ML will detect the correct pitch class in the adjacent reference range. At two octaves outside the ML range, the detection becomes less accurate in determining the pitch class. ML is less tolerant of noise and weak signals than the HPS method. 2.3 Cepstrum-Biased HPS Possibly the most popular pitch tracking method in speech analysis is the cepstral analysis technique. First, the Cepstrum is calculated by taking the DFT of the log of the magnitude spectrum of a speech frame. Then, the Cepstrum is inspected for a peak in a limited range, corresponding to the period of the signal, as well as for a second or third peak an equal distance (period) away from the first or second peak. An improved algorithm may be created by combining the Cepstrum with the HPS function (Master 2000). This technique has been used to initialize guesses in a polyphonic pitch detection system. Separately, the Cepstrum and HPS achieve modest success for this task. But combined, they yield fairly reliable results for at least the 2 pitch speech case. Since these functions exist in different domains, the first step is in combining them is to convert the Cepstrum to be frequency domain indexed. To do this, Cepstrum values at Indexcep= k are written to the Frequency Indexed Cepstrum (or FIC) with indices of IndexFIC = floor(N/k) where N is the number of points in the DFT and "floor" is a function specifying the greatest integer less than the argument. The new index represents the frequency bin value corresponding to the time value in the Cepstrum. Thus, a peak tending to indicate a period at value k in the Cepstrum now tends to indicate a pitch with a corresponding value. The FIC function and the HPS are then multiplied to create a new function, the Cepstrum-Biased HPS (CBHPS). In the CBHPS, spurious peaks in the pitch-doubling-robust FIC tend to be canceled out by corresponding low values in the pitch-halving-robust HPS function, and vice versa. Thus, the peaks seen in the new function have been generally very reliable for the 2 voice speech case. In that case, peaks were chosen as pitch indicators if their frequencies were not multiples of a lower frequency peak, and if their magnitude was at least 15% of the largest peaks' magnitude. CBHPS is good for application in multi-pitch detection, since it robustly handles noise and pitch errors. 2.4 Weighted Autocorrelation Function One popular time domain technique is to pick peaks in the autocorrelation function, or ACF The ACF is created from the equation: N-l S(T)= N Z (n)(n +T) n=0 and measures the extent to which a signal correlates with a time offset (T) version of itself. Because a periodic signal will correlate strongly with itself when offset by the fundamental period, we can expect to find a peak in the ACF at the value corresponding to a period. An alternate to the ACF is the Average Magnitude Difference Function. The AMDF looks not at the product of a signal with a time offset version of itself, but rather at the difference. Thus, the AMDF tends to have valleys where the ACF has peaks. Calculation of the AMDF is less computationally expensive than the ACF due to the lack of multiplication operations (Niesler and Robinson ). The equation for the AMDF is given as: 1N-l O(T) = N I x(n) - x(n +T) n=0 As noted by Kobayashi and Shimamura (1995), these two functions have independent statistics, and may be combined to produce a more noise-robust estimate of pitch, especially in cases where gross pitch error (more than 10 Hz error) is possible. For purposes of single F0 detection, this method was shown by those authors to be substantially more effective in noisy environments than either the ACF or AMDF alone, or the popular cepstral technique described above. The function described by these authors is: f(r) O(T) 4(7) + kc where optimal results were achieved with k = 1 regardless of SNR.

« Prev Next »