...A A A AA......-.......- -.............
-------- Alý--- A ---- A --- A, - -------
Figure 3: Overview of Maximum Likelihood
gives erratic answers when the input signal is halfway between two cases in the reference matrix; therefore, ML works
well if the input source is in a fixed tuning. Keyboard and
woodwind instruments are more appropriate for ML than strings
or voice since the latter instruments can easily produce nondiscrete pitches, particularly in vibrato.
Specifying a limited range of possible pitches can improve the efficiency in ML by lowering the number of test
signals in the reference matrix. If the true pitch is outside of
the test range by an octave, then ML will detect the correct
pitch class in the adjacent reference range. At two octaves
outside the ML range, the detection becomes less accurate in
determining the pitch class. ML is less tolerant of noise and
weak signals than the HPS method.
2.3 Cepstrum-Biased HPS
Possibly the most popular pitch tracking method in speech
analysis is the cepstral analysis technique. First, the Cepstrum is calculated by taking the DFT of the log of the magnitude spectrum of a speech frame. Then, the Cepstrum is
inspected for a peak in a limited range, corresponding to the
period of the signal, as well as for a second or third peak an
equal distance (period) away from the first or second peak.
An improved algorithm may be created by combining the
Cepstrum with the HPS function (Master 2000). This technique has been used to initialize guesses in a polyphonic pitch
detection system. Separately, the Cepstrum and HPS achieve
modest success for this task. But combined, they yield fairly
reliable results for at least the 2 pitch speech case.
Since these functions exist in different domains, the first
step is in combining them is to convert the Cepstrum to be
frequency domain indexed. To do this, Cepstrum values at
Indexcep= k are written to the Frequency Indexed Cepstrum
(or FIC) with indices of IndexFIC = floor(N/k) where N is
the number of points in the DFT and "floor" is a function
specifying the greatest integer less than the argument. The
new index represents the frequency bin value corresponding
to the time value in the Cepstrum. Thus, a peak tending to
indicate a period at value k in the Cepstrum now tends to
indicate a pitch with a corresponding value.
The FIC function and the HPS are then multiplied to create a new function, the Cepstrum-Biased HPS (CBHPS). In
the CBHPS, spurious peaks in the pitch-doubling-robust FIC
tend to be canceled out by corresponding low values in the
pitch-halving-robust HPS function, and vice versa. Thus, the
peaks seen in the new function have been generally very reliable for the 2 voice speech case. In that case, peaks were
chosen as pitch indicators if their frequencies were not multiples of a lower frequency peak, and if their magnitude was
at least 15% of the largest peaks' magnitude. CBHPS is good
for application in multi-pitch detection, since it robustly handles noise and pitch errors.
2.4 Weighted Autocorrelation Function
One popular time domain technique is to pick peaks in the
autocorrelation function, or ACF The ACF is created from
the equation:
N-l
S(T)= N Z (n)(n +T)
n=0
and measures the extent to which a signal correlates with a
time offset (T) version of itself. Because a periodic signal will
correlate strongly with itself when offset by the fundamental
period, we can expect to find a peak in the ACF at the value
corresponding to a period.
An alternate to the ACF is the Average Magnitude Difference Function. The AMDF looks not at the product of a
signal with a time offset version of itself, but rather at the
difference. Thus, the AMDF tends to have valleys where the
ACF has peaks. Calculation of the AMDF is less computationally expensive than the ACF due to the lack of multiplication operations (Niesler and Robinson ).
The equation for the AMDF is given as:
1N-l
O(T) = N I x(n) - x(n +T)
n=0
As noted by Kobayashi and Shimamura (1995), these two
functions have independent statistics, and may be combined
to produce a more noise-robust estimate of pitch, especially
in cases where gross pitch error (more than 10 Hz error) is
possible. For purposes of single F0 detection, this method
was shown by those authors to be substantially more effective
in noisy environments than either the ACF or AMDF alone, or
the popular cepstral technique described above. The function
described by these authors is:
f(r)
O(T)
4(7) + kc
where optimal results were achieved with k = 1 regardless of
SNR.