Page  00000001 A Genetic Search Technique for Polyphonic Pitch Detection Guillermo Garcia Creative Advanced Technology Center, Scotts Valley, California. Center for Computer Research in Music and Acoustics (CCRMA), Stanford University. guille@atc.creative.com / guille@ccrma.stanford.edu Abstract We present a method for estimating multiple, simultaneous fundamental frequencies in a polyphonic signal spectrum. The method takes advantage of the power of genetic algorithms to explore a very large search space with numerous local optima, and searches for a globally optimal combination of fundamental frequencies that best explains the polyphonic signal. No a-priori assumption about the number of fundamental frequencies is made, and no upper bound on this number needs to be imposed. Simulations show very good performance on signals with up to six fundamental frequencies, and performance remains acceptable when detecting up to eight fundamental frequencies. 1 Introduction Several approaches have been proposed for polyphonic pitch detection based upon different optimality criteria (Chafe&Jaffe 1986, Martin 1996, Macleod 1998, Walmsley,Godsill&Rayner 1999, De Cheveigne & Kawahara 1999, Klapuri 1998 & 2001, Goto 2001). This paper focuses on the search aspect of the optimization problem for polyphonic pitch detection. We present and evaluate a new search technique based on a genetic algorithm. When no a-priori assumption is made about the number of simultaneous fundamental frequencies (fOs) present in a signal, the search space can be extremely large. It consists of all possible combinations of 1, 2,..., N simultaneous fOs, with N unknown. For instance, for a fO range from 50Hz to 500Hz and a fO resolution of 1Hz, the space of up to four simultaneous fundamental frequencies has (4.0552).1010 points. Additionally, the search space contains numerous local optima. These characteristics suggest the evaluation of a genetic search approach, in virtue of the inherent parallelism of genetic algorithms (Holland 1975, Goldberg 1989, Koza 1992), and therefore their efficiency to solve global optimization, non-linear problems in extremely large search spaces. 2 Algorithm Description Our algorithm works by evolving a population of binary strings over successive generations, according to the principle of survival of the fittest. Each string codes a set of simultaneous fO values. A crossover operator allows for intelligent selection of new points in order to efficiently explore the large search space. A mutation operator provides insurance against eventual loss of good solutions. Any optimality criteria from the art of pitch estimation (e.g. Wise, Caprio & Parks 1976, Doval 1994 or references mentioned in Introduction) can be used as an objective function to measure the "fitness" of a given string. For computational reasons, and since the focus of this paper is the search aspect of the optimization problem, we have chosen a simple criterion that we describe in section 2.8. 2.1 String Structure Our genetic algorithm uses variable-length chromosome strings and a binary alphabet (Goldberg 1989). A chromosome string is structured as a concatenation of N substrings of L bits each. Each substring codes one fO value using binary fixed-point representation. Since no assumption is made about the number of fOs in the signal, the number of substrings N is variable and unbounded (it will be automatically limited by the dynamics of natural selection). On the other hand, since the set of fO candidates (i.e. fO range and resolution) is specified as an input parameter, the substring length L is fixed. I 110011 1|1 | 111 | * * *0 10101111 f0 code #1 f0 code #2 (L bits) (L bits) f0 code #N (L bits) Figure 1. GA string structure with L=4 bits. The substring length is calculated from the range of f0 values that need to be coded, Af0, and a specified f0 resolution df0, as the minimum integer L for which: 2L >A f odfo

Page  00000002 2.2 Input Parameters The input parameters for our genetic algorithm are: 1) fO resolution df, and range Af,. 2) Population size M. 3) Maximum number of generations to evolve G. 4) Crossover probability Pc per couple of string mates. 5) Probability of mutation Pm per string bit. 6) Maximum number of fO codes (i.e. substrings) in initial population. 7) Spectral bandwidth for fitness evaluation. 2.3 Initial Population Strings in the initial population are assigned a random number of substrings, ranging between one and a maximum specified number. The fO values coded by each substring are also generated randomly within the specified fO range. 2.4 Stopping Criteria and Result Designation The run is stopped when the algorithm reaches the specified maximum number of generations G. The string with largest fitness from the last generation is designated as the optimal solution. 2.5 Reproduction Operator Reproduction of strings according to natural selection is performed using the roulette wheel procedure (Goldberg 1989). Each string present in the current population is assigned a roulette wheel slot sized in proportion to its fitness f(s), where the index s identifies the string. The wheel is implemented as an array of partial cumulative fitness s(s): fc(s) = Lf (j) j=1 To select a string for reproduction into the new generation, we draw a uniformly distributed random number r between zero and the total cumulative fitness fc(M), and choose the minimum string index s for which f(s) > r. In our method, this selection procedure is enriched with a feature called "elitism" (Goldberg 1989): after a new generation has been evolved, if the best string in the new generation is not as fit as the best one from the previous generation, this last string is reinserted at the place of the worst one in the new generation. This strategy helps preventing accidental loss of highly fit strings. 2.6 Crossover Operator Recombination by crossover is performed to generate new points in the search space, by recombining available information in the current string population. Our crossover operator has two levels of recombination: 1) For the given set of fO values existing in the current population, generate strings that contain new combinations of these fO values. Innovation here is happening at the fOcombination level. 2) Generate new values of fO, by recombining those existing in the current population. Innovation here is happening at the f0-value level. When evolving a generation, the reproduction procedure described in the previous section is used to select two string mates at a time. After selection of two mates, we draw a random number r such that 0 ~ r ~ 1 and compare it to the crossover probability PC. If r < PC then crossover is performed on the mates. We use a single-point crossover operation designed as follows: Suppose that the two strings selected for crossover contain N, and N2 substrings respectively. Each substring codes each fO value with L bits. We first draw two random integers nl and n2 such that: O n, < N,; O n2 < N2 These numbers will determine which substring, in each of the selected strings, will contain the crossover point. Next, we draw a third random integer r, where 0 ~ r < L. The crossover points (i.e. the bit positions) in each string are then determined as: c, =nL + r C2 =n2L + r This crossover operator ensures that the length of the offspring strings is an integer multiple of L. It should be clear that by forcing r=O the crossover operation generates innovation at the fO combination level only. In other words, the crossover point would always fall at the borders between substrings, and therefore no new fO values would be generated. 2.7 Mutation Operator After crossover, mutation is eventually applied to one bit of the offspring chromosomes, according to a probability of mutation specified as an input parameter. The mutation operator works as follows: For each string reproduced into the new generation, the probability that the string does not undergo mutation, P, is p - 1- P= m (N.L) where Pm is the probability of mutation per bit, and (N.L) is the string length. We then draw a random number r such that 0 ~ r 1. If r > P, then the string undergoes mutation. A bit is randomly selected and set to its complement. 2.8 Fitness Evaluation Our measure of fitness f(s) for a given string s is based upon a correlation between the input spectrum and a comb spectrum (Martin 1981). Let's call f0' the fundamental frequency value coded by substring j in string s. We

Page  00000003 compute a partial fitness value f (s,j) as the correlation between the input magnitude spectrum IX(co)I and a reference comb spectrum with exponentially decreasing amplitudes e-: f (s, j) = X(2.chf j).ea h where h is the harmonic index and a is specified as an input parameter. In practice the Discrete Fourier Transform (DFT) is utilized and the harmonic frequencies 2/hflo are rounded to the nearest DFT bin. The comb spectrum is limited to a given bandwidth, which is specified as an input parameter. The exponentially decaying comb envelope (Martin 1981) is an effective strategy to reduce period ambiguity (i.e. a signal of period T also has periods kT with k integer), which causes submultiples of the actual fO to be detected. After the partial fitness fp(s,j) is computed for a substring j, the input DFT bins that were used in the correlation sum are zeroed for the remaining partial fitness evaluations of the string, that is, for the correlation sums corresponding to substrings j+],...,NA, where N8 is the number of substrings in string s. This is equivalent to constraining each spectral bin to belong to only one harmonic series. Strings that contain correct fO values along with spurious multiples or submultiples are penalized by this strategy. All DFT bins are reset to their original values when evaluation of the next string (s+]) starts. For each string s, a "raw fitness" value frl is then calculated as the sum of partial fitnesses over all its substrings j: N s fraw(S) =Lfp(Sj).=1 Finally, a fitness correction step is necessary to prevent premature convergence of the genetic algorithm. Premature convergence is caused by large differences in fitness between good and poor strings (Goldberg 1989). When good suboptimal strings dominate the natural selection process, population diversity and thus innovation can be dramatically reduced, therefore preventing generation of new search points, and forcing convergence towards suboptimal solutions. We solve this problem by imposing a fitness floor value Fin = Fa/P, where Fmaxis the maximum fitness in the current generation, and 3 is an input positive constant. Strings for which f(s) < Fin have their fitness reset at f(s) = Fmin. This simple limiting strategy effectively prevents premature convergence. 3 Simulations and Results We have evaluated our method on simulations with synthetic test signals, consisting of a mixture of N harmonic (square wave) signals plus Gaussian white noise with SNR=9dB, and with N variable and unknown by the genetic algorithm. Each harmonic signal had a fundamental frequency picked randomly within the allowed fO range, that was 20 Hz to 519 Hz with a fO resolution of 1Hz. In these experiments, the challenge for the genetic algorithm is double: it must find both the correct number of fundamental frequencies present in the signal, and also their correct values, in an extremely large search space with an unknown number of dimensions N. First, we used a monophonic input signal and performed 30 runs of the algorithm using G=20 generations and a population size of M=100 strings. The algorithm consistently found the correct answer, that is, the best individual had only one fundamental frequency, which had the correct value. Next, we set an input signal with two fO values and a population size M=200. The algorithm found the perfect answer in 9 out of 10 runs. The wrong answer actually contained the two right fO values, plus one spurious value. We then set an input signal with 4 fO values and performed two runs during G=15 generations with a population size of M=500 strings. In one run the algorithm found an individual with all four correct fO values, plus one spurious value. In the subsequent run, it found a perfect answer. For an input signal with 5 fO values, three runs with M=500 and G=15 consistently yielded solutions with the five correct fO values plus two spurious ones. Let's recall that, if only solutions up to 5 fO values were considered, the search space has (1.8086).1013 points! We then performed a series of one-run experiments for inputs containing respectively six, seven, eight, ten, and fifteen fO values, with M=500 and G=15. For 6 fO values, the algorithm found a solution with 6 correct values, plus 2 spurious ones. With an input containing 7 fO values, it found a solution containing 6 correct values and 2 false ones. With 8 fO values, a solution with 7 correct values and 2 false ones was found. With an input mixture of 10 fO values, the The string fitness as: fitness f(s) is then computed from the raw f(s)=fraw(S)-Ns fp where f is the mean partial fitness over the whole population: fp=-fp(s,j) /N, s,j / and where N8 is the number of fO codes (i.e. substrings) in the string. This is a way to penalize strings with too many fO codes, since it is equivalent to subtracting the average partial fitness from each partial fitness. Therefore, fO codes with partial fitness smaller than average become negative and penalize the global fitness of the string. Also, strings with any fO value outside the allowed range are assigned null fitness.

Page  00000004 solution yielded five correct values plus two false ones (thus only 7 fOs). When the population size was increased to M=1000, eight correct fO values were obtained, plus one false. Finally, with an input of 15 fO values, M=1000 and G=15, the algorithm found 6 correct fO values and two false ones. An important point to recall here is that, in the recombination process by crossover, the algorithm does not limit the number of fO codes (substrings) in each string. The strings are allowed to contain from zero to an unlimited number of fOs. There is a potential risk for unlimited growth of a string. However, string growth is moderated by the fitness penalization strategy described in section 2.8, which penalizes individuals containing more fO codes than necessary. As we see from the experiment results, this penalization strategy turned out to work well and the optimal strings found by the algorithm contained a number of fO values very close, or equal to, the real number of fO values in the input signal. 4 Conclusions The results from the experiments described above suggest that genetic search can be a useful tool in the context of polyphonic pitch detection. When estimating up to six simultaneous fundamental frequencies, the algorithm found the complete set of correct fO values, plus eventually one or two spurious fO values. For input mixtures of seven and eight simultaneous fundamental frequencies, the algorithm just missed one of them, and found two spurious values. With input mixtures of 10 and 15 values the performance of the algorithm was seriously degraded, finding about 50% of the correct fO values. This might be due to the fact that the population size (M=1000) was too small for such big search spaces. Computational cost of the current implementation (using Matlab) was an issue when using larger populations. Further research includes: 1) developing a C++ implementation of the algorithm to perform more extensive evaluation, including tests on real-world signals using larger populations; 2) improving the convergence properties of the algorithm by using more sophisticated genetic operators and fitness correction strategies; 3) implementing a fO-tracking algorithm to post-process results of individual frames, eliminating spurious detections; and 4) integrating more robust (although in general more expensive) optimality criteria for fitness evaluation. References Holland, 1975. "Adaptation in Natural and Artificial Systems", The University of Michigan. Goldberg, 1989. "Genetic Algorithms in Search and Optimization", Addison Wesley. Koza, 1992. "Genetic Programming", The MIT Press. Wise, Caprio & Parks, 1976, "Maximum Likelihood Pitch Estimation ", IEEE Trans. ASSP, Vol. 24. Martin, 1981, "Detection de fO par intercorrelation avec une fonction peigne", JEP, Vol. 12. Chafe & Jaffe, 1986, "Source Separation and Note Identification in Polyphonic Music", in Proc. IEEE ICASSP, Vol. 5. Doval, 1994, "Estimation de la Frequence Fondamentale des Signaux Sonores", Ph.D. Dissertation, Universit6 Paris VI. Martin, 1996, "A Blackboard System for Automatic Transcription of Simple Polyphonic Music", Tech. Rep. No. 385, MIT Media Lab Perceptual Computing Section. Macleod, 1998, "Fast Nearly-ML Estimation of the Parameters of Real or Complex Single Tones or Resolved Multiple Tones", IEEE Trans. ASSP, Vol. 46. Klapuri, 1998, "Number Theoretical Means of Resolving a Mixture of Several Harmonic Sounds", in Proc. EUSIPCO. Walmsley, Godsill & Rayner, 1999, "Polyphonic Pitch Tracking Using Joint Bayesian Estimation of Multiple Frame Parameters", in Proc. IEEE WASPAA, New Paltz, New York. De Cheveign6 & Kawahara, 1999, "Multiple Period Estimation and Pitch Perception Model", Speech Communication 27, 175 -185. Klapuri, 2001, "Multipitch Estimation and Sound Separation by the Spectral Smoothness Principle", in Proc. IEEE ICASSP. Goto, 2001, "A Predominant-FO Estimation Method for CD Recordings: MAP Estimation Using EM Algorithm for Adaptive Tone Models", in Proc. IEEE ICASSP.