# Neural Network Pitch Tracking Over the Pitch Continuum

Skip other details (including permanent urls, DOI, citation information)This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact mpub-help@umich.edu to use this work in a way not covered by the license. :

For more information, read Michigan Publishing's access and usage policy.

Page 432 ï~~Neural Network Pitch Tracking over the Pitch Continuum I. J. Taylor and M. Greenhough Department of Physics and Astronomy, University of Wales College of Cardiff PO Box 913, Cardiff, CF2 3YB, Wales, UK Telephone: (+44) 1222 874000 Ext 6964 Fax: (+44) 1222 874056 email: ian@astro.cf.ac.uk ABSTRACT: This paper describes an ANN pitch-determining system which can accurately determine pitch from a wide variety of musical instruments across the pitch continuum. The system incorporates an ARTMAP supervised, self-organising ANN consisting of two "SART" networks. The network is initially trained on "semitone-bin distributions" (created by a neural-like input mapping from an FFT to semitone bins) for notes in concert pitch. At this stage it is able to classify pitch within a semitone. Further spectral analysis then allows an estimate of the pitch with a typical accuracy of about 10 cents. 1. Introduction: Pitch-tracking systems are much sought after in the musical world, for automated score-printing of unnotated live or recorded music in ethnomusicological studies, and computer interaction with human musicians in live performances. Ideally, a pitch-tracking system which accurately simulates the way humans perceive pitch is needed. To this end, most researchers have incorporated, in their systems, ideas relating to certain pitch-perception theories (e.g. [TSS82] and [DWS82]). Quick and easy computational algorithms which perform in a similar way to humans have, however, proved elusive. Artificial Neural Networks (ANNs) offer an attractive, alternative approach, improving on the current pitch-extracting schemes on which certain pitch-perception theories and pitch-determination algorithms are based (e.g. summing the subharmonics ([Ter74] and [Her88]) or matching the harmonic-ideal template against the input spectrum to find the closest fit e.g. [Gol73]). Although such methods work well in general, there are instruments which produce a much depleted or inharmonic set of spectral components. Such spectral patterns may well confuse systems which use simple comparisons to determine pitch. Although such algorithms may be extended to cater for a greater variety of instruments, the process involves further pitch analysis and re-coding of computer implementations. Our method involves training an ARTMAP ANN [CGR9la] with a wide variety of spectrally different patterns, so that the information relevant to the pitch-determining process can be extracted. This approach has been shown to produce a more robust pitch-determining system capable of handling many more spectrally diverse cases. e.g. [TG94a] have shown that the method of subharmonic summation made 1.3 times more mistakes for absolute pitch and 4.5 times more for pitch chroma compared to an ARTMAP network consisting of an "SART" network as the ARTa module and an ART 1 network as the ARTb module. This paper describes an ANN pitch-determining system called SPeCS (SMARTMAP Pitch Classifying System) which can accurately determine pitch from a wide variety of musical instruments across the pitch continuum. SPeCS incorporates two unsupervised self-organising ANNs called SART within the supervised ARTMAP infrastructure to produce SMARTMAP (SART Modified ARTMAP). 2. SPeCS Outline: SPeCS is outlined in figure 1. Briefly, the CD-quality signal is Low-Pass filtered, re-sampled, adapted, fast Fourier transformed, interpolated and transformed into a semitone distribution before it is presented to the SMARTMAP network. After learning, the neural network can classify pitch within an accuracy of a semitone and then further examination of the spectrum reveals the pitch more precisely. The next six sections show the role of each of these operations in detail. (i) Sampling, Low-Pass Filtering and Re-Sampling: The sounds were sampled at CD-quality (44.1 kHz) and stored in sound files on a SUN SPARC 5 Workstation. Although, such a high sampling rate is not needed for determining pitch (i.e. only frequencies up to 4-5 kHz are needed), we chose this rate for three reasons. Firstly, our long-term intention is to make this system as general as possible so that it can work on a variety of computer platforms and ADC interfaces, some of which may have limited sampling-speed choices. Secondly, choosing a lower sampling rate would produce aliasing if there was no external low-pass filtering. Lastly, we are intending to built up a large database of musical sounds for analysis. High fidelity, therefore, could be important in some cases e.g. the analysis of timbre. The eventual goal is for this system to perform real-time pitch classification and so at least 20 classifications per second are needed to handle e.g. fast trills. Since a recursive time-domain low-pass filter 432 I C MC PROCEEDING S 1995

Page 433 ï~~Figure 1: Outline of SPeCS: The input sound is sent through several stages of pre-processing before being presented to the SMARTMAP neural network was used it was convenient to process a continuous time series and so 2048 samples (nearest power of 2 to 44100/20) were processed at a time. This allows about 21.5 classifications per second. A 7th order time-domain Butterworth filter with a cut-off frequency of 4 kHz was used (designed by Tony Fisher's interactive Filter Design Program at York, UK, available through the World Wide Web). This gave a sufficiently good roll-off so that after re-sampling to 11025 Hz (by taking every 4th sample) a reasonably clean signal of 512 samples consisting of frequencies up to 5.5 kHz remained. (ii) Adaptive Line Enhancing (ALE): [SD88]. This technique uses an LMS algorithm to update a set of coefficients (or weights) which are connected to previous points in the time-domain signal. The effect is to reduce the noise in the signal. 300 230 200 130 100 50 0 ISO 160 140 120 100 60 60 40 20 0 0 1000 2000 3000 4000 5000 4000 0 1000 2000 3000 4000 000 6000 1rqu4ncy #.0qu-c (a) (b) Figure 2: (a) Fourier Spectrum of a frequency of 2713 Hz with Gaussian noise (of twice the signal amplitude) added, and (b) the same after processing with the ALE algorithm For example, figure 2a shows the Fourier Spectrum of a sine wave saturated by Gaussian noise. The frequency of 2713.184 Hz was chosen to be exactly divisible by the frequency resolution of the spectrum (i.e. 21.533 Hz). This minimises the effects of the frequency spilling over on to adjacent frequency points. Figure 2b shows the same signal after processing by the ALE algorithm. ALE is an optional part of SPeCS which is only used when there is a poor signal-to-noise ratio. (iii) Spectral Processing: A 512-point fast Fourier Transform was then applied to the signal. Each bin in the Fourier spectrum therefore had a frequency resolution of 21.533 Hz. A process of interpolation then attempts to pin-point the peak frequencies more accurately. Several interpolation tables involving amplitude ratios and relative shifts were set up, catering for the whole frequency range. The ratios are calculated by considering the relative amplitudes of the spectral points neighbouring the peak e.g. r = (S1- S-1)/(S - Sj+1) (1) where Sj is a peak amplitude. Peaks are identified as those whose amplitude is significantly greater than the mean of 11 randomly chosen spectral points. Testing the method with 4000 sine waves of random frequency and phase indicated a typical accuracy of 0.4 Hz. IC MC PROCEEDINGS 199543 433

Page 434 ï~~(iv) Fourier-to-Semitone bin Mapping Scheme: Groups of frequency components lying within bandwidths of a semitone are mapped to individual 'semitone bins' of a representative intensity. Thus the Fourier spectrum is transformed into a distribution of intensity over the semitones of the chromatic scale. The mapping scheme here uses a chef-hat function which identifies the frequency with the largest amplitude in the area within a bandwidth of a semitone around the semitone's centre frequency. This is given in the equation below A = f1 if (Fs + Fs-1) f < (F, + Fs+1) As - 0 otherwise where F3 is the frequency of semitone s, f is the actual frequency under consideration and As is the amplitude at semitone-bin s. Figure 3 shows how a spectrum of a C3 note played on a Cello is transformed from an amplitude spectrum to an interpolated spectrum, and a semitone distribution. I 400 O 700 COO 300 400 300 200 100 0 loo eoo soo soa 900 200 100 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 10 20 30 40 0 00 7, lLornot.0 R..oluoio F'eqoency Socosoc Fuil Pr1.qoencypooi 0.octn. Snto i Omlon (a) (b) (c) Figure 3: (a) 256-point spectrum with frequency resolution of 21.533 Hz (b) with peak positions more accurately determined by interpolation and (c) the resulting semitone distribution (v) SMARTMAP: ART 2-A [CGR91b] is an algorithm which simulates the main properties of the ART 2 ANN. In fast-learn mode it accurately duplicates the behaviour of ART 2 and the authors also indicate schemes for simulating ART 2 in the slow-learn mode (called intermediate learning in ART 2-A). ART 2-A runs two to three orders of magnitude faster than ART 2. S..ART [TG94b] is a modification of the ART 2-A algorithm which speeds up ART 2-A in the intermediate learning case. By attaching, to each output node, a different learning rate which is decreased according to the amount of learning. the particular node has had, the training-presentation time is significantly reduced, and thus a speed-up of up to two orders of magnitude can be achieved. The SART system has essentially the same architectural structure as ART 2-A but differs in its finer architecture levels and function. SART consists of a layer of M F1 nodes, a layer of N T2 nodes, fully connected by a set of bottom-up adaptive weights, and an orienting subsystem which incorporates the reset mechanism (see figure 4a). Recently, a parallel SART algorithm has been implemented on a Transtech Paramid distributedmemory parallel computer. The Paramid consists of 48 i860-XP nodes, each with a processing capability of 100 MFlops. Two parallel SART networks were then connected by a dynamic map-field network to produce parallel SMARTMAP. The whole network's architecture can be seen in figure 4b. These networks were implemented, essentially, in the same way as our parallel ART 2-A and ARTMAP networks [Tay95]. (vi) Estimating the Exact Pitch: The amplitudes of the significant spectral peaks are held in an array, along with their frequencies (now known more precisely through interpolation). The ANN identifies the pitch within a semitone. Reference is then made to the array to find a significant peak (the fundamental) within a quarter-tone of this pitch. Subsequent peak-hunts are conducted, roughly around the 2nd, 3rd etc harmonic frequencies, but guided by a running mean of subharmonics reflecting the best estimate so far of the fundamental frequency. This procedure can cope with any missing or mistuned harmonics and determines the pitch with a typical accuracy of 10 cents. 3. Conclusions: The system was trained with 49 chromatic notes (C2 to C6) taken from 11 instruments, chosen for their spectral variety, including piano, tubular bells, trumpet and banjo. The 539 70 434 ICMC PROCEED I NGS 1995

Page 435 ï~~Attentional Subsystem Orienting Subsystem Output Pattern 2apF" " "ieedhNnim FF... 2 1 2 ARTb Nb F0 and Fj FieldsR I0 I o O O- 1 2 ART a N I0 1 M-1 Input Pattern Input Pattern (a) (b) Figure 4: (a) Topology of the SART ANN computational model. It consists of two input pre-processing fields To and..F1 and an output field (F2 ) fully connected by a set of bottom-up weights. (b) ARTMAP architecture consists of two SART modules, called ARTa and ARTb, connected by a map-field module. training patterns were learned in 20 seconds by the parallel SMARTMAP network using 3 processors, and indications are that the system will comfortably classify pitch in real-time. On a SPARC 5 workstation it can cope with about 30 pitches a second. References [CGR9La] Gail A. Carpenter, S. Grossberg, and J.H. Reynolds. ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4:565-588, 1991. [CGR91b] Gail A. Carpenter, S. Grossberg, and D. B. Rosen. ART 2-A: An Adaptive Resonance algorithm for rapid category learning and recognition. Neural Networks, 4:493-504, 1991. [DWS82] H. Duifhuis, L.F. Willems, and R.J. Sluyter. Measurement of pitch in speech: An implementation of Goldstein's theory of pitch perception. Journal of the Acoustical Society of America, 71(6):1568-1580, 1982. [Go173] J. L. Goldstein. An optimum processor for the central formation of pitch of complex tones. Journal of the Acoustical Society of America, 54(6):1496-1516, 1973. [Her88] D. J. Hermes. Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83(1):257-264, 1988. [SD88] Samuel D. Stearns and Ruth A. David. Signal processing algorithms, Chapter 12. PrenticeHall, Englewood Cliffs, NJ, 1988. [Tay95] Ian Taylor. Parallel supervised artmap learning of analog multi-dimensional maps using self adaptive ART 2-A neural networks. In Preparation, 1995. [Ter74] E. Terhardt. Pitch, consonance, and harmony. Journal of the Acoustical Society of America, 55:1061-1069, 1974. [TG94a] Ian Taylor and Mike Greenhough. Evaluation of artificial-neural-network types for the determination of pitch. Proceedings of the ICMC, pages 114-120, 1994. [TG94b] Ian Taylor and Mike Greenhough. S_.ART: A modified ART 2-A algorithm with rapid intermediate learning capabilities. Proceedings of the IEEE International Conference on Computational Intelligence, 2:606-611, 1994. [TSS82] E. Terhardt, G. Stoll, and M. Seewann. Algorithm for extraction of pitch and pitch salience from complex tonal signals. Journal of the Acoustical Society of America, 71(3):679-688, 1982. IC M C P ROC EE D I N G S 199543 435