Page  428 ï~~A NEURAL NETWORK FOR TRIAD CLASSIFICATION Tim Shuttleworth and Roland Wilson Department of Computer Science, University of Warwick, Coventry, CV4 7AL, England email: Tim.Shuttleworth@dcs.warwick.ac.uk Roland.Wilson@dcs.warwick.ac.uk ABSTRACT: This paper describes a neural network for the classification of triads. The network extends the architecture of (Laden, B. and Keefe, D.), and uses spectral data derived from a windowed Fourier transform of the audio signal. Among the properties of the network investigated were the number of hidden units required to encode the learned triad patterns and the generalization capability of the network when presented with previously unseen triads. The paper describes the structure and training of the network and discusses its potential application in the context of a general approach to automatic transcription being developed at the University of Warwick. Introduction The work presented is part of a project to develop a system for the automatic transcription of audio signals representing polyphonic musical works into a sequence of note values and onset times. The term polyphonic simply means in this context that more than one note may occur at a given point in time, either due to several instruments playing, or one instrument playing several notes. Initially attention is focused on works where relatively few instruments are involved (trios, quartets and so forth), but the system has no prior knowledge of how many voices to expect. Designing such a system one is immediately presented with a number of problems. First, the input audio signal will typically contain a very large number of samples. Furthermore the data represent the signal purely in the time domain, making it hard to extract the note pitches. Therefore the input signal must be transformed by a preprocessing stage which extracts only the most salient signal features in a form which facilitates transcription. For a system of the type considered here what is required are the frequencies and onset times of the harmonic partials of the notes in the input. Unfortunately, the Uncertainty Principle (Mortensen, R.) places limitations on the accuracy with which one can measure the position of a given feature in the time and frequency domains simultaneously. A further complication arises from the fact that, even with accurate estimates of the location of the partials in the time-frequency plane, a given set of partials could be accounted for by many different note configurations (particularly notes played in unisons and octaves). The magnitudes of the partials are of limited use in resolving this ambiguity because it is generally difficult to predict their values for a single instrument. Hence one cannot easily tell if a given partial is a harmonic of one or several notes. One solution of this problem is to limit the types of musical signal the system is presented with, but generally one needs to apply higher level knowledge about the structure of the music in order to resolve such ambiguities. Such high level information may be extracted from a prior analysis of the score, or may simply be in the form of general knowledge of musical structure (i.e. principles of harmony, melody and timing). However it is not immediately clear how one may usefully combine this higher level knowledge with the lower level signal representations. One strategy which has received some attention recently is the use of neural networks for higher level processing of musical structures. These networks have the advantage that they potentially allow for the integration of distributed low level signal representations with more abstract higher level musical representations. Additionally they can be made to exhibit robust behaviour and graceful degradation in the presence of noisy or degraded input signals. A Triad Classification Network In order to investigate the possibilities of using neural networks for transcription, a network was created for performing the task of classifying input triads. Such a network must combine information from the signal representation at the input with the higher level knowledge of what constitutes a triad, which implies knowledge not simply of the presence or absence of particular notes, but rather of their relative positions. The network used in the present work is illustrated in Figure 1. It is based on one described in (Laden, B. and Keefe, D.). The output classification is coded using a single unit for each of the possible classes of triad. An output value close to 1.0 indicates that the network has classified the input as the corresponding triad type. Each input unit connects to each of the hidden units, which connect in turn to the output units. Various numbers of hidden units were tried, here results for 3, 8 and 25 are presented. The connection between 428 I2 CMC PROCEEDINGS 1995

Page  429 ï~~Augmented Major Minor Diminished 0.9 1/(+exp(-x)) 0.8 0.7 0.6 -0.5.. o 00.4 / 0.3 F 0.2 0.1 r -6 -4 -2 0 2 4 6 Input C3 C#3 D3 D#3 E3 F3 F#3 G3 G#3 A3 A#3 B3 Figure 1: The Triad Recognition Network Architecture Figure 2: The Activation Function each pair of units has an associated weight, wij. The output value of each unit, oi, is formed by computing the total weighted input to that unit, N Ui - = Wi,1jj j=0 and passing the resulting activation value through a biased sigmoidal activation function: 0i 1 2 - 1 + exp(-u" + 0i The parameter 8i is a threshold value for each unit, and may be considered as a weighted input to each unit from a unit with a constant output of 1.0. Figure 2 shows the output as a function of total input for a threshold value of 0.0. Note that the output values of the units are limited by this function to the range 0.0 to 1.0. An appropriate choice of weights may lead the network to correctly classify triadic inputs, but suitable values for wij are not available a priori. Therefore a training algorithm was employed which would iteratively determine an appropriate set of weight values. Initially the standard back-propagation scheme was employed (Rumelhart, D. and McClelland J.), but after some preliminary experimentation a switch was made to the Quickprop algorithm described in (Fahlman, S.), which was found to produce similar results, but converged more quickly to a stable weight set. The details of both of these algorithms have been widely published, and it is therefore unnecessary to repeat them here. Suffice it to say that both algorithms work by presenting examples of each type of triad at the inputs and observing the resultant output values. The weights are then adjusted from output to input in order to reduce the error between the observed output and the desired output. Results The network was trained with 100 examples of triads (computed as in Appendix A), forming an equal mix of each triad type and tested with a further 100 different triads, again equally mixed. The learning algorithm was applied after the presentation of each training pattern in turn. After presentation of all training patterns (one epoch) the current error rate (defined as the number of output units which gave an incorrect output divided by the total number of possible correct output units for the epoch) was noted. Then all the the test patterns were presented in turn without the the learning algorithm being applied, and the error rate for the test patterns noted. Figure 3(a) shows the error rate for the training sets for networks with 3, 9 and 25 hidden units. Figure 3(b) shows the error rates for the test sets for the same networks. Discussion of Results The results shown in Figure 3 compare quite favourably with those of Laden and Keefe. For their closest equivalent network, they achieved a final error rate of 47%, which is very slightly worse than those shown Figure 3(b). A number of variations on the basic network architecture were considered, in an effort to improve the system's performance. The input representation was changed for one which was linear in frequency (giving many more input units, but allowing for a distinction between close but not coincident harmonics of different notes). Interestingly, despite the increased number of input units, this lead to a IC M C PROCEEDINGS 199542 429

Page  430 ï~~1.! 0... hidde unit... 0.9..hiddenu nit.. 25 hidd en unit.L...... 2 hidd en unit...................m....................... 3. 8hidd nnt:- -- 079 id~ ~o 0.7 4........." O 0.6 4 O.6 0. - 0.3---- 1- ---- ---- '..... _.............................................................. 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 25000 30000 36000 40000 Epoch Epoch (a) (b) Figure 3: Error rate as a function of training epoch for (a) the training set and (b) the test set. similar performance level to the previous pitch class approach. A tone chroma representation was also used, where all notes of the same name (all C's for example) were represented by a single unit, whose activation was the sum of the activations for the the separate units. Again, this resulted in a similar error rate as for the pitch class representation, but this representation was (as one might expect) better at classifying inverted triad inputs. Next, the output representation was altered so that the output encoded the intervals of the triad directly. The output now had two units, one for the root to mediant interval and one for the mediant to dominant interval. An output value of 0.0 indicated a minor interval, an output value of 1.0 indicated a major interval. Thus a major triad would be encoded at the output as (1, 0). Surprisingly, such an output coding scheme did not produce significantly improved results. It seems that the architecture is limited to a correct classification rate for unseen triads of no better than about 50% to 70%. Whilst this is far better than what one might expect through pure chance, it is still not sufficiently reliable for use as a major part of a music transcription system. Conclusions Clearly the gradient descent methods of which back-prop and Quickprop are examples are not powerful enough to correctly learn the intermediate representations required for the successful classification of triads. Because the definition of triads is based upon intervals, rather than absolute pitch values, it may be argued that triad classification requires knowledge of higher order statistics of the input data, which are not readily learned by gradient descent algorithms. Therefore attention was turned to a network for performing the simpler task of recognizing individual notes in the input. For an individual input note this requires knowledge of a fixed set of other input notes, arranged in a fixed pattern. The particular network architecture chosen was based upon the Hopfield paradigm, and is described in (Shuttleworth, T. and Wilson, R.). The note recognition network was found to be much more successful at identifying notes (which is after all a primary requirement for a transcription system), and also offers the possibility of being extended to form a complete transcription system. This system, whose components are illustrated in Figure 4, is currently under construction. Acknowledgements Funding for this work was provided by the United Kingdom Engineering and Physical Sciences Research Council and Central Research Laboratories Ltd. (A THORN EMI company). Appendix A Preprocessing Individual instrument samples were taken from the "McGill University Master Samples" Compact Disc series (Opolko, F. and Wapnick J.). Samples were made with 16 bit linear resolution at a frequency of 11,025 Hz. Instruments sampled were: Trumpet (F#3-D#6), Tuba (C2-G4), Xylophone (F-C8), Loud Piano (AO-C8), Soft Piano (AG-C), Double Bass (Cl-E4), Violin (G3-C7), Contrabassoon (A#0-F3), Flute (C4-C7) and Oboe (A#3-F6) A single level of a Multiresolution Fourier Transform (equivalent to a Short-Time Fourier Transform (Wilson, R. et aL) was computed for the first 1.5 seconds of the signal. The window function applied was 430 0ICMC PROCEEDINGS 1995

Page  431 ï~~Figure 4: The Envisaged Transcription System Architecture a relaxed form of finite prolate spheroidal sequence (FPSS), which is known to optimally concentrate energy about the window centre in both time and frequency (Wilson, R. and Spann, M.). The window duration was 8192 samples (0.74 seconds) with 50% temporal overlap between blocks and (due to the relaxation of the FPSS) 50% overlap in frequency. Since the phase information is not used in this work, only the magnitude of the Fourier coefficients was retained. The majority of the instruments sampled have an initial broad-band onset followed by a (relatively) steady-state. Due to the long window size used, the onset energy is largely contained within the first time bin, and so the steady-state section was found by extracting the Fourier coefficients of the second time bin. The Fourier transform gives a linear frequency scale, but the input representation is based upon the fundamental frequencies of the notes, which increase along a log-frequency scale. The input units of the network are labelled with the names of notes, and their activation values represent a Gaussian windowed average of the energy in a frequency range half a semitone above and below the fundamental frequency of the note label. For example, A4 (fundamental frequency 440.0 Hz.) represents the weighted average spectral energy in the range 427.5 Hz. to 452.9 Hz. (Note, the window is Gaussian on a log-frequency scale). References Fahlman, S. E.,"An Empirical Study of Learning Speed in Back-Propagation Networks," CMU Computer Science Report, CMU-CS-88-162 (September 1988). Laden, B. and Keefe, D. H.,"The Representation of Pitch in a Neural Net Model of Chord Classification" in Music and Connectionism, The MIT Press, Cambridge, Mass. (1991). Mortensen, R. E., Random Signals and Systems, John Wiley & Sons (1987). Opolko, F. and Wapnick, J., McGill University Master Samples, McGill University (1987). Rumelhart, D. E. and McClelland, J. L. (eds.), Parallel Distributed Processing-Explorations in the Microstructure of Cognition, Volume 1, The MIT Press (1986). Shuttleworth, T. and Wilson, R. G.,"The Recognition of Musical Structures using Neural Networks," to be presented at the International Joint Conference on Artificial Intelligence (August 1995). Wilson, R. G., Calway, A. D., and Pearson, E. R. S.,"A Generalized Wavelet Transform for Fourier Analysis: the Multiresolution Fourier Transform and its Application to Image and Audio Analysis,", I.E.E.E. Transactions on Information Theory, Volume 38, Number 2 (March 1992). Wilson, R. G. and Spann, M.,"Finite Prolate Spheroidal Sequences and their Applications II: Image Feature Description and Segmentation," I.E.E.E. Transactions on Pattern Analysis and Machine Intelligence, Volume 10, Number 2 (March 1988). ICMC PROCEEDINGS 199543 431