Page  441 ï~~ORGANISATION OF SOUNDS WITH NEURAL NETS Bernhard Feiten, Roland Frank Tamas Ungvary Elektronisches Studio Royal Institute of Technology H5 1, Inst. f. Konmmunikations- Dept. of Speech Communication and wissenschaften, TU - Berlin Music Acoustics Strasse des 17. Juni 135 Box 70014 D- 1000 Berlin 12 S- 10044 Stockholm T: +49 30 314 25699 T: +46 8 790 78 78 feiten@mvax.kgw.tu-berlin.de ungvary@kacor.kth.se Sound Information Technology (SIT) Research Group Abstract: A sound data base requires retrieval indices that are based on auditory perception. The attributes of auditory sensation are essential parameters for the organisation of sounds. Analog to the mechanism of recognition of sounds a self-organizing system for sounds is proposed. The feasibilty of a self-organizing neural net for sounds is shown by mapping 102 different sounds to a Kohonen feature map. Retrieval by verbal attributes requires correspondig sound analysis. A neural net for mapping sounds to sonic oppositions as verbal attributes is described. I Introduction To organize sounds knowledge of auditory perception must be applied. The attributes for auditory perception can be grouped in categories. 1. Attributes that match on transformation properties that are inherent or learned in a prelingual developement as recognition of sound. This implies the competence to decompose sound and to differentiate between loudness, pitch, sharpness, roughness, compactness, etc.. These attributes usually don't come to our mind, but they build an intermediate stage. Knowledge for this decomposition process is still very small. 2. Attributes that correlate with similarities of sounds in speech, music and noise of the enviroment e.g. similar to vowel a, o, u, sounds like flute, rush, rustle, murmur, thunder etc.. 3. Attributes that are based mainly on emotion, e.g. pleasant, hard, soft, violent, dirty, calm, dark. Up to now only particular solutions exist, how to handle sounds with aspect to their sonic features. The technical development today allows us to build large data bases upon sounds in various physical representations. But in general the retrieval of sound data is reduced to a linear search in an unsorted heap. For an intelligent retrieval it is necessary to expand the physical description of the sound data representation by a symbolic description for instance by the attributes mentionend above. This Sound and Property Data Base (SPDB) is the fundament for a composers expert system [Feiten 19901. The implementation is optimally based on an Integrated Project Support Enviroment (IPSE). It allows to combine knowledge for representing sound models with an intelligent percept based retrieval of sounds from a repository [Eaglestone 1991]. An expert shell system will be used to compile the knowledge about sound perception to develope a representational language with a vocabulary covering a large piece of the research area. On the other hand it is necessary to analyse the knowledge for the perception of sound. This can be seen as an extension of the physical analysis to find a transformation from the signal to attributes like these described above. Neural nets are well suited to realize transformations between different representation of signals even if the transformation rule is unknown. Two different attempts are presented. A self-organizing net is used to map sounds onto a two dimensional card in a way that similar sounds occupy neighbouring places. The result is a topographic map of indices to the sounds which can be used in a data base enviroment as a graphical retrieval index. The user then moves with the cursor through a two dimensional field of sounds, sorted due to their similarity. A net is trained by supervised learning to transform the input signal to a verbal description in terms of attributes mentionend above. Pairs of these attributes are used to build sematic scales. On these scales an inquiry to a sound data base can be started. II Organisation of Timbre Timbre is multidimensional. Beneath the multiple degrees of freedom of the spectrum for steady state sounds the temporal developement influences timbre essentially. One can distinguish in global temporal effects like attack, decay, modulation and microstructured effects of the different partial tones. To analyse this multidimensional source mainly two methods are applied. ICMC 441

Page  442 ï~~With multidimensional scaling for example a dissimilarity matrix can be estimated. From this dissimilarity matrix a timbre space can be calculated. For stationary sounds were shown that a spectral space derived from the Euclidian distance of the sounds spectrum correlates with the calculated timbre space quit well [Plomb 76]. A three dimensional solution for the timbre space of 18 different sounds was found to be much better in terms of agreement with the clustering solution, as well with respect to it's interpretability [Grey 76]. Instead of multidimensional scaling the organisation of sounds can be realized by a self-organizing neural net [Kohonen 84]. It models fundamental system principles of the brain. A spatial ordering process observed in the brain seems to be important for an effective representation of information. The various maps formed in self-organisation are able to describe topological relations of input signals. An experiment of mapping 102 sounds on a twodimensional map by a well kown algorithm proposed by Kohonen is described in Part III. For a verbal approach the method of sematic differential is suitable. Bismarck studied a large number of sematic differentials like hard-soft, sharp-dull, etc. [Bismark 72, Plomb 76]. The factor analysis revealed two main factors, strongly correlated with properties of the physical strucutre, e.g. sharpness. But complex tones with comparable sharpness cannot be discriminated adequately in verbal terms. This reflects our difficulty in saying what a particular tone sounds like and emphasis the demand for a well defined representational language for sound. Cogan defined 13 specific oppositions of sonic poperties. He used them to analyse music. Because the oppositions are regarded contextually and relatively by him this might result in problems for the application to sounds [Cogan 84]. In the absence of a better defined vocabulary for sound perception the sonic oppositions after Cogan are used. An attempt to map sounds to these oppositions automatically by applying a time delayed neural net is described in.Part IV. The preprocessing of the sounds for the net input is realized by modelling the ear. The frequency to place transformation on the basilar membran is an essential step in the auditory system and leads to the model of a bandpass filterbank. For the reason of efficiency the nonlinear transformation mostly is realized by a FFT filterbank. The short time spectrum then is transformed to the barkscale. The frequency bands with constant bandwidth are summarized to the "critical bands". The consideration of spectral and temporal masking effects leads to the specific loudness which finally results in a smoothed frequency-loudness-time-curve. III Self-Organizing Sound Map In perception there is a tendency to compress the information by forming reduced representations of the most relevant facts, without loss of knowledge about their relationships. In case of mapping the information to a two dimensional card the most prominent features are grouped in this two directions. Similar sounds then occupy neighbouring places. The result is a topographic map of the sounds. The model described here, arranges the neurons in a layer. The input is connected to the neurons by weights. The organisation of the sounds is realized during a learning phase by adjusting the synaptic weights.The self-organizing mechanism is based on feedback from neurons in the neighbourhood. An input signal distributed on multiple different input fibres can be seen as an input vector v = {vl,v2,...,vl}. Every input is connected to every neuron r by weighting it with the weights wrl. The feedback of neighbouring neurons happens by weighting their outputs fr' with the weights grr'- The mathematical formulation results the equation for the exitation fr a(.) is a nonlinear treshold function and 0 is the (1) fr= r X6 WrlVi+ V:g '.f'-O treshold. Positive values of grr' are exiting, negativ r fI. rr r values retarding. For grr' positiv for neighbouringIr neurons and grr' negativ for other neurons we get an effect of lateral inhibition which results in clear defined exitation maxima. Kohonen simplified equation (1) by estimating the exitation maximum only from the weighted input in form of the minimal Euclidian distance between the input and the weights (Eq. 2). The adaption of the weights is realized by the learning rule 2 described by Hebb (Eq. 3). a characterises the size of an (2) fr = min (w, - v/) adaption step and hrrmax is a function describing the value max r I of the weights grr-. The function hrrmax is simply defined by a two dimensional Gaussian function.The size of the term Ehrrmx is a measure for the plasticity h of the net. The plasticity h can be modified during the learning process between 1 < h < 0. In our experiment the step size and the radius are controlled by an exponential sinking function during the learning phase. The plasticity is given by equation (4). h0 is start step size, b0 the start radius. The final plasticity is determined by the damping factor a0 in relation to the num- (3) Wl - Wr (1 - Ehr ) + V!/1r ber of learning cycles of the training phase, co max max determines the final radius. The learning ICMC 442

Page  443 ï~~algorithm consists of cyclic evaluation of equation (2), (4) and (3) with different input vectors [Ritter 901. To keep the design of this experiment 2 simple only sounds of the length of 300 ms (r - rmax ) were considered. The preprocessing happened - C 0t as described in section II. We choose 50 -a 0t 2b0 point on the barkscale and 50 points on the (4) h = chr = hoe e time scale. The resulting sonogram was max used as input vector for the Kohonen feature map (see Pic. 1). With a dimension of N = 2500 and the size of the map e.g. x = 10 and y= 10 the required number of synaptic weights has to be 250 000. This shows one of the main problems with this algorithm. For the design of the experiment it was suitable to Sounds use synthetic random sounds. The random sound si - Sonogram synthesis algorithm allows to determine certain f qualitities as parameters. In this way 102 sounds si of 6 different categories were produced. The sounds are identified by numbers. 1-17 tonal, percussive, harmonic; 18-34 tonal, steady, harmonic; 35-51 tonal, percussive, rox unharmonic; 52-68 tonal, steady, unharmonic; 69 -85 noisy, percussive 86-102 noisy, steady. Though the categories had different features the sounds were still quite similar. For all sounds the Sound Map same pitch and loudness were used. Usually a Pic. 1: Design for Sound Map experiment normalisation of loudnes is required when the Euclidian distance is used. A variation of the parameters yields the following results. During every training phase very soon an organisation of the sounds on the map in agreement with the categories of the sounds appear. Initially it was in a random state. Pic. 2 shows such a map. The training 2 22 21 21 21 53 64 64 56 93 93 86 86 98 90 90 99102102 97 phase is not yet finished. The first 22 3 21 2 53 58 56 56 56 57 86 95 95 95 86 95808 102 category is still divided in three pieces. In 27 27 32 20 2 58 58 58 57 57 68 6 95 95 95 95 86 84 77 80 the final stage every category had always 18 18 26 26 26 58 58 57 57 60 60 6 95 95 95 95 95 84 84 84 19 19 26 6 54 57 57 57 60 60 6 87 87 95 95 95 84 84 84 only one area. For this picture for each 29 19 52 52 54 54 57 57 63 63 60 87 87 87 87 76 76 76 79 84 place the best match to the sounds was 29 29 52 52 66 62 55 55 63 87 87 87 87 87 76 76 76 79 79 estimated. Multiple places are occupied 29 29L52 66 66 66 62 55 55 55187 87 87 87.87 74 74 74 79 79 with same sounds, other sounds don't 23 2323 66 66 59 65 65 55 5587 87 87.87 74 74 74 74 72 72 appear, e.g. No. 4, 16,... The number of 23 23 67 67 65 59 65 65 65 55 57 87 87 73 73 74 74 74 72 72 23 67 67 67 61 61 65 65 65 41 41 73 73 73 73 74 74 74 78 72 different matched sounds depends on the 8167 67 67 61 47 47 47 41 41 41 5 73 73 75 75 75 78 7878 chosen parameters. Pic.3 shows that an 8 7145 45 50 47 47 47 36 51 51 5.73 75 75 75 69 78 78 78 increased number of learning cycles has no 7 7145 45 50 50 49 47 36 51 51 513517575 69 69 69 78 78 effect. Obviously there exists a maximum 6 7 10 45 37 37 49 49 36 51 51 35 35 71 71 69 69 69 82 82 3 3 10 37 37 a46 46 46 13 13 3 5 3 5 71 71 69 69 69 82 82 number of learning cycles depending on 14 14 10 10 37 37 46 46 46 13 13 13 17171 71 85 85 85 82 82 the number of neurons in the net. 14 14 10 48 48 48 46 46 46 13 17 17 17 17 5l85 85 85 82 82 Changing the final radius in combination 12 2 2 2.39 40 42 42 38 38 1 1 1 15 5 5 81 81 81 70 with the final plasticity increases the 12 12 2 2 39 42 42 44 38 38 1 1 15 15 5 5 81 81 81 70 number of different matched sounds Pic. 2: Sound Feature Map; Parameters: Number of Neurons significally (Pic. 4). The number of 400 (x=20, y=20), Start Plasticity 0.9, Final Plasticity 0.001, neurons of course influences the number of Start Radius 4, Final Radius 1, total number of learning cycles matched sounds (Pic. 5). The number of 50000, already done 400000 neurons has to be significally higher than the number of sounds to be represented in the net. The attempt to represent 400 neurons clearly showed the limitation with respect to memory and computation power. ICMC 443

Page  444 ï~~It makes no sense to use the design of this experiment Matched Sounds for a data base with for example 10000 sounds. We are working on a further developement which divides frequency and time into different nets. With the same12 resolution used here the feature map for spectra would have an input dimension of 50. The temporal course of sounds is given by the changing of spectra building Learning traces in this net. These traces can be learned by another 10000 20000 50000 100000 200000 Cycles net. That reduces the need for memory and computation Pic. 3: Number of different matched sounds power significally and the number of adressable sounds (total 102) Parameters: Number of Neurons 100 can be increased.(ttl12PaaeesNubroNuon10 (x=10, y=10), Start Plasticity 0.9, Final Plasticity 0.01, Start Radius 4, Final Radius 2 IV Time Delayed Neural Net for Sonic Oppositions Matched Sounds A modular time-delay neural network was trained to recognize a Final Radius subclass of sonic oppositions as described by Cogan [Cogan 84]. 5_4E4 Currently the oppositions grave/acute, centered/extreme, E, sparse/rich, beatless/beating were used. These oppositions were I I1 3 -trained within 4 isolated modules. Each module represents a time- fiLIJ delayed network as described in [Waibel 891 with some differences fJJ f-fFinal as follows: The input layer receives input from 32 spectral 0.001 0.01 0.1 Plasticity coefficients generated as descibed in section II, the first hidden layer Pic. 4: Number of different matched has 16 units and feeds the 8 units in the second hidden layer. sounds (total 102) Parameters: Number The increased number of units in a particular layer was forced to of Learning Cycles 50000, others see obtain a mapping of the audio spectrum, covering the range from Pic. 3 16Hz to 16Khz. Finally the output layer has 2 units. An outputactivation of { 1,0} referes to negativ, {0,1 } positiv, { 1,1 } mixed and Matched Sounds {0,0} neutral. The output of each module is then easily combined, 82 representing a two-dimensional feature-vector to describe the current 68 sound. The learning was done by the backpropagation algorithm [Rummelhard 87] with an additional dynamic impulse-factor to handle local minima in the error-space. The net was trained with a set of 100 synthetic generated sounds Number of with a duration of 300ms. The sounds were produced to cover the 100 225 400 Neurons four combinations of each oppostion pair with equal frequency. Pic. 5: Number of different matched After 100 cycles of learning the net was able to classify all sounds sounds (total 102) with an average recognition rate of 98% correctly. To test the net, a set of 20 natural and synthetic sounds were evaluated by hand and compared to the net result. An average of 71 % of the net results agreed with the results evaluated by hand. For the opposition beatless/beating the results were only 50% correct. This shows the difficulty of describing sonic features definitely. The results can be seen only as preliminary results because the repertoire of the training-patterns might be to small, and also the net-topology is a matter of further research. The mapping of sounds to verbal attributs suffers under the uncertainty of the definition of the attributes. References: Bismarck, G.v. (1972) Extraktion und Messung von Merkmalen der Klangfarbenwahrnehmung stationrer Schalle. Dissertation TU-Munchen. Cogan, R. (1984) New images of Musical Sound. Harvard U.P. Eaglestone, B.; Verschoor, A. (1991) An Intelligent Music Repository. Proc. ICMC 91, Montreal. Feiten, B.; Ungvary, T.(1990) Sound Data Base Using Spectral Analysis Reduction and an Additive SynthesisModel. Proc. ICMC. Glasgow Grey, J.M. (1975) An exploration of Musical Timbre Using ComputerBased Techniques for Analysis, Synthesis and Perceptual Scaling. Dissertation. Stanford University. Kohonen,T. (1984) Self-Organisation and Associative Memory. Springer-Verlag Berlin. Leman, M. (1990) Emergent Properties of Tonality Functions by Self-Organisation. Interface, Vol. 19, pp. 85-106 Plomb, R. (1976) Aspects of Tone Sensation. Academic Press.Ritter,H.; Martinetz, T.; Schulten, K. (1990) Neuronale Netze. Addison-Wesly. D. Rummeihart, J McClelland (1987) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Bradford. A.Waibel, et al (1989) Phoneme Recognition Using Time-Dealy Neural Networks. IEEE Transactions on Acoustics, Speech, and Signal Processing Vol. 37, NO. 3 March. ICMC 444