Page  256 ï~~A Neural Network Model for Sound Localization in Binaural Fusion Steve Berkley Graduate Program in Electro-Acoustic Music Dartmouth College, Hanover, N.H. ABSTRACT Binaural fusion involves taking two related signals from the neural interactions of both ears and fusing them into one coherent image. Rayleigh's duplex theory (1907) described how interaural time differences (lTD, or a phase difference) between the ears provides cues for low-frequency sounds, and interaural intensity differences (IID) provides cues at higher-frequencies. Similar to the visual system's separation of localization from identification, the auditory system is likely to separate these processes for efficient processing. Supervised learning and back-propagation are appropriate models for an auditory localization network because of the visual "target." Characteristics of the trained network are examined with respect to similar psycho-physiological behavioral responses for normal and abnormal sound localization phenomena in humans. 1 Sound Localization Recently, progress has been made towards understanding the mechanisms by which the nervous system extracts information for sound localization. Sound localization has been studied by behaviorists, psychoacousticians, neurologists, psychologists, cognitive scientists, and audiologists, with numerous studies and results. Johannes Miller (1838) postulated that a comparison between the signals arriving at the two ears formed a basis for sound localization [Moore, 1991]. The comparison of the sound image received between the two ears is called binaural hearing. The pinna, head, and torso of the listener all play important roles in providing directional cues to the auditory system. Binaural fusion involves taking two similar signals from the neural interactions of both ears and fusing the signals into one coherent image. Although completely dissimilar signals are not fused, the auditory system achieves binaural fusion as long as the signals presented to the two ears are similar in some way. Frequencies below 1.5kHz are important to auditory binaural fusion. Low frequency envelopes of complex signals (macrostructures) are used for fusion even though the finer details of the signals (microstructures) differ. A fused image is perceived when high and low-frequency components of a speech signal are directed towards different ears. Neither ear receives enough information to identify speech, yet an identifiable, fused image is perceived [Gelfand, 1990]. Rayleigh's duplex theory (1907) described how interaural time differences (ITD, or phase difference) between the ears provides cues for low-frequency sounds and interaural intensity differences (IID) provide cues at higher-frequencies [Gelfand, 1990]. This is because low frequencies bend around the head as a result of their large wavelength (ITDs are useful for wavelengths larger than the distance the signal must travel from the near ear to the far ear), and high-frequencies have wavelengths smaller than the diameter of the head [Gelfand, 1990]. IID accounts for the difference in perceived loudness between the two ears. Intercommunication between lID and ITD data from both ears collectively contribute to a complex process of time-intensity tradeoffs for lateralization. (Jeffress, Hafter, et al. often found their subjects reported two lateralized images instead of just one. The different images are "associated with different trading ratios," where one image is dependent upon ITDs, frequencies below 1.5kHz, yet unaffected by lID) [Gelfand, 1990]. Recent studies demonstrates that directional hearing for high frequency complex sounds is also affected by ITDs. Stevens and Newman (1936) researched which cues are used to localize sounds. Localizations were accurate below 1kHz and above 4kHz, with the greatest errors between about 2kHz and 4kHz, and noise like sounds were better localized than tones (possibly because of the IID in the high frequency noise) This indicates that most likely the localization cues are ambiguous around midrange frequencies. These findings were also verified in experiments by Sandel and others [Gelfand, 1990]. Spectral differences produced by the pinnae at high frequencies have been shown in recent experiments by Musicant and Butler to account for front-back localizations [Gelfand, 1990]. The smallest angle that a listener can discriminate is called the minimum audible angle, or MAA. The MAA has been shown to be smallest (best) for frequencies below 1.5kHz and above 2kHz. Generally, the MAA is 1-2 degrees for sound sources directly in front of the head. However, sounds around the side of the head cause the "cone of confusion" to occur. The cone of confusion is an area where binaural localization cues loose salience, and head movements provide significant localization cues, because the movements keep changing the position of the cone [Gelfand, 1990]. Other binaural phenomena such as the masking level difference (MLD) and the precedence effect represent current thinking in auditory neuroscience. Masking level difference is the phenomenon in binaural hearing where identical noise from both ears is separated from a tone, spatially separating the noise and the tone. Different noise presented to both ears does not allow the pure tone to emerge [Dowling, et al., 1986]. According to the precedence effect, we localize based on the direction of the 58.3 256 ICMC Proceedings 1993

Page  257 ï~~direct sound, not the reflections of the sound in the listening space. Other research has involved the ability of subjects to localize with only one functioning ear. Physicians may administer a battery of auditory tests to determine the site of a brain lesion. The auditory brain stem response (ABR) is used to test for brain stem disorders. A lowered MLD may also indicate a brain stem disorder [Musiek, et at., 1988]. These findings and further study promise to illuminate physiological and psychoacoustic structures involved in binaural hearing. The auditory nervous system is bilaterally symmetrical-- corresponding nuclei and pathways for each ear are found on both sides of the midsagital plane from the medulla all the way to the cortex. At certain anatomical levels, there are neurons that comprise pathways that cross the midline to make connections between corresponding nuclei. The neurons that make up the central pathways and structures of the central auditory system show great variety in their response patterns. This variety suggests that the central acoustic system plays an active role in processing information brought to it by the cochlear nerve. Thus, the projection system does not simply transmit information coded by the cochlear nerve to the auditory cortex. Auditory information is passed from the cochlear nucleus (CN) to the superior olivary complex (SOC).It is here in the superior olivary complex that neuron bundles have been located that are sensitive to interaural intensity differences and interaural time differences. Information is then passed from the superior olivary complex to the inferior colliculus (IC) There is emerging evidence that binaural information is coupled with spectral cues from the outer ear in several mid-brain regions to produce topographic representations of auditory space [Moore, 1991]. The external nucleus of the inferior colliculus in the barn owl contains a topographic representation of auditory space [Konishi, 1988]. Direct monaural pathways from the CN to the contralateral IC may also play a role in directional hearing. Changes in the binaural connections in the auditory brain stem occur in neonatals with a unilateral hearing loss. Additionally, by stimulating the normal ear, the physiology of the neurons in the inferior colliculus may be calibrated to the auditory space map in the superior colliculus. Adult hearing losses do not suggest similar adaptations [Moore, 1991]. Similar adaptations of the sound localization mechanisms are suggested by a systematic recalibration of localization cues affected by an increase in head size. As humans and animals develop, the head grows, and the space between the ears is broadened. 2 Learning Localization Cues The mechanisms for sound localization exhibit an impressive ability to adapt; these mechanisms must be "soft" on some level. The innate direction of visual attention towards a sound source must initiate the feedback essential to learning the spatial location of sounds. Muir & Field (1979) showed how sound localization directs "visual attention" towards a sound object; infants "orient" to a sound [Muir]. This suggests an innate visual reflex to sound or its directional cues. The performance of infants in such tasks has been shown to vary with age according to developmental changes in sensory dominance when asked to perform crossmodal perception tasks [Lewkowicz, 1988]. During the first 18 months of development, horizontal and vertical plane localizations develop significantly. Morrongiello & Fenwick's experiment (1991) found that five-month-olds exhibit preferential looking toward a static aural-visual combination, while seven-month-olds exhibited preferential looking toward aural-visual combinations with an aural dimension (depth). Ninemonth-olds exhibited preferential looking towards all aural-visual combinations with direction along a dimension. These developmental trends suggest such coordination between auditory and visual depth information is learned [Morrongiello, 1991]. Knudsen & Knudsen (1989) suggested an innate dominance of vision over audition. The vision of barn owls was displaced with eye prisms resulting in a shifted perception of sound location. Continuous exposure to this altered environment causes a more significant longterm recalibration of the sound localization system. Differing localization information from the auditory and visual space maps caused the developing owls to use the visual space maps to "calibrate associations of auditory localization cues with locations in space in an attempt to bring into alignment the perceived locations of auditory and visual stimuli emanating from a common source" [Knudsen, et al., 1989]. Visual influence over sound localization occurred even when the visual information is inaccurate. These experiments and other visual-aural displacement phenomena suggest mechanisms for resolving aural-visual spatiality discrepancies. In visual capture, a heard sound object, displaced from a seen visual object, is heard to emanate from the seen visual object; thus, an immediate override of the sound location occurs as a result of the discrepancy between the visual and aural spatial maps. A different kind of resolution occurs with actual recalibration of auditory cues, suggesting learning (as with the barn 'owls). The higher-level adjustments of actual spatiality involve dominance of vision, altering the sound location to match visual location (as with visual capture). Alternatively, the replacement of previously learned auralvisual correspondences on a "more fundamental level" suggests learning aural spatial location through visual feedback (as suggested by the varied learning rates of older owls as opposed to younger, developing ones). 3 Neural Net Modeling Neural nets have been used to study and model different aspects of behavior and cognition. One type of neural net model is trained to map a set of input patterns to an output pattern based on target values given to the system a priori (supervised learning). Supervised learning and back-propagation neural networks are appropriate models for an auditory localization network because of the visual "target." A few such neural network models have ICMC Proceedings 1993 257 58.3

Page  258 ï~~been implemented [Palmieri, et al., 1991], [Moseiff, 1991], and [Berkley, 1992]. The following is a discussion of the implementation of the Berkley model, concluding with a discussion of aural-visual spatiality discrepancy resolution. In my own uncontrolled experiments with kittens, I noticed the high precedence of localization errors without visual stimulus, while older cats exhibit finely tuned localization systems with quick responses and accuracy. The variety of additional neural input from motor and other systems to the auditory system provides clues that may account for such "additional contexts" in learning like the use of touch by blind people. These premises suggested to me that a supervised learning neural network trained with back-propagation would be an appropriate model for an auditory localization network because of the visual "target" involved. A sound source was recorded along different points on a semi-circle in front of a styrofoam model of a head equipped with binaural microphones. The model head was not equipped with pinnae since front-back localizations are beyond the scope of this project. Because of the precedence effect and the rarity of non-complex sounds, a typical complex sound was presented to the styrofoam head in a room with echo. Sounds were recorded onto a DAT on respective left and right channels. Fast Fourier Transformations (FFT) yielded spectral information for left and right channels of the recording. Next, the FFT data was processed in software, which provided network simulation input pattern files, mapping frequencies and intensities to correct tonotopic auditory input units (coarse coded). ITD for waveforms were calculated manually by comparing attack peak locations. These input units do not represent cochlear hair cells responding to frequency stimulus, but rather to the brainstem neurons responsive to ITD and IID. IID input units are laid out tonotopically in accordance with the spatial representation of frequency in the nuclei of the ascending auditory projections. Output units are coarsecoded and represent 45-degree units from 0-180 degrees, 90 being directly in front (See Fig. 1,2) Training occurred with a variety of complex input patterns spanning the audible frequency range. Testing involved simulating a new input pattern and comparing output to the correct location of the sound source. Four sound types were recorded at angles 0, 45, 90, 135, and 180 degrees. These sound types included a bell, a ride cymbal, a low string, and speech. LTD was estimated for the low string sounds because of the gradual attacks and complexity of the waveforms. Speech samples were reserved for testing the network after training, and were not included in training files. The network simulation software was fed a configuration file of ten input patterns (five bell sounds and five ride cymbal sounds) and a default learning rate. The network learned to map the patterns in 477 epochs arriving at a MSE of 0.00. Given a test pattern of a low string sound at 135 degrees, with no ITI) unit activations, the network responded with a 180 degree reversal of location. Once this low string was included into the network training patterns with a typical lTD activation of approximately +0.75, the network was retrained, requiring 943 epochs to arrive at an acceptable MSE. This new training set was able to identify the newly included pattern. Symmetric right-left reversals continued under the following circumstances: 1) An original pattern given to the network without LTD 2) A new pattern given to the network The network required both ITD and IID to properly localize a sound without right-left symmetric errors. ITD, even for high frequency sounds, was a necessary cue for localization. With this system, it is clear that only a small number of neural inputs (50-60) are required to localize a sound source without front-back localizations, based on ITD and IID inputs from the superior olives. Above the sound localization neural network, a "decision mechanism" resolves discrepancies between aural and visual space maps. (See Fig. 3) Units labeled "Bias/Attn./Task" represent bias, attention, or task. For instance, if the task is to locate a sound source, the visual space map is overridden because of the high "task" value attributed to the aural space map. Similarly, if the eyes are closed and a sound is heard, the visual attention bias is lowered to allow the aural spatial map to dominate. Under "normal" circumstances, visual capture results from the network's innate bias of the visual space map over the aural space map. 4 Discussion One important mystery in binaural hearing that relates to the above model is the role of higher auditory structures. What role do the thalamic and cortical structures play, because space maps are located in the midbrain? The primary auditory cortex (AI) is necessary for sound localization and plays a "sensory role." Neurons in Al are sensitive to IID and ITD. Since the primary auditory cortex is not needed for the detection of IID and ITD cues, what are the functional roles of IID-ITD sensitive neurons in AI? Perhaps AI lesioning interrupts a necessary pathway to another cortical center. Experiments with brain-lesioned patients may reveal the functional differences between different binaural processing centers. Perhaps there are fine-grained binaural sound localization systems, which encompass the area from the cochlear nucleus through the superior olivary complex into the inferior colliculi. A more coarse system may exist between the cochlear nucleus directly to the inferior colliculi utilizing monaural pathways. Splitbrain patients may show significant evidence for this hypothesis. Further developmental attributes need to be studied in the adaptation of the binaural hearing systems in lTD and MLD processing for younger, as opposed to older subjects. Further data should be collected concerning the contribution of ITD to sound localization in monaural subjects, and lID with hearing impaired subjects along a wide sample of ages. Finally, the physiological significance of a neural net model of binaural hearing may lie in validating the hypothesis it suggests regarding the processing of binaural cues. Given the simplistic assumptions of binaural hearing models, do the models suggest physiological data that can be gathered in experiments? 5B.3 258 ICMC Proceedings 1993

Page  259 ï~~By "freezing" clusters of weights and units, do the networks exhibit behavior suggestive of hearing impaired or brain-lesioned patients? These answers will only come with a collaboration between cognitive modeling and physiological and psychoacoustic experimentation. fTD INPUT UNITS COARSE COOING FIGURE 1 OUTPUT DEGREE UNITS COARSE COOING REFERENCES [Berkley, 1992] Steve Berkley. "A Neural Network Model for Sound Localization in Binaural Fusion: Issues, Model, and Simulation." Graduate Paper, Dartmouth College, 1992. [Dowling, et al., 1986] W. Jay Dowling and Dane L. Harwood. Music Coiton. p 59. New York: Academic Press, Inc., 1986 [Gelfand, 1990] Stanley A. Gelfand. HearingAn Introduction to Psychological and Physiological Acoustics. pp. 419-423. New York: Marcel Dekker, Inc, 1990. [Knudsen, et al., 1989] Eric I. Knudsen and Phyllis F. Knudsen. "Vision Calibrates Sound Localization in Developing Barn Owls." The Journal of Neuroscience. 9(9): September, 1989. pp. 3306-3313. [Konishi, 1988]. M. Konishi, TT Takahashi; H. Wagner, W.E. Sullivan, and C.E. Carr. "Neurophysiological and anatomical substrates of sound localization in the owl." Auditory Function: Neurobiological Bases of Hearing. Ed. Edelman, G.M.; Gall, W.E.; Cowan, W.M. New York: Wiley, 1988. [Lewkowicz, 1988] D. Lewkowicz. "Sensory Dominance in infants. 1. Six-month-old infants' response to auditory-visual compounds," "Sensory Dominance in infants. 2. Ten-month-old infants' response to auditory-visual compounds." Developmental Psychology. v24. pp. 155-182. [Moore, 1991] David R. Moore. "Anatomy and Physiology of Binaural Hearing." Audiology. 1991;30. pp. 126. [Morrongiello, 1991] Barbara A. Morrongiello and Kimberley D. Fenwick. "Infants' Coordination of Auditory and Visual Depth Information." Journal of Experimental Child Psychology. v52, 1991. pp. 277-296. [Moseiff, 1991] A. Moseiff. "An Artificial Neural Network for Studying Binaural Sound Localization." Proceedings from the 1991 IEEE Seventeenth Annual Northeast Bioengineering Conference, Hartford, CT. pp. 1-2. [Muir] D. Muir and J. Field. "Newborn Infants orient to sound." Child Development. v50. pp. 431-436. [Musiek, et al., 1988] Frank E. Musiek, Karen M. Gollegly, Karen S. Kibbe, and Suzanne B.Verkest. "Current Concepts on the use of ABR and Auditory Psychophysical Tests in the Evaluation of Brain Stem Lesions." The American Journal of Otology. v9, supplemenL December, 1988. pp. 25-35. [Palmieri, et al., 1991] F. Palmieri et al. "Learning Binaural Sound Loaization through a Neural Network." Proceedings from the 1991 IEEE Seventeenth Annual Northeast Bioengineering Conference. Har~ford, CT. pp. 13-14. "PERFORMANCES Sr moha m hndu. m FIGUIU ICMC Proceedings 1993 259 58.3