Page  00000001 Evaluating spatial sound with binaural auditory model Ville Pulkki Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology email: Ville.Pulkki@hut.fi Abstract Directional qualities of spatial sound systems are studied with a binaural model of human direction perception. The evaluated spatial sound systems are pair- and triplet-wise panning and first- and second-order Ambisonics. 1 Introduction Spatial sound systems aim to produce to a listener also the perception of directions of sound sources. To obtain a perfect quality in reproduction the directions of virtual sources should be reproduced as targeted. These attributes have been typically measured using listening tests. However, listening tests are expensive and time-consuming to conduct. Another way to evaluate spatial characteristics is to use an auditory model of spatial hearing. Studies on directional hearing have produced quite extensive and reliable models which are tuned with results from psychoacoustic and neurophysiological tests. In this study the quality of virtual sources generated with different multi-loudspeaker spatial systems is analyzed with a binaural auditory model. 2 Spatial Hearing Spatial and directional hearing have been studied intensively (Blauert 1997). The duplex theory of sound localization states that the two main cues of sound source localization are the interaural time difference (ITD) and the interaural level difference (ILD) which are caused respectively by the wave propagation time difference (primarily below 1.5 kHz) and the shadowing effect by the head (primarily above 1.5 kHz) (Blauert 1997). The auditory system decodes the cues in a frequency-dependent manner. The cues resolve in which cone of confusion the sound source lies. A cone of confusion is shown in Fig. 1. Direction perception within a cone of confusion is refined using other effects, such as spectral cues and head rotation (Blauert 1997). The azimuth-elevation spherical coordinate system is commonly used, however it is inconvenient here since a cone of confusion can not be presented with it easily. An alternate system has been proposed in (Duda 1997). In it the cone of cone of confusion Figure 1: Cone of confusion. Spherical coordinate system that is convenient in direction hearing. confusion in which a sound source lies is specified by an Occ angle between the median plane and the cone. The direction within the cone is defined by angle,cc, as depicted in Fig. 1. 3 Amplitude panning In amplitude panning the same sound signal is applied to a number of loudspeakers that are equidistant from the listener. The loudspeakers may have different gains. Many spatial sound systems for multiple loudspeakers apply the principle. Pair- and tripletwise panning and first- and second-order Ambisonics are reviewed here. 3.1 Pair- and triplet-wise panning Pair-wise amplitude panning methods (Chowning 1971) can be used for horizontal loudspeaker systems. The sound signal is applied to two loudspeakers between which the panning direction lies. If a virtual source is panned coincident with a loudspeaker, only that particular loudspeaker emanates the sound signal. A three-dimensional loudspeaker setup denotes here a setup in which all loudspeakers are not in the same plane with a listener. Typically this means that there are some elevated and/or lowered loudspeakers added to a horizontal loudspeaker setup. Triplet-wise panning can be used in such setups (Pulkki 1997), in which a sound signal is applied to three loudspeak

Page  00000002 ers at one time that form a triangle from the listener's view point. If more than three loudspeakers are available, the setup is divided into triangles, one of which is used in the panning of a single virtual source at one time. The number of active loudspeakers is then one, two, or three at a time. Vector base amplitude panning (VBAP) proposed in (Pulkki 1997) is a method to formulate pair- or triplet-wise panning in arbitrary loudspeaker layouts. In VBAP a loudspeaker triplet is formulated with vectors. The unit-length vectors 1,m n1 and lk point from listening position to the loudspeakers. The intended direction of the virtual source (panning direction) is presented with unit-length vector p. Vector p is expressed as a linear weighted sum of the loudspeaker vectors P = Ymlm + 9 nn + 9klk (1) Here gm, 9n, and gk are called gain factors of respective loudspeakers. The gain factors can be solved as g= pTL, (2) where g = [g g 9gk]T and Lnk = [1 m n Ik]. The calculated factors are used in amplitude panning as gain factors of the signals applied to respective loudspeakers after suitable normalization, e.g. I lgll = 1. VBAP can be used for pair-wise panning also, it is then equivalent with an existing panning law, the tangent law (Bennett, Barker, and Edeko 1985). 3.2 Ambisonics Ambisonics is basically a recording technique (Gerzon 1972). However, it can also be simulated to perform a synthesis of spatial audio. In this case it is an amplitude panning method in which a sound signal is applied to all loudspeakers of a horizontal or 3-D setup with gain factors g = NV-l(1 + 2cos (ai)), (3) where gi is the gain of i:th speaker, N is the number of loudspeakers, and a is the angle between loudspeaker and panning direction. The sound signal therefore emanates from all loudspeakers. Second-order Ambisonics applies the sound with gain factors g9 = N-1(1 + 2 cos (c) + 2 cos (2ac0)) (4) (Monro 2000). The sound is still applied to all of the loudspeakers, but the gains have prominently lower absolute values on the opposite side of a panning direction. 4 Evaluating spatial sound with binaural auditory model A model of spatial hearing, adapted from the literature, has been proven to predict accurately major effects in amplitude panning (Pulkki 2001). The ear canal signals are simulated with measured HRTFs, six individual sets were used. Simulations were performed symmetrically to left and right directions. The middle ear is modeled with a filter that approximates a response function derived from the minimum audible pressure curve. The frequency resolution of the cochlea is simulated with a gammatone filter bank with 42 frequency bands. Hair cells are modeled by half-wave rectification and low-pass filtering. ITD is calculated with inter-aural cross correlation, and ILD as a loudness level difference between ears in corresponding frequency bands. The cue values are translated with a database search to 6cc angles that they suggest, and the final values are called the ITD angle (ITDA) and the ILD angle (ILDA). 12 values of ITDA and ILDA at each frequency band are thus obtained, the mean values and standard deviation are plotted in figures. The evaluation system is tested by analyzing real sources in different directions. Constant values with frequency should be achieved in ideal case for cue angles, which are shown in Fig. 2. It can be seen that ITDA corresponds well to the direction of real source. There are minor deviations at large real source angles. The ILDA values behave consistently with directions below 500. With angles > 500 ILDA deviates from real source direction generally, it is roughly correct only at frequencies higher than 4 kHz. The large deviations are caused by nonmonotonic ILD behavior with source direction (Blauert 1997). This suggests that ITDA can be used in spatial sound analysis generally, in ILDA analysis it should be taken into account that ILD does not have large values between 700 Hz and 4 kHz. 4.1 Evaluating Ambisonics The localization cues for first-order Ambisonics were calculated with hexagonal setup (loudspeaker azimuths 30~, 90,..., 330~). The cues for hexagonal setup are shown in Fig. 2. The ITDA values at low frequencies are fairly stable, however they deviate from the target value prominently, especially with large panning angle values. ITDA is unstable at high frequencies. ILDA is also generally unstable and deviates from panning angle value prominently, however, the ILDA values for larger panning angles have greater magnitudes generally. There is a large bump between 400 Hz and 3 kHz in ILDA values. In previous section it was noted that large values do not exist for ILD at this frequency region. Such values may lead to near- or inside-head localization. The simulation result with quadraphonic setup was similar, and is not printed here. This simulation predicts that firstorder Ambisonics with different loudspeaker configurations fails to produce stable virtual sources to lateral directions. Second-order Ambisonics was simulated with the hexagonal setup. It produced relatively similar results with firstorder Ambisonics, as can be seen in Fig. 2. However, a prominent improvement is that low-frequency ITDA values coincide well with panning angle values. This suggests that using second-order Ambisonics lateral virtual sources can be pro

Page  00000003 18.2 12.4 8.5 5.7 - 3.9 2.6 1.7 S1.1 0.7 0.4 0.2 18.2 12.4 8.5 5.7 3.9 0 2.6 1.7 S 1.1 U0.7 0.4 0.2 18.2 12.4 8.5 5.7, 3.9 0 2.6 S1.7 a 1.1 U0.7 0.4 0.2 18.2 12.4 8.5 5.7 3.9 0 2.6 1.7 S 1.1 U0.7 0.4 0.2 18.2 12.4 8.5 5.7 S3.9 0 2.6 1.7 S 1.1 U0.7 0.4 0.2 Real sources 18.2 20o..........17........ 17 0 02 3.9 5 0. - - - - - -, -..:...... 1.7 70 "...........................................I 0.7 9 0........................... i........ ..,'.:. 0.2 Real sources 1- - 1...............................,. --: ----.... 0 -0 -0 0 1 2 3 4 50 60... 7 8,:..... ~:................................,............0................................... 130..50 o..................... -.7 0......-.........................-........... 9090 -30 -20 -10 0 10 20 30 40 50 60 70 80 90........................... l 9 0.............. -30 -20 -10 0 10 20 30 40 50 60 70 80 90 ITDA [degrees] Ambisonics 6LS............................. S 15' iiii iii iii iiii iii ~ i." i;-,;;ii iiii iii iii iiii iii i45 i 18.2 12.4 R 5F....;;..... ,.. - -......... i............... i S...........:..................................................................................... 40.C S, 90 5.7. 3.9.. 2.6. 1.7 21.1 0.7 -.,.. 7..............-...........!..................... i!....................... "......",L......................,'....................... ".. '..... '.................. 30 -20 -10 0 10 20 30 40 50 60 70 80 90 ITDA [degrees] 2nd-order Ambisonics 6LS S............................- - - -...............:.......:.......:....... -.......: - 15..............9 0.............. - - -......................... 30 -20 -10 0 10 20 30 40 50 60 ITDA [degrees] Periphonic 2nd-order Ambisonics 12LS 70 80 90 15.............................................. i...- - 45 0 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 ITDA [degrees] Triplet-wise panning 15................... - - - - - ---i 45 - - 9 -.......................... -15 18.2 12.4 8.5 5.7 - 3.9 0 2.6 1.7 S 1.1 0.7 0.4 0.2 18.2 12.4 8.5 5.7 S3.9 0 2.6 1.7 2- 1.1 0.7 0.4 0.2 18.2 12.4 8.5 5.7 3.9 0 2.6 ) 1.7 " 1.1 0.7 0.4 0.2 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 ILDA [degrees] 2nd-order Ambisonics 6LS.-.............. -15 -o o 90 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 ILDA [degrees] Periphonic 2nd-order Ambisonics 12LS - 115 45... 90 30 -20 -10 0 10 20 30 40 50 60 70 80 90 ILDA [degrees] Triplet-wise panning 0:.:..... . -.: _ ~ _.:......................~...;,............................ 4 5......-........... 0 9 0...-............. -30 -20 -10 0 10 20 30 40 50 60 70 80 90 ITDA [degrees] -30 -20 -10 0 10 20 30 40 50 60 70 80 90 ILDA [degrees] Figure 2: Auditory cues (ITDA and ILDA) for real or virtual sources in different directions. 6 individual HRTF sets. Whiskers denote 25% of standard deviation.

Page  00000004 duced at least with this hexagonal setup. When more loudspeakers are available for Ambisonics, they all are producing the same sound signal. To explore the effect of this, the previous simulation with second-order Ambisonics was repeated with two additional loudpeakers at directions 600 and 1200. In simulated cues (not shown here) it was found that low-frequency ITDA values are more unstable, and there are more deviation between individuals. Also, they deviate prominently from the intended values. This suggests that adding loudspeakers after some loudspeaker number does degrade virtual source quality with Ambisonics. Periphonic second-order Ambisonics was simulated with loudspeakers in two hexagonal grids in elevations -15o and 450. The simulated virtual sources were in the horizontal plane. The resulting cues for three intended directions are shown in Fig. 2. The cues are highly distorted and there are very large individual differences. Only ITDA behaves consistently below roughly 800 Hz. It is evident that the accuracy of O,, direction reproduction is degraded from second-order Ambisonics with hexagonal setup, shown in Fig. 2. The perceived 0,, can not be simulated, since there are no valid models available (Pulkki 2001). 4.2 Evaluating triplet- and pair-wise panning The perceived O,, of triplet-wise panned virtual sources was analyzed with the binaural model. The worst quality with triplet-wise panning is achieved naturally in a case where the virtual source is panned near the centroid of a triplet. This case was simulated with a triplet that consisted of two loudspeakers in -150 elevation and one in 300 elevation. The angle between each two loudspeakers of the triplet was approximately 600 from listener's point of view. The loudspeaker triplet was placed so that the centroid of it was near the panning direction. The results are shown in Fig. 2. The cue angles are fairly stable and correspond at least roughly to the panning direction. Large bias in 900 case is due to the fact that the loudspeakers are approximately 300 away from 90 direction. This corresponds to worst case condition, the bias would be smaller if loudspeakers were nearer 900 direction. The perceived 0,, direction of amplitude-panned virtual sources has been studied in (Pulkki 2001), where it is shown that 0,, is perceived individually. However, often the perceived 0,, direction varies between the 0,, directions of loudspeakers in the triplet that is used in panning. When the loudspeaker array is fairy dense, a relatively good accuracy for #ce perception can be thus achieved with triplet-wise panning. Pair-wise panning with VBAP was simulated with a hexagonal grid, the results are similar with triplet-wise panning, and they are not shown here because of lack of space. In the results ITDA is stable at low frequencies. The panning direction matches with low-frequency ITDA cues with panning direction 150. There is some bias from targeted direction with other panning directions, however less than with triangles. High-frequency ITD and ILD cues are fairly stable with frequency and coincide roughly with panning direction. The effect of increasing the number of loudspeakers was simulated in pair-wise panning by adding a loudspeaker to 60 O. The cues (not shown) were more stable, and corresponded better to panning directions. 5 Conclusions In this paper the virtual source directional qualities were simulated with a binaural auditory model. First-order Ambisonics seems to be incapable to produce stable lateral virtual sources even when there are loudspeakers in lateral directions. The produced cues are also relatively individual and unstable. Second-order Ambisonics produces accurate cues for low frequencies with a hexagonal loudspeaker setup, but high-frequency cues are unstable. When the number of loudspeakers is increased, the directional quality is degraded. Pair- and triplet-wise panning produce relatively stable cues for a large frequency range, although there might be some individual behaviour. The directional quality is increased if the number of loudspeakers is increased. 6 Acknowledgment The work of Mr. Pulkki has been supported by the Graduate School in Electronics, Telecommunications and Automation (GETA) of the Academy of Finland and by Tekniikan Edistamissditi6. References Bennett, J. C., K. Barker, and F. O. Edeko (1985, May). A new approach to the assessment of stereophonic sound system performance. J. Audio Eng. Soc. 33(5), 314-321. Blauert, J. (1997). Spatial Hearing, Revised edition. Cambridge, MA, USA: The MIT Press. Chowning, J. (1971). The simulation of moving sound sources. J. Audio Eng. Soc. 19(1), 2-6. Duda, R. 0. (1997). Elevation dependence of the interaural transfer function. In R. H. Gilkey and T. R. Anderson (Eds.), Binaural and Spatial Hearing in Real and Virtual Environments, pp. 49-75. Mahwah, New Jersey: Lawrence Erlbaum Associates. Gerzon, M. A. (1972). Periphony: With-height sound reproduction. J. Audio Eng. Soc. 21(1), 2-10. Monro, G. (2000). In-phase corrections for ambisonics. In Proc. Int. Computer Music Conf, Berlin, Germany, pp. 292-295. Pulkki, V. (1997, June). Virtual source positioning using vector base amplitude panning. J. Audio Eng. Soc. 45(6), 456-466. Pulkki, V. (2001). Localization of amplitude-panned virtual sources II: three-dimensional panning. accepted to J. Audio Eng. Soc.