Page  00000001 Comparison of Sound Spatialization Techniques in MPEG-4 Scene Description Riitta Vaananen1, riitta.vaananen @ Jyri Huopaniemi2, Ville Pulkki1 1Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing P.O. Box 3000, FIN-02015 HUT 2Nokia Research Center Speech and Audio Systems Laboratory P.O. Box 407, FIN-00045 Nokia Group Abstract This paper overviews the recently updated spatial sound features in MPEG-4 international standard. A scene description language present in MPEG-4 Systems can be used for sound composition, and creating immersive audiovisual scenes where the acoustic response is parametrically defined and can be designed to correspond to the visual scene. This interface enables detailed modeling of sound propagation in an acoustic environement, but it can also be used to create dynamic, spatial sound effects for audio-only applications. In this paper we present an overview of the audio part of the scene description, and discuss the different technologies that may be used to reproduce the spatialized sound. 1. INTRODUCTION Virtual acoustics is here considered as computer-modeled and rendered virtual sound environments. In these kinds of systems physical sound sources are represented by virtual sound objects that can be positioned in a 3-D space. The sound streams associated with these source objects are processed according to parametrically defined acoustic properties of the virtual environment. These parameters can be based on a physical model of the space (e.g., geometry of a room [1]), or on a set of statistical values having perceptually relevant information about the space (e.g., room acoustic parameters [12]), or on a combination of these. As a result the sound is processed so that the user of the system perceives the sound as if it was heard in a corresponding real or virtual space taking into account the relative positions of the sound sources and the listener. The concept of an audiovisual scene description language is characteristic to specifications used for parametric representation of computer models of virtual spaces or scenes. These application programming interfaces (APIs) provide possibilities for non-expert programmers to build up hierarchically structured virtual scenes relying on intuitive parametric and object-based presentation of the primitive scene objects (often called nodes). Among this kind of scene description languages are, e.g., the Virtual Reality Modeling Language (VRML) [2], Java 3D API [3], and the BInary Format for Scenes (BIFS) given in MPEG-4 [4]. The virtual sound environment modeling in scene description APIs has traditionally been in a minor role compared to the visual VR rendering. An improvement is offered by the MPEG-4 standard that specifies parametrization of advanced sound source and acoustic environment modeling where the effects of the modeled physical environment are taken into account in a detailed and flexible fashion, allowing design of virtual environments with sound source and propagation models of different complexities. These functionalities are enabled by the Advanced AudioBIFS that is a part of the BIFS specification in the first amendment of the MPEG-4 Systems standard [5], that reached a Final Draft International Standard (FDIS) status in late 1999. A more detailed description about these 3-D audio functionalities is given, e.g., in [6, 7]. In this article we first overview the concepts related to coding and presentation of sound in MPGE-4. Then we present the ideas of parametrizing and rendering of interactive audiovisual scenes in BIFS, and present the principles of the physical, model-based parametrization, and on the other hand the perceptual modeling of acoustic scenes. Finally, different sound reproduction techniques and the related audio DSP methods are discussed for spatially presenting sound sources in 3-D virtual environments. 2. SOUND CONCEPTS OF MPEG-4 STANDARD MPEG-4 is a standard for efficient compression of audio [8] and visual data, but unlike its predecessors MPEG-1 and MPEG-2, it defines new concepts for coding and presentation of the content. One of its main purposes is to enable interactive multimedia presentations that may be composed of several streaming data objects each of which may be coded with different compression methods. Another new concept is the coding of synthetic data, meaning computer generated sound or visual information according to parametric descriptions. In the Audio part of MPEG-4 there are two parts for producing of synthetic sounds, namely, the Structured Audio, and the Text-To-Speech Interface [9]. Figure 1 illustrates the composition of sound in an MPEG-4 terminal. The BIFS scene description language in MPEG-4 is used to present the audio and visual data streams that may be outputs of different MPEG-4 decoders. This presentation includes not only the playback of different streams, but also their spatio-temporal positioning in the local coordinate system and time-base of the terminal. As such, the BIFS enables positioning of decoded audio and visual data in 2-D or 3-D scenes. The general concepts of BIFS have been clarified, e.g., in [11], and detailed discussion of the audio part of BIFS, namely, the AudioBIFS, that is used for compositing the sound tracks that are played back, is provided in [10]. In addition to composition of data streams in time and space, BIFS nodes can be used to build up visual geometry objects composing a virtual world, define interactivity and dynamic behaviour for the scene objects, and to associate different objects together to form, e.g., audiovisual sound sources, or geometry objects with video streams projected on them.

Page  00000002 SOUND NODES: Sound source objects (relation to visual scene) AudioBIFS COMPOSITION (e.g., mixing, effectcs processing) SPATIALIZED Sndi n I llSnd l<ni, Sound OUTPUT SPATIAL EFFECTS MONOPHONIC *Source directivity *Distance: Attenuation. Delay.Air absorption *Material reflectivity *Object obstruction *Reverb (rendering of monophonic room acoustic parameters) SPATIALIZED (2-D or 3-D) *Listener model: Angle of incidence (direct sound and reflections) *Reverb (incoherence between channels to obtain spaciousness) AUDIO Natural Synthetic Speech DECODING Audio Audio Speech STREAMING AUDIO DATA L Figure 1: Composition of different audio streams at the MPEG-4 terminal. The streams are mixed in a pre-defined manner and they may presented spatially in the audiovisual scene. The source objects may also be associated with visual objects in the scene, or controlled interactively by the user. 3. INTERACTIVE AUDIOVISUAL SCENE RENDERING IN MPEG-4 In BIFS the binary format stream syntax is normative (i.e., specified in the standard), but the way that the scene is actually rendered is not normatively defined. This means that the way the geometry objects are shown on the display of the terminal is not defined, nor the way the sound sources should be processed in order to achieve their perceived position by the user at the decoding terminal. This is a practical issue since different techniques can be used to obtain the sound and the visual output. Therefore only the bitstream syntax and not the output method (for display or sound reproduction) is defined in the standard. Thus, although various parameters are defined for setting up scenes with advanced acoustic properties, only subjective evaluation can be done to verify whether the complete rendering of the sound scene is compliant to the standard. On the other hand, this gives flexible possibilities to utilize any spatialization methods and to take advantage of the latest technologies for viewing spatial sound scenes. However, there are aspects in Advanced AudioBIFS spatial sound processing that are normatively defined. These are features that are unambiguously used to define certain features in the virtual "scene response" (as opposed to virtual room response that is a term restricted to an impulse response of enclosed, reverberating spaces) to sound. These include, e.g., sound reflectivity or obstruction filtering caused by acoustically responding objects in the scene, rendering of distance dependent delay, attenuation, and air absorption, and time and frequency characteristics of a statistically parametrized room reverberation. These properties can be objectively measured from the impulse response of the sound compositor, However, in dynamic conditions where the source or the listener moves, for example, only subjective evaluation can be used to define that the sound rendering meets the specification, or that the sound scene is in coherence with the possible visual scene. Altogether, the spatial sound processing in BIFS includes various DSP filter components, when all or many of the Advanced AudioBIFS features are utilized at the same time. Care has to be taken when designing how the dynamic filter structures are updated in order to avoid audible artefacts. In Figure 2, the spatial sound modeling components present in Advanced AudioBIFS are listed. It can be seen that some of the filtering effects are monophonic, thus they can be realized in the same way regardless of the sound reproduction (headphone or loudspeaker) method. The effects that are mentionned as multichannel, on the other hand, are those that depend on the reproduction setup that must be taken into Figure 2: Spatial sound effects are here considered as monophonic, i.e., those that can be implemented with one reproduction channel, and as spatialized that require multichannel processing in order to obtain the effect. account when designing the DSP blocks used to implement them. The multichannel effects (spatialization of the 3-D incident angle of sound and reflections, and implementation of incoherent reverberation) are those whose design is completely left to be implementation specific. The minimum requirement for spatializing a 3-D position of sound in the MEPG-4 standard is that it should be amplitude panned between two channels in order to achieve left-right distinction of the incident angle of sound at the listener. 4. MODELING APPROACHES FOR ENVIRONMENTAL SOUND EFFECTS IN MPEG-4 Two approaches are included in BIFS for taking into account the effect of a 3-D environment to sounds. One takes advantage of the geometry components in the scene that are used to produce reflections to sound, or obstruction when they appear on the path between the source and the listener. This way an acoustic effect can be added to sound sources that is coherent with the visual scene. This scheme is referred to as the physical approach. The other, namely the perceptual approach, on the other hand relies on a set of parameters that are associated with each sound source (instead of using the visual, or "physical" environment, that is the case in the physical approach). These parameters describe time- and frequencydomain statistical properties of a room acoustic reverberation effect in a perceptually relevant manner [12]. With both approaches a room acoustic effect is obtained, but due to the fact that they provide different functionalities and have different processing requirements depending on the scene setup, it should carefully be considered which one to use. A more detailed overview about these different approaches is given in [7], where it can be seen that there are many resemblances in the DSP blocks that are used to implement them. With both the physical and the perceptual approaches it is possible to produce a set of sound reflections followed by an exponentially decaying late reverberation response. Whereas the perceptual approach provides an efficient way of producing perceptually good-quality reverberation, only the physical approach (though, in some situations computationally more costly) can capture the existence and positions of acoustically reflecting or obstructing visual surfaces in a scene, yielding an immersive experience in an audiovisual virtual world. The implementation of the late reverberation in both approaches can be the same, but in the physical approach it is bound to a geometrical region in the scene, and not associated with the source which is the case in the perceptual approach. Applications suitable for each of these approaches have been discussed, e.g., in [7].

Page  00000003 5. 3-D SPATIALIZATION OF SOUND IN BIFS Next we overview techniques that can be used to reproduce the sound scene parametrically defined in BIFS. These rendering methods can be grouped in different ways. One alternative is to look at the number of reproduction channels: monophonic, two-channel and multichannel. Another option is to discuss headphone and loudspeaker reproduction methods separately due to their inherently different nature. Finally, it is also possible to divide the methods according to the method of spatialization: panned or head-related (utilizing human spatial hearing cues). The processing requirements of different methods are discussed in the framework of virtual and interactive sound environment systems. Critical issues are the overall complexity of the spatial processing algorithms, how the directions of individually modeled direct sound and early reflections are spatially reproduced (and how the possible diffuse late reverberation is fed to the different channels) taking into account any dynamic changes in a 3-D virtual scene, and the overall setup used for viewing the scene. The latter point is strongly related to the modes of interaction used, since different input or interaction devices allow different degrees of freedom for the user to move (e.g., head-tracker vs. a mouse or a joystick as a navigation device), and that is further connected to the required size of the sweetspot of the spatialization system. Below the reproduction methods for spatialized audio are presented according to the division given in Fig. 3. The main difference between headphone and loudspeaker reproduction is the lack of natural crosstalk when headphones are used. This calls for differences in spatial perception of sound and therefore suggests different methods for spatialization. 5.1. Headphone reproduction Amplitude and time-delay panning The simplest method of spatializing sound in headphone listening is to use amplitude panning. One of the main drawbacks of headphone reproduction is inside-the-head localization, which occurs when only simple spatial cues are applied to the signal. This causes the panning to shift sounds between the ears, but externalization is not achieved. Combined amplitude and time-delay panning is a rough estimate of spatial hearing utilizing the interaural level and time differences (ILD, ITD, respectively), but externalization can only be achieved on the extreme sides of the panned image. Head-related processing The only way to create natural 3-D sound for headphone listening is to replicate human spatial hearing cues. This binaural reproduction is done with the aid of modeled or measured head-related transfer functions (HRTF). The individual nature of HRTFs, however, may cause problems in form of different localization errors, for example in-head localization and front-back confusions [13]. The processing requirements for early reflections can be greatly reduced of that needed for the direct sound. The late reverberation incoherence in headphone listening is easily achieved using various reverberator designs. In [14, 1], methods for filter design and spatialization in headphone listening are described in more detail. 5.2. Loudspeaker reproduction Amplitude panning In amplitude panning a sound signal is applied to several loudspeakers with different amplitudes. In the simplest form of it, a sound signal is applied to two loudspeakers in front of the listener, who perceives a virtual source in a direction that Figure 3: Reproduction of spatialized sound using headphones or loudspeakers. is dependent on the amplitudes of the loudspeaker signals [15]. The virtual sources can only be produced to directions between two loudspeakers. With multi-loudspeaker systems, the virtual sources can be produced to a larger set of directions. These methods are computationally inexpensive, and the complexity depends on the number of the speakers used. Each loudspeaker signal arrives to each ear of the listener, and the summed signals form the perception of a virtual source direction (referred as summing localization) [15]. Drawbacks are the possible coloration caused by the summed sounds, and that outside the best listening position the perceived position of sound falls to the nearest speaker with nonzero amplitude, that is caused by precedence effect [16]. If each loudspeaker is positioned in the horizontal plane, the setup is considered two-dimensional. An example of this is the 5.1 loudspeaker setup. For such setups pair-wise panning or Ambisonics is most often used. In pair-wise panning [17] the sound is panned to one loudspeaker pair at one time, and the configuration (number of speakers and their relative angles) can be arbitrary. The sound is not prominently colored, and outside of the best listening position the perceived directional error is limited to distance between one loudspeaker pair. The directional error decreases when the number of speakers is increased. Pairwise panning is thus suited to situations in which a good 2-D reproduction quality is needed for a large listening area. When the loudspeaker setup also includes elevated loudspeakers, it is considered as three-dimensional. 3-D vector base amplitude panning (3-D VBAP) [18] can be used in such setups. As a direct generalization of pair-wise panning 3-D VBAP performs the panning triplet-wise. A sound signal is applied to three loudspeakers that form a triangle from the listener's viewpoint. When more than three loudspeakers are present, the panning is still performed with one triplet at one time. The directional resolution of the system depends on the sizes of the triangles. The larger the number of the speakers is, the better directional accuracy is achieved to a large listening area. 3-D VBAP is suited to create a spatial impression to a large listening area with arbitrary 3-D loudspeaker setups. Pantophonic (2-D) or Periphonic (3-D) Ambisonics [19] are microphoning techniques which allow reproduction of recorded sound in certain 2-D or 3-D loudspeaker configurations. Ambisonics can also be used for positioning of virtual sources by simulating the microphoning technique [20]. However, in Ambisonics the same sound is applied to most or all of the loudspeakers, which generates some timbral artefacts and large directional errors outside of the optimal listening position due to precedence effect. Thus it can only be used effectively with one listener. Head-related processing When spatial hearing cues (i.e. HRTFs) are utilized for loudspeaker rendering of spatial sound, the concept of cross-talk cancellation is of major importance. Cross-talk occurs when

Page  00000004 signals travel from one loudspeaker to both ears (which is not the case in headphone reproduction). In order to replicate natural three-dimensional hearing cues, cross-talk canceling filtering is to be employed to original binaural signal. At best, cross-talk canceled binaural reproduction delivers very convincing three-dimensional audio over two loudspeakers, but the effect requires symmetric listening conditions and placement of the listener (see [21] for details on cross-talk cancellation). Head-related processing for loudspeakers can be carried out in a pairwise manner, designs for various speaker layouts (including 4-channel and higher) have been presented in [22]. The use of cross-talk canceled binaural rendering is suitable for applications were only a limited (in most cases, two) number of loudspeakers are used, for example in PCs, game consoles and home theatre systems. Conversion of binaural headphone rendering to loudspeaker listening is straightforward since it requires only a cross-talk canceler and possibly a correlation controller. Wave field synthesis Wave field synthesis (WFS) is a technique for virtual source positioning which aims at reproducing the original wave field accurately. Using WFS an equal direction perception can be achieved in a large listening area [23]. Ideally, the surfaces of the listening space are covered with planes of loudspeakers that are fed with the source signal to reproduce the wave field of a virtual source. The distance between the speakers is proportional to the wavelength of the reproduced sound (Ax < A/2 to avoid spatial aliasing). Restrictions to the quality of the reproduced field are also set by the finite size of the loudspeaker arrays. Linear loudspeaker arrays have been studied for reproducing the sound field in a horizontal plane [24], and in context of BIFS audio scene rendering in [25]. 6. CONCLUSIONS In this paper we overviewed the spatial sound presentation in MPEG-4 BIFS, and discussed reproduction technologies and processing methods that can be used to present the spatial sound. In both the physical and the perceptual approaches for sound scene design in BIFS, part of the environmental sound effects are only obtained by spatial processing of sound dependent on the sound reproduction setup and mode of interaction used for navigation in the MPEG-4 terminal. While in the different approaches the overall response to sound in a virtual acoustic scene is parametrized differently, both can be implemented with similar DSP filter structures. The reproduction method depends on the available resources including the loudspeaker setup, the available processing power, and the requirements about the listening area and the interaction mode. ACKNOWLEDGEMENTS This work has been supported by The Academy of Finland through the Pythagoras Graduate School, and the Graduate School of Electronics, Telecommunications and Automation (GETA), HPY:n Tutkimussaiatit, and Nokia Research Center. REFERENCES [1] L. Savioja, J. Huopaniemi, T. Lokki, and R. Viinanen. Creating Interactive Virtual Acoustic Environments. J. Audio Eng. Soc., 47(9), 1999. [2] IS 14772-1. The Virtual Reality Modeling Language (VRML97). Apr. 1997. url: http:// [3] Java 3DTM 1.2 API. 2000. url: http:// [4] R. Koenen. MPEG-4: Multimedia for Our Time. IEEE Spectrum, 36(2):26-33, Feb. 1999. [5] W. Swaminathan, R. Viinanen, G. Fernando, D. Singer, W. Bellknap, and L. Young-Kwon. ISO/IEC 14496 -1/FDAM1, Final Draft Amendment 1 of MPEG-4 Systems. Output doc. w3054 of the 50th MPEG meeting, Maui, Hawaii. Dec. 1999. [6] R. Vaiinanen and J. Huopaniemi. Virtual Acoustics Rendering in MPEG-4 Multimedia Standard. In Proc. Int. Comp. Music Conf ICMC'99, pages 585-588, Beijing, Oct. 1999. [7] R. Viinanen and J. Huopaniemi. Spatial Processing of Sounds in MPEG-4 Virtual Worlds. In To be presented in the X European Signal Processing Conference (EUSIPCO), Tampere, Finland, Sep. 5-8 2000. [8] K. Brandenburg, O. Kunz, and A. Sugiyama. MPEG-4 Natural Audio Coding. Signal Processing: Image Communication, (15). Special Issue on MPEG-4. [9] E. D. Scheirer, Y.Lee, and J.-W. Yang. Synthetic and SNHC Audio in MPEG-4. Signal Processing: Image Communication, (15), 2000. Special issue on MPEG-4. [10] E. D. Scheirer, R. Viinanen, and J. Huopaniemi. AudioBIFS: Describing Audio Scenes with the MPEG-4 Multimedia Standard. IEEE Tr. Multimedia, 1(3):237 -250, 1999. [11] J. Signes. MPEG-4 Binary Format for Scene Description. Signal Processing: Image Communication, (15). Special Issue on MPEG-4. [12] J-M. Jot. Real-time Spatial Processing of Sounds for Music, Multimedia and Interactive Human-Computer Interfaces. Multimedia Systems, 7:55-69, 1999. [13] D. Begault. 3-D Sound for Virtual Reality and Multimedia. Academic Press. Cambridge, MA, USA. 1994. [14] J. Huopaniemi, N. Zacharov, and M. Karjalainen. Objective and Subjective Evaluation of Head-Related Transfer Function Filter Design. J. Audio Eng. Soc., 47(4):218-239, 1999. [15] J. Blauert. Spatial Hearing, Revised edition. The MIT Press, Cambridge, MA, USA, 1997. [16] P. M. Zurek. The Precedence Effect. In W. A. Yost and G. Gourewitch, editors, Directional Hearing, pages 3-25. Springer-Verlag, 1987. [17] J. Chowning. The Simulation of Moving Sound Sources. J. Audio Eng. Soc., 19(1):2-6, 1971. [18] V. Pulkki. Virtual Source Positioning Using Vector Base Amplitude Panning. J. Audio Eng. Soc., 45(6):456-466, Jun. 1997. [19] M. A. Gerzon. Periphony: With-Height Sound Reproduction. J. Audio Eng. Soc., 21(1):2-10, 1972. [20] D. G. Malham and A. Myatt. 3-D Sound Spatialization using Ambisonic Techniques. Comp. Music J., 19(4):58-70, 1995. [21] W. G. Gardner 3-D Audio Using Loudspeakers. Kluwer Academic Publishers. Boston, 1998. [22] J. L. Bauck and D. H. Cooper Generalized Transaural Stereo and Applications. J. Audio Eng. Soc., 44(9):683 -705, 1996. [23] A. J. Berkhout, D. de Vries, and P. Vogel. Acoustic Control by Wave Field Synthesis. J. Acoust. Soc. Am, 93(5), May 1993. [24] M. M. Boone, E. N. G.Verheijen, P. F. Van Tol Spatial Sound-Field Reproduction by Wave-Field Synthesis J. Acoust. Soc. Am, 43(12):1003-1012, Dec. 1995. [25] ThreeDSpace. The Integrated Systems Laboratory, EPFL. url: projects activities/3Dspace/3D web.htm