Page  00000001 Analysis and Synthesis of Room Acoustics Using the MPEG-4 Scene Description Riitta Vianinen Helsinki University of Technology, Laboratory of Acoustics and Audio Signal Processing email: Abstract This paper describes how existing acoustic spaces are modeled with the aid of the MPEG-4 Scene Description tool. The Binary Format For Scenes (BIFS) of the MPEG-4 standard includes a set of functionalities that enable building up 3 -D virtual worlds. In this framework also sound sources can be included in the MPEG-4 scenes, and the sound propagation in a 3-D space can be taken into account by associating acoustic features with the scene. These acoustic features can be used to increase naturalness and immersion in audiovisual virtual reality applications, and for adding room acoustic prost-processing effects to sound in audio-only applications. This paper studies a possibility to extract acoustic parameters from measured room impulse response data to be used in describing and rendering of a virtual acoustic space in the MPEG-4 BIFS. Examples are described of two measured spaces that are described in BIFS, and the obtained virtual models are demonstrated with a real-time MPEG-4 decoder. 1 Introduction MPEG-4 is a standard for compression and presentation of audio and visual data for interactive multimedia applications. In addition to defining the technologies for data coding (that was the main scope of MPEG-1 and MPEG-2), MPEG-4 includes a scene description language that is used for terminal side composition of the decoded audio and visual data that form the final multimedia presentation (Koenen 1998). This scene description language is called the Binary Format For Scenes, or BIFS, and among other functionalities it includes a set of nodes, that is a superset of the Virtual Reality Modeling Language (VRML) (ISO 1997). Thus also MPEG-4 can be used for building up scenes where audio and visual data objects form a 3-D virtual world (Signes 2000). However, the BIFS nodes that are used for audio composition (called the AudioBIFS nodes) are characteristic only to MPEG-4, and they are meant for flexible compositing of decoded sound streams, and for presenting sounds spatially in 2-D and 3-D scenes (Scheirer, Vaininen, and Huopaniemi 1999). Enhanced spatial presentation of sounds in in MPEG-4 virtual scenes is enabled by a set of nodes called the Advanced AudioBIFS, that can be used to add Virtual Acoustic Environment (VAE) modeling to scenes (Vaininen and Huopaniemi 1999). These features may include modeling of the sound source directivity, effect of the medium where the sound travels, and the directional hearing of the listener. The propagation medium can be assigned properties that appear in natural reverberating spaces such as rooms or concert halls. At the rendering stage the relative positions of the sound sources and the listener, and the acoustic scene properties are taken into account when processing the sound before the final playback. In dynamic and interactive scenes the user can usually navigate in the environment (or the sound sources may move), and in the rendering stage of virtual acoustics this means that the user can perceive the varying acoustics according to the movements (Savioja, Huopaniemi, Lokki, and Vaininen 1999). In the following is described how MPEG-4 Advanced AudioBIFS parameters are extracted from measured room impulse response data. Measurements were done in two different enclosed, reverberating spaces, where the response was measured for several source - receiver position setups. Room acousitc simulations were carried out in a reference software platform of MPEG-4, according to parameters extracted from the measured responses. 2 Describing Room Acoustics in BIFS Advanced AudioBIFS allows two approaches for describing room acoustic response in a virtual environment. One is called the physical approach where the sound propagation and thus the acoustic response is dependent on the geometrical configuration of sound reflecting (or sound obstructing) walls. The other is named the perceptual approach where control parameters are based on perceptually relevant characteristics of the room acoustic response. The former approach is useful for example in audiovisual applications where the visual objects (e.g., walls of a room) are given also acoustic properties such as sound reflectivity. Thus when sound

Page  00000002 sources and a virtual listening point are placed in such a scene, pulse responses of two rooms of different size and shape were the direct sound and the reflections automatically correspond measured in order to build up corresponding virtual models to the visual scene. However, in the perceptual approach, the according to physical and perceptual descriptions of the room perceived sound response does not strongly relate to visual or acoustics. Both rooms have a simple geometry and fairly any other objects in the scene but only depends on the relative hard walls (with strong reflectivity), so that the first reflecpositions of the virtual sources and the listener. Therefore it tions were easy to identify in the measured response when is typically used for applications where an acoustic effect is comparing with the synthesized one. For subjective comparadded to sound and the scene does not necessarily even con- ison between the two virtual acoustic models, and between tain visual data. This is the case e.g., when post-processing the virtual models and the original measured setup, also realmusic with 3-D room acoustic effects, that are controlled with head recordings of musical sounds were done with the same parameters that intuitively describe the perceived room effect source - receiver positions as for the impulse response mea(Jot 1999). surements. In the small room, reponses were measured with two dif2.1 Physical and Perceptual Modeling of Sound ferent locations of the receiver with respect to the sound source. ScenesFor both recording positions the source was oriented in four different directions to recoginse the effect of the source diIn the MPEG-4 the above schemes are enabled with the rectivity to the magnitude of the reflections in the early part Advanced AudioBIFS nodes: DirectiveSound, AcousticScene, of the response. In the large room (a corridor significantly AcousticMaterial, and PerceptualParameters described in (Vddnin6Wnger in one direction compared to the two others) special Huopaniemi, and Pulkki 2000). DirectiveSound is used as attention was given to measuring the impulse responses at difa spatial sound source model in both the physical and the ferent source - receiver distances. This was done to obtain a perceptual approaches. AcousticMaterial and AcousticScene noticeable effect in the source presence related parameters. are used only in the physical appraoach. AcousticMaterial For room acoustic simulations the directivity of the sound is used for giving frequency dependent sound reflectivity and source (in this case a loudspeaker) was included as a source obstruction to flat, polygonal surfaces that can be used to con- property of the VAE modeling process in both the physical figure a room geometry. AcousticScene is used for defining and the perceptual approach. The directivity of the loudlate reverberation characterized by the frequency dependent speaker was measured in anechoic conditions for eleven azreverberation time, and the delay and the level of the late imuth angles starting from the front of the loudspeaker up to reverb. The reverberation can be added to the acoustic re- 180 degrees. First order IIR filters were designed to match (in sponse created by the first (geometry-driven) reflections for the least mean square error sense) to this data in the frequency obtaining a more complete and natural impression about an magnitude domain. enclosed, reverberating space. For the listening experience, and for subjectively comparIn the perceptual approach, the PercpetualParameters node ing the recorded sound samples with the ones processed with is used for giving a room effect to each sound source irrespec- the virtual acoustic responses, a binaural rendering setup was tive of the geometry of the surrounding scene. It contains a used. This means that in the case of both the physical and the set of parameters that describe the frequency dependent en- perceptual approach the virtual listener model was achieved ergy content and relations of the direct sound, the early reflec- by HRTF filtering and headphone playback. tions and the late part of the room acoustic response. These parameters that define the energy and the frequency contents of the early room effect (direct sound and early reflections) 4 Analysis of the Room Acoustic Data are called source presence, source warmth, and source brilliance. Late reverberance, heavyness and liveness define the The nl of the monophomc impulse responses m this late reverberation time and its variation with frequency. Run- study icluded eacting of time and frequency domain paning reverberance is a parameter that corresponds to the early rameters that were further used to control the room acoustic modeling process both in the physical and the perceptual apdecay time in the early part of the response, envelopment is proaches. In both approaches the late reverberation is charthe ratio between the direct sound and the early reflections, acterized by reverberation time that is dependent on the freand finally room presence defines the total energy of the late reverberation, quency, and the total energy of the late reverberation. In the case of the perceptual approach the energies of the early parts of the responses (namely, the direct sound, directional early 3 Measurement of Room acousticS reflections, and diffuse early reflections) were analyzed in three frequency bands. In the framework of BIFS these paFigure 1 shows the different steps for modeling of room rameters can be written as properties (or fields) of the nodes acoustics according to measured data. During this study im- that form the room acoustic part of an audio or audiovisual

Page  00000003 scene. In the case of the physical approach the early reflections are derived from the geometrical model (encoded with the aid of the AcousticMaterial node that give acoustic properties to polygons), thus the measured impulse responses do not offer enough information for this type of modeling. 5 Encoding and Modeling of Room Acoustics in BIFS In the percpetual approach to room acoustics modeling the frequency dependent energy relations analyzed from the responses can be transformed to attributes of the PerceptualParameter node (associated with a sound source object). In the physical approach only the parameters controlling the late reverberation characteristics can be obtained from the measured responses. This is because the late reverberation is considered diffuse and irrespective of the detailed geometrical configuration of the room. The late reverberation in the physical approach is thus meant for completing the room response given by the geometrical early reflections. 5.1 Simulating the Early Part of the Response In the case of the physical approach to room acoustics rendering, the geometries of both measured spaces were represented as polygon models in BIFS scene description. Each polygon was given acoustic reflectivity (with the help of the AcousticMaterial node) so that the first reflections off the walls could be computed with the image source method (Allen and Berkley 1979). The material filters were designed to fit measured reflectivity data in the least square error sense, and the filter coefficients (of first order IIR digital filters) were associated with each reflective surface. The image source method gives the delay and direction of each geometrically modelled reflection. The delayed version of the sound is thus filtered by the given reflectivity filter at the rendering stage. The information about the direction of the reflection is used to spatialize the sound, i.e., to take into account the user orientation with respect to the source of sound. In the perceptual approach no information about the time and spatial distribution of the early reflections is encoded as a part of the scene description content. They are mearly described by the energy and frequency contents, and even the number of the reflections (affecting the quality of the modeled response) can not be defined. As this approach is based on the Spatializer system (Jot and Warusfel 1995), it is recommended in MPEG-4 that thea early reflections are panned on both sides of the direct sound looking from the listener's location. In this study the reflections in the perceptual modeling were randomly distributed approximately within the same time interval after the direct sound as in the physical approach to make these two modeling schemes comparable. A common filter for all the early reflections in this approach was designed to fit the frequency dependent energy that was the result of the analysis of this part of the response, and encoded in the PerceptualParameters node in three frequency bands. 5.2 Late Part of the response The late part of the room reverberation in the virtual model was obtained in the following way: In the physical approach the reverberation time was computed from the measured responses in octave bands. A DSP reverberator with an exponentially decaying response was controlled with this frequency dependent reverberation time information that was included in an AcousticScene node definition. The delay (of the first output) of the reverberator was set so that the late reverb starts approximately after the physically computed early reflections. Thus this delay depends on the physical dimensions of the room, and in this study it was given different values in the different rooms. The gain for the reverberator output was set so that the total energy of the late reverb in the virtual model matched the energy of the measured response after a time that corresponds to the end of the early reflections in the response. In the perceptual approach the late reverberation time is similarily computed in different frequency bands as in the physical approach, but the control information given to PerceptualParameters is done by mapping this data to late reverberance, heavyness, and liveness parameters. 6 Conclusions The results of the virtual acoustic responses were compared with the recorded ones by calculating the same room acoustic parameters from them as from the measured ones. Obtaining nearly the same parameters in the perceptual approach is quite a straightforward task as the time - frequency distribution of the room effect rely on statistical parameters. This applies to the late reverberation part of the response in the physical approach, too. However, the early part of the response in the latter approach differs from the measured response as the geometrical simulations generating specular reflections independent on the reflection angle offer a very simplified version of the early reflections. Also, according to the requirements in the MPEG-4 standard (14496 1999) no more than the first order reflections were generated in the simulation software (MPEG-4 Systems reference software). 7 Future work Figure 2 briefly presents the basic approaches for carrying out evaluation of room acoustic modeling. The future work

Page  00000004 related to this study contains carrying out subjective evaluation of the responses encoded in MPEG-4 format, and synthesized in an MPEG-4 decoder (see, (?) for a framework of evaluating geometrically modeled room in a dynamic situation). MEASURING Monophonic room impulse responses at several source - receiver positions ANALYSIS Frequency dependent energy analysis of: Early part of the response Diffuse part of the response Reverberation time analysis From late response-> Frequency dependent RT60 data SYNTHESIS/MODELING Perceptual modeling: Generation of response matching the perceptual criteria Physical modeling: Simulation of sound propagation in the medium Figure 1: The measurement - analysis - synthesis procedure applied to room acoustics modeling. EVALUATION OF MODELING RESULTS Objective Subjective Extracting the same Listening tests parameters from the Dynamic simulations artificial responses (navigation in virtual as in the analysis environment) phase Figure 2: Two approaches for evaluation of modeled room acoustics. 8 Acknowledgements The author acknowledges Mr Tapio Lokki for providing surface material reflectivity filter and loudspeaker measurement data used in the simulations carried out during this work. References 14496, I. (1999). International standard (IS) 14496. information technology - coding of audiovisual objects (MPEG-4). Allen, J. B. and D. A. Berkley (1979, April). Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society ofAmerica 65(4), 943-950. IS 14772-1 (1997, April). ISO/IEC JTC/SC24 is 14772 -1 the virtual reality modeling language (vrml97) (information technology - computer graphics and image processing - the virtual reality modeling language (vrml) - part 1: Functional specification and utf-8 encoding.). url: Jot, J.-M. (1999). Real-time spatial processing of sounds for music, multimedia and interactive human-computer interfaces. Multimedia Systems 7, 55-69. Jot, J.-M. and 0. Warusfel (1995). A real-time spatial sound processor for music and virtual reality applications. In Proceedings of the International Computer Music Conference. Koenen, R. (1998, October). Overview of the MPEG4 Standard, Output Document no. N2459 of the 45th MPEG meeting, Atlantic City. url: mpeg-4.htm. Savioja, L., J. Huopaniemi, T. Lokki, and R. Viininen (1999). Creating interactive virtual acoustic environments. Journal of the Audio Engineering Society 47(9). Scheirer, E. D., R. Vaininen, and J. Huopaniemi (1999, November). AudioBIFS: Describing Audio Scenes with the MPEG4 Multimedia Standard. IEEE Transactions on Multimedia. Vol 1,No 3, pp.237-250. Sept. 1999 Signes, J. (2000). Mpeg-4 binary format for scene description. Signal Processing: Image Communication (15). Special Issue on MPEG-4. Vaininen, R. and J. Huopaniemi (1999, October). Virtual acoustics rendering in mpeg-4 multimedia standard. In Proceedings of the International Computer Music Conferencde ICMC'99, Beijing, pp. 585-588. Vaininen, R., J. Huopaniemi, and V. Pulkki (2000, August). Comparison of sound spatialization techniques in mpeg-4 scene description. In Proceedings of the International Computer Music Conference, Berlin, pp. xx-xx.