Page  00000001 Snd3D; a 3D sound system for VR and interactive applications Peter Lund6n The Interactive Institute The Studio of Emotional and Intellectual Interfaces Box 24 081, S-104 50 Stockholm, Sweden Email: peter.lunden( ABSTRACT Snd3D is a software based real-time 3D sound system based on ambisonics (erzon, 1980) and reproduction over loudspeakers aiming at VR and interactive applications. The system is able to perform real-time simulation of direction, distances and movements as well as the acoustic environment. The system is implemented in PD (Puckette, 1996), a graphical programming environment for audio and MIDI processing. One of the novelties with Snd3D is its ability to represent and simulate environmental sounds. BACKGROUND We believe that the aural side of VR is underdeveloped compared to the visual side. People very often tend to underestimate the importance of the aural domain in achieving a strong immersive experience. One explanation of this may be found in how we interpret our mental "image" of the reality surrounding us. We tend to think that the information used to build this "image" comes from the visual domain even if it has come from other senses. Several 3D sound systems have been developed but none of the systems that I know of has approached the problem of simulation of environmental sounds and outdoor acoustics. The lack of such capabilities makes it difficult to use those systems in many types of applications such as the modelling of townscapes. DESIGN GOALS The vision for Snd3D is to develop a complete aural VRsystem capable of handling the placement of many sources on the full three dimensional sphere as well as simulating the acoustics of the virtual space that is modelled. We did not want Snd3D to be dependent on any particular system or hardware. The pace of hardware development is very high and expensive special purpose hardware equipment gets obsolete in a very short time. That combined with the fact that standard computers are now beginning to be powerful enough to handle a full featured 3D sound system lead us to the conclusion that it was best to develop a software-based system. We give priority to a loudspeaker-based system over a headphone-based system for two reasons. Firstly we like the system to be used in situations with more then just a few listeners. The fact that headphone-based systems need to track the position and orientation of each individual listener as well as a large part of the computation has to be done separately for every listener, means that such system will be both computationally very expensive and need extensive tracking equipment. Secondly we judge the possibility of intercommunication between the audience as very important and did not want to spoil that possibility which a headphone based system would do. What we may lose is the higher perceptional accuracy that a headphone based system is supposed to give compared to a loudspeaker-based system, but this is not scientifically verified and remains to be proved. Ambisonics is a surround-sound system for reproduction over loudspeakers. It was chosen as the basic 3D sound technology because apart from most other loudspeaker based systems, it is able to place sounds over a 360-degree horizontal sound space as well as the full three dimensional sphere. It also has the advantage of having a compact intermediate format, called B-format that allows important manipulations to be performed, such as rotation of soundfields. Ambisonics also open up an interesting possibility to represent environmental sounds, which will be discussed later in this paper. Vector based panning (Pulkki, V. 1997) was also considered. It has many nice qualities but we did not choose this system because we thought the possibility in ambisonics to represent environmental sounds was too appealing to lose. PD, which is a graphical programming environment for audio and MIDI processing, was chosen as the platform for the implementation of Snd3D mainly because it is open source and it is available on several platforms. We found that PD allows us to build prototypes and test ideas very quickly. PORTING Snd3D is currently running on SGI IRIX and on Linux. The system was originally developed on an SGI Octane running IRIX 6.5. My experience of using this platform for this purpose is very good. I like to point out two things that are particularly good. The Octane has 8 channel digital audio I/O as standard, which is very well integrated in the system. The audio sub-system is very easy to program using the native audio library. The other thing I would like to point out are the real-time scheduling priorities that is available in the OS which makes it possible to get a glitch-free sound even when the system is quite heavily loaded. I really learned to enjoy these features of IRIX when I was porting Snd3D to Linux. The worst problem of porting Snd3D to Linux was the lack of support for multi-channel soundcards. Currently there is no working standard supporting multi-channel soundcards in Linux. The closest you can get is the Advanced Linux Sound Architecture (ALSA) but the multichannel support is not fully standardised yet. I found ALSA to be quite messy to use. It has some cumbersome and unnecessary complexity with too many layers, something that has obviously been borrowed from Windows. The situation become even worse when I tried to get the soundcard running, an M-Audio Delta- 1010 that is supposed to have an ALSA driver. After two days of work trying to get the card running I found out that the driver was only capable of playing 10 channels a once, not

Page  00000002 1, 2, 4 or any other number of channels, only 10. Another problem with the Linux port is that it is sensitive to the workload of the system and glitches often occur in the sound when the workload is high. What the causes are remains to be investigated but it might have helped if Linux had some real-time support similar to that in IRIX. REPRESENTATION OF ENVIRONMENTAL SOUNDS An important but often neglected aspect of 3D sound systems is the ability to model and represent environmental sounds. Not much research has been done on how to represent and model such sounds in real-time 3D sound systems. What is needed is a system that can represent and simulate the environmental sounds in every position in the modelled virtual space. Imagine for instance that we want to model a townscape with traffic noise. When we walk around in this virtual town the traffic noise will change gradually from one place to another and it will change radically when we round a comer. How can this be modelled? In most 3D sound systems this will be very difficult to achieve. To make things easier we need a special form of representation to handle these type of sounds. Environmental sounds can be divided into two parts. One part consisting of sounds that have direction and can be assumed to come from a point source. The other part consists of semidiffuse or diffuse soundfields that lack, or almost lack, directional information. The first part can be modelled by traditional methods. The second part needs more attention. In most 3D sound systems the sound sources are assumed to be point sources. Diffuse or semidiffuse soundfields can not by definition be considered as coming from a point source. Nor can they be represented as a number of distinct point sources as they are usually highly complex and it would require a large number of distinct point sources to achieve an acceptable result. The large number of sound sources required will be too computationally expensive to be practically feasible. For instance think of trying to model the traffic noise mentioned above. To bring about this we need to simulate the traffic and the acoustics for a considerable area around the listener's current position in the virtual town. This will be too computationally intensive to do in real-time. We need a representation that allows us to perform most of the calculations off-line. The employment of ambisonics gives an interesting possibility to represent semi-diffuse soundfields. The ambisonics B-format signal (Gerzon, 1980) can be thought of as a representation of a soundfield in a single point. There is a slight problem with this. The soundfield must be rotated in order to align the orientation of the soundfield and the listener. Luckily the B-format signal has the nice property that it is computationally quite cheap to rotate the soundfield represented by the signal. We propose a representation of semidiffuse soundfields that we call interpolated soundfields. An interpolated soundfield represents the semidiffuse soundfield in each position in a 3D space. This is accomplished by sampling the space on a grid where the soundfield at each sampling point is represented by a B-format signal. The intermediate positions are calculated by interpolating between the B-format signals at the surrounding grid points. The soundfield samples can either be recorded by a soundfield microphone (Smith, 1979) in the case of modelling an existing space, or pre-calculated off-line from an acoustic model of the space. Snd3D has an interpolation mechanism that allows the sampling grid to be irregular, the position of each sampling point can be defined separately. An interpolated soundfield is described in a text file with one line for each sampling point in space containing its position, geometry, and which sound file to use. The geometry describes the range of space where the sample is to be considered and its interpolation properties. Two boxes surround each sample point, one inside the other. The outer box is the bound of the influence the sample point can have on the surrounding. The sample point only influences the result for positions inside the box. For positions outside of that box the sound level of the soundfield associated with the sample point will always be zero. Inside the inner box it is assumed that the only sample point that influences the result is the sample point owning the box and the sound level will always be one. Between the inner and outer box the sound level is determined by an interpolation function, which is designed, to conserve the energy so that the overall sound level will not fluctuate during interpolation. All the sounds of sample points that have an influence on a position are mixed together. Ideally the sample points are positioned so that the outer box of one sample point touches the inner box of another. SYSTEM ARCHITECTURE Snd3D is built as a number of modules. The system architecture is designed to make the system flexible and to facilitate experimentation and rapid prototyping. Each module consists of a number of PD-objects. The major modules will be described below. Kernel The kernel module handles the flow of signals as well as controlling information between the other modules. All the other modules are inside of this module. Extern communication The external communication module handles all communication with the outside would. Snd3D can be considered as a client/server system. The server handles all sound computation and plays the result over its audio I/O. The client is typically a VR application that hosts the simulation and sends commands to the sound server to play the required sounds. A client connects to a server and controls it by passing messages to it using an API. Communication is facilitated using either TCP or UDP. Currently there are two APIs implemented, one for C and one for Java. The Java API has been used to control Snd3D from VRML2. Source The source module handles different types of sound sources. A sound source and all its properties is represented by one or more PD-objects. As there is no

Page  00000003 dynamic creation of source objects in Snd3D, there must be as many source objects as there are simultaneous sound sources. Sound source objects are allocated using a voice allocation algorithm (described later). There are two major types of sound sources, point sources which represent sound sources that can acoustically be considered as point sources, the second type, ambient sources, which represent ambient sounds and semi-diffuse soundfields using the interpolated soundfield technique described above. Ambient sources are implemented using two objects, one for the definition and one for active sources. The point source object represents a sound source, its position and other geometric information related to the source. The sound can come from either a file or from a direct input such as a microphone. The position of the source is computed relative to the listener. Distance is simulated by the intensity and the delay of the sound. By default the intensity is inversely proportional to the square of the distance. This is what happens in the physical world but arguments for other relations can be found in the literature (Bergault 1994). Other functions can easily be defined to set up the dependency between distance and intensity. A finite delay line simulates the delay and allows the delay time to be changed dynamically. This will automatically simulate Doppler shift without any additional computation. Direction is simulated by the encoding of the signal to the ambisonic B-format. After the encoding the signal is sent out on the B-bus. The ambient source definition object describes an interpolated soundfield sample. There is one object for each sample point. A source definition is triggered when a listener enters the field of influence of the soundfield sample. When a source definition is triggered it requests an ambient source activation object. The ambient source activation object is released when the listener leaves the field of influence of the soundfield sample. The ambient source activation object is used to reproduce an interpolated soundfield. Each object reproduces the soundfield associated with a single active sampling point. There must be as many objects as simultaneously active sampling points. Like point source objects this type of object also uses voice allocation to associate active sampling points with ambient source objects. Voice allocation There are two different instances of the voice allocation object, one for point sources and one for ambient sources. A round robin algorithm is used where the objects are allocated one in turn when a new source is started. Objects are released when the sound of the associated source is ended or stopped. When all objects are allocated no more sources can be allocated until an object is released. Room The room module handles the simulation of the acoustic environment. The early reflections have to be simulated separately for each sound source. They are simulated by a time-varying delay line and use the same physical delay line as the distance simulation in the source module. The late reverbration is computed using a feedback delay network (Rocchesso and Smith 1996). Listener Represents the listener, its position and orientation. In the current system there can only be one listener. The listener module handles the mixing of all audible sources, the decoding of the B-format signal and the distribution to the loudspeaker system. The listener has a number of submodules. The B-bus is a global resource in the system. Every source sends its output signal to this bus and it is responsible for mixing the signal. The rotation module takes the output of the B-bus and rotates the resulting soundfield according to the orientation of the listener. The rotation is done using quaternions to achieve smooth interpolation and to facilitate a simple representation of the rotation (Kuiper 98). The decoding module handles the decoding of the Bformat signal. There are a number of different modules for different loudspeaker configurations. Currently there are modules for 4 loudspeakers or 8 loudspeakers in the horizontal plane and 8 loudspeakers in a cube formation. The distribution and correction module handles the distribution to the loudspeakers. It can also compensate for the listening room and loudspeaker response. RELATIONS TO OTHER PROJECTS Snd3D is developed in co-operation with the department of parallel computing at the Royal Institute of Technology (KTH/PDC) in Stockholm as a part of the sound system in their "VR-cube" which is a 6-wall CAVE. Snd3D is part of a larger umbrella project "Sergels Torg, IT design for the city". Snd3D is used in this project to build the sound environment of a virtual townscape application modelling an existing square and the proposals for the reconstruction of that square. The "Sergels Torg" project is a good pilot project for Snd3D, as it needs modelling of point sources as well as environmental sounds. Modelling the environmental sounds of the existing square and the proposals requires different techniques. The model for the existing square is based on material recorded with a soundfield microphone. In the other case the environmental sounds must be synthesised, at least the acoustics, as it is not possible to record them. The environmental sounds and their acoustics can be precalculate off-line and later be used in the real-time simulation, which will greatly improve the speed of the system. CONCLUSIONS AND FUTURE PLANS Interpolated soundfields is an interesting method to model environmental sounds. We have planned to evaluate the

Page  00000004 use of interpolated soundfield technique in different situations and investigate the use of interpolated soundfields to store pre-calculations of room acoustic simulations. The simulation of Acoustics is not fully implemented in the current system. There is no analysis of the 3D geometry yet. Implementation of this part is planned to be done as soon as possible. A future version on Snd3D that also includes reproduction over headphones using head related transfer functions (HRTF) is planned. ACKNOWLEDGEMENTS I would like to express my appreciation and thank to Kare Hjelm at Stockholm Audio who lent me his soundfield microphone and thereby made this work possible. REFERENCES ALSA, "Advanced Linux Sound Architecture". Begault, D, R. 1994. "3-D Sound for Virtual Reality and Multimedia". AP Professional. Gerzon, M A. 1980. "Practical Periphony: The Reproduction of Full-Sphere Sound". Draft of lecture presented at Audio Engineering Socitey 65th Convention. Preprint 1571. Kuiper, J, B. 1998. "Quaternions and Rotation Sequences". Princeton University Press. Puckette, M. 1996. "Pure Data." Proceedings, International Computer Music Conference. San Francisco: International Computer Music Association, pp. 269-272.' Pulkki, V. 1997. "Virtual sound source positioning using vector base amplitude panning". J. Audio Eng. Soc., 45(6):456-466. Rocchesso, D and Smith III, J 0. 1996. 'Circulant and Elliptic Feedback Delay Networks for Artificial Reverberation". In IEEE Transactions on Speech and Audio, vol. 5, no. 1, pp. 51-60. Smith, H J. 1979. 'Ambisonics - the Calrec soundfield microphone". In Studio Sound, Oct. 1979, p.42-43.