Page  365 ï~~A New Approach to HRTF Audio Spatialization Matija Marolt Faculty of Computer and Information Science University of Ljubljana, Trzaska 25, SI-1000 Ljubljana, Slovenia matija.marolt@fri. uni-lj.si http://www.fri. uni-lj.si/-matic Abstract The article presents a new audio spatialization algorithm for spatialization methods based on the headrelated transfer function (HRTF). The algorithm uses a mathematical model of the HRTF to make the spatialization process simple and fast. The article thoroughly describes this algorithm and provides a comparison of the algorithm to the traditional block convolution filtering approach. 1. Introduction Audio spatialization can be achieved in a variety of ways. The most commonly used method is based on the head-related transfer function (HRTF), which represents the changes in sound as it travels from its source towards the listener's middle ear. The core of HRTF spatialization usually consists of a block convolution algorithm that filters the incoming audio stream with the HRTF and thereby calculates the spatialized result. This approach has two drawbacks: * its computational demands are very high, even when very fast block convolution algorithms are used; * immediate response to changes in sound source position is not possible, because block convolution is used. I propose a different algorithm for HRTF audio spatialization, which addresses these problems. In comparison to block convolution, the proposed approach proves to be faster and provides an immediate response to changes in sound source position. On the other hand, it also has its limitations: * we must know the sounds that will be spatialized in advance; * the amount of storage space the algorithm needs is much larger than the amount needed by the block convolution filtering approach. The first limitation is very much dependant on the type of application we are dealing with. In a simulation of a virtual environment, for example, we usually know which sounds will be attached to each object in a scene in advance; therefore this limitation is satisfied. We can not totally eliminate the second limitation, but we can make it easier by using a computationally non-demanding compression algorithm to reduce the amount of storage space needed. 2. The Spatialization Algorithm The algorithm is based on the spatial feature extraction and regularization (SFER) model for the head-related transfer function, as introduced by [Chen et al., 1995]. The SFER model is a functional approximation of the HRTF; it is a mathematical model that represents the HRTF as a function of frequency and direction. The modelled HRTF is expressed as a weighted sum of M eigentransfer functions (EFs; these are only dependent on frequency) and spatial characteristic functions (SCFs; these are functions only of spatial location). It can be written as: M hm(j,/j) = qgw;(Oj,0j)+qo, i=i (1) where hm (8j,,) represents the HRTF model ((O,,,) is the direction of the virtual sound source), q1, i = 0.. M the eigentransfer functions and w; (Oj, O) the spatial characteristic functions. M determines the accuracy of the model; if we use a larger M, we get a more accurate model of the HRTF (and vice versa). If we use this model (instead of the HRTF) in the spatialization algorithm, we get the following expression (s represents the incoming audio stream and o the resulting spatialized audio stream): M = X(q, os)w1(O,4/)+qo os i=1 (2) Because eigentransfer functions q, only depend on frequency, we can calculate the convolutions of these EFs and the audio stream s in advance (prior to spatialization) and store the resulting data: p1=qjÂ~s, i=O..M. (3) ICMC Proceedings 1996 365 Marolt

Page  366 ï~~The spatialization algorithm can then simply be written as: M o=XPiwi(61,4j)+Po Â~(4) We can immediately notice the advantages and drawbacks of this method from the above expressions. The spatialization algorithm (equation 4) is very simple and therefore fast. Its speed depends on the parameter M, which determines the accuracy of the HRTF model; a larger M means a more accurate model and therefore slower spatialization; a smaller M, on the other hand, means a less accurate model and faster spatialization. In general, even with a very accurate model, the method still proves to be faster than the traditional block convolution approach. The second advantage of this approach is the fact that the incoming audio stream is not spatialized in blocks of audio data, as is the case with block convolution algorithms. Therefore the response to changes in the sound source position can be almost immediate. The drawbacks are also very obvious. If we want to calculate the convolutions of eigentransfer functions and the audio stream in advance (equation 3), we must also know the sounds (audio streams) that we are going to use for spatialization in advance. Whether this condition can be fulfilled or not totally depends on the application that uses spatial audio. The second drawback is also quite obvious: the amount of storage space needed by the algorithm to store the precalculated data is much larger than the amount needed by the block convolution algorithm. Instead of storing only the sounds that are going to be spatialized, we have to store all the precalculated data, which may take up to (M+1) times the space needed to store each individual sound - (M+1) convolutions have to be precalculated and stored for each sound. This amount can be reduced by using a computationally non-demanding compression algorithm, but it still remains large. 3. The Comparison of Algorithms Several tests were performed in order to estimate the algorithm's speed in comparison to the traditional block convolution approach. All the tests were made on a Silicon Graphics Indigo workstation with a 100 MHz MIPS R4000 processor, running a scaled down version of SAS (Spatial Audio Server; see [Marolt, 1995]), which is a program for spatial audio simulation developed at our Faculty. The HRTF data used in these tests were dummy head microphone recordings measured at MIT. They are represented by 44.1 kHz FIR filters measured at 710 locations in space around the listener (see also [Gardner and Martin, 1993]). In the first part of these tests, the speed of the traditional block convolution filtering approach was measured. Overlap-add block convolution algorithm based on the fast Hartley transformation was used for spatialization. The algorithm's speed was estimated by spatializing an approx. 4.3 seconds long 44.1 kHz audio stream. To eliminate the influence of system's load on the measurements, only raw CPU time used for the calculations was taken as a result. Because extensive precalculations take place when using the proposed algorithm, some precalculations were also made when testing the block convolution algorithm. Block convolution is usually performed in three steps: " calculate the FFT of the input block; " multiply the transformed block with the FFT-ed filter (convolution in frequency domain); " calculate the inverse FFT and the necessary additions (overlap-add algorithm); The first step (FFT) was calculated in advance and the results were stored. The spatialization algorithm therefore consisted only of the second two steps and only these two were also timed. The tests were performed with HRTFs of different lengths: 64, 128 and 256 points. The CPU times (in seconds) used for spatialization of the audio stream with all three HRTF lengths are given in table 1. HRTF 64 128 256 time 1.75 1.81 2.14 Table 1: the speed of the block convolution algorithm In the second part of the tests the speed of the proposed method was measured. Fifteen models of the HRTF were first calculated. These were models calculated from HRTFs of different lengths (64, 128 and 256 points) and calculated with a different number of eigentransfer functions (M=6,8,10, 12 and 14). The percentage mean square errors of these models (compared to the original HRTF measurements) were also calculated as: where P represents the number of HRTF measurements on different locations in space - in our case 710, h1 the original HRTF measurements and h,,1 the HRTF model. The MSEs can be found in table 2.... 6 8 10 12 14 64 6.90 4.10 j2.86 I1.87 1.29 1128 17.25 ]4.44 3.12 42.15 j1.56 256 7.68 4.90 3.60 2.62 2.04 Table 2: MSEs of the HRTF models (in %) Marolt 366 ICMC Proceedings 1996

Page  367 ï~~After the models have been calculated, the necessary precalculations were made for each model, as described in the previous section. Finally the speed of the spatialization algorithm for each model was measured. The audio stream used for the measurements was the same as in the first part of the tests. The results can be found in table 3. I M I 6 8 10 12 j 14 Time 0.51 0.67 0.83j0.98 1.13 Table 3: the speed of the proposed algorithm Table 3 displays CPU times (in seconds) used for spatialization with the proposed algorithm for all the HRTF models. We see that the algorithm's speed can be expressed as a linear function of M, what can also be observed in equation 4. The length of the HRTF model used in calculations has no effect on the algorithm's speed; the speed is the same if we use a 256 point or a 64 point HRTF model. On the other hand, the HRTF's length has a slight effect on the model's accuracy, what can be seen in table 2. The comparison of the two algorithms is given in table 4. M 6 8 10 12 14 64 3.43 '2.61 2.11 1.78 1.54 128 3.54 2.70 2.18 1.85 1.60 256 4.20 3.19 2.58 2.18 1.89 space 1.75 2.25 2.75 3.25 3.75 Table 4: the comparison of the two algorithms The first three rows of the table represent the relative speed indexes of the two algorithms; i.e. 4.20 in the third row of the first column means that the proposed algorithm with a HRTF model built from 6 EFs was 4.20 times faster than the block convolution algorithm with HRTF length of 256 points. When we compare these two algorithms, we must also take into consideration the amount of space needed by both algorithms to store the precalculated data. The relative increases in the amount of space needed to store the data can be found in the last row of table 4. For example: 1.75 in the first column means that the proposed algorithm with a HRTF model built from 6 EFs needs 1.75 times more storage space than the block convolution algorithm (this number is the same for all HRTF lengths). The amount of used space is again a linear function of M and increases with the accuracy of the model. This amount is not dependant on the HRTF's size. If we look into these results, we find that the algorithm is not very useful if we want to use a very accurate HRTF model for spatialization. The speed increase is small and the amount of space we need in order to store the precalculated data quickly becomes an overkill. On the other hand, the algorithm becomes more and more efficient when a smaller number of eigentransfer functions is used for HRTF representation. In this case it can be three or more times faster than block convolution algorithms and it also does not use much more space. Such a speed increase can be very important when we want to simultaneously spatialize multiple audio streams in real-time. But how will the use of a non-accurate HRTF model affect the listener's ability to correctly estimate the virtual sound source position? According to Wightman and Kistler [1992] the HRTF can be adequately approximated by using only five eigentransfer functions. The localisation results obtained by using such a model are almost equal to the results obtained by using the measured HRTF. The only difference is a slight increase of front-back and up-down localisation confusions. The use of HRTF models approximated by fewer than five EFs results in a large deterioration of front-back and updown localisation ability, while right-left localisation remains very good even when only one EF is used. 4. Summary The presented algorithm for HRTF audio spatialization proves to be a solid improvement over the more commonly used block convolution approach. It gives good results when used with a HRTF model calculated from a small number of eigentransfer functions. It is very useful for real-time audio spatialization of multiple concurrent audio streams in applications where the audio streams used are known in advance. In this case its speed contributes to a much larger number of audio streams that can be concurrently spatialized and this also proves to be of a larger importance than the slightly larger amount of storage space this method needs. References [Chen, et al., 1995] Chen, J., Van Veen, B.D., Hecox, K.E. (1995). "A spatial feature and regularization model for the head related transfer function," J. Acoust. Soc. Am. 97,439-452. [Gardner and Martin, 1994] Gardner, B., Martin, K. (1994). "HRTF measurements of a KEMAR dummy-head microphone," MIT Media Lab Perceptual Computing Technical Report No. 280. [Marolt, 1995] Marolt, M. (1995). "Spatial audio synthesis," graduate thesis, Faculty of Electrical and Computer Engineering (in Slovenian language). [Wightman and Kistler, 1992] Wightman, F. L., and Kistler, D. J. (1992). "A model of head-related transfer functions based on principal component analysis and minimum-phase reconstruction," J. Acoust. Soc. Am. 91, 1637-1647. ICMC Proceedings 1996 367 Marolt