Page  210 ï~~On the Perceptual Optimization of Synthetic Acoustical Systems1 Paul R. Runkle and Gregory H. Wakefield Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor MI 48109 paulr@eecs.umich.edu ghw@eecs.umich.edu Abstract Individual differences in binaural cues play a significant role in the effectiveness of virtual auditory displays: an individual's ability to localize a synthesized source over headphones is impaired when using HRTFs that were not their own. In this paper, we present a method for tuning HRTFs such that they more closely approximate a listener's HRTFs. The HRTFs are modeled using a low-dimensional rational form whose parameters are varied based on subjective responses provided by the listener. Optimization issues of relevance to this problem include minimizing the number of responses for convergence, and imposing useful constraints on the parameter space. 1 Introduction 1.1 Cues for sound Localization The generation of realistic synthetic acoustical environments over headphones, such as in virtual auditory displays, relies on the reproduction of auditory cues that humans exploit for determining the spatial position of external acoustic sources. Among these cues are the gross binaural features associated with overall interaural level differences (ILDs), and overall interaural time of arrival difference (ITDs). In addition, the spectral features of the sound field at the tympanic membrane is dependent on spatial position. All of these cues are embodied in the head related transfer function (HRTF). Although other cues may be used to resolve ambiguities in spatial position, such as dynamic cues associated with head movement, reverberation, and visual association, this research is primarily concerned with the acoustic information provided by the HRTFs. To include the cues associated with the HRTF in the synthesis procedure, HRTFs are measured at several spatial positions in an anechoic environment using microphones placed near the tympanic membrane of a human subject (often a head and torso simulator is substituted in place of the human). To generate the percept of a source at a specific location, the source is convolved with the measured HRTF from the desired location and presented over headphones to the listener. It has been shown that subjects are significantly better at localizing sources that were generated through their own HRTFs than through those measured from another subject (Wenzel, 1993). The goal of this research is to investigate the tuning of a given set of HRTFs to better match those of the listener by first approximating the measured HRTF with a low-order model, and then perturbing the model parameters based on subjective evaluations of the virtual acoustic presentations. 1. Work supported by a grant from the NIDCD of NIH (ROI DC00706 01IAI) 1.2 Representation of HRTFs For any optimization problem, minimizing the dimensionality of the search space will generally guarantee more rapid convergence to a solution. A low-order representation of the HRTFs also reduces the computational burden of synthesizing virtual acoustic sources. To realize a such a lower dimensional representation, we approximate the HRTFs as rational transfer functions. The approximations are computed using a gradient search with a log-least squares error criterion developed by Blommer and Wakefield, (1994). A further reduction in dimensionality of the directional component of the HRTF may be achieved by representing the HRTF as a cascade of a common and a directional transfer function as proposed by Middlebrooks and Green (1990). The common transfer function (CTF) is defined as the arithmetic mean of the HRTF from all spatial positions, and therefore is independent of source location. The directional transfer function (DTF) is defined as the residual between the measured HRTF and the CTF, and embodies only the directional information in the HRTF. 2 Subjective Tuning of HRTFs By varying the parameters of the DTF model, it may be possible to compensate for individual differences in HRTFs, yielding more accurate source localization. This can be accomplished by incorporating subjective preferences in conjunction with an iterative search through the parameter space to generate model parameters which more accurately represent the user's DTF. The problem may be solved by employing variations to standard optimization solutions: The state of the system W is to be modified over a number of observations according to some cost function C. Ideally, the process will approach a solution W* which satisfies W* = argmin C(W,P) Here, the cost function is dependent on some per Runkle & Wakefield 210 ICMC Proceedings 1996

Page  211 ï~~ceptual parameters, P, which are not directly observable. In this case, P would be related to the perceived spatial location of the source, and is therefore dependent on the stimulus. Since the cost function, C, is not known analytically, and a human is required to participate in the optimization process, there are two fundamental constraints imposed on the design of the iterative search. First, the number of observations that a user be required to evaluate should be minimized to a few hundred. Also, the type of response is limited to ranking two or more acoustic presentations according to their relative proximity to the desired percept. 3 Experimental Results The initial experiments were designed to determine whether the combination of subject response and algorithm methodology could modify the model parameters to produce better DTFs. We wanted to make the task as easy as possible for the subject, while providing sufficiently many degrees of freedom to the algorithm to generate good prospective solutions. These goals are obviously in conflict, since increasing the dimensionality of a search dictates that the search will be more difficult. In the experiment, the subject was required to move a source located directly in front of them to a position on the horizontal planelO degrees in azimuth to the right of front and center. The source for every presentation was a 200 ms, 0.2 - 10 kHz broadband noise filtered by the relevant HRTF. For each iteration, the subject was presented with two choices, and was asked to select which stimulus sounded spatially closest to the reference. For this task, a reference was provided at the target position to facilitate performance. Because of differences in the measured and subject's HRTFs, the initial and target positions may not be perceived as equivalent to the desired locations. First, however, it was important to determine if a subject could consistently move the sources to a different location. If so, then a reference free experiment can be conducted. 3.1 Unconstrained Tuning In the unconstrained experiment, the parameterization of the HIRTF included the fixed CTF and a six-zero, sixpole (denoted (6,6)) approximation of the DTF, which yielded an overall error of less than 2dB from the measured HRTFs in the frequency range of 0.2 - 10kHz. Also included were the ILD and ITD parameters. Thus for two channels, there were a total of 26 parameters to be varied. In simulations employing a log-least squares error criterion in magnitude, solutions to the problem were achieved in about 200 iterations. However, the human subjects reported that the source locations for prospec tive solutions tended to shift dramatically in azimuth from the target position. Because of the task difficulty, repeatable convergence was not achieved. This aberrant shift in the perceived source azimuth is explained by the effects of moving a pole or zero in the left channel without performing a compensating action in the right channel: the result is to introduce a change in the ILD which produces the observed change in azimuth. Because of the task difficulty, the introduction of constraints imposed on the search space was investigated to introduce some control on the perceived location of the sources. 3.2 Constrained Search The constrained experiment involved fixing the ILD and ITD of the prospective solutions, keeping the source close to the target in azimuth. Therefore, the task became one of adjusting only the local spectral features governed by the (6,6) systems,. It was hoped that by perturbing only the spectrum, the solution would sound spectrally and spatially similar to the target, while also making the response decision easier for the subject. The results of the constrained experiment indicate that the subjects were unable to adjust the test stimuli to match the spatial position of the target. In particular, the sources would often vary in diffuseness, or split into multiple sources. The cause of this phenomenon was found to be contributed by local changes in phase introduced by pole or zero movement. Especially at low frequencies, changes of as little as 0.15 radians in the interaural phase difference introduced increased diffuseness of the source. Some potential solutions for the phase problem include using sources that are high passed at 2 kHz or above, where small local phase differences are not likely to change the source percept. Additionally, the use of an allpass phase equalizer is being explored to cancel out phase changes due to pole and zero movements. Finally, constraints that couple parameters across channels may help to eliminate confusing changes in binaural cues. References [Wenzel, 1993], Wenzel, E.M., Arruda, M., Kistler, D.J, Wightman, F. L., Localization using nonindividualized head-related transfer functions. J. Acoust. Soc. Am., Vol 94, 1993, pp. 111-123. [Middlebrooks, 1990], Middlebrooks, J.C., Green, D. M., Directional dependence of interaural envelope delays. J. Acoust. Soc. Am., Vol 87, 1990, pp. 2149-2162 [Blommer, 1994], Blommer, M. A., Wakefield, G..H., On the design of pole-zero approximations using a logarithmic error measure. IEEE Trans. on SignalProc., Vol 42, 1994, pp. 3245-3248. ICMC Proceedings 1996 211 Runkle & Wakefield