Page  00000001 On the importance of phase information in additive analysis/synthesis of binaural sounds Tue Haste Andersen, Kristoffer Jensen Department of Computer Science, University of Copenhagen email: {haste,krist} Abstract This article presents a number ofpsycho-acoustic experiments for assessing the importance of phase information regarding the spatial qualities of synthesized sounds from voiced instruments. Binaural recorded sounds are synthesized using additive analysis/synthesis with and without phase information. The phase information is used in the synthesis to preserve the characteristics of the waveform. The experiments show the importance of phase when dealing with localization of voiced instruments. The phase information is shown to be crucial for determining the direction of the sound. Furthermore, it is observed that the test person's ability to perceive the spatial qualities of the sounds is not dependent on fundamental frequency when synthesis is done with phase information, and indications are given that the attack part of the sound has limited influence on spatial perception. 1 Introduction Additive analysis/synthesis has successfully been applied to a number of applications over the past years. However, not much work has been done in relation to additive synthesis of binaural sounds, preserving spatial information. The phase information is of importance to the localization of sounds. Not only for the perceptually correct reproduction of the analyzed sounds, but also for the ability to control the synthesis. If spatial perception is to be a control parameter in the synthesis, as is the case for e.g. virtual reality applications, better knowledge is necessary to deal with phase in an efficient way. The goal of this work is to determine the importance of phase information for the spatial perception of voiced instrument sounds, and thereby provide better understanding of how to synthesize instrument sounds with spatial qualities. 2 Localization Studies of how phase affects the timbre of monaural sounds can be found in the literature. Plomp and Steeneken (1969) made a study on various relatively simple waveforms composed of the same spectral information, and Patterson (1987) constructed a model of phase perception based on studies in neurology. In contrast to the relatively few experiments on monaural sounds and timbre, extensive studies of localization and lateralization of sound can be found in the literature. These experiments are typically conducted using simple sound sources, e.g. sinusoids or filtered noise. From these experiments it is well known that, for sounds below 1000 Hz the most important parameter for determining direction of sound is interaural time difference (ITD), and above 2000 Hz interaural level difference (ILD) is the most important. Change in the spectral envelope of the signal is also important, although the individual roles of these cues are not well understood (Wightman, Kistler, and Perkins 1987). For correct reproduction of binaural recorded sounds over headphones, it is necessary to take the head related transfer function (HRTF) into account, and also the transfer function of the record and playback equipment (Wightman, Kistler, and Perkins 1987). 3 Additive analysis/synthesis In relation to additive synthesis, previous work has investigated how to preserve the waveform of the analyzed sounds. This has been done successfully using cubic interpolation of estimated phase and frequency values (McAulay and Quatieri 1986). The system, which is also used here, consists of an analysis part, where amplitude, frequency and phase are extracted as time varying parameters, using the Short Time Fourier Transform (STFT). The sound is then synthesized by k s (t) an(t)cos[0,(t)1 n=O (1) where k is the number of partials, a,(t) the time varying amplitude, and On(t) is the time varying phase. For binaural analysis/synthesis, there are of course two sets of parameters. The different synthesis methods used in the experiments are: 1. Using amplitude, frequency and phase

Page  00000002 2. Using only amplitude and frequency, leaving out the phase information 3. Using (1) with spectral envelope normalized in both channels For synthesis with phase information, On (t) is calculated using cubic interpolation between frequency and phase values, as described in (McAulay and Quatieri 1986). For synthesis without phase information On (t) is simply obtained by integration of the measured frequency values over time: Sound source position in front of the head. Corresponds to O0 1 -... -- ~ --------------~-- 7--+ Right (2700) (::: -3 Left (900) 6 Artificial head & torso 5 On(t) = w (7T)dT Jo (2) On(t), an(t), wU(t) are estimated once for every period of the fundamental of the sounds. At this rate the ITD is generally not present in an (t) or wUn(t). The sound quality of the additive analysis/synthesis is very good. The degradation of a large variety of musical sounds was previously found to be between imperceptible and perceptible, but not annoying (Jensen 1999). For all the resynthesized sounds, each method retains the perceptually important features, such as the attack part, very well in monaural analysis/synthesis, even though there still is a slight difference in the waveforms between the original and resynthesized sounds with phase. The resynthesis with phase retains the non-instrumental part of the sound, such as noise, better than the method without phase information. When performing binaural analysis/synthesis, the localization information is less present in synthesis without phase than with phase. 4 Experiments and results To examine the importance of phase information and other localization cues in the sound, two psycho-acoustic experiments were conducted. The first was made as a preliminary study, whereas the second was carefully designed and verified using appropriate statistical tests. The variables under control in these experiments are: Reproduction method, sound source distance, instrument, angle of incidence and fundamental frequency. The reproduction method is either using the original recorded sound, or a synthesized sound as described in section 3. In all the experiments an interval scale with eight different incidence angles was used, as shown in figure 1. The different parameters were chosen so a reasonable diversity of sounds was available in the experiment, and to test if the same results can be obtained from these experiments, as has been concluded in previous studies of simple sound stimuli found in the literature. Figure 1: Setup during recording of sound sources. The artificial head and torso is placed in the middle while recordings are done at eight different angles of incidence. 4.1 Preliminary experiment A preliminary experiment with seven subjects was first conducted. For the sound recordings an artificial head and torso was used, with DPA microphones attached to the inside of each ear. Sounds of a guitar were recorded at eight different angles, equally spaced around the head, at a distance of 0.5 meters. The recordings were done in an office room using a DAT recorder. Two things were tested: 1. Absolute judgment: The subject listens to all sounds, both originals and resynthesized, in random order. For each sound the subject is asked to determine the direction of the sound. 2. Relative judgment: The subject listens to two sounds synthesized using the previously mentioned methods 1 and 2, played back one after the other in random order. The subject is asked which of them is easiest to determine the direction and distance of. The results from the first experiment, where the subjects are asked to determine the absolute direction of the sounds, show the same average error for original sounds and sounds synthesized with phase. For sounds synthesized without phase, the error is substantially higher. The results clearly show that phase information is important when dealing with spatial information in binaural recorded sounds. The notion of error is based on the difference in sound source position of the recordings, and the answer the subject gave, using the scale on figure 1. A number of corrections are made. The error is the sum of front/back error (weighted 1) and left/right error (2). Undetermined answer gives error 4. Figure 2 shows the results from the second test with relative judgment. In 60% of the cases it is easier to determine the direction of the sounds when synthesized with phase information. For 33% of the sounds there is no difference, and for

Page  00000003 70 60 0Vo~ne O Guitar 60 50 S40 S 30 c 20 10 50 40 -0 30 0 u 20 10 W thout phase W th phase No dfemence O rzinal W thout phase W ~h phase W ~h phase, am pnonm. Figure 2: Answers from preliminary experiment with 7 subjects, listening to 48 sounds each. The figure shows which of the synthesis methods (1 or 2) sounds are easiest to locate in space. 7% it is easier to determine the direction when no phase information is used. Furthermore it is observed that for sounds coming directly in front or from the back of the listener, the sounds are more often judged to be undetermined regarding the spatial qualities. This test gives an idea about importance of phase information, but it is not clear what exactly is tested. Because sounds synthesized with and without phase information in some cases have small variations in timbre, they can be difficult to compare. 4.2 Absolute judgment In this experiment absolute judgment is used as in the previous experiment, but with a larger setup. All sounds are recorded in an anechoic chamber to minimize reflections from the environment. Two instruments are used: singing voice and guitar. Each instrument is recorded at eight different angles equally spaced around an artificial head and torso, using two different distances, 0.5 meters, and 2.0 meters. In all, 160 sounds are recorded. The experimental setup includes 11 subjects listening to a set of the original recorded sounds and the corresponding synthesized sounds, played back over headphones (Beyer Dynamic DT 990). The singing voice is synthesized using method 1-3 and the guitar using method 1 and 2. In this experiment all results are analyzed using Anova tests with multiple within-subject variables. Post-hoc tests were done using a Bonferroni test at 1% significance level. Reproduction method. A significant influence is found in reproduction method with p <.001. Synthesis not including phase is found to be significantly different from the originals and synthesis including phase. Figure 3 shows the mean error for the different synthesis types, for both guitar and singing voice. The singing voice was furthermore tested using synthesis Figure 3: Mean error for the different types converted to degrees by multiplying by 45~. of reproduction, method 3: with phase information, but with the spectral envelope normalized in both channels. The difference between this synthesis method and synthesis with phase (method 2) is shown not to be significant. No significant difference in performance is found between the original sounds and sounds synthesized with phase, which is surprising. Although the synthesized signal is close to the original in the stable part of the sound, sometimes transients are slightly modified, due to bad time resolution in the STFT analysis. Correctly reproduced transient signals, for instance the attack, should be easier to localize. These results indicates that this is not the case, although a more throughout examination is needed. Distance to sound source. Performance difference in the two recording distances is not observed. Because a small difference in mean sound level between the recorded sounds were present, no experiments was done regarding the test persons ability to perceive distance. Instrument type. Figure 3 furthermore shows a difference in instrument type, with p <.04, On average the guitar received a higher error, compared to the voice. The finding can be explained by the fact that the guitar has a more diffuse sound source than the voice, which is better localized. Angle of incidence. Angle of incidence is significant with p <.001. Figure 4 shows the mean error for all the subjects, as a function of angle of incidence and synthesis type. It is clear that it is difficult to judge sounds located at the front and back of the head. This corresponds to previous research of localization, where no correction of filtering by the subjects pinae and the headphones is done (Wightman, Kistler, and Perkins 1987). The spectral envelope normalization only gives a degradation in the sounds coming directly from the left or right

Page  00000004 - Original. -0-- Without phase 60 -*- With phase 55 S50 ~ 45 40,,, U.. 81.5 145.3 Angle of incidence (degrees) 326.9 Fundamental frequency (Hz) 659.2 Figure 4: The mean error as a function of angle of incidence. 0~ corresponds to a sound coming directly from the front of the listener. of the listeners head (900 or 2700), whereas removing the phase gives a degradation in all angles of incidence, except the front/back (00 or 1800). Fundamental frequency. The experiments with guitar sounds have been done with recordings of tones with fundamental frequency in the range from 80 to 660 Hz. Figure 5 shows the average error of these different tones for the different reproduction methods. The observed difference is interesting because it shows that the localization ability improves as a function of fundamental frequency for the sounds synthesized without phase information. The interaction of fundamental frequency and reproduction method is significant with p <.03. It is difficult to conclude any general tendency from only four different fundamental frequencies, but it is interesting because it indicates that ILD or the spectral envelope is used as a cue at these frequencies when no interaural phase difference is present. The decrease of error for the sounds resynthesized without phase may be explained by the fact that the ITD is more influent in the amplitude envelopes for the very high pitched sounds. 5 Conclusion A number of psycho-acoustic experiments involving the importance of phase in sound localization have been conducted. The results show that phase is an important parameter when performing additive analysis/synthesis of binaural recordings. If the phase is left out, the ability to perceive spatial qualities of the sounds is substantially degraded. For sou Figure 5: The mean error as a function of fundamental frequency of guitar tones for different reproduction methods. nds in the examined frequency range the results indicate that there is no relation between fundamental frequency and the ability to perceive spatial qualities when phase information is used in synthesis. For the guitar sounds synthesized without phase, the error is substantially higher for sounds with low fundamental frequency, compared to sounds with high fundamental frequency. The phase is important in all incident positions, except front/back, whereas the spectral envelope is mainly influential in the lateral positions. Furthermore, indications are found that the attack part of the sound may not be particularly important for localization, as long as the interchannel time difference is preserved. References Jensen, K. (1999). Timbre Models of Musical Sounds. Ph. D. thesis, Department of Computer Science, University of Copenhagen, Report No. 99/7. McAulay, R. J. and T. F. Quatieri (1986, August). Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal processing ASSP-34(4), 744-754. Patterson, R. D. (1987, November). A pulse ribbon model of monaural phase perception. Journal of the Acoustical Society of America 82(5), 1560-1586. Plomp, R. and H. J. M. Steeneken (1969). Effect of phase on the timbre of complex tones. Journal of the Acoustical Society of America 46, 409-421. Wightman, L., D. J. Kistler, and M. E. Perkins (1987). A new approach to the study of human sound localization. In W. A. Yost and G. Gourevitch (Eds.), Directional Hearing, pp. 26 -48. Springer-Verlag.