Page  216 ï~~The Synthesis of Sung Vowels in Female Opera and Belt Qualities Dr Michelle Evans* and Dr David M Howard Department of Electronics, University of York, Heslington, York YO1 5DD, England *Michelle Evans' present address is 39 Rose Street, York Y03 7JE, England. Abstract Considerable interest exists in the singing quality now quite commonly used in the theatre known as 'belt'. This vocal quality is heard in much popular and ethnic music today, and is commonly encountered on the stages of London's West End and New York's Broadway. In this paper, the perceptual salience of synthesised adult female belt and opera voice qualities based on acoustic and voice source analyses of natural sung tokens is described. 1. Introduction With the aid of modern voice analysis and synthesis systems it is now becoming possible to categorise different vocal qualities in singing in terms of their production, acoustic output, and perception. This will enable singing students to choose from a selection of singing quality models and with the aid of objective visual feedback systems (e.g. Howard & Rossiter, 1994) aim for an optimum vocal production to produce the chosen vocal quality efficiently and healthily. The aim of this study was to use spectral analysis and closed quotient analysis (e.g. Abberton et al., 1989) in order to define the salient acoustic and voice source features of female opera quality and female belt quality, and then to reproduce these qualities using the analysis results as input parameters to drive a voice synthesiser. A perceptual test was carried out to confirm the robustness of the models. A full description of this work is given in Evans (1995). 2. Recording and analysis Multi-channel recordings of the acoustic pressure waveform and the averaged output waveform from a Rothenberg two-channel electroglottograph (Rothenberg, 1992) were made for a number of adult female singers appearing on or training for the London West End theatre and opera stage singing in both opera and belt qualities. The recordings were auditioned to establish which vowels were actually performed using belt quality resulting in a number of tokens being discounted. Vowels extracted from the data of three singers were used in this study. The vowel /3:/(as in her) at pitch G4 sung by the opera singer AG and by the belter VP, plus the vowel /a:/ (as in far) at pitch E5 sung in both opera and belt qualities by singer MC were subjected to average spectral analyses. The spectral results for the four separate vowel-pitch tokens are shown in the left-hand column of figure 1. These four tokens were subject to copy synthesis using the Klatt synthesiser. The results are shown in the right-hand column of figure 1. 3. Analysis and synthesis The spectral analyses reveal several spectral differences that exist between the two qualities. The opera quality tokens are characterised by dominant fundamental components and a region of lower energy at around 3-4.5kHz separated by an energy trough. The belt tokens are characterised by 2nd harmonic dominance and a high energy plateau extending up to over 5kHz. Vibrato was found on both opera tokens and the belt token (MCE5/a:/) where the vibrato appears with a amplitude was reduced. Figure 1 shows spectra for the natural sung tokens (left column) and synthesised versions of these tokens by means of the Klatt speech synthesiser (Klatt, 1980). It can be seen that reasonably good spectral copies can be achieved by means of the Klatt synthesiser, although most of the higher frequency components above 5kHz are missing due to its bandwidth limitations. It should be noted that the relative formant positions of the natural sung data were not used to set the input formant parameters to the synthesised versions. This is due to the Klatt design being limited to speech synthesis where synthesis of these high frequency high energy regions as found in these singing qualities is not required. The closed quotient values for the natural data were converted to open quotient values for use as a voice source control parameter for the speech synthesiser. Vibrato was absent on the natural belt token VPG4/3:/, and in order that this would not provide too obvious a cue for the synthesised belt versions, vibrato was set to half the amplitude level for the opera tokens and included on all the synthesised tokens. Evans & Howard 216 ICMC Proceedings 1996

Page  217 ï~~NATURAL VOWELS Â~ - ' " -......AO.:........... A....... SYNTHESIZED VOWELS d8,-H( Opera..L o; G4/3:/ 1I3 c 6/; 0 dB -60C d 0 ri I= o " L 06) 0.i 8 ' Hz 0. C C rp.f.. 1' y 4 H 7 Belt CLoG G4/3:/ dE'." 0dB -68 "20 Opera (LO6 ES/a:/ 10d I -60...;.0' dF n. ( L 8. 80:. 1.3 Kz 'Hz..:MC! t:: i i: F-, 0 F0 d B M A G~ 4L Li,d B 11 6 Belt. L 0 ES/a:/ -60 MC. \j1 )K'lj _ i. IA _ '" I d S 1 ", LCIG) 0 dE.-60 " *~Wji7ji171 Figure 1: Average spectra of the synthesised sung vowels (right column) derived from the natural real sung vowels (left column). All the above vowel sounds were used in the perceptual test. 4. Perceptual test A perceptual test was carried out to compare identification of natural and synthetic belt and opera tokens. There were eight stimuli (four natural sung vowels and four synthetic vowels) and each was repeated 10 times in a randomised order. Ten normally hearing listeners took part, some were musical and some were not musical, and they were asked to identify whether the stimuli were sung in opera or belt voice. ICMC Proceedings 1996 Figure 2 shows the results with relation to the previous musical experience of the listeners. Correct identification of the stimuli was much higher than chance level, and the synthetic stimuli were correctly identified less often than the natural stimuli. It shows that the group of listeners with no musical experience (PG, MT, ME) have the lowest correct identification for the synthetic stimuli, but their identification of the natural stimuli is similar to that of the more musical listeners. 217 Evans & Howard

Page  218 ï~~V G V 100 - 90 80 70 -60 -50 -40 -30 -20 -10 -0 In IT 20 18 16 14 12 10 8 6 4 2 0 % real sung vowels correct E % synthesized sung vowels correct = no. years musical experience I-T ''! 'Ii P'f: 74,rte 11rs _11 1 II -r r--- a I I M 1 M"7 Ir r---- I E--+ V subjects Figure 2: Perceptual test results arranged according to the musical experience of each listener. 5. Discussion and conclusions It was expected that the synthesised versions would be correctly identified less often than the natural tokens since much information likely to contribute to naturalness of sung tokens, such as jitter and shimmer, was absent. This could partly account for the lower level of correct identification of synthesised stimuli for listeners with no previous musical experience. It could be that non-musical listeners are less able to differentiate non-natural sounds than listeners with musical experience because a higher degree of perceptual acuity might be expected in musical listeners. This may be especially true for the synthesised sounds whose voice source is modelled as a constantly repeating waveshape. In other words, musical experience may possibly be equated with a higher degree of aural awareness and analytic listening skills. However, the overall high correct level of identification does show that the perceptual differences in the parameters which were used to define the two qualities are large enough to be able to distinguish between them, even though the parameters changed were limited. This experiment suggests that a basic perceptual difference between female opera and belt qualities lies in the relative strength of spectral energy in: (I) the fundamental, and (ii) the 3-5kHz region. For notes in the female singing range, the harmonics that appear in the 3-5kHz region tend to be those above the 5th to 7th. This is important as an acoustic cue to timbre since these harmonics are: (a) not individually resolved by the hearing system and (b) those that make a sound 'rough' or 'cutting' perceptually (e.g. Howard and Angus, 1996). It should be noted that in all cases, a high degree of spectral matching was achieved which was less good overall for belt quality simply because its natural frequency range extends well above that of the Klatt synthesiser itself. 6. References Abberton ERM, Howard DM, and Fourcin AJ. (1989). Laryngographic assessment of normal voice: A tutorial, Clinical linguistics and phonetics, 3, (3), 281-296. Evans M. (1995) Vocal Qualities in Female Singing, University of York: unpublished D.Phil. Thesis. Howard DM, and Rossiter D. (1994). Real-time visual displays for use in singing training: An overview, Proceedings of the Stockholm Music Acoustics Conference: SMA C-93, Publication no. 79 of the Royal Swedish Academy of Music, 191-196. Howard DM, and Angus JAS. (1996). Music technology: An introduction to acoustics and psychoacoustics, London: Focal press. Klatt DH. (1980). Software for a cascade/parallel formant synthesiser, Journal of the Acoustical Society of America, 67, 971-995. Rothenberg M. (1992). A multichannel electroglottograph, Journal of Voice, 6, 36-43. Evans & Howard 218 ICMC Proceedings 1996