Page  00000322 A Parallel-Formant Speech Synthesizer in Max/MSP Michael Kexin Ma, Sidney Fels*, and Robert Pritchard" * Dept. of Electrical and Computer Engineering, University of British Columbia ssfelsC@ece.ubc.ca tSchool of Music, University of British Columbia bob @interchange.ubc.ca Abstract Gesturally Realized Audio, Speech, and Song Performance (GRASSP) is a software implementation of a real-time parallel-formant speech synthesizer in Max/MSP. The synthesizer is based on the JSRU (Joint Speech Research Unit) parallel-formant speech. The resulting synthesizer is controlled by a Cyberglove, a custom glove, a foot pedal, and a Polhemus tracker. 1 Introduction This project refines and develops previous work found in Fels' and Hinton's GloveTalklI. The abilities are expanded and it is referred to as the Gesturally Realized Audio, Speech, and Song Performance environment (GRASSP). Our motivation is the desire for a highly flexible speech synthesizer that can be used in improvisatory and composed performances, but which can also control other performance attributes such as sound diffusion, video processing, and kinetic sculpture. 2 The GRASSP environment GRASSP includes a physical interface and Max/MSP bpatchers grouped into a Learner, a Mapper, and a Synthesizer. The physical interface is comprised of a right hand Cyberglove with an attached Polhemus sensor, a left hand custom glove, and a foot pedal. 2.1 Data Acquisition The Learner contains the Dictionary and the Recorder bpatchers. The Dictionary displays the right hand configurations for fifteen consonants, and the Recorder stores the eighteen data points generated by the Cyberglove when the user imitates those positions. Vowels are controlled by the horizontal location of the Polhemus sensor while the hand is in an open position. As the performer moves her wrist to target locations in the horizontal plane the Recorder is used to store the coordinates for each of the eleven vowels. Once all of the required data for each phoneme has been recorded it is stored in an Accent file. 2.2 Performance In performance the Mapper relates the current hand positions and configurations to the stored data. Normalized radial basis functions calculate which phoneme data set in the Accent is closest to the performer's current hand configuration. The Mapper - containing a mixture of experts architecture with a vowel, a consonant and a manager expert - calculates the probability of the system producing a consonant, and adjusts the mix between the vowel and consonant managers' outputs. The foot pedal controls the overall volume of the resulting sound. 2.3 Sound sources and processing The GRASSP system uses models of glottal waveforms as its primary sound source. It can also use sampled waveforms or live sound input, providing a wide variety of timbral resources for performers. The audio output from GRASSP can be processed using the UBC Toolbox, a set of thirty-five Max/MSP audio and video processors that can be controlled by GRASSP's physical interface, enhancing and expanding the performance abilities of the system. References J. M. Rye and J. N. Holmes, "A Versatile Software ParallelFormant Speech Synthesizer", JSRU Research Report No. 1016, 1982. J. N. Holmes, "Requirements For Speech Synthesis in the Frequency Range 3 - 4 kHz", Proc. FASE Symposium on Acoustics and Speech, Venice, pp 169-172, 1981. S. S. Fels and G. E. Hinton, "Glove-Talkll: A neural network interface which maps gestures to parallel formant speech synthesizer controls", IEEE Transactions on Neural Networks, Vol 9, No. 1, pp. 205-212, 1998. E Lewis and M A A Tatham, "A Generic Front-End for Text-toSpeech Synthesis Systems", Proc. 3rd European Conference on Speech Communication and Technology, Vol. 2, pp913-916, 1993. R. Pritchard and S. S. Fels, "GRASSP: Gesturally-Realized Audio, Speech and Song Performance", Proc. 6th International Conference on New Interfaces for Musical Expression, Paris pp 272-276, 2006. 322