Page  00000001 LARYNXOPHONE: USING VOICE AS A WIND CONTROLLER Alex Loscos, Oscar Celma Universitat Pompeu Fabra Barcelona, Spain {alex. loscos, oscar. celma}@iua. ABSTRACT In the context of music composition and production using MIDI sequencers, wind instrument tracks are built on the synthesis of music scores that have been written using whether MIDI keyboards or mouse clicks. Such modus operandi clearly handicaps the musician when it comes to shape the resulting audio with the desired expression. This paper presents a straightforward method to create convincing wind instrument audio tracks avoiding intermediate MIDI layers and easing expression control. The method stands on the musician ability to mimic, by singing or humming, the desired wind instrument performance. From this vocal performance, a set of voice features are extracted and used to drive a real-time crosssynthesis between samples of a wind instrument database and the musician's voice signal. 1. INTRODUCTION When writing a song with a sequencer, adding convincing wind instrument tracks is easier said than done if the composer can not use a wind controller, whether because he lacks it or he lacks the skills to use it. When this occurs, either the composer records the score using a MIDI keyboard; either writes it by clicking the mouse. Clearly none of these solutions give intuitive and meaningful control over the final synthesis result and they usually require of a tedious test and trial tuning. The hypothesis in which this work is based is that voice can be successful in specifying main characteristics of a melody but also expressivity nuances such as attack sharpness / smoothness or dynamic envelope. For such reason, we propose a voice driven wind synthesizer that takes profit of this assumed correspondence between the expressivity of these two instruments. 2. VOICE TO MIDI-MIDI TO WIND VS VOICEWIND CROSS-SYNTHESIS If we think about controlling a tone generator, a sampler, or a synthesizer, it instantly comes to mind using MIDI. In the case we are trying to solve here, this would imply using a voice to midi conversion to turn voice expression attributes into MIDI control, and then use these to drive the synthesis. However, an alternative solution exists in which voice directly drives the wind instrument synthesis using real-time cross-synthesis techniques. Both approaches are presented and sections. studied in the following 2.1. Voice to midi converters There already exist several hardware and software commercial voice-to-midi converters. 2.1.1. Software and hardware solutions Regarding voice-to-MIDI software, among diverse solutions that offer real-time conversion we can find: MidiVoicer1, and Digital Ear2. Regarding hardware converters, we remark these two products: Vocal to MIDI3, and MidiVox4. Obviously, it is nonsense to solve the problem of not having a wind controller by replacing it with a hardware voice to MIDI requirement. Figure 1: MidiVox 2.1.2. Real-time note onset detection in singing voice Although some of these converters might be useful and give reasonable results in some cases, they generally lack of robustness. The problem comes from the singing voice real-time note onset detection. Such critical process is in charge of deciding at each analysis frame time if the current voice data belongs to a new note or, if it has to be considered part of the note that was already being performed. This decision has to be taken with no frame delay and with no prior knowledge on the final melody (not even key and/or scale). The considerable complexity of this problem makes it nearly impossible to avoid the converter outcoming false notes. 2.1.3. Mapping voice attributes to MIDI controls Regardless the modest suitability of using a voice to MIDI converter, it is important and probably a key factor to decide how to map voice attributes to MIDI messages. 1 voicer.html 2 3 4

Page  00000002 Basically, driving a synthesis with voice utterances requires from a mapping between pitch / dynamic related voice attributes and MIDI messages: note on / note off (key number and key velocity), polyphonic key pressure (aftertouch), and pitch bend change. Of course neither these voice attributes fulfill vocal expressivity space nor the MIDI messages are able to reproduce all possible wind instrument expressive nuances; however, when adequately mapped will allow a useful basic control. The first solution is to map the analysis pitch envelope to the key number and its associated modulation along the note, the pitch bend; and to map the Excitation plus Residual (EpR) Voice Model excitation parameters [1] (excitation gain, excitation slope, and excitation depth as shown in figure 3) to the key velocity and its associated modulation, the aftertouch. Amp Gaint Gain + SlopeDepth eSlopef - Gain - SlopeDepth _ Frequency I~ _ Figure 2: EpR model voice excitation representation. This process has to run on the fly, which means once a frame has been detected as the onset of a note, the converter takes the current pitch and dynamic values (possibly averaged along short past history) as the mean over the note. Thus, all following frames that are considered part of that note define aftertouch and pitch bend messages from the difference between its current values and the onset frame values. 2.1.4. Synthesizers Some synthesizers have appeared quite recently with which very realistic wind instruments sounding can be achieved. We should mention among these the Vienna Symphonic Library (VSL)l and Synfutl2. Although both are sample based synthesizers, VSL approach is to have a huge amount of samples but no sample transformation rather than concatenation, while Synful approach is based on fewer samples and the possibility of transforming them using a sinusoidal plus residual model of the sound, and cutting and splicing techniques. 2.2. Cross-synthesizing voice with wind instruments We define cross synthesis as the technique by which elements of one or more sounds combine to create a new one with hybrid properties. The cross-synthesis between wind instrument samples and voice utterances is a shortcut technique that avoids intermediate MIDI conversions. Taking profit of this technique capabilities, we can extend the voice control further than pitch, dynamics and their associated modulations and set off continuous control over, for example, the sharpness of the attack or the instrument timbre modulations. 3. THE LARYNXOPHONE A prototype with limited features, we call Larynxophone, has been implemented out of trumpet samples. 3.1. The prototype Similar to [2], the Larynxophone processes can be decomposed in non real time processes, which are the ones that take place as a prelude, before the application runs, and the processes that take place in real time, which are the ones that occur while the user is performing. The analysis used for both the trumpet samples and the voice signal captured by the microphone is frame-based, and uses spectral domain techniques that stand on the rigid phase-locked vocoder [3]. 3.1.1. Non-real time processes: instrument database creation The non-real time processes focus on the wind instrument database creation. That is, on recording real performances, editing and cutting them into notes, analyzing and labelling them, and storing them as an instrument database. In the current implementation, the database contains only three trumpet notes at A3, A#5, and C5. For each sample the database contains a binary file in which the necessary data resulting from the analysis is stored. 3.1.2. Real time processes: Voice Analysis, Transformations and Synthesis The real-time processes start with a frame based spectral analysis of the input voice signal out of which the system extracts a set of voice features. This voice feature vector decides which sample has to be fetched from the database, and controls the cross-synthesis between the instrument sample and the voice signal. Because of the nature of the prototype database, the criterion to decide the trumpet sample to pick at each frame time is "take the nearest sample in pitch", which is a particularization of the more general criterion "take the sample with most similar expression". From that sample, the trumpet frame is chosen sequentially, taking into account loops. The frame is transformed to fit the user's energy and tuning note specification, for which energy correction and transposition with spectral shape preservation is applied, with similar techniques to those described in [4]. Finally, the synthesis is in charge of concatenating the synthesis frames by inverse frequency transformation and the necessary window overlap-add related processes. S 2

Page  00000003 Voie Anaysis:.... Cross-synthesis Analysis - pitch energy Win Analysis Synthesis instrument Pitch transposition Instrument & Energy correction Figure 3: Larynxophone block digram 3.1.3. Additional Transformations The prototype incorporates some additional transformations. One of them is a one octave up / down transposition. When turned on, it is applied to the voice analysis data before using it to fetch the trumpet frame. Another additional transformation is pitch quantification, implemented as in [5] with optional specifications of key and chord. There is also an 'Extract Unvoiced' switch button that mutes the unvoiced parts of the input. This allows the musician to use unvoiced allophones in the performance. Finally, the prototype includes a morph feature by which the resulting synthesis timbre can defined as a balanced interpolation of both voice and trumpet timbre. Timbres are defined by means of the harmonic peaks magnitude data. The morph slider in the interface defines the morph interpolation value. 3.2. Improvements The actual state of the Larynxophone is far from a complete prototype. Its preliminary essence leaves room for improvement in each and every process involved. Some short term improvements are considered in this section. The current instrument database calls for many more samples, ideally several different dynamics (from pianissimo to fortissimo) for each of the notes the instrument can perform. The more samples the database contains, the less extreme transformations we will have to apply. The only voice attributes used as cross synthesis controls are energy and pitch. Energy should be replaced by EpR voice excitation parameters and more complex controls such as the ones mentioned in ~2.2 could be achieved by using voice excitation gain first derivative or spectral tilt voice attributes. In synthesis, whenever the instrument sample is shorter than the note performed by the user, the system keeps looping it from its attack to its release. It is a must to have a 'loop region' attribute computed and stored in the database for each of the samples. In the current prototype, the problem of deciding the sample to pick up from the database is analogous to the of the real-time note onset detection presented in ~2.1.3 since the system uses one sample per note. A solution to this problem would be to decide at each frame which is the best sample to use and whenever a switch is settled a smooth transition from the current database sample to new one would be applied using the spectral concatenation explained in [1]. 4. CONCLUSIONS There is along way to walk to get to a musical meaningful voice to wind instrument synthesizer. The Larynxophone is just a first step. The requirements of the user, in terms of ability to sing are not strict, and even less when using the pitch quantification transformation with the corresponding key and chord specified. The approach used in the synthesizers presented in ~2.1.4 is to include, in the instrument database, complete musical phrases that represent as many different ways of performing as possible. The current prototype might be far from these synthesizers's sound quality, but the Larynxophone approach does not require any expression database, and even more important, does not require the musician to fit any idea into a discrete expression set. 5. AKNOWLEDGEMENTS This research has been partially funded by the EU-FP6 -IST-507913 project SemanticHIFI. 6. REFERENCES [1] Bonada, J., Loscos, A. "Sample-based singing voice synthesizer by spectral concatenation", Proceedings of Stockholm Music Acoustics Conference, Stockholm, Sweden, 2003. [2] Cano P., Loscos A., Bonada J., de Boer M., Serra X. "Voice Morphing System for Impersonating in Karaoke Applications", Proceedings of International Computer Music Conference, Berlin, Germany, 2000. [3] Puckette M. S. "Phase-locked vocoder", Proceedings of IEEE Conference on Applications of Signal Processing to Audio and Acoustics, Mohonk, USA, 1995. [4] Laroche J., Dolson M. "New phase-vocoder techniques for pitch-shifting, harmonizing and other exotic effects", Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1999. [5] Zolzer U. DAFX - Digital Audio Effects. Wiley, John & Sons, March, 2002.