Page  1 ï~~CONSTRUCTION OF AN ELECTRONIC TIMBRE DICTIONARY FOR ENVIRONMENTAL SOUNDS BY TIMBRE SYMBOL Yohei Kobayashi Naotoshi Osaka Tokyo Denki University School of Science and Technology for Future Life 2-2 Kanda nishiki-cho, Chiyodaku, Tokyo, 101-8457, Japan yohei@srl.im.dendai.ac.jp, osaka@im.dendai.ac.jp ABSTRACT This paper introduces an electronic timbre dictionary, a search or synthesis system for user desired timbre. Timbre is represented by timbre symbols newly defined here. The method for designing timbre symbols is discussed. The timbre dictionary is designed to open to network users, so that these symbols can be defined and edited by many of network users. Thus the number of timbre symbols will increase. However, they are expected to converge and become stable as entries in Wikipedia have. The system is designed to ease manipulation of timbre symbols and linked sound data using a newly-defined graphical user interface (GUI) interaction mode: snaked path listening. This paper reports the total system configuration in detail. The system is intended for use in automatic sound media content creation and electronic music score-making and performance. 1. INTRODUCTION We have been studying a search and synthesis system for environmental sounds for application to multimedia content creation. We call the system an electronic timbre dictionary. In order for users to directly represent a sound image, a timbre symbol is newly introduced, which represents timbre by symbol for a short duration sound stream or sequences of such symbols. We use timbre symbols as an input to the system. In relation to timbre search, various studies have been done for user-desired multimedia contents from large videos and music database. For audio browsing tools, [1] and [2] are well known. Trials have been done to design a sophisticated aural and visual display that maps sound files of interests to assist sound analysis, synthesis and retrieval problems. SoundComplete [3], having annotations to overall or arbitrary portions of a piece of music, enables those music pieces searched and the creation of new music. These systems use Music XML [4], which is an XMLformatted music notation system. Freesound [5] is a well-known search system for sound in general. A search is done for a name of a sound, an instrument, or a file. If timbre is added as an annotation when a sound is registered, a search using timbre can also be possible. However, timbre is not designed systematically and dealt with only as one of the annotations. Technology for searching for environmental sounds directly from user-desired timbres is not known to exist. The electronic timbre dictionary we propose here is aimed at doing a more elaborate search for a sound based on a sophisticated systematic timbre structure using timbre symbols. The difference between this system and ready known systems is that this system is not intended as a general purpose sound browser but is particularly for timbre symbol manipulation; sound files are limited to smaller ones or isolated sounds, and whole music pieces are not dealt with. In order to represent a timbre, we have been proposing a timbre symbols [6]. In reference [6], the necessity for a timbre theory has been argued for, and requirements for the theory have been discussed. We stated that timbre symbols are indispensable, and the design and structure of such symbols is discussed in section 2. In order to develop timbre symbols and make them more sophisticated, we intend to release the dictionary on the internet, so that any users can share it. The timbre dictionary is not only referred to, but also modified, edited and revised by many network users for both timbre symbols and sound data, and resultantly similarly to Wikipedia's success, it is expected that it will converge, become stable and reach high quality. In section 3 and succeeding sections, its system design and configuration are described. 2. DESIGN OF TIMBRE SYMBOLS Let us classify known sound representation models or synthesis systems from the viewpoint of the components which they are composed of. We define those components as timbre components, although not all of them are perceived. Table 1 shows the classification of timbre-component-based representations, which try to resynthesize the original sound as closely as possible, in terms of timbre components. The classification is written in the order of granularity of the timbre components, including additive synthesis. In No.1, a sinusoidal model [7], sinusoids are used as timbre component with the smallest granularity. SN ratio is the measure of the fitness of the wave. In No.2, SMS [8], noise is added to the timbre component. Noise is treated

Page  2 ï~~as a stochastic process and justified since microscopically they are perceived in the same way, and reconstruction of a wave is not the goal of the system. These two components might not be appropriate to be called timbre components since their granularity is too small and cannot be perceived. No. 3 is one of the target levels of our study, in which durations shorter than speech vowels are treated as a timbre component, and called micro-timbres. No. 5 is familiar to all, that is, onomatopoeia, whose duration is the same as that of speech phoneme. No. 4, musical instrumental sound, is a trial to represent in terms of phoneme, while in No.6, speech is expressed in terms of instrumental sound. In reference [10], a trial has been conducted to represent environmental sound such Table 1 Classification of timbre notation of sounds in terms of timbral components SD: Spectral Distortion No Timbre Granul- Measure Analysis component arity 1 Sinusoids Small SN ratio Algorithm [7] 2 Sinusoids Small SD Algorithm [8] Noise 3 Micro Medium timbre 4 Phoneme Large SD Algorithm [9]/ Manual 5 Onomato- Large poeia 6 Instru- Large SD/ Algorithm [9]/ mental Peceptual Manual sound distortion 7 MIDI Large SD Algorithm [10] signal Table 2 An example of timbre symbols for water sound Timbre Macro Onomatopoeia (Japanese) component timbre Drop p# picha, pisha, pita, pichi, pwan Stream ch# choro, joro Shower, shoer, jar, cop, Stir sh# gobo, puk, dok as birds singing in terms of MIDI sound sources. If timbre components of large granularity are adopted for representation of a sound, distortion increases even in an analysis/synthesis framework. These are useful for some artistic pieces expressing deformed sounds. In this paper, timbre components are restricted to perceptive units of timbre as short as tens of milliseconds. This includes No.3 to 5. of the table. Timbre symbols are defined to represent all of these component: micro-timbre, phoneme, and onomatopoeia. We also define macro-timbres as a sequence of a few timbre symbols, which represent micro-timbre sequences or groups of those. Table 2 gives an example of timbre symbols for water sound. 'p#' represents water drop sounds in general. Figure 1 gives an example of micro timbres used as the lyrics. Once these symbols are defined for a sound, they are available to be used in writing a musical score. This will lead to new directions in the scoring of electroacoustic music. This system is not suited for invariant textural sound such as various noises but is particularly useful for rapidly changing timbre such as the stirring of water and birdsong, as well as speech. Those symbols are initially provided with in a system. Users can add their own new definitions. For sounds which interest many users, symbols will converse among speakers of the same language. This optimistic thought is from the success of Wikipedia. Inversely, in the case of multiple sounds for one symbol, the system can browse all of them. This framework provides us with infinite timbres. Timbres that we feel could use meaningful symbols for purposes other than use as sound sources for artistic music are: internal medicine stethoscopes, the struck sound of water melon to check its sweetness, and car engine sounds used to check for trouble. 3. IMPLEMENTATION OF ELECTRONIC TIMBRE DICTIONARY Since timbre symbols are not familiar to users, the system needs to assist their timbre symbol manipulation. This section describes total system configuration in detail, including assistive technologies for timbre symbol interface such as several types of graphical user interface interaction modes: hierarchical type, vector type and snaked path listening. 3.1. Outline of the system We implemented an electronic timbre dictionary. System functions and its relations are shown in figure 2. The timbre symbol database and wave database are shared by all the users via network. Network users can search and synthesize sounds from timbre symbols. In a web browser of the client's system, a user can access both the timbre symbol and wave database through text input and two methods of interaction with the GUI. The system has three modes: iN 0i L I I I I I I I 5 k kt pwan Figure 1 Example of micro timbre and its sequence

Page  3 ï~~........................ Figure 2. Block diagram of the system function 1. Sound data and timbre symbol registration mode 2. Timbre symbol editing mode 3. Sound looking up mode: search/synthesis 1 is an initial mode. Sound data are sent via network, and timbre symbols are defined together with other annotation and stored in the database. Figure 2 depicts a registration window. 2 is an editing mode. Supposing sound data has been stored and timbre symbols have already been defined. Symbols and their data can be edited by other users to improve the quality of the symbols and the database. 3 is the most important, and a primary purpose of the system. Other two functions are assistive ones for this looking up function. In this system, "looking up" a symbol in the dictionary does not only mean performing a sound search, but also synthesis if an appropriate sound does not exist. In mode 3, there are two types of inputs: interactive input and batch input. Interactive input gives the server timbre symbols one by one using two methods provided in the GUI; this will be described in detail in the next section. In batch input, a timbre symbol script is input to the server. This directly links to search engine and accesses the timbre symbol database. It finally accesses the sound data which is linked to the timbre symbol. 3.2. Configuration of software system In order for users to manipulate the system via web browser, a Flash is used as an interface on the basis of Adobe Flex Builder2 [ 11]. Figure 3. Hierarchical type mode A web server runs on Windows. Apache is used to control access from web browsers on the client system. In order to decrease the loads to clients, a script language, PHP is used server-side to process requested messages. The synthesis control component is written in C#, and the synthesis engine is equipped with a dynamic-link library (DLL). 4. CHARACTERIC FUNCTIONS FOR THE SYSTEM This section states the characteristic functions unique to this system. First, there are two newly-defined GUI interaction modes: one for hierarchical-type and another for snaked-path listening. These are introduced particularly for this system to allow users to reach the target timbre as quickly as possible. Then, sound synthesis functions are discussed. They are a collection of known algorithms and can evolve so that the system can incorporate a greater variety of timbres. 4.1. Hierarchical type and Vector type modes In order to search timbre symbols efficiently in a shorter time, they are placed in a hierarchical structure according to how the sound is generated, rather than how it is perceived. Timbre-symbol search is done within user-defined hierarchies. Fig. 3 depicts a hierarchical-type mode. The vector type is used to show timbres in one category of a layer. It has many perception-related physical feature dimensions, such as pitch, MFCC (Mel-Frequency Cepstral Coefficients), sound duration, centroid, etc. Timbres are placed in the multi-dimensional vector space in terms of these feature vectors. Principle Component Analysis (PCA) is adopted to reduce the feature vector dimensions to 3D. One of the advantages of the system is both hierarchy structure and vector spaces are both taken into consideration in order to find the displacement of timbre.

Page  4 ï~~t h.......... snakedpath i Figure 4. Snaked-path listening mode 4.2. Snaked-path listening mode In order for users to actively search for the timbres they want, we implemented a function which enables succeeding playback of sounds by tracing a path through the timbre icons shown in the mode introduced above. We call it snaked-path listening. Fig. 4 depicts the view for listening. It shows the traced path of a mouse cursor; sounds which are linked to these timber symbols on the path are played back in sequence. It also gathers touched icons and reassembles new timbres shown below the figure. 4.3. Sound synthesis functions The system is also equipped with sound synthesis functions. These functions are convenient in the following cases: 1. a lack of sounds corresponding to the timbre symbol the user has input. 2. a comparison of a wave found in the wave database and a wave synthesized from given timbre symbols. 3. a sound data sequence defined by snaked-path listening. In the case of 2, the user can choose between the sounds being compared. In the case of 3, the path defines a sequence of sounds. The synthesis functions can produce a wave file of those sounds played in sequence. At present, a limited number of synthesis methods are provided: STK (Synthesis Tool Kit) [12], physicalmodel-based synthesis engine of Sho [13][14], a Japanese traditional instrument and Karplus-Strongmodel-based wavetable synthesis [15]. 5. CONCLUSION An electronic timbre symbol dictionary has been researched. It consists of both search and synthesis systems for user-desired timbres from timbre symbols for environmental sounds. In this study, some considerations for timbre symbols have been given, and a server-client system via network has been reported. We introduced two new types of GUI interaction modes: a hierarchical type and snaked-path listening. They make using the system more comfortable. Our first version is in Japanese. However, we will endeavor to provide an English version, and hopefully we can release it to the public in the near future. 6. REFERENCES [1] Tzanetakis, G., and Cook, P., "3D Graphics tools for sound collections," Proc. Of the COST G-6 Conf. On DAFX-00, Verona, Italy, 12, 2000. [2] Brazil, E., Fernstroem, M., Tzanetakis, G., and Cook, P., "Enhancing sonic browsing using audio information retrieval," ICADO2, Kyoto, Japan, 2002 [3] Hirata K. and Matsuda S.," A current status report on music entertainment software SonudComplete, IPSJ SIGMUS, 2003-MUS-52, 2003. (in Japanese) [4] Good, M., "MusicXML in Practice: Issues in Translation and Analysis," International Conference Musical Application using XML. 2002. [5] Music Technology Group, "The freesound Project," http://freesound.iua.upf.edu/ [6] Osaka, N., "Toward construction of a timbre theory for music composition," Proc. of the ICMC, Miami, 2004. [7] McAulay, R. and Quatieri, T., "Speech Analysis/ Synthesis Based on a Sinusoidal Representation," IEEE Trans. on ASSP, vol. ASSP-34, No. 4, Aug. 1986. [8] Serra, X. and Smith, J., "Spectral modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition." Computer Music Journal 14(4): 12-24, 1990. [9] Segnini, R., "Score phonetization and speech derived composition," Proc. of ISMA, 2-S2-13, Nara 4, 2004 [10] Modegi, T., "Development of MIDI Encoder Tool 'Auto-F'," ICADO2, pp158-163, Kyoto, Japan, 2002. [11] Adobe, "Flex Builder2," http://www.adobe.Com/ jp/products/flex/ [12] Cook, P., "Introduction to Physical Modeling," A K Peters Ltd, Audio Anecdotes: Tools, Tips, and Techniques for Digital Audio, pp. 179-198, 2003. [13] Hikichi, T., Osaka, N. and Itakura, F., "Timedomain simulation of sound production of the sho," Acoustical Society of America, Vol.113, No.2, pp. 1092-1101, 2003. [14] Hikichi, T., Osaka, N. and Itakura, F., "Sho-SoIn," Journal of New Music Research, Vol.33, No.4, pp.355-365, 2004 [15] Karplus, K. and Strong, A., "Digital synthesis of Plucked-String and Drum Timbres", The Music Machine, pp. 467-479, 1989.