Page  433 ï~~Sound Base: Phonetic Searching in Sound Archives Kees de Koning Shaun Oates Centrum voor Kennistechnologie SITRG Lange Viestraat 2b Dept. of Computing 3511 BK Utrecht, NL University of Bradford phone: +31 30 332366 BRADFORD, BD7 IDP, UK fax: +31 30 328195 phone: +44 274 733466 3949 kees@hku.uucp fax: +44 274 383920 ckoning@praxis.cs.ruu.nl shaun@uk.ac.bradford.computing Abstract Sound Base [1] is a project allowing users to manipulate an archive of timbres using a combination of different sound representations. In this paper we examine one aspect of Sound Base, the retrieval of timbres using phonetic entries. To do this we use established techniques of; audio to phonetic transcription, textual to phonetic transcription, metric theory and best-match search strategies. 1. Introduction - a broad statement of the problem Computers have been proved to be the most versatile musical instruments, allowing the musician access to a range of timbres which exceed that of any instrument previously built. With this new freedom comes attendant problems; how does a musician control the expressive power of computer musical instruments? Previous work on timbral definition and manipulation systems [2] has identified three domains of sound representation; perceptual, acoustic and synthetic. A timbral definition in the perceptual domain refers to the perceptual qualities (or percepts) of the sound. Producing practical sound manipulation systems based on percept control is problematic because timbre itself is a multidimensional attribute [3], and there is no generally accepted definition of what these dimensions represent. The acoustical domain refers to the body of techniques for measuring physical properties of sound (principally Fourier transform techniques). Systems which allow the manipulation of acoustical attributes typically force the user to work with a large body of information. The synthetic domain refers to the process of defining sound through the manipulation of the parameters of a synthesis algorithm. Many systems of sound synthesis have been developed but manipulation of sounds in the synthetic domain is still more akin to programming than to typical musical activities. 1.1. The Use of a Sound Database One method of supplying the musician with a large range of sophisticated timbres, is to provide an environment in which the musician can both design sounds and easily archive them for later reuse. A possible architecture for such a system is described in [4]. In order to be of use to the musician the archive must contain a large number of timbres, and the user will therefore be confronted with the problem of locating a desired sound within the archive. This problem is further highlighted if the musician is not the sole provider of information to the database, as would be the case for group projects or the use of prepackaged commercial material. In such cases mechanisms which allow users to index entries with a personalised description of timbres are not applicable. We stated in the previous section that there were three main methods of describing sounds, we have also shown that none of these methods could offer a general purpose intuitive language for specifying musical timbres. However there is a mechanism which is widely used by human beings to describe sounds in proximate terms: onomatopoeia, the formation of names by imitating the subject sound (eg crash, bang, cuckoo, etc.). In this paper we investigate the possibility of navigating a database of timbres using phonetic descriptions and appropriate search mechanisms. 2. Phonetic Searching of a Sound Database The advantages of using phonetics as a means of sound specification are; phonetics are intuitive and widely used, and there is a large body of work available on the mechanics of phonetic manipulation. ICMC 433

Page  434 ï~~2.1. Phonetic Representation A phoneme is not a single sound, rather it is an abstraction which covers a set of sounds, called allophones. The criteria for grouping a set of allophones under a phoneme are; they are felt to be the same by the speaker, they cannot be used for distinguishing between words and they differ in ways which are predictable from the context [5]. Another way of viewing phonemes is that they are a set of units that are required for transcribing utterances in a particular language unambiguously. The set of phonemes needed varies between languages and between dialects within languages. Phonemes provide a method of transcribing a number of different timbres because human beings have considerable flexibility in the production of sound structures. 2.2. Segmenting Audio Signals into Phonetic Keys The earliest machines for speech recognition were involved in the automatic transcription of speech into phoneme-like symbols (the so-called phonetic typewriters). The earliest approach to this involved template matching; the frequency spectrum of the incoming signal was matched against a set of standard spectrum patterns, one for each phoneme. Another approach, inspired by the principal of distinctive features was also proposed and it consists of making several binary classifications, each based on a principal most appropriate for that classification. In the phonetic typewriter [6] speech input is first transformed into successive samples of the frequency domain, this is then processed on the simple criteria that the energy in a particular region of the frequency spectra be above a certain threshold value. The result is a matrix which indicates the presence of energy at a particular frequency and time of the input sample. This amplitude-frequency-time matrix is then compared with the contents of several phoneme memories, one for each phoneme in the encoding system. This system was quite successful in matching vowel sounds and some consonants, particularly if the spectral energy distribution is distinctive, but cannot cope with some sounds that have similar spectral energy distributions but different durations and onsets (eg s and t ). An alternative system is the binary selection system [7], which operates by passing the sound through a series of tests designed to discriminate between phoneme types. The first test usually determines whether it is a voiced/unvoiced phoneme. This is done by detecting the presence of a fundamental in the signal. The next stage of testing discriminates between voiced-turbulent / voiced-non-turbulent and unvoicedfricatives/unvoiced-stops. These stages continue until it is possible to discriminate between all phoneme types. This system is more effective than the phonetic typewriter because it can extract a significant number of extra features. The important feature of phoneme transcription is that it is merely an interpretation of the data, and as such several differing interpretations of the data may have near equal validity. This will influence the strategy for comparing the interpretations of both the input query and the database of sounds. Archive descriptions lposition in database Comparison Search input distance sml query eedback Text-Phoneme Converter Audio-Phoneme Converter User Interface Figure 1. System Architecture 2.3. Architecture of the Database There are three elements to the problem of searching a sound database by phonetic query. These are; constructing a user interface for the creation of queries, deriving phonetic interpretations and retrieving the ICMC 434

Page  435 ï~~closest match from the available timbres. A general architecture for processing phonetic queries is shown in Within this architecture there are a number of options available to the system designer which will effect the complexity of the various modules. 2.4. Constructing a User Interface for the Creation of Queries The design of the user interface will have repercussions on the implementation of other system modules, because the complexity of the search mechanism is highly dependant on the number of degrees of freedom available in the user interface. In this paper we differentiate between closed and open interfaces, and examples of such are outlined in later sections. Inevitably there is a trade-off between computational simplicity and expressivity. We have identified three types of user interface: iconic, a closed interface; textual and speech, both of which are open interfaces. 2.4.1. Iconic Phoneme Construction Interface In the iconic interface the user is provided with a subset of phonemes occuring in the database. A query is constructed by selecting the composition and order of a phonetic sequence. The rationale of this interface is that there is a relatively small vocabulary of phonemes from which the user can choose to construct a representation of the sound she is looking for. This interface can be constructed as a screen with, for example, forty buttons which form a "virtual" phonetic keyboard that can be used to enter the desired timbre. One of the problems of this approach is the definition of a vocabulary of available phonemes, which is powerful enough to describe all sounds, but also small enough for the user to survive in it. 2.4.2. Speech to Phoneme Conversion In the speech interface the user actually speaks the sound she requires. This is the most intuitive of the approaches as it exploits the natural ability to mimic sounds. Another advantage of this system is that it is possible to construct this interface using the audio to phonetic converter described in section 2.2, which will be used for generating the phonetic index for the database sounds. 2.4.3. Text to Phoneme Conversion The textual user interface allows the musician to assemble a sequence of letters which, she feels, most approximates the sound she wants. This approach allows the user complete freedom in specifying timbres, without the limitations of either their experience in constructing phoneme sequences or their vocal tract. However, the process of converting a text description to a phonetic representation is non-trivial [8]. To understand this one must consider the series of translations involved. First the user forms a requirement in her mind, she must then interpret this mental picture of the timbre into a sequence of text which represents this concept. The system must then interpret this text into an appropriate phonetic sequence. This has introduced an extra stage in the process, when the user formulates her textual description of the sound. This stage is highly culturally dependant. For example, consider the sound of a dog barking. English speakers will transcribe the sound as "woof' whereas Dutch speakers will write "woef' or Danish "w0f'; also consider the sound of a cuckoo, which a Dutch speaker will describe as "koekoek". As with converting audio to phonemes there will be a number of equally viable interpretations. 2.5. Matching a query to available keys Phonetic searching can be viewed as an extension to the class 2 search as given by Burkhard [9]. This search returns a set of keys which are, in some sense, "closest" to the query key. The "distance" between two keys x and y is given by a predefined metric d(x,y). For example, in searching a film database, a query for actors aged 30 may return a set of actors aged between 28 and 32. The criteria of what constitutes a close match is dependant on either the size of the answer (eg only return the 5 closest matches) or values within a specified tolerance (eg within 2 years of the query). This method is the most popular technique for vague queries, and is further developed in [10]. The system for searching the database will have two key components; the metric used to calculate distance between phonetic keys and the strategy used to navigate the search space. 2.5.1. Distance Comparison The distance comparison may apply on three different levels; individual phonemes, phoneme sequences, and sets of phoneme sequences. Because of the vagaries of phonetic transcription similar timbres may have acquired different phonetic representations, therefore the comparison of elements should be semantic rather than symbolic. Comparison can be achieved by a metric which measures the perceptual distance between ICMC 435

Page  436 ï~~phonemes using a tabular metric as described in [11 ]. With the ability to calculate the distance between individual phonemes, we can then calculate the distance between phoneme sequences. This is a complex operation as the algorithm will need to acknowledge; similarity of present phonemes, absence of phonemes and the importance of individual phonemes to the perception of timbres. Finally we have stated in previous sections that two methods of deriving phoneme keys, audio to phonetic and text to phonetic transcription, can generate a number of equally valid phonetic sequences. This means that a number of phonetic sequences must be compared for each item. 2.5.2. Search Mechanisms There has been extensive research in the area of structuring databases for optimal search strategies. A possible approach is first described by Burkhard [9], involving structuring the database using graph theory to arrange related items using cliques. 3. Summary and Further Research This paper considers aspects of a system which allows users access to a wide range of sophisticated timbres. We outline a mechanism for searching a large timbral database in an intuitive manner, by using phonetic queries. This mechanism was reduced to the following elements; audio to phonetic transcription, text to phonetic transcription, user interface construction, phonetic similarity metrics and search algorithms. Of these problems only user interface construction and phonetic similarity metrics have not been extensively investigated. It is therefore necessary that more work be done on these areas. 4. Acknowledgements We thankfully acknowledge the cooperation of the following: Dr. Barry Eaglestone; Roel Vertegaal; METASOUND, Amsterdam; and the SIT research group. 5. References [1] R.P.H. Vertegaal and C. de Koning, "Haalbaarheidsonderzoek Sound Base Project", internal report of Centrum voor Kennistechnologie, Utrecht and METASOUND, Amsterdam, 1991. [2] W. Buxton, S. Hull. and A. Fournier, "On Contextually Adaptive Timbres," pp. 414-416 in Proceedings of the International Computer Music Conference, Computer Music Association, 1982. [3] R. Plomp, "Timbre as a Multidimensional Attribute of Complex Tones," pp. 397-410 in Frequency Analysis and Periodicity Detection in Hearing, ed. R. Plomp and G.F. Smoorenburg, A.W. Sijthoff, Leiden, 1970. [4] B.M. Eaglestone and A.P.C. Verschoor, "An Intelligent Music Repository", Proceedings of the International Computer Music Conference, Computer Music Association, Montreal, 1991. [5] P. Ladefoged, "The Phonetic Basis for Computer Speech Processing", pp. 3-27 in Computer Speech Processing, ed. F. Fallside and W.A. Woods, Prentice-Hall International, UK, 1985. [6] H.F. Olson and H. Belar, "Phonetic Typewriter", Journal of the Acoustical Society of America, pp. 1072-1081, Vol. 28(6), 1956. [7] J. Wiren and H.L Stubbs, "Electronic Binary Selection System for Classification of Phonemes", Journal of the Acoustical Society of America, pp. 1082-1091, Vol. 28(6), 1956. [8] M. Stella, "Speech Synthesis", pp. 442-446, in Computer Speech Processing, ed. F. Fallside, W.A. Woods, Prentice-Hall International, UK, 1985. [9] W.A. Burkhard, "Some Approaches to Best-Match File Searching", pp. 230-236, Communications of the ACM, Vol. 16(4), 1973. [10] D. Shasha and T.-L. Wang, "New Techniques for Best-Match Retrieval", pp. 140-158, ACM Transactions on Information Systems, Vol. 8(2), 1990. [11] A. Motro, "VAGUE: A User Interface to Relational Databases that Permits Vague Queries", pp. 187-214, ACM Transactions on Office Information Systems, Vol. 6(3), 1988. ICMC 436