Page  00000001 Multimedia Environment for Sound Database System Hai Qi, Taichi Muramatsu and Shuji Hashimoto Department of Applied Physics, Waseda University {qihai, taichi, shuji} @shalab.phys.waseda.ac.jp Abstract We propose a new type of sound database system with multi-modal interface for sound data retrieval and modification. Sounds, images and gestures are used as keys for data retrieval. Sound data are characterized in the time-frequency space and 896 feature parameters are stored with original AIFF wave data in the database file. The preference of the individual user is acquired in the repeated data search to make the system adaptive. The user can modify the retrieved sound using the multimedia interface of the system to get the satisfactory sound. 1. Introduction When a composer, a sound designer or a sound director does his artistic creations, he often wants to find the suitable sound for his work. He, however, sometimes cannot describe the sound in words even when the sound exists in his mind, which makes it difficult to use the conventional keyword-based database retrieval system. Furthermore, when the objective sound is not included in the database, the user needs to modify the obtained sound to reach the objective one using a sound modification function of the system. Recently, some sound database system have been reported. Sound editing and mixing functions are integrated with database application in some of the reported works [Blum et al, 1995][Keislar et al, 1995]. Other researchers have proposed an analogous database system for sound synthesis [Vertegaal & Bonis, 1994]. Neural networks provide a different way to find a mapping between acoustical attributes and perceptual properties, which can be used for the automatic indexing of a sound database [Feiten &Gunzel, 1994]. The authors have proposed a content based sound database retrieved by sound [Hashimoto et at, 1996][Muramatsu et al, 1996]. This paper proposes a new type of adaptive sound database system with multi-modal interface for data retrieval and data modification. The user can input sounds, images and gestures to operate the system and get the objective sound after repeated data searches and sound modifications. 2. System Overview This multi-modal database system uses images, sounds and gestures to search the database. The system overview is shown in Figure 1. We use microphone, CCD camera, Drawing tool and DataGlove with a position sensor as input devices of the system. sound imag gesture extraction command of feature parameters analysis visual output Figure 1 System Overview When the system stores the original sound wave data in the AIFF format, the sound feature parameters are extracted and stored in a parameter table file as the index of the stored sound. To use this system user gives a image, a sound or a gesture to the system. The feature parameters of input data of any medium are associated with sound parameters. These parameters are used as the search key. As the search key has the same form as indexes in parameter table file, the system can compare them easily. By

Page  00000002 calculating the normalized distances between the search key and the sound data indexes, the system displays from one to five most similar sounds. Then the user selects one of them and use it as a new key sound to search again or employs the data processing function to modify the selected sound to reach his satisfactory sound. 3. Sound Indexing The sounds are digitized into 16bit data sequence at the sampling rate of 44.1kHz and stored in database in AIFF format. As the length of each sound data is 216(=65536), the duration of a sound data is about 1.5 second. The sound characterization is performed by using short term spectra sequence to express the temporal-frequency features of the sound. The length of one short term frame is 1024 (23ms), and every frame overlaps by 512 words (11.5ms). Therefore we have 127 frames in one sound data. In each frame the following seven kinds of feature parameters are extracted, where i denotes the frame number. Xii: the fundamental frequency. Xi2: the periodicity factor which is the ratio of the power of the harmonic frequency components and the total power. Xi3: the power of the lowest band up to 1.5KHz. Xi4: the power of the second lowest band up to 3KHz. Xis: the power of the third highest band up to 6KHz. Xi6: the power of the second highest band up to 12KHz. Xi7: the power of the highest band above 12KHz. For each above features, we calculate a average value X, and then we take the difference xj. between the average and a value of each frame as follows, 4. Characterization of Input Key The key for sound data retrieval is given in a variety of forms; sound, image, and gesture. To associate the input key with sounds in the database, we have to characterize these input key and map onto the sound feature space. In the case that the key is given in sound, the characterization can be done in the same manner as the sound data indexing. In the case that the key is an image, we have two ways for characterization. One is to regard the image as the spectra where the horizontal axis, the vertical axis and image intensity represent the frequency, the time and the power, respectively. The other is to find seven lines in the image and regard them as the temporal evolutions of seven features. In this case, the user have to assign each line to the different feature and each line must represents a single-valued function of time. At present we use seven kinds of color lines to distinguish features. For extracting parameters,each line is divided into 127 equivalent parts in X-direction, and the average Y-direction level of each part is regarded as the parameter value. After calculating the average level and distance between the average and level of each part, we can get 896 parameters for the image. An example of a line image is shown in Figure 2. image window in screen 0 1 2 126 127 Fingure 2 Example of a line image 127 Cxit x = S 127 x i = X - XI (1) (2) Where j(=.. 7) denotes the feature number. Thus, we have 896=7x127+7 feature parameters for each sound including seven averages. Therefore, the sound data retrieval is done in the 896 dimensional feature space. The system can display the spectral profile as the pictorial description of the sound which allow the user to "see" the sound. For a gesture input, we get movement trajectories of a right hand position, a direction of hand, and bends of five fingers which are regarded as the fundamental frequency, the periodicity factor and the powers of five frequency bands. Then we divide the sequence of the hand motion into 127 equivalent short segments. Doing the same calculations as for line image, we get 896 parameters from the gesture as a key for sound data retrieval. Some special gestures to be used as commands for system operation are also recognized from the same gesture data by using a pattern matching method. The command gestures are registered arbitrarily according to user's preference in advance.

Page  00000003 5. Sound Data Retrieval The system compares the key parameters with sound index parameters by calculating the distance Hi as follows, H t= t (X -X)2 2 17 1 k 1 j= 1 Xij ^D^- ^)2^~ (3) where a', and b are weights for features; the large a, and b values mean that feature j is significant in the data search. These weights are renewed after user' s selection according to the following equations to adapt the definition of the distance to the user's preference. 1 S-1 -s ) S=a +(Xj-X) ) I 1+1 127 l b - =b +1 (x 1x I1 x - x 1.+1 J, J I+1 (4) (5) a J (an +b ) b where 5 and xiS 2 I (6) (7) are the parameters of the sound The proposed system provides some functions for sound modification, such as a pitch shifter to modify the fundamental frequency, a sound mixing to create a new sound and a variety of filters. Using these function, user can make a suitable sound even when the sound don't exist in database. Apart from these, the system can play from one to five sounds simultaneously by distributing the powers of each sound to stereo output at different ratios to make an effect that each sound can be heard at a different location. Therefore, the user can easily compare the retrieved sounds, discriminate them and choose the most satisfactory one. In order that user can use the system easily, the system provides a user-friendly interface. With this interface, user need not use a keyboard or a mouse, he only needs to make special gestures such as "OK", "1", "2", "3", "4", "5", "up", "down", "right" and "left" to select functions of system menus. Furthermore the position sensor on the DataGlove can be used as a positioning device to indicate the user's selection. 7. Experimental Results The proposed system was implemented on SGI workstation, and it takes about 15 seconds for extracting parameters from a sound. Figure 4 shows a typical example of sound data retrieval with a sound key. Figure 5 shows the retrieved sound spectra by using image key. Figure 6 shows results by a line image key. Although results by a gesture key is not shown here, a gesture defines a similar key representation as a line image. The number of sound data stored in the database is around 400 which is not sufficient at present. We, however, obtained promising experimental results because the sound modification function can create satisfactory sound for users to compensate the low hit rate of the system. 8. Conclusions A new type of sound database system with a multimodal interface and a data modification ability was proposed. 1.5 second may not a sufficient sound duration for general uses of sound database. We are now improving the system to extend the data length by introducing a data compression technique and a more efficient sound characterization algorithm. The proposed system is not only designed for professionals in art but also for people working in a variety of sound creation and entertainment fields. The authors consider that multimedia database system will be used in various situations. Furthermore, the proposed system can be used as a kind of musical instrument which allows artists to draw out music from his mind. selected by the user. As a result, the distance between the input key and the selected sound become shorter in the next retrieval as shown in Figure 3. (I ^Z renewed weigt selected sdund 0 0 ~ Figure 3 Distance in 896 Dimension feature space 6. Sound Modification and Man-Machine Interface Because the expected data may not exist in the database but only in user's mind, even the most similar data in database may be unfit for user's idea.

Page  00000004 Figure 4 Experimental result by a sound key................................................. Figure 6 Experimental result by a line image key References [Blum et al] T. Blum, D. Keislar, J. Wheaton, and E. Wold, "Audio Database with content-Based Retrieval ", Proc. IJCAI 1995, Workshop on Intelligence Multimedia Information Retrieval, 1995. [Keislar et al,] D. Keislar, T. Blum, J. Wheaton, and E. Wold, "Audio Analysis for Content-Based Retrieval", Proc. ICMC1995, pp.199-202, 1995. [Vertegaal & Bonis] R. Vertegaal and E. Bonis, " ISEE: An Intuitive Sound Editing Environment", Computer music Journal, Vol.18, No.2, pp.21-29, 1994. [Feiten &Gunzel] B. Feiten and S. Gunzel, " Automatic Indexing of a Sound Database Using Selforganizing Neural Nets.", Computer Music Journal, Vol.18, No.3, pp.53-65, 1994. [Hashimoto et al] S. Hashimoto, H. Qi, and D. Chang, "Sound Database Retrieved by Sound", Proc. ICMC 1996. [Muramatsu et al] T. Muramatsu, H. Qi, and S. Hashimoto, "Sound Database System for Multimodal Data Retrieval -Data Retrieval by Sound-, Proc. International Conference on Computer Music & Music Science, 1996. Figure 5 Experimental result by an image key