Page  00000240 Music Creation from Moving Image and Environmental Sound Shogo Takahashi* Kenji Suzuki* Hideyuki Sawada** Shuji Hashimoto* *Dept. of Applied Physics, Waseda University 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan ** Dept. of Intelligent Mechanical Systems Engineering, Faculty of Engineering, Kagawa University {shogo, kenji, shuji}, Abstract A new system to create music from audio-visual environmental data is proposed. The system consists of an image analyzer, sound analyzer, and music generator. These audio-visual components work together to generate music and sound. All the system is operated in Opcode MAX environment on Macintosh G3 without any other special equipment. The proposed system allows users not only to create sound and music with his motion and voice but also to get feedback with audio-visual effects in real time. In the on-site demonstration we would like to show the performances of the system and a sound installations to react to the environment as an example. 1. Introduction The purpose of this work is to realize an interactive environment for musical performance in the on-site demonstration. In this paper, we describe an advanced interactive system, which can create music in real time by extracting structural features from moving image and environmental sound automatically, and associating them with musical features. Many researchers have reported automatically music creation [1][2], and from moving images [3]. We have also developed some systems so far for musical performance [4]. To generate background music from video or moving image is basic work of this study [5]. In particular, we developed a sound analyzer that is capable to capture auditory information. From this, the system allows us to create a variety of music. It should be noted that all components of sound analyzer and synthesizer are constructed with MAX objects. Users, therefore, can easily treat the operations and relations by changing the connections among these objects. The system consists of two objects and a patcher; Image Analyzer, Sound Analyzer, and Music Generator. Firstly, moving images are captured from the CCD camera at the rate of 10 frame per second. The image analyzer calculates color information, the spatial and temporal features of the image. At the same time the environmental sound are captured from the microphone equipped with the system. Then, the sound analyzer calculates auditory information of the sound. Finally, the music generator tunes the chord progress, rhythm, tempo, which by associating the image and sound features with the musical features, and creates the music. 2. System Overview All the system is operated in Opcode MAX environment on Macintosh G3 without any other special equipment. Figure 1 shows us the description of developed system. In this section, we describe the description of image/sound analyzer and music generator. The interactive music is created through these three MAX object in real time. 2-1. Image Analyzer Object This object called Image Analyzer can extract temporal and spatial features from moving images [6]. The input of the object is moving image from the video camera, while output consists of following image features: Color information (i) RGB (Red, Green, Blue) components (ii) Hue, saturation, and lightness components Spatial and temporal features (i) Edge density (ii) Scene changing value (binary data) In the present work, the sized of captured frames are 320x240 pixels. Each frame divided into MxN areas. The features of moving image are calculated in both of the whole frame and image of each area to obtain the global features and the local ones. Eq.(1.1) shows values of RGB components obtained by image data at each frame. - 240 - ICMC Proceedings 1999

Page  00000241 Image Analyzer Ob c IZEo. Figure 1. System Overview MN Lkall (t) = L (, j) (1.1) (k = R,G,B,H:,S,7L) where L denotes summation each RGB brightness, and Ik represents each one of pixels. In the same manner, the calculations on hue, saturation and lightness can be performed. Then, m and n represent the width and height of each divided area, respectively. Then, using above color information, average values of edge density are also extracted as the spatial features (eq. (1.2)). n-1 m-I Ek(i, jit)=-l t(x-lkX1, y-l,yt)-lk(xi+1, y-1,t) xi-1 Y1.1 -lk (xi -1, yj +1,t) -lk(xi + 1, yj +1,t) +41k i, y, t) } Ek _ all(t) = Ek(i,jt) (1.2) (k = R, G, B) '_1 0 -1" QuickMask Matrix 0= 0 4 -1 0 -11 where E denotes summation of each edge density. On the other hand, in order to obtain temporal features, the image analyzer stores basic frame data as a background image at the beginning of the detection of moving image. Comparing to the background image, scene changing can be detected by calculating the temporal difference at every newest frame (eq.(1.3)). M-1 n-I Dk(i, j)xlk Y)-lk,(xy) M N Da= ilDk (i,]j) k JB -1 (1.3) Daii > 6 scene change Dail s 0 where D denotes difference of brightness and I. and Ik,, represents brightness of background and present image. Figure 2 shows the extraction of scene changing. X-axis represents time t, while y-axis represents Da1l. When Da,,, exceeds the threshold, scene change is occurred. LJ J a[s scenecsange - -E Figure 2. Extract Scene Changing 2-2. Sound Analyzer Object The object called sound analyzer works to extract sound features and auditory information of environmental sound. The input of the object is sound wave from microphone equipped with standard Macintosh MIC-in. It outputs the following sound features: ICMC Proceedings 1999 - 241 -

Page  00000242 Figure 3. System Architecture of Music Creation System Sound features (i) Velocity (ii) Fundamental frequency Auditory information (iii) Environmental state The velocity of the sound is calculated as eq. (2.1) where N denotes the number of samples per one frame, and A represents the maximum of velocity. Velocity =l10olog0o N x2(t +nAt i) (2.1) While, fundamental frequency is calculated with the cepstrum method by using the harmonizing structure of the sound. In order to avoid the sampling errors, kth peak of frequency is divided by k as described in eq. (2.2). P P fe fk k (2.2) In addition, the system can also recognize the state of environment from the auditory information. We made an experimental thresholding of sound velocity and spectrum density in order to distinguish with each mode of auditory information such as noisy, silent and singing. associate the image and sound features extracted from the above two analyzers with the global musical features (such as the chord set, rhythm, tempo, timber and volume). This patcher consists of following internal patcher: - Backing chord progress generator - Drum pattern generator - Melody generator The correspondence of the image/sound features with musical features can be pre-determined or interactively defined by the user. The Music Generator accepts data not only from above two analyzers, but also from the external signals such as tuning the parameters on this MAX patch. The internal data set of chord patterns such as backing scales, chord progress is involved in the musical generator. Although the basic rule of composition is adapted in advance based on musical theory, the user can add the new rules for the appropriate algorithmic composition. Then, the music generator compiles all the determined musical features into the MIDI sequence. The modified features include timber, volume, tempo, rhythm, chord progress and melody line. At present work, the melody notes are created according to changes of color information, while the velocity of each note is produced by edge density. Baking chord progress and the drum pattern are renewed every 8 bars of generated music, or instant of scene changing. The backing chord is selected from 2-3. Music Generator Patcher Finally, the Patcher called Music the melody, chords and rhythm. Generator creates This object can -242 - ICMC Proceedings 1999

Page  00000243 prepared typical five patterns. In addition, the base note of backing chord can be changed with 7 steps of key notes defined by obtained the pitch of auditory information. While, with regard to drum pattern, 72 sets that have 6 different tempo are available. According to the change of temporal features of the image, the rhythm of created music changes with time. On the other hand, the kinds of instruments are also changed at the instant of scene changing. 3. Interactive Music Effects This system starts to generate music when an object appears in front of the camera or when the microphone captures environmental sound. In this section, we describe the several kinds of interactive effects between user and the system. Scene Changing Effect: The intense temporal features changing causes the renewal of all musical features by updating the composition. Volume Effect: According to the change of velocity information obtained by the sound analyzer, the total volume of generated music/sound is modified. In order that the system follows human's vocal performance, the change of volume is done with a short time delay. Stereo Sound Effect: When the object moves in the right/center/left area of captured image, created music is also output with stereo effect of audio speakers according to the location of moving object. In addition, each three area has the different pattern of music creation. Therefore, the system allows user to express his intention as if he has three kinds of instruments. Harmony Effect: When a user starts to sing in front of the microphone, the harmonic effect is occurred. With a short time delay, a single note is produced so that the system would harmonize with the captured voice. Feedback Effect: The system can be driven not only by user's intention but also by generated music and sound by itself. Therefore, once the system starts to interact with human, the interactive effect will be arisen. In other words, the user and system can make a sort of cross performance as if the initiative of performer is continuously changed each other (Figure 4). 'I >?r5 Fu 4 -I-----te aJdt Figure 4. Interactive feedback effects 4. Conclusion and Future Work We have introduced a new system to create music from environmental and auditory information obtained by equipped video camera and microphone in real time. From the experimental results, the variation of music generation and the interactivity is proved. Moreover, since all objects for interactive music generation are constructed on Opcode MAX, users can easily modify the relationship between music generation and input components. With regard to the multimodal application of the proposed system, the authors developed a system with an autonomous mobile robot [7]. Further consideration is arisen to provide more variety set of musical rules that are related to audiovisual input more effective. In order to reflect users' favorites for music generation, the relationship using neural network between user's input and output of the system is also another interesting topic. References [1] L. Hiller, and L. Isaacson: "Experimental music", McGraw-Hill Book Company, Inc., (1959) [2] M. V. Mathews, and F. R. Mooer: "GROOVE-a program to compose, store, and edit functions of time", Communications of ACM, Vol.13, No.12, (1970) [3] J. Nakamura, T. Kaku, K. Hyun, T. Noma, and S. Yoshida: "Automatic Background Music Generation based on Actors' Mood and Motions", The Journal of Visualization and Computer Animation, Vol.5, pp.247-264, (1994) [4] Hideyuki Sawada, Naoyuki Onoe, and Shuji Hashimoto: "Sounds in Hands - A Sound Modifier Using Datagloves and Twiddle Interface -", in Proc. of International Computer Music Conference '97 pp.309-312, (1997) [5] Naoyuki Onoe, Dingding Chang, and Shuji Hashimoto: "Background Music Generation Based on Scene Analysis", in Proc. of International Computer Music Conference '96 pp.361-362, (1996) [6] Y. Gong, C. Hook-Chuan, and G. Xiaoyi: "Image Indexing and Retrieval Based on Color Histograms", Multimedia Modeling, World Scientific, pp.115-126, (1995) [7] Kenji Suzuki, Takeshi Ohashi, and Shuji Hashimoto: "Interactive Multimodal Mobile Robot for Musical Performance", in Proc. of International Computer Music Conference '99, (1999) ICMC Proceedings 1999 -243 -