Page  151 ï~~Adaptive Timbre Control Using Gesture Pitoyo Hartono* Kazumi Asano Wataru lnoue Shuji Hashimotot Dept. of Applied Physics, School of Science and Engineering, Waseda University Abstract From the beginning of human history, music has been developed as one form of emotional expression. The media to do so is musical instruments, which are used to transform the body movement into sound. The relationship between the action and generated sound is determined by the physical structure of the musical instrument in the case of traditional music, or by the software written by a composer in the case of modern music. In this study, by using a neural network we have tried to develop a "new form" of musical instrument that can be used easily to translate their body movement into the desired sound by users without musical skills. 1 Introduction Due to its complexity, we have to exert much effort and time in order to master musical instrument. In the case of conventional musical instruments, we have to know about its physical structure and the rules that are applying to the instrument's characteristic, before we can use them as a tool of expression. While in the case of recently developed electronic musical system[l][2][3][4][5[6J the flexibility of expression is developed enormously and some of them seem promissing for professional composers. However, the structure of the instruments are so complicated that it is almost imposible for users to master them in a short period without special knowledge. In both cases, skill and experience are very critical. We believe that included in our ordinary sense, there exist some logical relation between our body expressions and sounds that makes us able to manifest our musical feelings without using an musical instrument. In the recent years, our laboratory has been engaged in developing sound system which can control the musical sound space by gesticulaton of hand movement using dataglove and computer baton system, but the relationships between the parameters and sound parameters are fixed, so the user can only play in limited way[7[[[9. To make a new kind of musical instrument that can be played without an initial limitation, we have tried to develop an interface that can be controlled by gestures in this study. It can not only be used easily by users without special musical knowledge, but it can also be tined flexibly to have the desired relationship between gesture and sound. The flexibility of the interface is highly critical because the sense of relationship between gesture and sound differs from one person to another. For example, one user may like to use a finger snap as a gesture to start the sound, but another user may prefer different gesture to express the same thing. With such an interface, we will be able to bring a new kind of "expression tool" which can be used by everybody, into reality. * e-mail: t e-maihl ICMC Proceedings 1994 151 Interactive Performance

Page  152 ï~~Basically, musical tone is characterized by the basic frequency, the amplitude, and the timbre. The musical timbre especially is related to lots of factors such as spectra, over tone structure, decay time, sustain level, release time, envelope figure, and so on. In this study, we use white noise or the saw-tooth wave as a sound source and control the envelope parameters and the post filter characteristics to modify the generated sound by gesture. As gesture parameters, we measure a gesture by hand positions, movements ant its velocity that can be obtained in the real time from the information extracted from a Dataglove and a three dimensional magnetic sensor. The association between the gesture parameters and sound parameters is made by a five layered neural network. A neural network, which recently has been used for many kinds of pattern clasification, is said to simulate the function of human brain. An artificial neural network can be compared to the biological one and the connection weight in the network to the synaptic connection. Unlike the usual computer algorithm, instead of giving a specific procedure for solving a problem, in the case of a neural network, we only need to provide the network with problem - answer sets, and train the neural network to produce the desired answer when a certain problem is inputted. In our study, the network is taught beforehand to generate some typical timbres according to primitive gesticulation which may be common for most people. Next, the individual user can teach the network to use the system to make the gesture - sound association satisfactory for him/her. To do this individual system tuning, we have introduced a new learning algorithm of neural network to repeat pattern generation and weight correction. The proposed system is an intelligent musical instrument that can adaptively change the relationship between the gesture and the sound. Moreover, the on-line tuning will allow a chance operated performence. The final goal of our study is to provide a system to stimulate the human creativity through an action generated music which will be used not only for musical art but also for musical therapy. Dataglove Dataglove's Control Device gesture Neural Netwoki Sound Source sound Sound Output Fig. 1.Block Diagram of The System Interactive Performance 152 ICMC Proceedings 1994

Page  153 ï~~2 System Configuration The system consists of personal computer, a dataglove and its control divice, a three dimensional magnetic sensor, a MIDI sound source and a sound effector as shown in Figure 1. The magnetic sensor is used to detect the three dimensional position of the user's right hand and the velocity of its movement, while from the dataglove, the flexing degree of the five fingers can be extracted and used as gesture parameters. Instead of providing a fixed relation for the linkage of gesture parameter and sound parameter, we put a neural network between them so that each user can choose the most desireble relation that is unique for him/her. 3 Neural Network 3.1 The Structure of Neural Network and Learning Process We used a five-layered neural network, with mixed linear and nonlinear neurons. The structure of the neural network is shown in Figure 2. Gesture Sound Fig.2. Five Layered Neural Network R2 Gesture Parameters Sound Parameters Rl Fig.3. Two Dimensional Emotional Space The first and third layers consist of linear neurons, while the other layers consist of nonlinear neurons, which are indicated by white and black circles in Figure 2 respectively. ICMC Proceedings 1994 153 Interactive Performance

Page  154 ï~~By restricting the number of linear neurons in the third layer to two, we tried to map the gesture into two dimensional space, which we call "emotion space". We can investigate the topological relationships among a variety of timbres in the emotion space. The behaviour of our network is written as follows: XjL = 1Z WiL-.l9Y=L-1 (1) =L-1 1"L = f(XJL + a3L) (2) L = 1,2,...N jL = 1,2,...ML iLi = 1,2,...ML-1 where ML represents the number of neurons in the L-th layer, XjL and YL represent input and output of neuron j in the L-th layer respectively, WiLljL represents the conection weight between neuron i in the (L -1)-th layer and neuron j in the L-th layer, while 6iL indicates the threshold value of neuron i in the L-th layer. (L = 0 represents the input layer). We put the output function f(x) for the nonlinear neurons as: f(x)1=,,s(3) 1+eP and for the linear neurons: f(x)= x (4) We do not put a limitation on the method of neural network learning, but because of its simplicity we choose to use the back propagation method9]1Â~]fl], in which the difference between the network's actual outputs and desired outputs are used as an index of weight correction. Square error E between the jth vector of the actual network's output Yjoutpu, and the jth vector of desired output Yid can be calculated as follows: Ed = 1" - Yutput) (5) 2 Then, the connection weight of neurons is corrected to reduce the above square error as, WiL,(t + 1)= W=L (t) - =d O (6) BOWiLj in equation (6), WiLj(t + 1) is the connection weight between the i-th neuron in the L-th layer and the j-th neuron in the next layer at time t + 1, and WIL (t) shows the value of the same weight at the previous time t while ir/is a small constant that indicates the learning rate. In the proposed system, at first the network is trained to generate some typical timbre according to primitive gesticulation which may be common for most people, using the back propagation method described above. Interactive Performance 154 ICMC Proceedings 1994

Page  155 ï~~3.2 Active Learning In recent years, we studied a new kind of learning algorithm for the neural network which we call "active learning "[12][13][14], in which the neural network can actively self-produce the effective sample pattern while advancing the learning process. We use the modified active learning for the personal tuning of the network in the second phase of learning. With the proposed algorithm, the network will not only be passively depend on given problem(pattern) - answer sets, but will actively produce a variety of patterns, while the user will always give answers to the generated patterns succesively. All we have to do is to provide the network with an initial pattern, then then network will be trained to generate the user's favorite pattern according to the answer. That is, in the simplest version of the second learning for the specific gesture input X10, if the answer is "yes" then the network correct the weights according to equation (6) toward the desired output Yjd given as: jd = (1 + a) Â~ Yj + R(7) Where a is renewal parameter. If the answer is "no" then the desired output will be set as follows to pull away from the present pattern,(see Figure 4.) d=(u+ R(8) R in equations (7) and (8) is a small random perturbation to avoid the ceasing of pattern modification and to generate variety of patterns in the desired direction in the pattern space to provide a chance to find more desired output. The amplitude of Rt will increase when the "answer of no" is repeated. While R will decrease when the "answer of yes" is repeated. yes'R R2 0 Xjd RI Xjoutput ("no") X RI Figure.4. Random Walk in Emotion Space 4 Experiments At first we have trained the neural network to link three sets of gesture - sound relations as "soiid sound", "heavy sound" and "light sound",respectively. At present the gesture parameters consist of: * hand movement velocity v= and v1, * two parameters to indicate the conditions of the user's palm(whether it is turned upward or downward) * three parameters to indicates the conditions of user's fingers(whether they are flexed or not) ICMC Proceedings 1994 155 Interactive Performance

Page  156 ï~~while the sound parameters consists of: " the rise and release period of sound envelope " the gains of low pass and high pass post filters The example of relationship between the adjectives which describe the nature of the sounds and the gestures is shown in Table 1, while Table 2 shows the relationship between the adjectives and sound parameters. Maximum Output,-r-L --; Time rise period release period Fig.5. The sound envelope gain low pass frequency Figure.6. Post Filter Characteristic solid light heavy I palm downward upward downward Â~ finger hand movement clenched shaking powerfully horizontaly clenched moving verticaly strecthing moving verticaly Table 1. adjective - gesture relation solid light heavy rise 0.0 0.6 0.6 release 0.0 0.6 0.6 low band gain 0.3 1.0 1.0 high band gain 1.0 0.3 1.0 Table 2. gesture - sound relation The adjective - gesture and gesture - sound parameters do not need to be the same as the relationship shown in Table 1 and Table 2, each user can choose the most likely adjective - gesture and gesture - sound parameters, and train the neural network to recognize the Interactive Performance 156 ICMC Proceedings 1994

Page  157 ï~~linkage between those parameters. There is no limitation im the number of gesture - sound sets, but in this study we choose the three most understandable adjectives to link with the most suitable sounds in our sense for the initial training data of the neural network. After the first learning phase, the neural network produced three points in the emotional space, each of which represents one adjective that is used to train the neural network. The distribution of the points in the emotional space is shown in Fig 7. R2 0 light o heavy R1 o solid Fig.7. Distribution inside emotional space The advantage to use neural network is that the system can produce not only the "patterns'(in this case gesture - sound sets) that were used in the training but it can also produce patterns that lay between the trained sets. For example, the system can produce the "middle sound" between a light sound and heavy sound if we can produced the appropriate intermediate gesture, although it is hard to do so because we do not exactly know how to gesticulate that way. To test the ability of the network to produce the intermediate sound, we can also drive a cursor in the two dimensional emotional space using a mouse. For example, if we move from a point that represents a heavy sound toward a point that represents a light one, then the system will produce sound that gradually changes from the light sound to the heavy sound. We can move freely inside the emotional space to obtain unique sounds that can not be expressed using simple adjectives. With the proposed system, a more "varied" performance can be brought into reality because each user will be able to tune in the second learning using the active learning algorithm to express their musical feeling freely without any limitation. 5 Conclusions In this study we proposed a new kind of musical instrument that, unlike the conventional musical instrument, can be mastered in a short period by any user without the necessity of any musical knowledge. Moreover, the users themself can decide the behaviour of the musical instrument, which is difficult in case of ordinary musical instruments. With active learning, in the second learning phase the neural network will produce a specific sound if gesture parameters are given, and then the user can decide whether the sound is suitable for the given gesture or not. If it is suitable, than the network will once again produce a sound in the direction of the "good sound", if not the network will pull the sound away from the "bad sound". In other words, the sound will make a random walk inside the two dimensional emotional space, searching for the most suitable sound. The procedure will be repeated until the user is completely satisfied with the gesture - sound relationship. By doing this, we will be able to train the network by checking the gesture - sound relationship in real time to define the specialized musical instrument. Furthermore, we also plan to include, not only hand gestures, but also the other body part movements, like head movements or facial expressions into the system, so that the freedom ICMC Proceedings 1994 157 Interactive Performance

Page  158 ï~~of self expression will be much greater. If the whole system can be downsized, other than musical instrument, we would also like to apply our system as an aid instrument for the handicapped people who cannot speak, for the the system can be taught to recognize "self made" chirology. The traditional chirology itself is sometimes very hard to be learned becuse the rule of chirology's gestures do not always fit the user's sense. With the proposed system, the user can choose the gesture that is most suitable for him/herself so that the user can express him/her mind more free and easily. References [1] Keane.D, Gross.P.: The MIDI Baton, Proc. ICMC, pp 49 - 55(1988) [2] Rubine.D, McAvinney.P: Programmable Finger-Tracking Instrument Controllers, Computer Music Jour, Vol.14, No.1, (1990) [3] Boie.B et al.: The Radio Drum as a Synthesizer Controller, Proc.ICMC, pp.42 - 45 (1989) [4] Rokeby: Body Language, ACM Siggraph Art Show, Atlanta (1988) [5] Chabot: Performance with Electronics", Proc.ICMC, pp. 65 - 68 (1989) [6] H.Katayose et al.: Virtual Performer, Proc.ICMC (1993) [7] Tsutomu Harada et al.: Real Time Control of 3D Sound Space by Gesture, Proc.ICMC, pp. 85-88 (1982) [8] Morita et al.: Computer Music System that Follows A Human Conductor, IEEE Computer,, pp 45-53(1991) [9] Sato et al.: Singing and Playing in Musical Virtual Space, Proc. ICMC, pp. 289 - 292 (1991) [10] D.E. Rumelhart, J.L McClelland and the PDP Research Group: Parallel Distributed Processing, vol.1 & 2, The MIT Press, (1986) [11] P.Pereto: An Introduction to the Modelling of Neural Networks, Cambridge University Press [12] S.Y.Kung: Digital Neural Network, Prentice Hall [13] Shuji Hashimoto: Pattern Generation Using Neural Network, Convention Record, D217, IEICE Japan, (1990) [14] Miwa Ishii, Shuji Hlashimoto: Automatic Pattern Generation Using Neural Network, Convention Record, D-14, IEICE Japan, (1991) [15] P. Hartono, S.Hashimoto: Active Learning Algorithm of Neural Network, Proc.IJCNN'93, pp.2548 - 2551 Interactive Performance 158 ICMC Proceedings 1994