Page  138 ï~~Virtual Performer *Haruhro Katayose *Tsutomu Kanamori *Katsuyuki Kamei *Yoichi Nagashima *Kosuke Sato *Seiji Inokuchi **Satosi Simura *Laboratories of Image Information Science and Technology *Seni LC, I If, Toyonaka, Osaka, 565, JAPAN * **Osaka Univertiy of Arts **Higashiyama, Konan-cho, Minamikawachi, Osaka, 585, JAPAN Abstract This paper describes the overview of the Virtual Performer, a system/environment which realizes nonverbal communication in the fields of live art. The Virtual Performer consists of gesture sensors based on multisensors, modules which analyze obtained gestures and generate action plans, and presentation facilities. An adaptive Karaoke system and a session model based on multi-agents will be described as media partners of the Virtual Performer. Configuration for the Shakuchi music will be shown as an example of the composing environment of the Virtual Performer. Realtime control of a CG guitarist will be also described in this paper. 1. Introduction 2. Virtual Performer Computer Science has greatly developed and contributes in a wide area ranging from engineering to daily life. It is true that computer technology has led to the development of indispensable tools but it is said that there remains room for improvement of user-interfaces. Computer science deals with verbal information. This is one of the reasons why computer science has made great progress. At the same time it results in the limitation of user-interface. Recently, in the technological field, there has been great interest in how to deal with non-verbal information. To achieve non-verbal communication between man and machine, we have to study and develop sensors to obtain outside information, a recognition system to understand obtained information, a planning module based on recognized environment, and a facility for presentation. Virtual Reality is realized technology in non-verbal communication [Zelter 1991]. We can see a lot of applications of Virtual Reality technology, see SIGGRAPH proceedings. Our system proposed here has a goal of realizing "KANSEI" communication in realtime. KANSEI is used in terms of sense, feeling, emotion, or sensitivity [Katayose 1990]. The essence of KANSEI communication is communication of intention and emotion using patterns with presence. This paper will describe the Virtual Performer which has achieved "KANSEI" communication in the art field. Improvisation or interactive composing are hot themes in computer music, and interactive media art using sensors of continuous sound has been given much attention recently [Chadabe 1983] [Machover et al. 1989] [Chung et al. 1991]. We have been developing the Virtual Performer which is a system composed of gesture sensors, a module analyzing and responding obtained information, and a facility for presentation. This section gives an outline of the Virtual Performer. Live performance is a form of art in which tension is the highest, because the same performance cannot be repeated. In live performances by human performers, each performer knows the scenario beforehand. He decides the timing of events to play which is written in the scenario, based on the information communication in real time. Furthermore, human performers who are familiar with each other can guess correctly what their partner will do next, observing the acoustic and motion gestures given by their partners. This function is achieved by the combination of a complementary utilization of sensors, knowledge processing in order to respond correctly to recognized information, and a facility of presentation. The purpose of the Virtual Performer we propose here is to simulate such processing and to provide an interactive composing environment. Figure 1 shows the outline of the Virtual Performer. Virtual Performer is composed of gesture sensors, a module analyzing and responding obtained information, and presentation facility. The first is the sensing module. Sensor fusion 4A.1 138 ICMC Proceedings 1993

Page  139 ï~~technique is used in the construction of the gesture sensors. The sensing module consists of sensors for motion gestures and acoustic gestures. As for motion sensors, different kinds of sensors are used complementarily in order to acquire data with high resolution in a wide range. As for acoustic data, chords can be detected in real time. The second part is a module analyzing and responding obtained information, that is a performance controller. This module decides performance data analyzing the recognized gestures. Uncertainty of the data acquired by the sensors is clarified in this module, considering all output from each sensor. In this module, higher musical primitives, such as name chords, periods of phrases, and structural accents, are also analyzed in real time. These musical primitives, as well as low level signals from sensors, are used as control parameters of performances. The third is the CG and sound generator. In addition to artistic CG, a human physical model which generates animation synchronized to the performance has been constructed. At present, a hand model to play guitar has been made as a prototype. Regarding the sound generator, MIDI instruments and digital sound synthesis using D/A board are available. We have two goals in using the Virtual Performer. One is a role as the composer's environment, and the other is as a partner system for general use. The former aims to compose total media art. The Virtual Performer environment enables us to compose media art which exceeds the limitations of human control in real time. One of the authors, Satosi Simura is composing interactive music accompanied by a background video. The technological outline of this piece will be described in the section Virtual Musician as Composing/Performing Environment. The media partner systems, which is different from the former artistic interest, aims to present human interfaces, that is, "KANSEI communication" which is popular. An adaptive karaoke system and a session system will be shown in section five as media partners. 3. Handling Gestures It is relatively easy for humans to understand gestures. Some people might think it is quite easy for machines to recognize gestures. However, gestures will differ every time, especially in live performances. This results in difficulty in adjustment of thresholds. Furthermore, humanbeings can perceive data with high resolution in a wide range. This means that a person can focus on what interests him. In order to realize such humanlike sensing by a machine, a sensing section in the Virtual Performer utilizes multi-sensors. Performers' rough movements in a wide area on stage are measured with an image processing sensor. It detects the light of the light-emitting diodes attached to the performer's body. The infrared filter attached to the camera enables us to use this sensor in the ordinary lighting environment. It is possible to change the width and resolution by changing lenses. The price data is measured with sensors based on supersonic transmitters and Gyroscopes. As for accoustic sensors, the following two kinds have been developed. One is a sensor for a musical in P infrared image sensor MIDI MIDIl1 (Sound data) mic MIDI 2 (Rough motion data) MIDI-3 (Detail motion data) MI -- Acoustic Sen'or Figure 1 Virtual Performer lCMC Proceedings 1993 139 4A.1

Page  140 ï~~strument whose sound has a fast attack. The sensor first calculates frequencies using plural DFT coded on transputers. The sensor can identify the chords of guitars in real time. The other sensor is for continuous sound for Shakuhachi, which will be described in the section six. Sensor-fusion technique is used in order to cope with ambiguity. For example, some vibrato techniques used in Shakuhachi playing are identified by the combination of results of acoustic sensors and sensors based on gyroscopes. MAX [Puckette 1988] is used to define how to execute sensor fusion. MAX is very useful also in such usage. For example, it is quite easy to make a patcher that recognizes chords from note-on events. But it has difficulty in recognizing a structurally defined object in time domain. There is a possibility of stack overflow unless we can control activating time of objects. In order to reduce the burden on MAX, we developed a preprocessor of realtime Production System in which we can describe the activating time of each working memory. MIDI is adopted for data communication and control in the Virtual Performer, because a lot of commercial software is available to define data set for interactive composing. The output of each sensor complies with MIDI format. We can use each sensor individually if necessary. 4. CG Generation Visual media plays an important role in the meaning of Presence defined in the concept of Virtual reality technology. In our project, visual media is used in two ways. One is in order to present real-time CG art. The other rather focuses on the amusement aspect. It is not a big problem to model and to make realistic animation in the artistic approach. Many researchers have been studying how to model and animate realistic CG with a technological interest. In this section, an approach of generating CG for playing musical instruments will be described. This CG is used as a virtual guitarist of the session system described below. At present, we have been developing CG for playing the guitar, which moves synchronizing to the music. The graphics are synthesized using a real image of a human performer and computer graphics. It is important to represent natural movement of fingers in the performance to improve the reality. It is, however, difficult to measure movement of fingers in a real performance and there are numerous patterns in movement of fingers in playing a musical instrument. This paper describes a method which generates the motion of playing the guitar, using the static position data of fingers measured in advance. 4.1 Model Limiting the discussion to the playing of chords, the important points of modeling the left hands are (1) generating the form of the hand when the chord name is given and (2) generating movement when a chord progression is given. There is a study that decides the form of a hand by maximizing the estimating-function of articular angle after indicating the points of the finger tips. Our method considers the relation between each finger, and the position of free fingers can be defined. (1) Estimating-function The estimating function is defined as follows. Where auj (0 to 1) is articular angle of the ith finger and the jth joint. aoij represents the angle when we relaxthe finger. Ef= E (aij(aij - ai;-)2+pj(aij -ai-I j)2 +yij(aij-a-1j)2} i,j The first term represents the effect of each finger. The second operates so that the the angles of adjacent joints will be similar. The third operates as to make a relaxing finger form. The following equation is used in order to adjust the position of the finger tips. Where, pi is the point pf the finger tip, Pi is the desired position. E,p = i(piPi)2 h =0 pi is calculated using aij and the length of the segment. The estimating function is defined as follows, E=Ef+Ep (2)Generating motions Our method uses spline functions to decide movement from one chord form to another. Ef described above is used in order to represent the relaxation of fingers. We can get the following equation, where V is the velocity, and VO and Vi is the velocity when t = 0 and t = 1 respectively. A is [aij}. 2-2 11 A0 A(t)=(t3t2tl)(-3 3 -1 A 00 1 V0 10 00 A Vi Here, Vo and Vi is as follows. Vo = -40{ 8Ef(Ao)/&i1poA -0 Vl = -(I 1 [(8Efo)~i}1+p(AI-Ao) The first term operates as the relaxation and the second term operates as the velocity control. Figure 2 shows the processing flow of generation of movement of the CG hand. Figure 4 shows an example of the hand movement using the model above. The fingers are represented using a cylinder model and total of 800 polygons are used to construct a hand image. It 4A.1 140 ICMC Proceedings 1993

Page  141 ï~~Figure 2 Motion Generatin of a hand Figure 3 Model of a hand takes 40 milliseconds in motion generation and drawing. 5. Media Partners This section describes Media partners which aim to show general users examples of user-friendly interactive art system arts under the virtual performer environment. 5.1 Adaptive Karaoke system Automatic accompaniment system has been an important theme in computer music since Dannenberg and Vercoe proposed the system in 1984 [Dannenberg 1984] [Vercoe 1984]. We have developed an adaptive Karaoke system that follows human singing as an application of an automatic accompaniment system [Takeuchi et al. 1993]. Karaoke is a sort of music minus 1 and is very popular in Japan. It is thought to be one of the most suitable examples to show computer technology in music to the public. Most Karaoke music, however, is music that keeps tempo. Therefore, our system would be like a pianist for chanson, to put it more precisely. There is another study in adaptive Karaoke systems in Japan [Inoue et al. 1992]. The aims of this study and our study are almost the same; 1) Tempo following, 2) countermeasure for the singer's mistakes, 3) Identification of the music from a sung phrase, and 4) Pitch following. But how to assign pitch of human voice which is unstable, even by good singers, has not fully discussed in their study. This section focuses on our approach to this difficulty. The difficulty in assignment of pitch are the form of vocal envelope, musical expression in singing, and real-time frequency analysis. As for musical expression, portament or vibrato which are not described in the score are frequently added in ordinary singing. Furthermore, these expressions differ according to the singer. Another problem is, which is thought to be a kind of portament, the pitch around the attack is not stable even in a good singer's vocal. The slow attack of vocal envelope results in difficulty in matching timing between vocal and accompaniment. The problem of real-time frequency analysis is mis-assignment of pitch due to harmonics. One of the solutions is to control the cut-off frequency attached before the A/D converter, yet it is still difficult to extract the right frequencies for every vocal. The system overview is shown in Figure5. First, the vocal is input to a Macintosh computer through a DC canceler, an amplifier, a 1KHz low-pass filer and a A/D converter. The pitch and power is extracted at the cycle of 40 milliseconds by the pitch and power detector. The interval time is adjusted as the window contains at least two waves of the lowest frequency of sound. Pitch detection is based on counting zero (e) I=U.x (c) t=0.4 (f) t=1.0 Figure 4 An example of motion generation of a hand ICMC Proceedings 1993 141 4A.1

Page  142 ï~~I Vocal Japanese songs that are composed of continuous vowel syllables. To cope with this problem, the clear attack candidates which may appear at the beginning of phrases and the notes which start from plosive sound are marked (Figure 6). These marked points are 0 (m 0 E3 "na" 1 0) 1 (m 0D3 "mi" 1 0) 2 (m 0 C3 "da" 1 0) 3 (m 0 A2 "no" 1 0) 4 (m 0 C3 "klssu" 2 0) 6 (m 0 C3 "mou" 2 0) 8 (m 0 D3"i"1 0) 9 (m 0 D3 "I" 20) 11 (m 0 E3 "do" 50) Before marking M1 (m 003 3mi"10) 2 (m 003 "da" 1 0) 3 (m 0 A2 "no" 1 0) 4 (m 0 C3 "klssu" 2 0) 8 (m 0 03 '1r 1 0) 9 (m 0D3 "ti" 20) 11 (m 0 E3 "do" 50) After markning [Accompaniment] Figure 5. Overview of Adaptive Karaoke System crossing intervals. First, the peak value of input signal and its timing are detected by an observing the half length of the window. The peak length is used to detect attack and the timing is used as the starting point of integration of the signal. Zero-crossing interval counting is applied to the integration of the input signals. The integration method is effective in reducing errors due to ripples around zero value. As described above, it is very difficult to detect accurate frequency at the attack. Accompaniment after the accurate and confident pitch is detected is too late. In our system, the timing of the attack obtained from power component is used for the primal scheduling of the accompaniment data. The pitch data is calculated between the note length which is referred from the score data. The weighted average pitch in a certain length enables us to cancel the effect of portament or vibrato. This approach is thought to be confirmation of the pitch and length of notes using templates. If this confirmation fails repeatedly, the system will call the functions which are designed to cope with the singers' mistakes. At this stage, the previous frequency data stored in the ring buffer is used. It is difficult to detect attack of voice, especially in Figure 6. Score data for matching used in finding where the singer is singing as well as for the primal matching points. This method introduced to gain stability is the reason why this system can not follow the rapid tempo change. However, we assume that it is a rare case in Karaoke music. The system dynamically controls the threshold to judge attack or not as it falls around where the attack may appear. As for the score data, the data which a human musician has played is used. The data is quantized at every bar start not every meter as they keep agogics. The human dynamics are also kept and used in the accompaniment. For the expression of "interval" between phrases, we can describe the "interval" sign on the attack candidates. The system waits for the data at the "interval" marked point. 5.2 Jasper and Agent session model There are lots of outstanding studies on interactive performances. Max has been the standard tool since it was proposed. The fundamental aim of a software for interactive performance is controlling and sequencing music. Another interest in the interactive performance system is how to understand input signals as music. Cypher is one of the completed system which can analyze music with agent model [Rowe 1992]. We have developed a Jam session partner called JASPER, which aims at responding to a players' intention [Wake 1991]. This system responds to a player's intention by changing its performances based on the tension parameter, which is calculated from the number of notes, velocities, and registers. This system succeeded in showing an example of Kansei communication to some extent. But there were problems on the higher analysis of music and how to plant personalities into the system. This section describes a session model based on multi-agent for Jazz which is considered in order to cope with these problems. 4A.1 142 ICMC Proceedings 1993

Page  143 ï~~shown in figure 7. It enables us to plant personalities in each computer musician: thus participants of the session can enjoy a varied Jazz session. The computer player has a listening part and a playing part. Listening part is divided into logical analysis and subjective analysis. Logical analysis contains recognition of musical primitives. Musical primitives consist of chords, melody, base, drums and their upper concept, chord progression, rhythm and so on. One of the analyisis at this level is key-change based on detecting dominant motions. The recognized musical primitives firstly are mapped into common musical meaning. Next, the musical-excitement-value which represents the individual appraisal of music is calculated from musical primitives and common musical meaning with multiplying parameters written in personality profiles. These parameters represents what and how the agent is interested in the elements. In the playing part, solo and accompaniment is selected from a registered database and reshaped based on above musical-excitement-value and positive/passive parameters with knowledge on music performance. The system will normally obey to the performer, but sometimes insists on its personalities based on positive/passive parameters. This is a way the system Figure 6. Kansei Commucation using nonverbal (pattern) information The spirit of a Jazz session is improvisation. The performance depends on the circumstances, at the same time, and the feeling of the players. This is why the same performance can never be repeated. Human players communicate with each other through music, understanding intention included in each other's playing, which is thought to be a type of non-verbal communication. The target, here, is to make a computer jazz player that can substitute for a human player. Our system is constructed with a multi-agent model as f'fpf ftf f'fpfp f f'f'fp f f'f'fp f f fefpfpf f'f.fpf f e e #1 ep p m odel.e.e.e p.e e \ % % % % % % % % % % % % %% % % % % % % % % % % % % % \ \ % % %%J 1" e" i,'v,d '|,% ' % % % % % %e f f-f- f i f f ff - f f ffffListening Part~eee %ffi e -- - - - - - - - -- - - - - - - - if if ifi f f e i.e i / / / / / i /. U eifVfif f / i If If JLi"f If f if f f I % % % -------% "-""-"-"----""-"-----"""-------% ----- % % % % % -,-.... % % % % % % % % % -H.igher Music -----------------....- ------------- Recognition Level hor dm i f % Ctctic""o r Phrase emtinforma', ff seo e "mldtna Facter %..%;%,,session % Clour ody ten - pI Cls re."..a1 1% a 1 1 -_ Tension Tnson / cuicalment ---Facter........fxit m ntf".,. f., ff., f,., f, f f',,, --- note Rhythm au """ """: " Local Music ' rorgption( dy rression/.;, laying Part;Perception / ord N me o dy '. 1' '..' '1. ' '1" e r""": o: rm: " a n' c e ' P-.erc, to otes...:-!Personality Scheduler i.., e so.,"::...,, " Knowledge Domain- KneDoan(esnlt)Kowdg -- Peerlt)..-- St.ylenweg... Rgistr Cicumst nce w.,;.... Nmeo#:,,:":''"Pperrformances, Level.,, % Know edge om ain::........1........:... Figure 7. A Session model based on multi-agents ICMC Proceedings 1993 143 4A.1

Page  144 ï~~responds to a players' intention. The logical listening analysis part consists of agents activated by musical primitives and its generalization module. The musical primitives are Chord, Bass, Drums and so on. They are extracted cooperatively using a knowledge base about the fundamental harmony of Jazz and a player's practical know-how. The generalization module is used to extract "common musical meaning", called "Common Emotion Factor" and "Common Tension Factor". These two attributes are transformed into subjective musicalexcitement-value based on personality profiles. The agent for chords decides the chord type, for example, open/close, imperfect chord without root/third/ fifth. The agent for base has two functions, root note decision and beat tracking. The agent considers possibilities that the note is root, the third, the fifth, the seventh or a scale note. It is related to the agent for drums. The network that the agents for base and chords recognizes the total beat of the music. Functions of agent for drum are beat tracking and recognition of rhythm variations such as, swing ( 2 beat and 4 beat ), rock, bossa nova, 16 beat, latin, and samba and so on. In playing swing, it is the cymbals, not the snare or the bass drum, which are used to keep tempo. Bossa nova has syncopations with cowbell or rimshot. The agent takes notice of these characters. The generalization module analyzes more conceptual musical primitives using output of information from three agents. It recognizes of chord progressions and functional chords such as tonic, subdominant and dominant. We assume these musical primitives give us common musical image. For example, dominant motion makes us sense the end, and the motion of IIm7-llb7-1maj makes us feel blue. The generalization module traces common emotion factor which indicate the degree of emotion that the music itself contains in a few bars. In harmonic theory, chord progression has been thought to force us to use limited scale notes. If we adhere the theory, the melody may become uninteresting and scale-colored. It is natural to think that a good soloist considers the above theory, whilst at the same time, is conscious of the key in his mind. Consequently, scales can be classified into a few groups which we call common tension factor. The generalization module checks every mark of common emotion factor and tension factor. This processing is repeated whenever the module receives new data from the agents group. Users can enjoy a variety of session play with the Virtual Musician by setting the parameters of personality profile accompanied by a CG musician. At present only a guitarist CG is available. We are planning to make other CG music players as a Media Partner (see chapter 4). 6. Virtual Musician as Composing/Performing Environment This section describes Composing/Performing Environment of Virtual Musician. Figure 8 shows the environment constructed for the computer music featuring Shakuhachi which will be performed at the post event of this ICMC, LIST/IAKTA workshop. Acoustic sensors Tremolo. Octave jump, Vibrato, Portamento, Pitt Power, 20Hz Acoustics (sund) Figure 8. System configuration for computer music featuring Shakuhachi 4A.1 144 ICMC Proceedings 1993

Page  145 ï~~The music piece is called "Chikukan no Uchu (Cosmology of bamboo pipep)" composed by one of the authors, Satosi Simura. It is based on the famous words of a priest Sokuchi-Zenshi, which has been thought to be conceptual basis of Shakuhachi performance. A Shakuhachi player has to comply with traditional manners in addition to actions for making sound to generate sound in a religious meaning. These manners are also used for the control of parameters of samplers and effecters. In the The Virtual Performer environment, it is possible to control CGs using PC1. As the visual presentation of this piece, BGV (back ground video) is chosen for artistic reasons. The environment is composed of modules of sensors, a PC computer to analyze the output of sensors, a PC that control synthesizers and DSPs and sound equipments. The connection for the sensor-fusion is realized in a dual way and not by the direct connection. This is a kind of fail-safe mechanism. Even if the or the systmes should fail, there is a mechanism by which an operator would give a cue to the system in order to continue performing. Sensors used in this configuration are almost same as described in Chapter 3. An acoustic sensor is specially equipped for the Shakuhachi. The Shakuhachi has such a slow attack that it is difficult to extract the accurate timing of the attack. But the Shakuhachi has characteristic sound types that are thought to be promising triggers for controlling music. These sound types have names. The aim of the acoustic sensor is to identify the sound type. The fundamental tasks of this acoustic sensor are finding pitch (as described in adaptive Karaoke), calculation of diversity of power and pitch, and template matching of power and pitch. It is extremely difficult to identify sound type without mistakes. Ambiguity of sensor output, such as detail identification of vibrato type is clarified by cooperation with other sensors. This piece and the technology described here will be performed at the LIST/IAKTA workshop, the post event of this ICMC. Those who are interested in our activity are welcome. 7.Conclusion This paper has described an overview of the Virtual Performer, a system/environment which realizes nonverbal communication in the fields of live art. Users can experience a new image of arts as performers or audience. Specifically, an adaptive Karaoke system and an session model were introduced as media partners of the Virtual Performer. Configuration for the Shakuhachi music was shown as a usage of the composing environment of the Virtual Performer. There are a lot of possibilities in usage of the Virtual Performer. One of the recent goals in the artistic area is to attempt to make a performance including dancers. One of the usages is to make a conducting environment. It is easy to construct, because the basic tools are equipped in the Virtual Performer. At the same time there remains difficulties which need to be resolved. This paper has shown one example of KANSEI communication. Understanding the player's intention, however, is an even more challenging and complicated task. We would like to consider this problem using various interdisciplinary approaches. References [Chadabe 1983] L.Chadabe. Interactive Composing. Proc. ICMC, pp.298--306, 1983. [Chung et al. 1991]. Joe Chung et al. Adevelopement environment for string hyperinstrument. Proc. ICMC, pp.150-152. 1991.[Zelter 1991] D. Zeltzer. Autonomy, Interaction and Presence. MIT Media Lab, MA, 1991. [Danneberg 1984] Roger Dannenberg. An On-Line Algorithm for Real-Time Accompaniment.Proc. ICMC, pp.193-198, 1984. [Inoue et al. 1992] Wataru Inoue et al. Automated Accompaniment System for singing. Proc. of Summer Symposium'92, pp79-82, JMACS, (in Japanese) 1992. [Kanamori et al 1993] T. Kanamori, H. Katayose, S. Shimura, and S. Inokuchi. Gesture Sensor in Virtual Perofrmer. Proc. ICMC, 1993. [Katayose 1990] H. Katayose and Seiji Inokuchi. Kansei Music System. CMJ, 13(4), pp.72-77, 1990. [Machover et al. 1989] Tod Machover and Joe Chung. Hyperinstrument: musically intelligent and interactive performacne and creativity systems. Proc. ICMC pp.186-190, 1989. [Rowe 1992] Robert Rowe. Machine Listening and Composing with Cypher. CMJ, 16(1), 1992. [Puckette 1988] Miller Puckette. The Patcher. Proc.ICMC, pp420-429, 1988. [Takeuchi et al. 1993] N. Takeuchi, H. Katayose and S. Inokuchi. Virtual Performer: Adaptive KARAOKE system. Proc. 46th Annual Conference, IPSJ, vol 2, pp.147-148 (in Japanese), 1993. [Wake 19911 Sanae Wake et al. The session System Reacting to the Sentiment of Player. Proc. of Summer Symposium'92, pp.43-46, JMACS, (in Japanese) 1992. [Vercoe 1984] Barry Vercoe. The Synthetic Performer in the Context of Live Performance.Proc. ICMC, pp. 199-200, 1984. ICMC Proceedings 1993 145 4A.1