TOWARDS AN AUTOMATED MUSIC TEACHING SYSTEM: AUTOMATIC RECOGNITION OF MUSICAL MELODIES USING THE WF-4RSkip other details (including permanent urls, DOI, citation information)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact email@example.com to use this work in a way not covered by the license. :
For more information, read Michigan Publishing's access and usage policy.
Page 00000001 TOWARDS AN AUTOMATED MUSIC TEACHING SYSTEM: AUTOMATIC RECOGNITION OF MUSICAL MELODIES USING THE WF-4R Jorge Solis *1*4, Keisuke Chida*], Kei Suefuji*1, Chiaki Arino*1, Atsuo Takanishi*1 *1 Waseda University Department of Mechanical Engineering Tokyo, Japan ABSTRACT In this paper, we are proposing to develop an autonomous system using our anthropomorphic flutist robot which can interact naturally with humans. Therefore; the robot requires not only the ability of reproducing the score, but also needs the ability of extracting the symbolic descriptions defined by human through its sensors; to be used for deciding the next action of the robot. In this year, we are presenting the improvement of the mechanical system and the addition of some cognitive functions (i.e. music recognition) implemented on the new version of the flutist robot, the WF-4R (Waseda Flutist No.4 Refined). Both, mechanical and cognitive issues added to the robot will aid to implement the proposed an automatic transfer skill system. An experimental setup is described in order to test the recognition system implemented. 1. INTRODUCTION The research of Humanoid robots constitutes both one of the largest potentials and one of the largest challenges in the fields of autonomous agents and intelligent robotic control. Such challenge stands as the natural evolution of advanced robotics but represents also the ancient dream of humans to replicate ourselves. Thus, humanoid robots respond to the need for useful machines that can co-exist with humans and furthermore, it represents an attempt to imitate nature and to replicate humans. In a world where man is the standard for almost all interactions, humanoid robots have a very large potential of acting in environments created for human beings due to their human-like mobility and dexterity. At Waseda University, the idea of replicating the human organs involved in the flute playing has been carried out for enabling the communication with humans at emotional level. As an example of such communication, in the last year, we have introduced the idea of using the anthropomorphic flutist robot WF-4, as an assistant tool for transferring flutist skills to beginners . As it is shown in Fig. 1, an experiment was done to use the robot for helping to improve learners' performances. In such experiment, the robot was able to improve the learners' performances by providing verbal and graphical feedback to the students (Figure 2). Even that, from the results of the experiment, the flutist robot was useful to improve beginners' performances; the flutist robot still requires further *4 JSPS Research Fellow Japan Society for the Promotion of Science Japan improvements to render a more natural interaction during the learning process. Thus, we should take into account both shape and mobility of the human and versatility of the human senses to enable the robot to act as a human does. Therefore, in this paper, the improvement of the mechanical design and the interaction system with humans is presented. Firstly, the mechanical design of the arm mechanism to improve the flute position accuracy will be described. Then, the introduction to the new transfer skill system architecture will be given, which is based on the addition of some basic cognitive functions. Finally, the implementation and evaluation of a melody recognition system based on the Hidden Markov Models (HMM) is detailed described. Figure 1. The WF-4 and a professional flutist player transferring basic skills to a beginner flutist player. 2. THE ANTHROPOMORPHIC FLUTIST ROBOT In this year, we have developed the new version of the anthropomorphic flutist robot WF-4R (Waseda Flutist No.4 Refined) which has a total of 38-DOFs in order to reproduce the organs involved in the human flute playing (see Figure 2): lips (5-DOFs), head (4-DOFs), fingers (12-DOFs), tongue (1-DOF), lung (1-DOF), vibrato (1-DOF) and finally two arms (each with 7 -DOFs). The addition of the arm mechanism enabled to enhance the accuracy of the flute positioning (maximum error position of 0.14 mm), which it is an important issue for implementing an automatic transfer skill system that is able of performing musical scores with the same sound quality. 3. AUTOMATED TRANSFER SKILL SYSTEM Last year, we have developed an automated transfer system skill using the WF-4 to teach flutist beginners how to improve their sound quality. In , the previous version of the flutist robot, the WF-4, has been
Page 00000002 used as a teaching tool to improve the sound quality of beginner flutist players. An experimental setup was done designed to compare the added value of using the flutist robot for teaching to beginner students against the conventional way of teaching. The results demonstrated that the performance of pupils were better when the robot was used. 3) 3rd Step: the student and the robot play together the score. 4) Repeat again all the steps. Human's Sils Hurnan's Skills; Feedback SDemonstrates Skill Figure 2. The anthropomorphic flutist robot WF-4R. However, the interaction between the flutist robot and the learner was in still in some way restricted. The robot's operator pre-programmed the sequence of the experiment so that learners could not decide by themselves which melody they wish to play. Therefore, it becomes necessary to add further cognitive functionalities to the robot so that the robot can interact with humans at the same logical perceptual level. Our goal is to propose a new generation of assisted music teaching tools that includes the capability of detecting the presence of a musical partner using the vision system and identifying and evaluating the musical performance of a student using the auditive system; in order to criticize and make suggestions to the learner for improving his/her performance. In Fig. 3, the proposed architecture of a general automated transfer skill system (GTSS) is presented. Due to the idea that humans acquire 80% of their knowledge through the sense of vision and 11% through the sense of hearing, the proposed automated transfer skill system includes visual and auditory sensory systems. Due to the complexity of the system, as a first approach, the implementation of a music recognition system using the auditory system is presented. The idea of implementing a melody recognition system is based on the idea of rendering a more natural interaction between the robot and the student. As it was presented in , the learning process was done as follows: 1) 1st Step: the student plays a score while the robot records and evaluates his/her performance (verbal and visual feedback is provided). 2) 2nd Step. the robot plays the same score while the student hears the performance. Figure 3. Proposed automated transfer skill system. Even that we have demonstrated the improvement of learners using these steps, the score played by the robot to demonstrate the correct way of playing was preprogrammed. Therefore, we aim that the robot should be able of determining which score was played by the learner without pre-programming the score to be played by the robot and without requiring further information from the learner. Such information is then transmitted, by means of TCP/IP, to the PC controller of system in order to replicate the detected score . In order to verify the effectiveness of the recognition system, an experimental setup was designed with flutist beginners. 4. THE RECOGNITION SYSTEM The recognition system should solve to main tasks: model the human skill and identify under going actions. Modeling human skills is a difficult task due to the stochastic nature of human performance as none can repeat a task exactly in the same way. Inside this stochastic process, human actions are considered as the measurable stochastic process and the knowledge or strategy behind it as the underlying stochastic process. In the particular case of music, the Music Information Retrieval (MIR) has been presented as a tool to perform the named Query-By-Humming; where at least three different algorithms have been investigated : note-interval using the Dynamic Programming search , melodic-contour using the Dynamic Time Warping search , and Hidden Markov Model Matching . Even though the Dynamic Programming and Dynamic Time Warping algorithms demonstrated to be better for doing the musical recognition compared with the Hidden Markov Models , the first two algorithms present the following constraints : do not generalize; both are player dependent oriented; need examples(s) for each melody from each subject; and become computationally expensive for large amount of music data.
Page 00000003 Raw audio Symbol Sequence Pr- Faue Quantization HMMP processing Extraction Recognition Figure 4. Steps required before training and identifying the incoming data using the HMM recognition system Due to the requirement of a general recognition system which is player independent oriented, could be also used to compare and evaluate learners' performances (recognition output), and it is easy to train; we have implemented the Hidden Markov Models (HMM). Specifically, we have used discrete version of the HMM as it doesn't require a large amount of training data. An ergodic HMM with eight states has been used. A compact notation of a discrete HMM, to represent the complete parameter set of the model, is given in (1). The learning (training) in HMM is achieved by adjusting the model parameters, to maximize the output probability (2) that the model matches with a given observation . A = (A, B, T) (1) P(O |IA) (2) In order to train and identify human actions, we need to convert, in our case, the raw audio data into a sequence of symbols, as it is shown Fig. 4. The quantization procedure was done by using the Vector Quantization. The training and identification procedures of the HMM were implemented using the classical analysis procedures : probability evaluation (backward and forward methods), parameter estimation (Baum-Welch method), and the optimal state sequence (Viterbi algorithm). Furthermore, the numerical computing underflow, when long sequence observations are given to the HMM, were solve using a scaling factor . 4.1. Feature Extraction Feature extraction process on an utterance may be viewed as mapping of the utterance into multidimensional parameter space. Utterances of an audio data will generate a set of points in that space. In this case, we have used the Mel-Frequency Cepstrum Coefficients (MFCC) for extracting sound features . The MFCCs mainly have been applied for speech recognition due to their ability to represent the speech amplitude spectrum in a compact form. Each step in the process of creating MFCC features is motivated by perceptual or computational considerations [11I]. In our case, to generate mel-cepstrum vectors, audio data were segmented using windows with a size of 2048 points at a sampling frequency of 44100k Hz (with a 5000 overlapping using Hamming window). Then, each frame was transformed into a 31 coefficient MFCC vector. Additionally, we have included an additional energy coefficient, which is computed from the log of the signal energy, to the feature vector. 4.2. Quantization The specific preprocessing algorithm we chose is the vector quantization (VQ) based on Linde, Buzo and Gray's algorithm (LBG), which is an extended algorithm of K-means . The VQ techniques have been used widely and successfully to solve quantization and data compression problems. In an HMM-based approach, we need to quantize each multi-dimensional feature (in our case, 33-dimensional feature vector) into a finite symbol sequence before giving the data to the HMM. The purpose of designing an M-level vector quantizer (called a codebook with size M) is to partition all k-dimensional training feature vectors into M clusters (in our case M 32) and associate each cluster c, whose centroid is the k-dimensional vector c', with a quantized value named codeword (symbol) o,; where the overall distortion is calculated as (3). In order to produce the codebook, we have recorded some samples of music. Once the final codebook is obtained, it is used to quantize each training and testing feature vector into a symbol value (codeword) for being used by the HMM recognition. M K D=y3 xi -c'i i=O n=1 (3) 5. EXPERIMENTAL SETUP & RESULTS In order to test the recognition system, we have asked to perform three times each of the melodies shown in Fig. 5 to five flutist beginners, a professional flutist player and the WF-4R (Figure 10). Each performance was recorded with a sampling rate of 44100k Hz. As a result, we have obtained 105 file samples of audio data, where some of them were used for training each melody model (14 samples per each model) and others for testing recognition performance (7 samples of each melody). In Fig. 6 the recognition ratio after evaluating each sample (training and testing data) with the HMMs. In the case of the training data, an overall performance around 94% has been obtained. In the case of the testing data, the overall recognition rate was around 86%. This recognition average was obtained even that, based on the information recorded from the flutist professor; most of the students have several common problems: slow tonguing, breath quantity and control speed and tempo.
Page 00000004 Fiur 5. Meode usdfrteeprmntfo rtaaa i____9_*L 4A____ F~,~~~~~~~igur 5 Mloie uedfo te epeimntfrm r.Waamts From these results, we may conclude two things: firstly, the implemented HMM can be useful, not only for speech recognition but also for music recognition, which are fundamental features to improve the interaction between the human and the robot. On the other hand, the advantage of using this recognition system may be useful also for evaluating the robot's performance compared with the human professor. m Training Data [ Testing Data 90 ** o 0 70 S50 " 40 1 2 3 4 5 Melody Figure 6 Recognition average from recorded data. 6. CONCLUSIONS The new anthropomorphic flutist robot WF-4R has been introduced. Furthermore, an automated transfer skill system using the flutist robot has been presented and described. The implementation of a melody recognition system has been presented. The experiments done demonstrated the effectiveness of both systems. 7. ACKNOWLEDGMENTS A part of this research was done at the Humanoid Robotics Institute (HRI), Waseda University. This research was supported (in part) by a Gifu-in-Aid for the WABOT-HOUSE Project by Gifu Prefecture. A part of this research has been also supported by the Japan Society for the Promotion of Science (JSPS). The authors would like to express thanks to Mr. Akiko Sato for her valuable help for testing the music recognition system. 8. REFERENCES  Dannenberg, R.B., Birmingham, W.P., Tzanetakis G., Meek, C., Hu, N., Pardo, B. "The MUSART testbed for Query-ByHumming evaluation," In Proc. of the International Symposium on Music Information Retrieval, 2001.  Linde, Y., Buzo, A., and Gray, R., "An Algorithm for Vector Quantizer Design," IEEE Transaction on Communications, Vol. COM28, NO. 1, pp. 84-95, January, 1980.  Logan, B. "Mel Frequency Cepstral Coefficients for Music Modeling," in Proc. of the International Symposium on Music Information Retrieval, 2000.  Mazzoni, D., Dannenberg, R.B. "Melody matching directly from audio," 2nd Annual International Symposium on Music Information Retrieval. Bloomington: Indiana University, pp. 17-18, 2001.  Meek, C., W.P., Birmingham. "Johnny can't sing: a comprehensive error model for sung music queries", 3rd International Conference on Music Information Retrieval (ISMIR), pp. 124 -132, 2002.  Nechyba, M.C. (2000). EEL6935: Machine Learning In Robotics II (Lecture notes). University of Florida:. Department of Electrical and Computer Engineering, Florida.  Pardo, B., Birmingham, W.P. "Encoding timing information for musical query matching," 3rd International Conference on Music Information Retrieval (ISMIR), pp. 267 -268, 2002.  Penn G. What we can't do yet: problems with pattern matching. CSC 2518 Lecture, University of Toronto, 2005.  Rakesh, D. (1996). A tutorial on Hidden Markov Models. Technical Report No.. SPANN-96.1.  Solis, J., Bergamasco, M, Isoda, S., Chida, K., Takanishi, A. "Learning to play the flute with an anthropomorphic flutist robot," Proceedings of the International Computer Music Conference, pp. 635-640, 2004.  Zwicker, E. and H. Fastl, Psychoacoustics. Facts and Models, SpringerVerlag, Berlin, 1990.