Page  00000001 Adaptive High-level Classification of Vocal Gestures Within a Networked Sound Instrument Jason Freeman*, C. Ramakrishnant, Kristjan Varnikt, Max Neuhaus, Phil Burk, and David Birchfieldt *Department of Music, Columbia University tAkademie Schloss Solitude *Arts, Media, and Engineering Program, Arizona State University auracle@auracle.org; http://www.auracle.org Abstract We have implemented a high-level vocal gesture classifier which uses Adaptive Principal Component EXtraction (APEX), a neural network implementation of Principal Components Analysis (PCA), to reduce a multi-dimensional space of statistical features into a three-dimensional highlevel feature space. The classifier is used within Auracle, a real-time, collaborative, Internet-based instrument: highlevel features, along with envelope data extracted from lowlevel voice analysis, are used to drive sound synthesis. The APEX neural network is initially trained with a sample database of vocal gestures, and it continues to adapt (both within single user sessions and over longer time periods) in response to user interaction. Both training and evolution are accomplished without supervision, so that the nature of the classifications is determined more by how users interact with the system than by the preconceptions of its designers. 1 Introduction Auracle, a project conceived by Max Neuhaus, is a networked sound instrument designed for a lay public without formal musical training. Users interact with each other in real time over the Internet, playing synthesized instruments together in a group "jam." The instruments are entirely controlled by users' voices, taking advantage of the sophisticated vocal control which people naturally develop in learning to speak. To facilitate real-time Internet-based collaboration without sacrificing audio quality or overloading a central server, the system does not transmit audio signals; instead it transmits control information gathered from the multi-level analysis of vocal input. This analysis data is mapped onto control parameters for synthesizers, which run in sync on each client machine. The client software is implemented as a Java applet incorporating the JSyn plugin (Burk 1998), and real-time collaboration is handled by a server running TransJam (Burk 2000). Our low-level analysis processes 40 ms frames of incoming audio to extract voicedness / unvoicedness, the fundamental frequency (for voiced frames), the first two formant frequencies and bandwidths, and the root mean square (RMS) amplitude. The mid-level analysis groups low-level analysis frames into gestures by creating boundaries at silences longer than a threshold. (This simple technique produces suitable results for us.) It then simplifies the gesture into breakpoint envelopes and computes numerous statistical features to describe the gesture. The high-level analysis reduces these mid-level features to three high-level features which describe the gesture. The lowlevel envelopes and high-level features are in turn mapped onto synthesis control parameters. The entire system is transparent enough that users are able to relate their vocal gestures to the output sound. Figure 1: The overall architecture of Auracle. Low-level envelopes and high-level features for each gesture are transmitted to the server once the gesture and its analysis are finished, and this data is not mapped onto synthesis parameters until it is received back from the server. This approach minimizes differences amongst clients in sound output, reduces network traffic, and creates a delay between voice input and synthesized response which facilitates a conversational style of interaction. Proceedings ICMC 2004

Page  00000002 The high-level features seek to classify vocal gestures in a low-dimensional, continuous space which describes a wide range of user activities. Since one of the goals of Auracle was to design an instrument which evolves over time based on how it is used, the high-level classification responds to how users interact with the system: it becomes more sensitive to aspects of gestures which users vary more. The overall architecture of Auracle is discussed in more detail in Ramakrishnan et al. (2004). This paper focuses on the implementation of the high-level classifier. A large number of statistical features are calculated to describe user gestures. These mid-level features are fed into an Adaptive Principal Component EXtraction (APEX) neural network (Diamantaras and Kung 1996 and Kung, Diamantaras, and Taur 1994), which implements Principal Components Analysis (PCA) to perform a real-time, adaptive mapping of the large mid-level feature space onto a small high-level feature space. 2 Mid-Level Feature Vectors The input into the high-level classifier is a series of midlevel feature vectors, each of which describes a single vocal gesture with a set of numerical features. Each mid-level feature is a different statistical description of the low-level analysis data. The choice of features is based largely on studies of vocal signal analysis for emotion classification by Banse and Scherer (1996), Yacoub et al. (2003), and Cowie et al. (2001). While our classifier is not focused solely on emotion, we found this research a useful starting point. Studies of timbre, most of which extend Grey's (1977) multidimensional scaling studies, were also informative, but their focus on steady instrumental tones was less directly applicable to the variety of vocal gestures expected from Auracle users. While many emotion classification studies try to separate linguistically determined features from emotionally determined features (Cowie et al. 2001), this is not necessary in Auracle. Our system responds to features of user input whether they are linguistically determined, emotionally determined, or consciously manipulated by users to control the instrument in specific ways. Each feature vector contains 43 features: the mean, minimum, maximum, and standard deviation of fO, fl, f2, and RMS amplitude, as well as of their derivatives; the mean, minimum, maximum, and standard deviation of the durations of individual silent and nonsilent segments within the gesture; and the ratio of silent to nonsilent frames, voiced to unvoiced frames, and mean silent to mean nonsilent segment duration. 3 Principal Components Analysis It is theoretically possible to directly transmit the 43 -element mid-level feature vector for each gesture across the network and to map these mid-level features onto synthesis control parameters, but we found it impractical to directly address this amount of data. We did not wish to choose a subset of features on which to focus, nor did we wish to manually create functions to compute each high-level feature from the entire mid-level feature vector: both of these approaches would have forced our own biases onto the design of the environment, and in doing so would have contradicted the goals of the project and driven users to interact according to our own preconceptions. Instead, we created a database of sample vocal gestures, computed the mid-level feature vectors for each of these gestures, and performed PCA with varimax rotation (Kaiser 1958). We were encouraged to pursue this technique by studies such as Lee, Narayanan, and Pierracini's (2001), in which PCA of statistical features played a critical role within an emotion classification system. F2 Derivative Max -0.797 F2 Std. Dev. -0.776 F1 Min -0.759 F2 Derivative Min 0.737 F2 Derivative Std. Dev. 0.730 F1 Derivative Min. -0.647 FO Min. 0.629 FlO Mean 0.574 FO Max 0.776 FO Derivative Max 0.731 Rms Mlax 0.728 FO Derivative Std. Dev. 0.718 FO Std. Dev. 0.715 Rmls Derivative Min -0.702 Rmis Derivative Max 0.673 Rms Std. Dev. 0.662 FO Derivative Min. -0.652 Rms Derivative Std. Dev. 0.589 Rms Mean 0.510 0.531 F1 Max 0.528 Figure 2: A table showing rotated components from the offline PCA of the sample database. Rows are mid-level features; columns are high-level features. Only loadings with absolute value above 0.5 are included in the table. The database included 230 gestures from 10 participants. Half were male and half female. They came from 7 countries and spoke 6 different native languages. Subjects were given no direction as to what kinds of vocal gestures to create, so the database contains speaking, singing, and many different types of vocal noises. While the inclusion of Proceedings ICMC 2004

Page  00000003 multiple gestures from each participant in the database could bias the analysis, we were impressed by the variety of vocal sounds each participant produced, and we favored using a larger database of gestures over a smaller database with only a single gesture from each subject. The PCA of this database projects the 43-dimensional feature space onto a 3-dimensional space which preserves 47.25% of the total variance of the original feature space. The rotated component loading matrix (Figure 2) suggests general interpretations of each component: the first component roughly describes aspects of silent and nonsilent segments of the gesture; the second component roughly describes aspects of fO, fl, and f2; and the third component roughly describes aspects of amplitude and fO. 4 Adaptive PCA Implementation The offline PCA of sample data described in the previous section encouraged us to use PCA for Auracle's high-level classifier. But a static implementation based on this analysis would not allow the system to adapt. The classifier must perform both short-term adaptation - by changing over the course of a single user session to focus on the mid-level features varied most by that user - and longterm adaptation, in which the classifier's initial state for each session slowly changes to concentrate on the mid-level features most varied by the entire Auracle user base. An adaptive classifier does sacrifice a degree of transparency in its classifications: it is more difficult for users to relate their vocal gestures to sound output when the high-level feature classifications, and thus the mappings, are constantly changing. And it is impossible to interpret the meaning of high-level features during the design of mapping procedures, since their semantics change with adaptation. For us, though, transparency in this component of Auracle is less important than adaptability: we map amplitude and frequency envelopes more directly onto related synthesis control parameters and use these high-level features to alter other qualities of the synthesized sound. For example, in a physical model of a string which is plucked repeatedly and very quickly, we map the amplitude and fO envelopes from the vocal input directly onto amplitude and fO envelopes for the string. The fl and f2 envelopes from the vocal input control bandpass filtering of the physical model's output. High-level features, in contrast, control decay time of the final pluck, the detuning of two strings used in tandem, and the rate at which the string is plucked. 4.1 Classical PCA The principal components of a set of feature vectors are the eigenvectors of the covariance matrix of the set. The eigenvectors with the highest eigenvalues explain the greatest amount of variance in the original feature space. To evenly distribute explained variance amongst a subset of principal components, a rotation procedure such as Kaiser's varimax method (1958) must then be applied. This implementation of PCA is ill-suited to Auracle. It has polynomial-time complexity (proportional to the number of mid-level feature vectors and features); in order to implement it in our Java-based real-time environment, the number of mid-level features would have to be reduced. Furthermore, this implementation is designed for a fixed data set, not a dynamic, constantly-expanding set as in Auracle. Small changes in the data set can lead to large changes in the results when a different solution suddenly becomes more optimal; this undermines the system's transparency. 4.2 Neural Network PCA We instead implement PCA with the Adaptive Principal Component EXtraction (APEX) model (Diamantaras and Kung 1996 and Kung, Diamantaras, and Taur 1994), which improves upon earlier neural networks proposed by Oja (1982), Sanger (1989), Rubner and Tavan (1989), and others. APEX efficiently implements an adaptive version of PCA as a feed-forward Hebbian network (with modifications to maintain stability) and a lateral, asymmetrical anti-Hebbian network. The Hebbian portion of the network discovers the principal components, while the anti-Hebbian portion rotates those components. The learning rate of the algorithm is automatically varied in proportion to the magnitude of the outputs and a "forgetting" factor which controls the algorithm's memory of past inputs (Kung, Diamantaras, and Taur 1994). Unlike many other neural networks, it is easy to monitor how APEX adapts; each feed-forward weight represents the importance of a particular mid-level feature in the computation of a particular high-level feature. Integration into Auracle. To integrate APEX into Auracle, we first trained the neural network with the feature vectors from the sample database, iterating through several hundred epochs until the average delta of the lateral weights fell below a threshold. (Since training is unsupervised, this delta is a good method to test for convergence.) This initializes Auracle's server-side state. When a user launches Auracle, his/her neural network is initialized with weights downloaded from the server. The forgetting rate of the user's network is set so that it quickly adapts to a particular user's gestures. When a user logs out of Auracle, his/her network weights are transmitted back to the server, which merges them with its current weight matrix: wijk+l= (1 - 13)wijk + 3Wiu where i and j refer to the index of a particular weight within the weight matrix, k is the current value stored on the server, u is the current value stored on the user's machine, and B is the server's learning rate. While the forgetting factor on individual clients is set to facilitate quick adaptation, the Proceedings ICMC 2004

Page  00000004 learning rate on the server is set to facilitate slow, gradual evolution. Because the APEX neural network is linear, we are able to use this simple linear learning rule on the server to directly modify the neural network's weights and to produce a slower rate of adaptation on the server than on each client. Data Scaling. We normalize features before they are input into the network, using a running z-score scaling. Then we recenter, rescale, and constrain the normalized values to fall within the 0-to-l range and to represent two units of standard deviation. This is sufficient to describe nearly all input. Since our mapping components expect high-level feature values between 0 and 1, we must also scale the neural network's output values. We first rescale output values based on the theoretical minimum and maximum values for each component, but in practice, this limits most outputs to a small portion of the 0-to-l range. So we further rescale the output values by expanding and contracting segments of the 0-to-l range, based on how many values have recently fallen within each segment. The final output describes the most active segments in highest resolution. 5 Conclusion The use of the APEX neural network achieves our goals for the classification of vocal gestures into high-level features. It reduces a multi-dimensional space into a threedimensional space, preserving much of the variance of the original. It provides an unsupervised mechanism for initialization, short-term adaptation, and long-term adapation, making classifications based more on user interaction than on our own preconceptions. This paper has deliberately avoided stating exact values for parameters such as the server-side learning rate and the client-side forgetting factor. We continue to adjust these values in order to strike the best balance between transparency and adaptability. There is a conceptual flaw in the server-side learning method. It assumes that the weight matrix on the server is identical at logon and logoff, but in a real-time multi-user environment this is often not true. While this problem is mitigated by the server's slow adaptation rate compared to the client's high adaptation rate, we hope to find a more elegant solution to address this issue. We also plan to experiment with additional mid-level statistical features and with non-linear classification techniques to see what effect they have on the system's transparency and adaptability. By doing so, we hope to improve the effectiveness of our classifier within Auracle and to ultimately create a more engaging interactive environment. 6 Acknowledgments Thanks to Aria Adli for his advice with the initial offline PCA and for his feedback on an initial draft of this paper. The Auracle project is a production of Akademie Schloss Solitude with financial support from the Landesstiftung Baden-Wiirtemburg. We express our gratitude for their generous support. References Banse, R., and K. R. Scherer. (1996). "Acoustic Profiles in Vocal Emotion Expression." Journal of Personality and Social Psychology, 70 (3), 614-636. Burk, P. (1998). "JSyn - A Real-time Synthesis API for Java." Proceedings of the International Computer Music Conference, pp. 252-255. Ann Arbor, MI: International Computer Music Association. Burk, P. (2000). "Jammin' on the Web - A New Client/Server Architecture for Multi-User Musical Performance." Proceedings of the International Computer Music Conference, pp. 117-120. Berlin, Germany: International Computer Music Association. Cowie, R., E. Douglas-Cowie, N. Tsapatsoulis, et al. (2001). "Emotion Recognition in Human-Computer Interaction." IEEE Signal Processing Magazine, January 2001, 32-80. Diamantaras, K. I., and S. Y. Kung. Principal Component Neural Networks. (1996). New York: John Wiley and Sons, Inc. Grey, J. M. (1977). "Multidimensional perceptual scaling of musical timbres." Journal of the Acoustical Society of America, 61 (5), 1270-1277. Kaiser, H. F. (1958). "The varimax criterion for analytic rotation in factor analysis." Psychometrika, 23, 187-200. Kung, S. Y., K. I. Diamantaras, and J. S. Taur. (1994). "Adaptive Principal Component EXtraction (APEX) and Applications." IEEE Transactions on Signal Processing, 42 (5), 1202-1217. Lee, C. M., S. Narayanan and R. Pieraccini." (2001). Recognition of negative emotions from the speech signal," in Proceedings of the IEEE Conference on Speech Recognition and Understanding, pp. 240-243. Los Angeles, CA: IEEE. Oja, E. (1982) "A Simplified Neuron Model as a Principal Component Analyzer." Journal of Mathematical Biology, 15: 267-273. Ramakrishnan, C., Freeman, J., Vamik, K., et al. (2004) "The Architecture of Auracle: A Real-Time, Distributed, Collaborative Instrument." Proceedings of the 2004 Conference on New Interfacesfor Musical Expression, 100 -103. Rubner, J., and P. Tavan. (1989) "A Self-Organizing Network for Principal-Components Analysis." Europhyics. Letters, 10(7): 693-698. Sanger, T. D. (1989). "An Optimality Principle for Unsupervised Learning." In D. S. Touretzky, ed., Advances in Neural Information Processing Systems. San Mateo, CA: Morgan Kaufman. Yacoub, S., Simske, S., Lin, X., Bumrns, J. (2003). "Recognition of Emotions in Interactive Voice Response Systems." HPL-2003 -136, available at http://www.hpl.hp.com/techreports/2003/HPL-2003-136.html. Proceedings ICMC 2004