Page  240 ï~~Distal Learning of Musical Instrument Control Parameters Michael A. Casey Perceptual Computing Group MIT Media Laboratory Cambridge, MA 02139 Abstract This paper describes a parameter estimation method for sound generating environments based on distal supervised learning techniques using muli-layer neural networks. The paradigm we have chosen for investigation is that of learning to control a musical instrument in order to produce an intended sound. We present the general framework of distal learning from contemporary control theory literature and show that musical instrument control is a distal learning problem. Examples of the application of distal learning to the control of various sound synthesis environments are discussed. We also consider representational issues for signal-based learning in neural networks. 1 Introduction When performers play musical instruments they realize a mapping from an internal representation of sound intentions to a set of motor actions that, when applied to the instrument, create the intended sounds. Clearly this mapping is learned over a significant amount of time during which the performer practices many musical passages and many forms of articulation appropriate to their instrument. After much training the musician is able to realize novel intentions without having to practice every possible situation in advance. Such is the case in improvisation, for example, where the performer draws on previously learned skills to create new musical outcomes. In this paper we are primarily concerned with the issue of timbral control of a musical instrument and in modeling the learning process that allows a performer to produce a sound outcome from a sound intention. Space dictates that we limit the discussion to the modeling of static control environments. However, the methods presented here can also be applied to dynamic control environments. We first introduce the distal learning problem and show that it is appropriate for learning to control a musical instrument. 2 Distal Learning The distal learning problem is illustrated in Figure 1. The Learner controls a set of distal variables via a set of proximal variables. The proximal variables are inputs to an environment that produces a distal outcome. Musical performers directly control action parameters such as bow pressure, bowing speed and finger positions. These control parameters pass through the musical instrument, which is a complex dynamical system, and the resulting sound is a transformation of these inputs. Thus the performer has indirect control over the sound outcome. The learner holds an internal representation of the sound that they want to produce, i.e. a sound intention, and it is the difference between this and the sound outcome that is used to drive the learning of the control parameters to the instrument. We refer to the sound output as y, and the sound intention as y*. Thus the error term for learning can simply be stated as: E = (y*-y) (1) The musical instrument is referred to as a plant or physical environment to which the learner has to find an inverse model that maps 1B.4 240 ICMC Proceedings 1993

Page  241 ï~~Intention =" Ations 481 IN. 0 Wired d cOt._......................... y Intentions Actions Figure 1: Distal Learning System intentions to actions.1 In our case the environment takes a set of musical control parameters as input and produces sound as output. The learner's task is to take a sound intention as input and produce a set of musical control parameters as output. This is the inverse of the physical environment's transfer function, thus the term inverse model. There are two strategies for learning the inverse model: direct inverse modeling and inverse modeling using a forward model. 2.1 Direct Inverse Modeling Direct inverse modeling is an example of a classic supervised learning problem. The learner is presented with a set of output/input relationships to learn the mapping from intention to actions using, for example, back propagation. This type of learning is associative and relies on the interpolation of learned values for novel situations. Although it is intuitively reasonable, direct inverse modeling poses a number of problems as a strategy for distal learning: Problems with Direct Inverse Modeling * Non-Convex Mappings * Not Goal Directed * Not Distal Learning One-to-many mappings occur when the learner has to map a single point in the problem space to many possible points in the solution space. This situation occurs often in musical problems where there are many sets of parametric values that give rise to a perceptually similar outcome, see [Lee and Wessel, 1992]. Thus the mapping is 1These are control theory terms from standard texts. See [Astrom and Wittenmark, 1984], for example, for more details. Figure 2: Non-Convexity Problem. The point on the left should be mapped inside the convex region on the right, but the multiple solutions are averaged to the cross in the center of this region. one-to-many for learning the inverse model, where a single intention can be characterized by many possible parametric values. The nonconvexity problem arises when the learning algorithm has the effect of averaging the multiple solutions of the one-to-many mapping together to form a single solution point. If the target solution space is convex this results in a situation where the globally optimal solution for the learning algorithm is outside of the problem's solution space. Thus direct inverse modeling produces non-convex solutions. See Figure 2. Another major problem with direct inverse modeling is that it is not goal directed. There is no strong relationship between the intention and the set of actions that the learner is associating. Therefore there is no guarantee that the learned mappings will lie in a relevant part of the general problem space when it comes to performance. The final problem with direct inverse modeling is that it does not solve the distal learning problem. A major advantage of distal learning is that it weakens the role of the teacher during training. The paradigm allows the learner to relate errors in performance outcome to errors in action parameters in such a way that only the difference between the intention and the outcome need be used. Direct inverse modeling uses a much stronger assumption of a teacher where explicit intention/action pairs are presented and errors are calculated with respect to the actions without regard for the performance outcome. For these reasons we propose that the direct inverse modeling strategy is not appropriate to our problem even though it seems an intuitively correct solution. ICMC Proceedings 1993 241 1B.4

Page  242 ï~~A Actual Predked Outcome y.y Error Term (y' y) 2 Figure 3: Composite Learning System 2.2 Forward Models The use of a forward model to facilitate the learning of an inverse model is an important concept from motor control and motor learning theory, see [Jordan, 1990, Jordan and Rumelhart, 1992]. A forward model is a learned internal representation of the physical environment. The learning of the forward model is similar to the direct inverse modeling strategy but it serves a completely different purpose. Once learned, the forward model's weights are fixed so that the inverse model can be trained using the distal errors which are propagated back through the forward model to the inverse model. The forward model produces a predicted outcome, y, for a given set of actions. This predicted outcome could be used directly to train the inverse model using the predicted performance error. E = (y*-Y)(2) However, if the forward model is not a perfect model of the environment the learning of the inverse model will be corrupt. Therefore a composite learning system is implemented in which the forward model is placed in parallel with the physical environment and the predicted performance error, Equation 2, is replaced with the values of the performance error, Equation 1. The performance errors are propagated back through the forward model to the inverse model until near-asymptotic performance is reached. See Figure 3. 3 Implementation The implementation of the system in Figure 3 comprises two neural networks placed in series. The composite system is effectively a four-layer network2, see Figure 4. We used 2We refer to a layer as a matrix of weights and the input/output vectors associated with that matrix. Actual ~Outcome y Figure 4: Neural Net Implementation back propagation for the learning algorithm but the method is generalizable to any neural net learning algorithm. A number of distal models were trained implementing parameter estimation for FM synthesis, additive synthesis and for various physical models of instruments and voice. The inputs to the neural net were sound intentions represented by the magnitude spectra of the sounds modified using the method described in Section 3.1, and the outputs were vectors of parameters to the synthesis environment. 3.1 Coarse Coded Spectra Problems using sound spectra as inputs to neural nets arise because of the narrowness of the spectral peaks in musical sounds. If the spectra are used without further filtering performance errors may be high even if the sound outcomes are similar to the sound intentions. This effect is that small qualitative changes in sound spectra give rise to large quantitative changes in distal errors. To avoid this problem we used a coarse coding method for sound spectra to facilitate more accurate similarity judgements between sound intentions and sound outcomes, see Figure 5. Coarse coding is essentially a convolution of the magnitude spectrum of a sound with a guassian function: g[k] = e-2-7~ (3) where g is the guassian function in vector form, k is the frequency function index, p is the mean and o is the variance. 18.4 242 ICMC Proceedings 1993

Page  243 ï~~Figure 5: Coarse Coding of Sound Spectrum The upper figure shows the magnitude spectrum of three components, the lower figure shows the coarse coded version using a guassian filter. The convolution is expressed in the frequency domain as: 00 where y is the coarse coded spectrum, y is the magnitude spectrum of the synthesized signal, and g is the guassian function of Equation 3. Without such a representation there is no way for the learning algorithm to detect when it is adjusting control parameters in a direction that will minimize the distal errors over the training data. All the sound inputs to the neural network and the sound outputs from the synthesis environment where reresented using this strategy. 4 Results The distal learning models that we have implemented have shown significant success in estimating control parameters to a number of different sound-generating environments. The results of these prototypical models suggest that we can attempt more complex learning tasks. It is our intention to expand the project by including support for dynamic synthesis environments and a facility for the selection of multiple models to parameterize many types of sound. Figure 6 shows the distal errors for the training of the inverse model for FM synthesis using 3 parameters: carrier frequency, modulator frequency and modulation index.!. I! I...0"..........2..........4-........ I'Â~...... L i Figure 6: Distal Errors for FM Synthesis Plot of distal errors against trials at near-asymptotic performance after 250 epochs of 16 training patterns. 5 Summary In this paper we have presented the problem of modeling how a performer learns to control a musical instrument given an internalized target representation and a distal outcome. We have discussed some important concepts from control theory such as direct inverse modeling and forward models and we have shown how they apply to the musical control learning problem. The implementation of such a system using multi-layer neural networks was also presented along with issues pertaining to the representation of sound for neural network learning problems. We believe that there are many potential applications for distal learning techniques in the computer music field and expect that issues of control will become prominant in computer music applications. References Astrom and Wittenmark, 1984] Astrom, K.J. & Wittenmark, B.W. Computer Controlled Systems. Englewood Cliffs, NJ: Prentice Hall. [Jordan, 1990] Jordan, M.I. "Motor learning and the degrees of freedom problem." In M. Jeannerod (Ed.), Attention and Performance, XIII. Hillsdale, NJ: Erlbaum. [Jordan and Rumelhart, 1992] Jordan, M.I. & Rumelhart, D.E. "Forward models: Supervised learning with a distal teacher", Cognitive Science (in press). [Lee and Wessel, 1992] Lee, M. & Wessel, D. "Connectionist Models for Real Time Control of Synthesis and Compositional Algorithms". Proceedings of the International Computer Music Conference 1992. International Computer Music Association. ICMC Proceedings 1993 243 IB.4