Page  00000348 A NEW GESTURAL CONTROL PARADIGM FOR MUSICAL EXPRESSION: REAL-TIME CONDUCTING ANALYSIS VIA TEMPORAL EXPECTANCY MODELS Dilip Swaminathan, Harvey Thornburg, Todd Ingalls, Jodi James, Stjepan Rajko, and Kathleya Afanador Arts, Media and Engineering Arizona State University email:dilip @ asu.edu ABSTRACT Most event sequences in everyday human movement exhibit temporal structure: for instance, footsteps in walking, the striking of balls in a tennis match, the movements of a dancer set to rhythmic music, and the gestures of an orchestra conductor. These events generate prior expectancies regarding the occurrence of future events. Moreover, these expectancies play a critical role in conveying expressive qualities and communicative intent through the movement; thus they are of considerable interest in expressive musical control contexts. To this end, we introduce a novel gestural control paradigm for musical expression based on temporal expectancies induced by human movement via a general Bayesian framework called temporal expectancy network. We realize this paradigm in the form of a conducting analysis tool which infers beat, tempo, and articulation jointly with temporal expectancies regarding beat (ictus and preparation instances) from conducting gesture. Our system operates on data obtained from a marker based motion capture system, but can be easily adapted for more affordable technologies combining video cameras and inertial sensors. Using our analysis framework, we observe a significant effect on the patterns of temporal expectancies generated through varying expressive qualities of the gesture (e.g., staccato vs legato articulation) which at least partially confirms the role of temporal expectancies in musical expression. 1. INTRODUCTION 1.1. Temporal expectancy and human movement In everyday human movement, be it walking footsteps, the striking of tennis balls during a match, the movements of a dancer set to rhythmic music, or the gestures of an orchestra conductor, the occurrences of future events are highly informed by the occurrence times of past events. We consider this as the defining property of temporal structure. From this definition, event sequences lacking temporal structure must be Poisson processes [21], which have independent, exponentially distributed inter-event times. With a Poisson process, any event's occurrence is "maximally surprising" as it does not depend on the elapsed duration since previous events.1 Poisson processes have a rich history in electroacoustic music, for instance they are used extensively by Xenakis [27] to specify the formal structure of sparse textures of sound. Sparse textures are built up from rare events [27]; which from the listener's standpoint are conceptually identical to events which are unexpected due to lack of temporal structure: if an event is considered "rare" by a listener, the structurally similar past events are no longer in that listeners memory. On the other hand, when one observes temporally structured movement, the fact that the occurrence of the next event is highly informed by the occurrence times of past events naturally induces an expectancy, a strong feeling of anticipation that this event is about to occur. We hypothesize, moreover, that an important process by which a performer builds up tension is the sustaining or gradual heightening of this anticipation over prolonged periods. Consider a hypothetical movie scene, where a killer stalks his victim in a forest and the victim is alternately running and hiding behind trees, trying desperately to survive, yet the inevitable is soon to come. Our anticipation is sustained by ominous, swelling music, and rapidly shifting camera views from increasingly odd angles. In this example, the filmmaker runs the entire gamut of multi-modal feedback and also relies heavily on long-term narrative structures. However, in many situations such tension can be built up through bodily gesture alone. The idea of expectancy has received much attention in the music cognition literature [15, 18, 2, 8]. A central theme in these efforts is that the subsequent realization or circumvention of expectancies involving melodic or harmonic structures is a fundamental component of the listener's affective response [15, 18]. Temporal expectancies are also well-studied in music, perhaps not so extensively in terms of affective response, but as a means to the perception of rhythm [20, 5, 14, 10, 28]. For instance the pioneering work of Povel and Essens' [20] reveals that in the perception of temporal patterns, humans generate a hierarchical clock for which durations are governed by the pattern's distribution of accents. Since the state of the clock indicates when it is about to reset, this model provides at least an implicit encoding of temporal expectancy. Desain [5] and Zanto et al. [28], among others, extend 1 The standard, homogeneous Poisson process which is what is usually meant has the additional consideration that event times are identically distributed i.e, the event process evolves at a constant rate [21]. 348

Page  00000349 this work, as they reveal specific neurological mechanisms behind the generation of temporal expectancies. Furthermore, Jones et al. [10] and McAuley [14] have observed direct effects on how different expectancy patterns can influence listeners perception of duration and meter. While the music cognition literature is rich in theories regarding temporal expectancy and its role in music perception, relatively little attention has been paid to parallel structures in movement perception. Hagendoorn [7] has extensively researched neurological mechanisms underlying movement-based expectancies, hypothesizing that the elicitation, realization, and circumvention of these expectancies each play similar roles in dance perception as they do in music perception. Unfortunately, this work does not deal with temporal expectancies. Another omission shared between this work and the music cognition literature, which would be of vital interest in gestural control interfaces, is the lack of emphasis on how the performer perceives their own actions vis-a-vis temporal expectancy. For instance, what role might expectancy play in movement or music improvisation? We hypothesize that just as music composers or improvisers are innately aware of their potential to craft emotion in their "audience", so too are those who move. Hence, the role of expectancy in gestural control for musical expression must therefore be situated within the communicative intent underlying expressive movement. Other aspects of communicative intent in expressive movement have been well-studied, for instance the Laban Movement Analysis (LMA) system [6] concerning Body, Effort, Shape, and Space has been familiar to the dance community for over 80 years. However, the fullbody focus of the LMA system coupled with the present lack of computational frameworks conformable to lowcost sensing technologies has prevented extensive application of LMA to gestural control of interactive music systems. Because temporal expectancy is essentially orthogonal to those aspects of communicative intent considered by LMA, they should be considered as complementary rather than competing paradigms. Systems based around temporal expectancy can only augment the expressive possibilities of systems based around LMA, and vice versa. As such, we are developing a new gestural control paradigm for musical expression based on the induction of temporal expectancies from human movement. Our efforts will enable musical expression to be more tightly coupled to the performer's communicative intent. A more complex interplay between performer and instrument may also arise as the instrument can be programmed to induce sympathetic expectancies in the performer, to subvert the performer's intentions, or to cultivate situations where complex temporal structures emerge from the hybrid nature of the interaction. Our initial realization of this paradigm concerns conducting gesture, due to the rich set of associations which couple expressive gesture to expressive musical form. Conducting gesture conveys both indicative and expressive attributes of the music, for instance timing (beat, tempo, and meter); dynamics (crescendo/diminuendo, accents); articulation (legato/ staccato/marcato/tenuto) and phrasing [12, 25]. As we will show, much of this information is largely conveyed via temporal expectancies. To this end, we have currently developed a robust method for inferring beat, tempo, and articulation jointly with temporal expectancies concerning ictus (beat) and preparation (sub-beat) positions 2 from a single spatial control representing the location of the conductor's baton. Computational conducting gesture analysis has proven quite challenging. The vast majority of current systems are rule-based [25, 16, 12] and assume standard spatial patterns which correspond to various meters (3/4, 4/4, etc.) However, experienced conductors often evolve highly personal styles which do not always follow these patterns [13]. Nevertheless both orchestra and audience seem to have little trouble inducing timing, articulation, dynamics, phrasing, and other expressive qualities from the gesture, even if they are only peripherally aware of the conductor's motion [25]. This is partly due to reinforcement from the music and foreknowledge of the score; however, few would argue that the conductor's role in managing this process by conveying critical information via bodily gesture is anything less than central. Hence, a major speculative hypothesis of this paper is that conductors convey musical attributes not directly through the standard patterns, but by inducing temporal expectancies in viewers through trends in features such as magnitude velocity and direction which are a) invariant to specific spatial patterns and b) intelligible through peripheral awareness. Preliminary results (Sec. 3) yield substantial evidence confirming this hypothesis, especially those regarding the conveyance of articulation through the temporal expectancy regarding the preparation event. To this end, we propose a fully integrated Bayesian framework for the joint induction of a) temporal expectancies regarding ictus and preparation; b) fundamental musical attributes from expressive as well as indicative conducting gestures, namely beat, tempo, and articulation. This framework as well as computing temporal expectancies, incorporates them as a source of prior knowledge to aid the induction of musical attributes. We use a feature set consisting only of the magnitude velocity and direction of the conductors baton relative to shoulder position. We note that dynamics in these features are invariant to the presence of specific spatial patterns. Features are currently extracted using a marker-based motion capture system from Motion Analysis Corporation. However, they may also be readily computed using low-cost inertial sensors, since the latter are well-adapted to tracking changes in position. Before giving details of our approach, we define what temporal expectancy is in the context of a Bayesian framework. Furthermore, our definition must agree with that given for temporal structure; namely, that temporal structure is evident if the tendencies for new events to occur depend on the occurrence times of previous events. Several of the present authors [17, 24] have defined the Bayesian posterior temporal expectancy as the posterior probability that a new event will occur in the next time instant, given all features observed up to and including the current time. 3 Via Bayes rule, the posterior expectancy incorpo2 Expectancies regarding sub-beat positions turn out to be highly informative regarding articulation; see Sec. 3. 3 A similar model seems to have been proposed contemporaneously in the computer vision literature for modeling honeybee trajectories; cf. [19]. 349

Page  00000350 rates these observations along with prior knowledge from temporal structure. To encode this prior knowledge, [24] defines a prior temporal expectancy, which is the conditional probability that a new event will occur in the next time instant, given all past event occurrence times. The prior expectancy is not a single number, but an entire distribution, a function defined over all possibilities of past event occurrences and occurrence times. This is as opposed to the posterior expectancy given a specific observation sequence. Hence, our use of the Bayesian posterior temporal expectancy is entirely consistent with the aforementioned definition of temporal structure: any temporally structured event sequence will generate an informative (non-uniform) prior temporal expectancy and thus influence the posterior temporal expectancy, which models the viewers belief that a new event is about to occur. The rest of the paper is organized as follows. Sec. 2.1 discusses how event occurrences (ictus; preparation) influence the dynamics of observed magnitude velocity and direction trajectories based on a novel construct, the hypotrochoidal model. The hypotrochoidal model conforms to observed tendencies related to the standard spatial forms, which for instance model the cusp-like behavior associated with the ictus. Secs. 2.2, 2.3 make use of informal maximum entropy techniques in order to model in a probabilistic sense, the largest typical set of spatial trajectories which conform to the observed tendencies. Sec. 2.4 describes the induction of temporal expectancy and Sec. 2.5 integrates all previously described dynamic models into a single probabilistic model which fuses features across magnitude velocity and direction to infer beat, tempo, and articulation. Sec. 3 shows results and highlights a case study exploring the effect of staccato and legato articulation on all temporal expectancies considered. 2. PROPOSED METHOD 2.1. Hypotrochoidal model We now consider how beat positions manifest as tendencies in either magnitude velocity or direction. While experienced conductors may eschew the standard spatial forms, we hypothesize that these tendencies are at some level rooted in the spatial forms; i.e. beats manifest as "cusplike" behavior. To model such tendencies in magnitude velocity and direction which generate these cusps, we first a) derive constraints from a general class of idealized forms called a hypotrochoid, and then b) apply maximum entropy criteria to construct dynamic probabilistic models given these constraints [9]. This maximum entropy ap-I proach allows us to model the largest possible typical set of actual conducting gestures which conform to the given constraints. A hypotrochoid is a spatial curve with x(t) =(a - b) sin(t) - h sin (a t) and y(t) (a - b) cos(t) + h co (a t) as parametric equations [26]. Fig. 1 shows a graph for a 1, b 1/3, and h e {0.833, 1.0, 1.2}. The ratio b/a determines the number of segments, while h controls nuances of the cusps. When h < 1 cusps become smoother and when h > 1 they develop loops. Since loop behavior is more natural in the context of conducting ges tures we target values of h slightly above 1. Now we consider what trends in magnitude velocity and motion direction features are implied by the hypotro Trochoidal Analysis, a=1 & b=1/3 Sh=.83 Fu 1Stid prh=1o0 0c 6 - m El h=1e2 0 4 --0.2 - -0.4 - -0.6 - os-0. -12 -1 -0.8 -0.6 -0.4 Xt-0.2 0 0.2 0.4 0.6 Figure 1. Synthesized hypotrochoids. choidal model. These features are computed as follows. Magnitude velocity: V(t) = N(t)2 + (t)2 Motion direction: 0(t)=tan-1(W M)/ (0) (1) Fig. 2 displays graphs of these features using a= 1, b 1/3, and h 1.2. To analyze the conducting gestures, we Directional statistics of a hypotrochoid a=1,b=-1/3,h=1.2 -1 >.5r 0 t10 05 Time in frames Time in frames 0.8 0 Time in frames Figure 2. Magnitude velocity and motion direction trajectories for synthesized hypotrochoid. focus on the 3D motion of the baton, as monitored by the motion capture system. We handle the natural variations in orientation and scale by normalizing the motion to a standard planar kinesphere, according to the algorithm of Fig. 3. 2.2. Modeling trends in magnitude velocity From Fig. 2, we observe that the magnitude velocity of the wrist motion starts at a minimum (close to zero) at the cusp (ictus) and increases to a maximum at the midpoint (preparation). Correspondingly, the magnitude velocity derivative remains positive at the ictus, becomes zero at the preparation and becomes negative before the next ictus. For the purpose of segmentation and for developing dynamic models, we define the stroke as being in two parts. The first part describes the motion from the ictus to the preparation; the second from the preparation 350

Page  00000351 Figure 3. Normalization of conducting gesture. to the next ictus. For each part we define two modes, Mt, each indicating either the onset or continuation of the corresponding part. Thus Mt e {'O1',' C1',' 02,' C2'}: Mt ='O1' onset, first part; Mt ='C1' continuation, first part; Mt ='O2' onset, second part; Mt ='C2' continuation, second part. These modes are diagrammed in Fig. 4. With the modes thus defined we can summarize the observed tendencies in Table 1. The influence of magSingle stroke, Trochoid a=1,b=1/3,h=1.2 Figure 5. Magnitude velocity: Single time slice DAG. Observed variables are shaded; hidden variables are unshaded. In Fig. 5 Yv,t denotes the observed magnitude velocity; Vt the inherent magnitude velocity and At the "first order difference" of V. That is, P(~tVt-_1, At) concentrates deterministically on Vt = Vt-1 + At. P(At|At-1, Mt) is developed by encoding the tendencies in Table 1 using Jaynes' principle of maximum entropy [9]. Let us first consider continuation modes. From Table 1 we have At > 0 for Mt ='C1' and At < 0 when Mt ='C2'. Furthermore, we expect some continuity of At; i.e. At At-l, which can be controlled by E At - At-1 2 < cr. Putting these constraints together and using the methods in [4], we can solve for the maximum entropy dependence in closed form. At N /+(At_, <r2 ), Mt ='l' At A n- (At, - 2 ), Mt ='C2' (2) o.0 2 c 0 2 - 1 c) 0.5 02 C2 where N+ and AN-, respectively, are Gaussian distributions truncated to be positive and negative, sharing the mean At-i and variance o2. At the ictus and preparation, we do not constrain At A At-1; instead we allow for sudden changes in dynamics, weakly constraining At A1) (for Mt ='01') and At -A(2) (for Mt ='02'), where A(1), A(2) > 0 are nominal values; i.e. At N+(A(),A,1), Mt='01' At ~N--(-A(2),2), Mt t='2' (3) Time in frames Figure 4. Single stroke of magnitude velocity with various Mt segments. Mode Mt Observed tendencies in At '01' At > 0, At At-1 'C1' At > 0, At At-1 '02' At < 0, At At-i 'C2' At < 0, At At Finally, via P(Yv,t Vt) we model the observed velocity as the actual velocity plus zero-mean Gaussian noise: Yv,t N(, aY). 2.3. Modeling trends in motion direction From the hypotrochoidal model, we also observe the ictus through a rapid (not necessarily abrupt) change in the direction of motion; during the rest of the stroke the direction changes much more slowly. We call the region of rapid change just succeeding the ictus the transient region, and assume that the latter ceases before the preparation. As shown in Fig. 6, we model Mt {'0',' T',' C'} and specify the corresponding SSM using the DAG shown in Fig. 7. Here Mt ='0' corresponds to the onset of the ictus; Mt ='T' the remainder of the transient region, and M =' C' the continuation, or remainder of the stroke. We do not attempt to observe preparation through motion direction data. We let Ot model the inherent direction of P(X1:N) = n 1 P(Xi Pa{Xi}), where Pa{Xi} are the parents of Xi. For instance, in Fig. 5 P(Mi:t, At-i, At, Vt-1, Vt, Yv,t) P(Mi:t-1)P(Mt I Mi:t-1)P(At IAt-_,Mt)P(Vt Vt--,At)P(Yv,t Vt). Table 1. Magnitude velocity modes and corresponding tendencies observed in At nitude velocity modes on the noisy magnitude velocity observations is modeled as a switching state space model (SSM), for which Fig. 5 shows the directed acyclic graph (DAG).4 4 A DAG [23] is a graphical representation of a factorization of a joint probability distribution into conditional distributions. If a DAG consists of nodes X1:N, the corresponding factorization of 351

Page  00000352 motion, we the inherent derivative of Ot, and Yo,t the observed direction. Similarly to magnitude velocity, the inherent motion direction is driven by its derivative; i.e. P(O Ost-i, wt) concentrates deterministically on O9 O- 1 + wc. This derivative is large during onset and transient regions; otherwise small. Similar maximum entropy arguments as those used for the magnitude velocity model apply here, as well; hence we have for P(wu w-1, Mt): denotes the type of articulation expressed at time t namely legato or staccato, and anticipating the fusion of magnitude velocity and direction features (Sec. 2.5), we consider Mt as the union of all previously described modes; i.e. Me f {'01,' T',' C1',' 02',' 02'} as defined in Table 3. In order to compute elapsed durations since the most recent ictus, Tictus,t and the most recent preparation, Tprep,t we propose the use of dual timer variables, namely Ti,t, T2,t and hence, Tt e {T1,t, T2,t }. The joint distribution for the SSM in Fig. 10 factors as per the DAG. PC P(oit eo-j I Mt f{/O/1/ T11) =Ar(0 (7) P (bit e-it1, Mt=/C/) = #(o0-2 b) (4) where aT > c. Finally, we model Yo,t as Ot plus zeromean Gaussian noise: Yot r" JVA(Ot0 -2 Single stroke, Trochoid a=1,b=1/3,h=1.2 40 70 0 Figure 6. segments. TV\ Time in frames Single stroke of motion direction with all Mt Figure 7. Motion direction: Single time slice DAG. 2.4. Temporal expectancy model Let us now turn to modeling temporal structure via P(Mt|Mi:t-1). In most musical circumstances it is safe to assume, at least locally, that the temporal structure of beat onsets is quasi-periodic [3], with a tempo period that changes slowly over time. In [17, 24], a general method is proposed for modeling prior temporal expectancies for quasi-periodic event sequences which includes the following additional state variables: tempo period TF, elapsed duration since previous onset T, and onset incidence Mt e { '0',' C'}. That is, for ordinary quasi-periodic event sequences Mt, the dependence P(MtJ | Mi:t-1) can be modeled as first-order Markov: P(Tt, Tt, Mt Tt-1, Tt-1, Mt). Presently, we extend this approach to model the more complex temporal structures found in conducting strokes. The DAG of the temporal expectancy model for conducting gestures is shown in Fig. 8. Here, Tt encodes the tempo period similar to that of the model proposed in [17, 24], whereas other variables encode information specific to conducting gestures. For instance, at e {'L',' S'}1 Figure 8. Single time slice DAG of temporal expectancy model for conducting gesture Now we specify the individual distributions implied by the DAG (Fig. 8). As we expect instantaneous tempo deviations to be proportional to the current tempo period, P(Tt Tt-1) follows log Tt r-- (logTt-1,, 2), after [3]. The timers Ti,t and T2,t evolve deterministically according to the second and third columns of Table 2; i.e.; both P(,t Ti,t-1, Mt) and P(T2,t T2,t-1, Mt) concentrate deterministically on these possibilities. P(at |at-1) encodes the assumption that articulation changes infrequently across time; i.e. P(at at-1)=(a)1(0, at + }, where a < 1. Finally, PQ(Mt |iMt-1,sl,t, T2,tI It) is used to encode the prior temporal expectancy, which we now discuss. Mt Tit T2,t Tictust Tprep,t '01' 1/2 T2,t-1 Tit Tit + T2,t 'Ti' Til,t-1 + 1 T2,t-1 Ti,t Tilt + T2,t /C'/' Tit-1 + 1 T2,t-1 Tilt Tilt + T2,t '2' /Ti,t-1 1/2 Tit + T2,t T2,t '02' 1Ti,t-1 T2,t-1 +1 Tit + T2, T2,t Table 2. Behavior of timer variables and computation of elapsed durations under different modes Since all expectancies considered (ictus; preparation) depend only on the elapsed duration since the previous ictus (Tictus,t, given by the fourth column of Table 2), we may encode the prior temporal expectancy via P(MtJ| Mt-1). A state transition diagram for P(Mt| Mt1) with transition probabilities as functions of at, Tictus,t, and Tt is shown in Fig. 9. Here there are essentially three expectancies to consider: pictus, the expectancy for the next ictus; ppre, the expectancy for the preparation, and pc1, the expectancy for the end of the transient region. The ictus expectancy pictus is induced by the quasiperiodic structure of the beat pattern, which depends on the tempo but not on the articulation. We can model the inherent elapsed duration between ictii using a random vari 352

Page  00000353 able L; log L - A(log Tt, a,1). The probability that a new ictus will occur at time t, given that time Tictus,t-1 has elapsed since the previous ictus, is the same as the probability L < Tictus,t-1 + 1, given that L > Tictus,t-1. Hence pictus = Haz(rictus,t-1), where Mt fusion Mt mag. velocity Mt motion dir '01' /01' 0'' 'T1' 'C1' 'T' 'C1' C1l' 'C' '02' '02' 'C' 'C2' 'C2' 'C' Table 3. Correspondence between fusion modes and velocity, motion direction modes. (Tt!}-( 1 fT P(L|Tt)dL Haz(T) = -P(Lt)dL 1 - f P(L|Tt)dL (5) denotes the hazard rate [21]. Similarly, we expect that 1 Tc 1 -pc1 0-1 01 1 -- Pprep S- Pictus S02 -%rep Pictus Figure 9. Fusion: Mode transitions. the preparation expectancy, pprep, depends on both tempo and articulation. Nominally the preparation should occur halfway between ictii; however, under staccato articulation depending on the orientation of the conductors hand, the preparation can occur much earlier or later than the midpoint. We model the corresponding elapsed-duration variable, logL f (log(Tt/2), a07,2), at ='L' logL~ ý N(log(Tt/2),,)2', at ='S' (6) where o-rs > oaTL expresses the much greater deviations attributed to staccato articulation (at ='S'). Then prep = Haz(Tctus,t_1) according to (5). Finally, considering the length of the transient region, we do not explicitly model its dependence on tempo or articulation because we expect it to be very short. Ideally under h = 1 for the hypotrochoidal model (Sec. 2.1), the transient region should have zero length; however under more practical conditions, we expect this region will persist for one or two frames (at a nominal rate of 20 fps). We specify pc1 = 1/2 to model an expected duration of two frames. 2.5. Fusion of magnitude velocity and motion direction features with temporal expectancy model To jointly estimate beat, tempo and articulation as well as infer posterior temporal expectancies regarding ictus and preparation from observed magnitude velocity and motion direction features, we fuse the aforementioned probabilistic models for inherent magnitude velocity (Sec. 2.2, Fig. 5) and motion direction (Sec. 2.3, Fig. 7) trends with the prior temporal expectancy model developed above (Fig. 8). Fig. 10 shows a single time-slice of the resultant DAG. 5 All quantities of interest are estimated and 5 Note that we must remap the mode definitions for the magnitude velocity and motion direction modes for the inherent feature trajectory models. This mapping is given by substituting the first column of Table 3 in place of the second and third columns for velocity and direction, respectively. Figure 10. Single time slice DAG showing fusion of magnitude velocity, motion direction and temporal expectancy models. derived as follows. * Mode (Mt) and articulation (at), since they are discrete, are estimated by maximizing the filtered posteriors P(Mt Yv,1:t, Yeo,:t) and P(at Yv,1:t, Ye,:t) respectively. Doing so gives at any instant, the minimum-error estimation given present and past feature observations [11]. Ictus locations are determined those times t where the estimated mode, M!, equals '01'. Similarly, preparation onsets are determined when Mt ='02'. * Tempo, since it is continuous-valued, is estimated as the mean of the filtered posterior P(Tt Yv,1:t,Y0,1:t), which yields the minimum mean-square error estimator [11]. * The posterior temporal expectancy for the ictus, as defined in Sec. 1.1, is computed via P(Mt+ =-' 01'| Yv,:t, YO,l:t). Similarly the preparation expectancy is given by P(MIt+l '02' Yv,1:t, Ye,1:t). * All posteriors are computed using a standard sequential importance resampling particle filter [1]. The overall preprocessing, feature extraction and inference steps have time complexity which is linear in the number of frames and can be easily implemented in real time. 3. RESULTS AND DISCUSSION We have tested our method on extensive real-world data from performances by a novice conductor using a very 353

Page  00000354 simple marker set: left and right shoulders, right elbow, right wrist and the baton tip. The raw marker data is first normalized using the algorithm described in Fig. 3 and then magnitude velocity and motion direction features are extracted via (1). The marker data is often noisy due to missing markers and occlusions, and this noise tends to amplify when taking the approximate derivatives required to compute these features. Hence, we apply third-order Savitzky-Golay smoothing [22] to both feature sets before presenting this data to our algorithm. Fig. 11 shows results regarding the real-time estimation of tempo, articulation, and beat positions for a short segment (270 frames at 100 fps; 2.7 seconds) using a metronome running at 90 bpm. Despite the short interval, the tempo and articulation estimates clearly converge to the correct hypotheses within 1.5 seconds and 0.3 seconds respectively. The beat (ictus) segmentation given in the lower half of the figure also makes intuitive sense, as segments are closely allied with cusp minima of the magnitude velocity curve and points of rapid direction change. Fig. 12 shows similar results for the legato case. In Figs. 13 and 14 the posterior temporal expectancies are compared for these cases. We see that there is no appreciable difference regarding the ictus expectancy; however there is a significant difference regarding the preparation expectancy. With staccato articulation, the preparation expectancy develops earlier and builds up over a longer period as compared with legato. As discussed in Sec. 1.1 a prolonged increase of temporal expectancy is a key component in the build up of tension. Hence our intuitive sense is confirmed that gestures associated with staccato articulation are communicated more strongly. Furthermore, as differences in articulation exhibit such dramatic effects on expectancy variations, while exhibiting rather slight effects on the induction of indicative musical attributes such as beat and tempo, we conclude that it is primarily through temporal expectancy that the very palpable difference in musical expressivity through different articulation styles is communicated. 4. CONCLUSION We have introduced a new gestural control paradigm for musical expression by developing a computational framework to jointly infer basic musical parameters (beat, tempo, articulation) and induce temporal expectancies from the baton motion of a conductor. Via temporal expectancy our paradigm focuses on the expressive and communicative intents underlying this motion. In fact, our preliminary results do confirm an initial speculative hypothesis that musically expressive elements of the conducting gesture are communicated specifically through temporal expectancies. While the initial realization of our paradigm is rooted in conducting gesture, virtually all of the computational tools can be generalized beyond conducting; for instance our dynamic models of motion features do not assume the standard spatial forms commonly associated with conducting (Sec. 2.2, Sec. 2.3) and the prior temporal expectancy framework discussed in Sec. 2.4 can be attached to entirely different types of features or con trols. We have chosen to focus on "conducting-like" gesture because we believe the associations between gesture and musical expression are quite richly established through E Temp 120 0 100 -coo 50 100 o tracking SIn2e0n0c 3esult 150 200 250 300 Articulation inference P(at='S') _ 0.5 0 n j 0 50 100 150 200 250 30 I G) 0.1 0 005 Stroke segmentation 50 100 150 200 250 Time in frames 300 Figure 11. Inference results on conducting data expressing staccato articulation. E 8-140 -0 0 50 100A Artic Tempo tracking E) Inference result --Ground truth ah-~C~.~ksQI~&rwu~a~n~L~P~s~~ I I I I 150 200 ulation inference P(at=L') 250 300 0 0.5 - 0..... Lj O 0 50 100 150 200 250 30C Stroke segmentation I 0. 1 ~ C) 0 0 50 100 150 200 250 300 Time in frames Figure 12. Inference results on conducting data expressing legato articulation. this framework. For future work, we first plan to run more extensive tests using a broader population of conductors, experienced as well as novice, particularly from the experienced conductors we can obtain qualitative feedback on how well the musical expressivity of the gestures are captured. We also intend to generalize our framework to the induction of higher level temporal patterns (for instance meter and related accentual patterns; cf. [20]), such that a greater range of complexities and nuances of temporal expectancies associated with musical rhythm may inform the gestural control of musical expression. 5. ACKNOWLEDGEMENTS We gratefully acknowledge that this material is based upon work supported by the National Science Foundation CISE Infrastructure and IGERT grants Nos. 0403428 & 0504647. 0 354

Page  00000355 0.6 o0 o o 0.4 Predictive expectancy of 01 --- Expectancy Event occurrence iI Il 0 1.. I.. I..m- _ _ _ I I 0 50 100 150 200 250 30 Predictive expectancy of 02 1,1,1 -1-1,1,, 0.6 o 0.4 0.2 - - - Expectancy - Event occurrence Ston r.., - *.'. I I 0 50 100 150 200 250 300 Time in frames Figure 13. Predictive posterior expectancy curves of 01,02 segments along with event occurrences for staccato articulation. Predictive expectancy of 01 0.8 0.6 o 0.4 0.2 - - - Expectancy Event occurrence *-; 0 50 100 150 200 250 O.E o 0.4 0. Predictive expectancy of 02 - - - Expectancy I Event occurrence 50 100 150 200 250 Time in frames [6] P. Hackney. Making Connections: Total Body Integration Through Bartenieff Fundamentals. Routledge, 2000. [7] I. G. Hagendoorn. Some speculative hypotheses about the nature and perception of dance and choreography. Journal of Consciousness Studies, pages 79-110, 2004. [8] D. Huron. Sweet Anticipation: Music and the Psychology of Expectation (Bradford Books). The MIT Press, 2006. [9] E. T. Jaynes. Probability Theory: Logic of Science. Cambridge, 2003. [10] M. R. Jones and J. D. McAuley. Time judgments in global temporal contexts. Perception and Psychophysics, pages 398-417, 2005. [11] S. M. Kay. Fundamentals of statistical signal processing: estimation theory. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993. [12] P. Kolesnik and M. Wanderley. Recognition, analysis and performance with expressive conducting gestures. In International Computer Music Conference, 2004. [13] E. Lee, I. Grull, H. Kiel, and J. Borchers. conga: A framework for adaptive conducting gesture analysis. In International Conference on New Interfaces for Musical Expression, 2006. [14] J. D. McAuley. The effect of tempo and musical experience on perceived beat. Australian Journal of Psychology, pages 176-187, 1999. [15] L. B. Meyer. Emotion and Meaning in Music. University Of Chicago Press, 1961. [16] D. Murphy, T. Andersen, and K. Jensen. Conducting audio files via computer vision. In A. Camurri and G. Volpe, editors, Gesture Workshop 2003, volume LNCS 2915 of Lecture Notes in Computer Science, pages 529-540. SpringerVerlag, Heidelberg, Germany, 2003. [17] K. Murphy. Dynamic Bayesian Networks:Representation, Inference and Learning. PhD thesis, University of California, Berkeley, 2002. [18] E. Narmour. The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model. University of Chicago Press, 1990. [19] S. M. Oh, J. M. Rehg, and F. Dellaert. Parameterized duration modeling for switching linear dynamic systems. In International Conference on Computer Vision and Pattern Recognition, 2006. [20] D. J. Povel and P. Essens. Perception of temporal patterns. Music Perception, pages 411-440, 1985. [21] S. Ross. Stochastic Processes. Wiley Interscience, Yorktown Heights, NY, 1995. [22] A. Savitzky and M. J. E. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, pages 1627-1639, 1964. [23] H. Thornburg. Detection and Modeling of Transient Audio Signals with Prior Information. PhD thesis, Stanford University, 2005. [24] H. Thornburg, D. Swaminathan, T. Ingalls, and R. Leistikow. Joint segmentation and temporal structure inference for partially-observed event sequences. In International Workshop on Multimedia Signal Processing, 2006. [25] S. Usa and Y. Mochida. A multi-modal conducting simulator. In International Computer Music Conference, 1978. [26] E. W. Weisstein. http://mathworld.wolfram.com/ hypotrochoid.html. MathWorld-A Wolfram Web Resource. [27] I. Xenakis. Formalized Music. Pendragon Press, Stuyvesant, New York, 1992. [28] T. P. Zanto, J. S. Snyder, and E. W. Large. Neural correlates of rhythmic expectancy. Advances in Cognitive Psychology, pages 221-231, 2006. Figure 14. Predictive posterior expectancy curves of 01,02 segments along with event occurrences for legato articulation. 6. REFERENCES [1] S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. Tutorial on particle filters for on-line nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 2001. [2] J. Berger and D. Gang. A neural network model of metric perception and cognition in the audition of functional tonal music. In International Computer Music Conference, 1997. [3] A. Cemgil, H. J. Kappen, P. Desain, and H. Honing. On tempo tracking: Tempogram representation and Kalman filtering. In Proceedings of the 2000 International Computer Music Conference, pages 352-355, 2000. [4] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, 1999. [5] P. Desain. What rhythm do I have in mind? Detection of imagined temporal patterns from single trial ERP. In Proceedings of the International Conference on Music Perception and Cognition (ICMPC), 2004. 355