Page  00000073 NON-LINEAR SCALING TECHNIQUES FOR UNCOVERING THE PERCEPTUAL DIMENSIONS OF TIMBRE John Ashley Burgoyne and Stephen McAdams Centre for Interdisciplinary Research in Music and Media Technology Schulich School of Music of McGill University 555 Sherbrooke Street West Montr6al, Qu6bec, Canada H3A 1E3 {ashley, smc} ABSTRACT Seeking to identify the constituent parts of the multidimensional auditory attribute that musicians know as timbre, music psychologists have made extensive use of multidimensional scaling (MDS), a statistical technique for visualising the geometric spaces implied by perceived dissimilarity. MDS is also well known in the machine learning community, where it is used as a basic technique for dimensionality reduction. We adapt a popular non-linear variant of MDS in machine learning, Isomap, for use in analysing psychological data and re-analyse an earlier experiment on human perception of timbre. Isomap is designed to be better than linear MDS at maintaining the local relationships of each data point with its neighbours, and our results show that it can produce a more musically intuitive timbre space without sacrificing correlation with those aspects of timbre perception that have already been discovered. Future work should explore the new timbral dimensions uncovered by our algorithm. 1. INTRODUCTION As any computer musician knows, timbre is one of the most important compositional parameters, and yet it remains one of the most under-theorised. Part of the reason for the relative lack of theory may be due to the fact that, unlike pitch, timbre is a multi-dimensional auditory attribute, and it it is difficult to draw general conclusions about timbre as a whole without first identifying its constituent parts. Nonetheless, there have been a number of attempts to uncover the underlying dimensionality of timbre space over the past few decades, most based on perceptual experiments with synthesised and recorded tones. Early experiments with synthetic tones identified spectral centroid and attack time as primary components of timbre, in addition, at times, to a third dimension that was more difficult to interpret and dependent on the stimulus set [4, 5]. Later studies with more sophisticated models came to similar conclusions, and suggested that the third component might be a measure of irregularity in the spectral envelope [6, 7]. A recent confirmatory study has verified these interpretations [2]. All of these studies are based on a statistical technique known as multidimensional scaling (MDS) [9]. The basic idea of MDS is to take the set of proximities between all members of some set of data points, e.g., sample timbres, and to model them as distances in a Euclidean space of as few dimensions as possible. In the context of timbre, these proximities are usually taken from psychological experiments in which human subjects have rated their perception of the (dis)similarity between timbre pairs. The trouble with MDS in this context is that its classical form was designed to interpret a single set of dissimilarities among items, not the average over all subjects of an experiment. The first robust solution to this problem was the INDSCAL algorithm [3], which models a special weight on each dimension for each subject in the experiment in order to better fit model distances to the set of empirical dissimilarities. The more sophisticated CLASCAL algorithm reduces the number of parameters in INDSCAL by modelling weights not for individual subjects but for a smaller number of aggregate subject groups, called latent classes [11]. CLASCAL and its variants are the standard techniques for analysing timbre spaces today. There is another problem with these techniques, however: being linear, they consider all distances estimated by the human subjects to be equally reliable and of equal relative scale. Although some relatively straightforward extensions to MDS can treat the latter problem, e.g., CONSCAL [12], the former requires more aggressive modifications. One such modification, known as Isomap, replaces large distances in the original distance matrices with socalled geodesic distances along a hypothetical manifold [8]. Previous papers at this conference have demonstrated that Isomap and its relatives can uncover meaningful musical relationships that traditional linear MDS will always miss [1]. This paper combines the CLASCAL and Isomap models to re-analyse the data from a major study of timbre [7]. Section 2 provides a more detailed explanation of these algorithms and best practises for interpreting their results. Section 3 presents the results of our new scaling and compares them to the original study. Section 4 concludes with suggestions for future applications of non-linear scaling to the study of musical timbre. 73

Page  00000074 2. CLASCAL AND ISOMAP 2.1. CLASCAL Traditional MDS was designed to handle a single set of pairwise proximities only. A number of models have been presented to adapt MDS for multiple-subject experiments, of which the most important for studying timbre has been CLASCAL [11]. The CLASCAL model seeks to minimise the approximation error in the following equation: -R -1/2 dijk r [ (i),r(XJrXkr)2 (1) where dijk is the dissimilarity rating that subject i assigned to stimulus pair (j, k), R is the number of dimensions in the output set, wr(i),r is a special weight for the so-called latent class W(i) to which CLASCAL has assigned subject i, and Xjr and Xkr are the r-th components of the output vectors for stimuli j and k. Latent classes are meant to represent groups of subjects who pursue similar rating strategies. The number of latent classes used is a compromise between over-parametrisation, e.g., the INDSCAL model, which assigns each subject to its own class, and overgeneralisation, e.g., ignoring differences between subjects by taking the average over all dissimilarity matrices. A Monte-Carlo likelihood-ratio technique is used to determine the optimal number of classes. Another problem with traditional MDS is that it assumes all of the variance in a data set can be explained by dimensions common to all stimuli. This assumption does not hold for timbres: many include instrument-specific components such as the sound of the returning hopper in a harpsichord. A more sophisticated version of CLASCAL separates these components, known as specificities, using the following model: -R -1/2 dijk W(i),r(Xjr-Xkr)2 +V(i)(S+Sk) (2) _r=1 where sj and sk are the specificities for stimuli j and k and vr(i) represents the weight subjects in class W(i) give to specificities when distinguishing timbres [7, 10]. 2.2. Isomap Isomap arose as a solution to the problem of dimensionality reduction for data sets like the famous 'Swiss roll' pictured in Figure 1 [8]. Looking at the plot, it is obvious to a human that the data are arranged on a two-dimensional plane that has been coiled and presented in three dimensions. This fact is not obvious to traditional MDS, which strives to preserve every pairwise distance in the set, including those between the ends of the roll and the inner or outer loops. The ingenious solution in Isomap is to throw away all pairwise distances in the set except those at the local level, i.e., those in a small region immediately surrounding each point in the data set. These regions can be selected as a fixed number k of the nearest neighbours to (a) Embedded in 3-D (b) Unrolled in 2-D Figure 1: The 'Swiss roll' data set. On the left, the data is presented in its original form. On the right, the data is presented as it should be unrolled for human interpretation. Traditional MDS can never arrive at this solution, however, because it seeks to preserve the distances between the ends of the roll and the inner/outer loops. each point in the data set or as those points that fall within a sphere of fixed radius e around each point. The other distances are then recomputed using an all-pairs shortestpath algorithm, yielding an approximation of the so-called geodesic distances, or distances in the lower-dimensional form. After these approximate distances are computed, traditional MDS is applied. At first glance, the Swiss roll appears to be a fundamentally different problem than that of estimating timbre spaces. There is little reason to believe that human subjects would willfully twist their ratings of the similarities between timbre pairs into more dimensions than are already present. The larger message of Isomap, however, is that unless a space is perfectly linear, large distances in a scaling model can mask important structures in the data. It seems prudent to check for such structures in psychological data, and because Isomap is based on classical MDS, unlike a number other non-linear scaling techniques, it lends itself naturally to combination with CLASCAL. Each subject's dissimilarity matrix is processed according to the Isomap algorithm up to the final MDS step. After this pre-processing is complete, the new dissimilarity matrices are fed to CLASCAL as usual. 3. EXPERIMENTS AND RESULTS We did not perform a new perceptual study for this paper, but rather re-examined data from McAdams et al.'s 1995 study of 88 subjects [7]. We chose to examine the judgements of the professional musicians in the study only, 24 in all, in order to simplify our analysis. The timbres used in the study overlapped considerably with those used in [6], which were a set of recorded, FM-synthesised timbres designed to mimic traditional musical instruments. In addition to 12 timbres from this set, McAdams et al., following the legacy of [5], also synthesised 6 hybrid timbres, e.g., the oboleste, a combination of the perceptual features of oboe and celesta sounds. Each subject had an opportunity to rate the dissimilarity between all 153 pairs of these 18 timbres. 74

Page  00000075 Rise Isomap (non-linear) -0.87 -0.45 -0.72 0.03 -0.79 0.07 -0.15 0.18 0.22 Instrument Abbreviation Lin. N.-L. S.C. Flux I 2 'd ~,.. -0.94 0.44 0.30 0.26 -0.00 -0.14 0.09 0.91 0.02 0.10 0.49 -0.15 0.92 -0.35 -0.53 -0.15 0.42 0.18 -0.21 -0.88 0.82 -0.67 0.24 -0.16 Table 1: Correlation coefficients of significant dimensions in the linear and Isomap models against each other and log rise (attack) time, spectral centroid, and spectral flux. Values in bold are significant at p = 0.005; values in italic are significant at p = 0.1 3.1. Dimensionality and class membership As the authors of CLASCAL recommend, we used the standard Bayesian information criterion (BIC) to establish the dimensionality of the MDS spaces and the special Monte-Carlo technique mentioned above to determine the number of latent subject classes. We found that the traditional linear model produces a four-dimensional space when including specificities, which surprisingly, is of higher dimensionality than the space uncovered in [7] for a larger set of subjects. The reason for this difference is likely that only three latent classes are necessary to get the best fit for our data as opposed to the five necessary for the larger set. The Isomap model was able to reduce the space back to three dimensions with the same number of latent classes, although the class membership differs. The lower right-hand corner of Table 1 presents the correlation coefficients between the dimensions of the linear and Isomap models. The leading dimensions are very strongly correlated (r = 0.92). The second dimension of the linear space, however, correlates best with the trailing dimension of the non-linear space, and the trailing dimension of the linear space correlates strongly with the second dimension of the non-linear space. Because the dimensions are ordered according to their perceptual importance, this last result is somewhat surprising: although [7] employs the same linear model as ours, our non-linear model yields a better match to their results. In a further wrinkle, the leading dimension of the linear model correlates better with the trailing dimension of the non-linear model than the second does. The table also presents correlations with McAdams et al's acoustic correlates of the dimensions of timbre space: (log) rise, or attack, time, the spectral centroid, and spectral flux, i.e., a measure of change in the spectral shape throughout the duration of the stimulus. The first two dimensions of the non-linear space correlate with attack time and spectral centroid; in keeping with the relationship between the dimensions of the linear and non-linear models, it is the leading and trailing dimensions of the linear model that exhibit similar correlations. Consistent with the findings of [7], spectral flux correlates only weakly with the model dimensions. The problem trailing dimension of the non-linear space again exhibits an unexpected French horn Trumpet Trombone Harp Trumpar Oboleste Vibraphone Striano Harpsichord English horn Bassoon Clarinet Vibrone Obochord Guitar Strings Piano Guitarnet hrn tpt trn hrp tpr = tpt + gtr ols = obo + cel vbs sno = str + pno hcd can bsn cnt vbn = vbs + trn obc = obo + hcd gtr stg pno gtn 0.83 0.47 2.01 1.29 1.93 1.65 1.74 1.99 2.56 1.23 1.50 2.10 2.12 0.00 1.53 1.35 2.20 1.26 2.75 1.56 1.80 0.87 1.73 2.46 3.43 1.78 3.55 2.85 0.77 4.22 2.36 3.53 2.47 1.76 2.71 2.98 Table 2: Square roots of the specificity values for the linear and non-linear models. Names of hybrid timbres are printed in italic. correlation, correlating with log rise time at a comparable level of significance (p = 0.005) to the leading dimension. These coefficients confirm the commonly held result that attack time is a dominating component of the perception of timbre and that spectral centroid is also a significant component. Like previous studies, however, our results leave room to explain at least one further component that could explain the trailing dimension of our non-linear space or the second and third dimensions of our linear one. Spectral flux, the explanatory power of which has already been called into question [2], fares notably poorly. 3.2. Outliers and specificities A somewhat surprising result emerged when examining our data for outliers. Although no outliers were detected in the untransformed data, an analysis of the latent class assignments revealed one outlier after the transformation. Fundamentally, the Isomap transformation emphasises the effect of fine-grained distinctions between those timbres a subject perceives as fairly similar and discounts the effect of coarse-grained distinctions between timbres a subject perceives as largely dissimilar. Thus, this outlying subject uses comparable strategies to other subjects' when making coarse distinctions but a unique strategy to make fine distinctions. It bears further study to determine exactly how this strategy differs from the others. Although the outlier situation and class membership differ between the two models overall, they appear to have one latent class in common, i.e, with the same membership. Looking at the weights Wc(i),, and Vc6(i), one can see that the defining characteristic of this common group is an emphasis on specificities when making timbre similarity judgements. Given this result, one would expect that the specificity values would be similar between the two models, but this is not the case. Table 2 presents these values for both models. One can see that specificity values 75

Page  00000076 vbs g* tr ols. Shrp pno vbn hcd obc * hr vbs Sols obc hcd gtr pno vbn a, I E -0 -0 cz tpr sno * E - 0 -<c tpr 0 IQ -1 -: gtn 0 stg tbn ** can -2- * tpt cnt 3 c bsn -3- hrn -4 50 4 0 -2 -4 -5 Spectral centroid tpt n acan -1 -. v, tbn bsn hrn -2- h gtn. g S sno... ~cnt 5 4 2 o- - 0 -3 -2 -4 -5 Spectral centroid (a) Linear MDS (b) Isomap Figure 2: Two timbre spaces based on [7], one generated with linear MDS and the other generated with Isomap on averaged subject spaces. The unlabelled axes correspond to dimensions 2 and 3 in the respective CLASCAL spaces. are higher for the non-linear model in general and that the correlation is poor (r = 0.23). 3.3. Timbre spaces The principal advantage of the non-linear model is that it can preserve spatial structure with fewer dimensions. Figure 2 presents the complete non-linear space and dimensions 1, 2, and 4 of the linear space (so as to preserve the dimensions that correlate with the non-linear model). The third axes are left blank because it is difficult to interpret the meaning of these dimensions. As one would expect from the strong correlations in Table 1, the spatial groups of the models are similar, although three-dimensional rotation illustrates that the points in the non-linear space are clustered more tightly. 4. SUMMARY AND FUTURE WORK When applied to experiments on human timbre perception, Isomap appears to retain the most desirable aspects of the global structure of linear models while tightening the local structure of the resulting timbre space. Our results raise interesting questions about differing strategies subjects use when distinguishing highly similar vs. highly dissimilar timbres and suggest that there remains at least one dimension of human timbre perception for the community to discover how to interpret. 5. REFERENCES [1] J. A. Burgoyne and L. K. Saul, "Visualization of low-dimensional structure in tonal pitch space," in Proc. Int. Comp. Mus. Conf, 2005, pp. 243-46. [2] A. Caclin, S. McAdams, B. K. Smith, and S. Winsberg, "Acoustic correlates of timbre space dimensions: A confir matory study using synthetic tones," J. Acoust. Soc. Am., vol. 118, no. 1, pp. 471-82, 2005. [3] J. D. Carroll and J.-J. Chang, "Analysis of individual differences in multidimensional scaling via an n-way generalization of 'Eckart-Young' decomposition," Psychometrika, vol. 35, no. 3, pp. 283-319, 1970. [4] J. M. Grey, "Multidimensional perceptual scaling of musical timbre," J. Acoust. Soc. Am., vol. 61, pp. 1270-77, 1977. [5] J. M. Grey and J. W. Gordon, "Perceptual effects of spectral modifications on musical timbres," J. Acoust. Soc. Am., vol. 63, no. 5, pp. 1493-1500, 1978. [6] C. L. Krumhansl, "Why is musical timbre so hard to understand?" in Structure and Perception of Electroacoustic Sound and Music, ser. Excerpta Medica, S. Nielzen and O. Olsson, Eds. Amsterdam: Elsevier, 1989, no. 846. [7] S. McAdams, S. Winsberg, S. Donnadieu, G. D. Soete, and J. Krimphoof, "Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes," Psych. Res., vol. 58, pp. 177-92, 1995. [8] J. B. Tennenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction," Science, vol. 290, pp. 2319-23, 22 December 2000. [9] W. S. Torgerson, Theory and Methods of Scaling. New York: Wiley, 1958. [10] S. Winsberg and J. D. Carroll, "A quasi-nonmetric method for multidimensional scaling via an extended Euclidean model," Psychometrika, vol. 54, no. 2, pp. 217-29, 1989. [11] S. Winsberg and G. De Soete, "A latent class approach to fitting the weighted Euclidean model, CLASCAL," Psychometrika, vol. 58, no. 2, pp. 315-30, 1993. [12] -, "Multidimensional scaling with constrained dimensions: CONSCAL," Br J. Math. Stat. Psychol., vol. 50, pp. 55-72, 1997. 76