Page  00000001 Signal Analysis of the Singing Voice: Low-Order Representations of Singer Identity Maureen Mellody (1) Gregory H. Wakefield, Ph.D. (2) (1) Applied Physics Program, University of Michigan (2) Department of Electrical Engineering and Computer Science, University of Michigan Abstract Signals from 12 soprano and mezzo-soprano singers are analyzed with the modal distribution, a high-resolution timefrequency analysis method, to obtain measures of instantaneous amplitude and frequency of the signals' partial components. These estimates are used to construct singer-specific transfer functions for each vowel and pitch analyzed. An AR approximation to a residual representation of these transfer functions is determined. This low-order residual approximation is used in a hierarchical clustering experiment. Using this representation, functions from each singer for a given vowel tend to cluster into a single class before classes collapse across singers. This result suggests that the low-order residual approximation captures singer-specific features in the sung signals. 1. INTRODUCTION Human beings are remarkably adept at identifying a person's speaking voice with minimal information, and this same perceptual acuity applies when listening to the singing voice. The singing voice is unique in the listener's ability to easily distinguish among performers. What does the human auditory system extract from an audio signal to complete this identification task? In the present paper, we seek to identify, model, and synthesize singer-specific features of signals recorded from sopranos and mezzo-sopranos. The cues that a listener uses to identify a singer should be present in a signal-based analysis of musical passages generated by that singer. Accordingly, we are interested in identifying such cues to parameterize a singer's voice and to quantify individual differences in production across singers. The signal-based analysis demonstrated in this paper is an improvement over existing physiology-based studies of the singing voice, as these techniques are invasive to the singer. This invasiveness can make the task disagreeable to the singer, and, more importantly, may subtly alter the way in which he or she produces sound. In addition, a signalbased approach can also give information about the musical performance which could not be extracted from a purely physical characterization. These expressive gestures are likely to be contributing factors to a singer's characteristic vocal quality. The foundation of the present study is the ability to analyze and re-synthesize a given musical passage without loss of perceptual fidelity without degrading vocal identity or quality. Having such allows us to modify the extracted parameters to alter vibrato or formant structure and observe the perceptual effects. In this way we can determine which signal features are distinguishing characteristics of a particular singer, as a function of both pitch and vowel. In this work, we introduce the feature set obtained from our signal-based analysis and show results from some simple clustering methods applied to these feature vectors. 2. METHODS 2.1. Signal Acquisition The voices of twelve sopranos and mezzo-sopranos (students or graduates of a voice MFA degree program) were digitally recorded in a recital hall. A clip-on microphone, attached at sternum level, was used to record 44.1 kHz, 16 -bit, mono signals. A wide range of vocal material was recorded, from which only a small portion was used in this study. Specifically, the singers were asked to sing each of the Italian vowels on a three-note ascending-descending vocalise in full vibrato, holding each note for three seconds. Holding the vowel constant, the singers repeated the pattern every half-step in a two-octave range, e.g. G3 to Eflat5 for the mezzo-sopranos and B3 to G5 for the sopranos. 2.2. Signal Analysis Methods A (high order) transfer function for combinations of pitch, vowel, and singer is identified via a time-frequency analysis of the signal. The basic analysis steps are as follows: 1. Analyze the passage with the modal distribution method (Pielemeier and Wakefield, 1996), a high-resolution time-frequency technique, to extract the time-varying amplitude and frequency for the partials of the notes in the passage;

Page  00000002 2. Create a (high-order) composite vocal tract transfer function for each combination of starting pitch, vowel, and singer (Mellody and Wakefield, 2000). This transfer function assumes a constant-amplitude input with time-varying frequency as obtained from the modal distribution estimates; 3. Determine a common transfer function for each vowel, obtained by averaging (in decibels) the transfer functions obtained for each singer-vowel-pitch combination below pitches of F4, which should be lower in pitch than the first passagio; 4. Obtain a residual transfer function for each combination of singer-vowel-pitch. Each composite transfer function is projected onto the common transfer function (both in decibels), and the component orthogonal to this projection is considered the residual. The modal distribution time-frequency method has been shown to provide errors in frequency and amplitude estimation of only 0.1 cents and -60 dB, respectively (Pielemeier and Wakefield, 1994), a considerable improvement over Fourier techniques. These estimates, as well as the high-order transfer functions obtained, have been used to successfully synthesize sounds in (Mellody and Wakefield, 2000). 2.3. Low-Order Modeling of the Residual Transfer Function From the high-order transfer functions, low-order, perceptually robust approximations are constructed to synthesize new instances of sung vowels which bear the identity of the singer. To develop a low-order representation of a particular singer's vocal quality for a given vowel, the residuals are parameterized using autoregressive (AR) analysis. An exhaustive search is performed for each residual over model orders ranging between 10 and 100, in increments of 10, to determine the most appropriate model order. 2.4. Clustering Methods A nonparametric, hierarchical clustering analysis was used to group the feature vectors determined from each singer-vowel-pitch residual transfer function (Jain and Dubes, 1988). At each level, L, in the clustering, the number of clusters that exists is (N-L), where N is the total number of feature vectors. The separation distance among all pairs of feature vectors from differing clusters is determined using a Euclidean norm. The minimum of these distances is found, and the two clusters to which the corresponding feature vectors belong are merged together at each clustering level. 3. RESULTS 3.1. Residual Transfer Functions The use of a source-filter model to parameterize a given singer relies upon the static filter properties of each singer's output. That is, if singer identity is, in part, reflected in the low-order model of the residual transfer function, we expect relatively small variations among the PQ.a r6 d r6., r6 s 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Freq, Hz 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Freq, Hz Figure 1: Residual transfer functions, shown in dB units as a function of frequency. The top panel shows residual transfer functions determined from the scales beginning on E4 (solid) and F4 (dotted) for singer SO1 performing the vowel /i/. The lower panel shows residual transfer functions determined from two scales beginning on E4. The solid line is estimated from singer SO1 (same as upper panel) while the dotted line is estimated from singer S06. transfer functions obtained from one singer when compared to the variations across singers. An example of the amount of variability observed in our transfer functions is shown in Figure 1. The top panel of Figure 1 shows two residual transfer functions from singer S01, obtained from two phrases sung a half-step apart (on E4 and F4). The lower panel shows two residual functions on the same pitch (E4), but from two different singers. The norm-squared of the difference between the two functions in the top panel is over ten times smaller than the difference between the two functions in the lower panel. This small within-singer variation relative to a large across-singer variation is typical of all of the twelve singers analyzed. The within- and across-singer comparison is also illustrated in Figure 2 for the vowel /i/. This figure shows average difference values between singers, where the values are determined by computing the average normsquared of the difference between each pair of functions, in dB units, for those functions that begin at or below F4. In general, the within-singer variation (the diagonal of the matrix) show much smaller differences than the acrosssinger (off-diagonal) values. It should also be noted that the first seven singers in the figure (the lower-left corner) classify themselves as sopranos, while the remaining five (the upper-right corner) classify themselves as mezzosopranos. 3.2. Residual Parameterization A feature space of order 11,025 is prohibitively large for any sort of clustering analysis or pattern classification; to reduce this feature space, we explored approximations to the residual transfer functions. Ultimately, the feature

Page  00000003 0.6........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 4 6 8 10 12 Singer Figure 2: Average of the squared norm of the difference (in dB) between residual transfer functions for the vowel li/. Only residual transfer functions with starting pitches at or below F4 were used in the averaging. The diagonal elements are the within-singer differences, while the off-diagonal elements show the across-singer differences. The first seven singers are sopranos and the remaining vector selected was the vector of coefficients obtained from the AR analysis. A model order of 50 was chosen to approximate the 11,025-point residual transfer function; this model order was chosen as a compromise between the dB error between the fit and the original function and the computational complexity needed with a large model order. Figure 3 shows the residual transfer function for the singer S01 on the vowel /i/ beginning on E4 (solid line), the same residual shown in Figure 1. The dotted line shows the best-fit function for an order-50 AR approximation to this function. Figure 4 shows the vector of AR coefficients for the two different residual transfer functions plotted in the lower panel of Figure 1. These two vectors are estimated from two different singers, although from the same vowel and pitch. 20 jj -10 - o -lo 0.4 r 0.2 -0.2 -0.4 10 20 30 40 50 Coefficient number Figure 4: AR coefficients for a model order of 50, estimated for the residual transfer functions on the starting pitch E4 for the vowel /i/. The solid line is the coefficient vector for singer S01 and the dotted line is the coefficient vector for the singer S06. 3.3. Clustering Analysis The dendograms for the clustering results for these residual transfer functions are too large and intricate for display. Instead, Figure 5 shows the average number of different singers in each cluster, as a function of clustering level. Each panel in the figure shows the results for a different vowel. If the clustering were "perfect," then the plots would show a value of one for each clustering level until there were only 12 levels remaining, at which point the singer clusters would have to merge and there would be more than one singer per cluster. In each panel, the number of singers per cluster is equal to one for a large number of the clustering levels, with a standard deviation of zero or very close to it. Once the clustering level, L, becomes large (i.e., the number of clusters, (N-L), becomes small), different singers collapse into the same cluster. This information can potentially be quite important. If we had perfect clustering, we would then have no information about relative distances between the singer clusters. However, since the algorithm classifies multiple singers within clusters, we can use this information to determine the correlation between these clustering results and the perceptual response to singer identity: do listeners also think those two singers are similar? The standard deviations of the number of singers per cluster is quite large for the larger values of clustering level. This indicates that certain clusters are single-singer, but that other clusters contain a large number of singers. This could be due to outliers in the feature vector set; clusters from two different singers may be more similar than these outliers are to their own singer group. An example of a hierarchical clustering result is shown in Figure 6 for the vowel /a/ for low pitches, where the dendogram has been cut at the clustering level L=61. In the figure, each cluster is represented by a horizontal bar, broken down into the different singer(s) present in this cluster. The singer groups are labeled on the figure; an "S" corresponds to a soprano and an "M" to a mezzo-soprano. The L=61 level corresponds to 12 clusters; with "perfect" 0 2000 4000 6000 Freq, Hz 8000 10000 Figure 3: Residual transfer functions for the vowel /i/ and note E4 for singer SO1. The solid line is the original residual estimate, and the dotted line is an AR approximation, using a model order of 50.

Page  00000004 15 10 /a/ 0 20 40 60 0 20 40 60 12 M05 11 M05 10 o M04 15 5 " 0 0 15 | 10.* 5 0 3 0 10 - 15 3 10 5 lo/ 0 o 15 20 40 60 8 i7 i6.5 U4 3 2 1 M02 M02 M02 M01 S06 M05 S06 M05 S04 SS04 S03 S07 M03 M 4 S01 S02 S05 20 30 40 50 60 70 20 40 60 10 5 0 / 0 20 40 60 Clustering Level, L Figure 5: Average number of singers per cluster as a function of clustering level, for the five vowels analyzed. The feature vectors used were the N=50 autoregressive analysis coefficients for residual transfer functions obtained from starting pitches below F4. There are a total of twelve singers. clustering, this would be the level at which each singer mapped to a single, distinct cluster. In this example, however, three of the clusters contain feature vectors from multiple singers, and three other clusters have only a single member. This could indicate similarity between the singers who are clustered together. Additionally, this may indicate that the vectors that reside in the single-member clusters may be outliers and not consistent with the normal production of that particular singer. 4. DISCUSSION AND CONCLUSIONS The remarkable degree of consistency exhibited in a singer's estimated residual transfer functions (particularly within the pitch region below the first passagio) suggests that these functions are characteristic of a singer and can be used for identification and classification purposes. A condensed representation of that residual, namely the vector of coefficients of the autoregressive model, appears to be a useful feature parameterization in which each singer's pattern is a unique fingerprint of that singer. The present work has focused on the static resonance structures of the singing voice. Clearly, dynamic properties of the glottal source, vibrato in particular, may also contain singer specific features. A formal perceptual experiment is necessary to determine the importance of the vibrato frequency features relative to the static filter features. The addition of the vibrato information into the feature vector 0 2 4 6 8 10 12 14 16 18 Number of feature vectors Figure 6: Hierarchical clustering results for the final 12 clusters (corresponding to a cluster level of L = 61) for the vowel /a/. The twelve clusters are shown as horizontal bars, with the number of feature vectors from each singer shown and labeled. The "S" prefix indicates the singer was a soprano, and the "M" a mezzosoprano. could dramatically improve the clustering performance, as anecdotal evidence suggests the vibrato features are strong perceptual cues for identity. The relatively compact representation proposed in Section 2.3 can be used to synthesize vowels that replicate a desired vocal color. However, although hierarchical clustering can separate singers using their feature vectors, this does not indicate that a listener uses the same features for perceptual identification and classification. The feature vector representation needs to be validated perceptually to correlate this representation with the auditory system response. Finally, it should be noted that only isolated vowel exercises were analyzed. The addition of singer-specific features in consonant production, context, and prosody are all necessary to fully capture and understand how a singer makes her own unique sound. 5. REFERENCES Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall, Inc., Ch. 3. Mellody, M. and Wakefield, G.H. (2000). "Modal distribution analysis and synthesis of a soprano's sung vowels," submitted to J of Voice. Pielemeier, W.J. and Wakefield, G.H. (1996). "A high-resolution time-frequency representation for musical instrument signals," J Acoust Soc Am, 99(4): 2382 -2396. Pielemeier, W.J. and Wakefield, G.H. (1994). "Multi-component power and frequency estimation for a discrete TFD," Proc. of IEEE Symp. on Time-Freq. and TimeScale Anal., Philadelphia, PA, 620-623.