Page  00000484 ON TIMBRE BASED PERCEPTUAL FEATURE FOR SINGER IDENTIFICATION Swe Zin Kalayar Khine Tin Lay Nwe Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 Haizhou Li ABSTRACT Timbre can be defined as feature of an auditory stimulus that allows us to distinguish the sounds which have the same pitch and loudness. In this paper, we explore timbre based perceptual feature for singer identification. We start with a vocal detection process to extract the vocal segments from the songs. The cepstral coefficients, which reflect timbre characteristics, are then computed from the vocal segments. The cepstral coefficients of timbre are formulated by combining information of harmonic and the dynamic characteristics of the sound such as vibrato and the attack-decay envelope of the songs. Bandpass filters that spread according to the octave frequency scale are used to extract vibrato and harmonic information of sounds. The experiments are conducted on a database of 84 popular songs. The results show that the proposed timbre based perceptual feature is robust and effective. We achieve an average error rate of 12.2% in segment level singer identification. 1. INTRODUCTION The rapid evolution of the digital multimedia technologies in computer and Internet technology has enabled huge multimedia database. With these continually growing databases, automatic music information retrieval (MIR) has become increasingly important. Singer Identification (SingerlD) is one of the important tasks in MIR. It is the process of identifying the singer of a song. In general, a SingerlD process comprises three major steps. The first step is detecting singing segments (vocals) in a song. Vocals can be either pure singing voice or a mixture of singing voice with background instrumentals (nonvocal). The second step is singer feature computation. Features are extracted from vocal segments. The last step is formulating singer classifier using feature parameters. In this paper, we propose new solutions for the second step of singer feature computation. Earlier studies in SingerlD use features such as Mel Frequency Cepstral Coefficients (MFCC) [6]. Recently, studies start looking into perceptually motivated features which are able to appreciate the aesthetic characteristics of singing voice for music content processing and analysis. For example, vibrato motivated acoustic features are used to identify singers in [1, 11]. Besides vibrato, harmonic is also a useful feature for SingerlD. In fact, harmonics of soprano singer's voice are widely spaced in the spectrum in contrast to that of bass singer's voice [8]. Hence, harmonic spectrum is useful to differentiate between low and high pitch singers. One of the basic elements of music is timbre or color. Timbre is the quality of sound which allows human ears to distinguish among different types of sounds [15]. Cleveland [3] states that an individual singer has a characteristic timbre that is a function of the laryngeal source and vocal tract resonances. Timbre is assumed to be invariant with an individual singer. On the other hand, Erickson [6] states that the traditional concept of an invariant timbre associated with a singer is inaccurate and that vocal timbre must be conceptualized in terms of transformations in perceived quality that occur across an individual singer's range and/or registers. In general, these studies suggest that timbre is invariant with an individual singer or there is a particular range of timbre quality associated to an individual singer. In this paper, we would like to study the use of timbre based features in SingerlD task. Poli [9] measured the timbre quality from spectral envelope of MFCC features to identify singers. In [17], timber is characterized by the harmonic lines of the harmonic sound. In this paper, we propose determining timbre by the harmonic content of a sound and the dynamic characteristics of it such as vibrato and attackdecay envelope [15]. The rest of this paper is organized as follows. In section 2, we present the methods for vocal detection. In section 3, we study perceptually motivated acoustic features and their characteristics. In section 4, we describe the popular song database, experiment setup and results. Finally, we conclude our study in section 5. 2. VOCAL DETECTION We extract vocal segments from songs. Subband based Log Frequency Power Coefficient (LFPC) [10] is used as acoustic features. We train hidden Markov models (HMM) as vocal and nonvocal acoustic models to detect vocal segments. Vocal detection errors can affect SingerlD performance. To reduce the vocal detection error, we formulate the vocal detection as a hypothesis test [12] based on confidence score. In this way, we only retain vocal segments, with which the acoustic models have high confidence. Vocal segments with confidence measure which are higher than a predetermined threshold are accepted for the SingerlD experiment. 3. ACOUSTIC FEATURES We next study several perceptually motivated features, namely harmonic, vibrato and timber features, to characterize song segments. We propose to use subband filters on octave frequency scale in formulating 484

Page  00000485 these acoustic features. 3.1. Harmonic Sopranos have higher fundamental frequency than bass singers. Hence, harmonics of soprano's voice is widely spaced in contrast to that of bass singing. Upper panels of Figure 1 (a) and (b) show the examples of harmonic spectrum of soprano and bass singing respectively. To capture this information, we implement the harmonic filters with the centre frequencies located at each of the musical notes as in middle panel of Figure 1 (a) and (b). The list of the frequencies of the musical notes can be found in [7]. Subband filters span up to 8 octaves (16 kHz). Each octave has 12 subbands and there are 96 subbands in total. Output of subband filtering is given in lower panels of Figure 1. For soprano, lower panel of Figure 1(a) shows widely spaced peaks. However, the peaks are narrowly spaced in lower panel of Figure 1(b) for bass singers. seen in Figures 2(a) and (b). According to [2], such irregular vibrato excursions are very common in most of the tones. In [5], vibrato extent is categorized into two different types, 'wobble' and 'bleat'. Wobble has a wider pitch fluctuation and slower rate of vibrato as in Singer-A. However, bleat has a narrower pitch fluctuation and faster rate as in Singer-C. Hence, the information such as 1) regularity or irregularity in vibrato excursion, 2) two different vibrato types of 'wobble' and 'bleat', and 3) vibrato rate is integrated into acoustic feature. As a result of the modulation of pitch, the frequencies of all the overtones vary regularly and in sync with the pitch frequency modulation [13]. Therefore, we implement the subband filters with the center frequencies located at each of the musical notes to characterize the vibrato. The list of the frequencies of the musical notes can be found in [7]. Due to the fact that singing voice contains high frequency harmonics [16], our subband filters span up to 8 octaves (16kHz). Our subband filters are implemented with a bandwidth of ~1.5 semitone from each note since vibrato extent can increase more than ~ 1 semitone when a singer raises his/her vocal loudness [13]. We employ cascaded subband filters (referred to as vibrato filter) [11] to capture vibrato information from acoustic signal. Second Layer /0I I I I I First Layer 0.06 10 10.5 11 11.5 12 Frequency(kHz) 16kHz Figure 3. A bank of cascaded subband filters 0 0.5 0 0.5 0 0.5 Peaks1,Peaks 0.5 0 0.5 1 0 0.5 1 Frequency(kHz) (a) Soprano singer (b) Bass singer Figure 1. Harmonics and harmonic filtering. 3.2. Vibrato Vibrato is a periodic, rather sinusoidal, modulation of pitch and amplitude of a musical tone [14]. Vocal vibrato can be seen as a function of the style of singing associated to a particular singer [11]. 20 Period, Rate=1/Period eg.....4. Sina r-A 0.6 0.8 1 0.6 0.8 A Vibrato Extent 1 0.6 0.8 Vibrato Extent i Extenr (a) t 0 -20o S20 (b) C 0 o 0 = 200 (c) 0 -201 0 Figure 2. Vibrat Vibrato is char or excursion and Female singers 1 vibrato rate than occurring at the t shown in Figure vibrato excursion balanced. Howevc 200 400 600 800 1000 Singer-B V2 X 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 ' Local Maxima 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 Frequency(kHz) (a) (b) (c) 200 400 600 800 1000 Figure 4. Vibrato fluctuations and vibrato filtering Singer-C observed at the note G#5, 830.6Hz. (a) vibrato S V/ fluctuates left (b) no fluctuation (c) vibrato fluctuates right. The upper panel shows spectrum partial. The 200 400 600 800 1000 middle panel presents the frequency response of the Time(ms) vibrato filter. The lower panel demonstrates the;o waveforms of 3 singers at note D6, instantaneous amplitude output of the vibrato filter. With 1174.6Hz. the output from the vibrato filter, we are able to track the local maxima to derive the vibrato extent [11] acterized by two parameters: the extent We illustrate the vibrato filter [11] and subband the rate as illustrated in Figure 2(a). outputs in Figures 3 and 4. The filter has two cascaded tend to have a slightly faster mean layers of subbands. The first layer has overlapped male singers [4]. Vibrato excursions trapezoidal filters. The second layer has 5 non-;one D6 for three different singers are overlapped rectangular filters of equal bandwidths for,s 2(a), (b) and (c). In Figure 2(c), each trapezoidal subband. Trapezoidal filters are tapered s to the up and down from the note is between ~ 0.5 semitone to ~1.5 semitone. The vibrato er, unbalanced vibrato excursion can be fluctuations are observed by tracking the local maxima 485

Page  00000486 in the instantaneous amplitude output of the subbands in the second layer as shown in the lower panel of Figure 4. Local maxima indicate the position of the vibrato. The distance between the center frequency of the corresponding filter and the local maxima informs the vibrato extent. The tapered and overlapped trapezoidal filters in the first layer allow vibrato fluctuations of adjacent notes observed at the output of the subbands in the second layer to be 'continuous'. The vibrato filter captures irregularities or regularities in vibrato excursion and 'wobble' or 'bleat' vibrato types. 3.3. Timbre Sounds may be generally characterized by pitch, loudness and quality. For sounds that have the same pitch and loudness, sound quality or timbre describes the characteristics which allow human ears to distinguish among them. Timbre is a general term for the distinguishable characteristics of a tone. Timbre is mainly determined by the harmonic content of a sound and the dynamic characteristics of the sound such as vibrato and attack-decay envelope of the sound [15]. Attack-decay processes of two different singers are shown in Figure 5 (a) and (b). Singer-i's voice takes more time to develop to its peak than that of Singer-2. And, the decay process of Singer-1 is more gradual than that of Singer-2. Peak Amplitude b eak Amplitude S-................................................1....... 0.14 0,5 1 1A5 2 2,5 (a)Singer-1 (b) Singer-2 Figure 5. Attack-decay envelopes. Studies found that it takes a duration of about 60ms to recognize the timbre of a tone. If a tone is shorter than 4ms, it is perceived as an atonal click [15]. 3.4. Cepstral Coefficient Computation We first divide a music signal into frames of 20ms with 13ms overlapping and apply Hamming window to each frame to minimize signal discontinuities at the end of each frame. Each audio frame is passed through harmonic filters for harmonic content analysis to derive log energy of each band. Finally, we compute a total of 13 Octave Frequency Cepstral Coefficients (OFCChar) from the log energies using Discrete Cosine Transform. We then replace the harmonic filters with vibrato filters to compute the OFCCvib coefficients. To account for timbre characteristics, output log energy of vibrato filter is augmented by that of harmonic and Mel-scale filters. Then, we compute 13 TimBre Cepstral Coefficients (TBCC). We augment the feature coefficients with time derivatives or delta parameters from two neighbouring frames to capture temporal information. For example, delta parameters take care of vibrato rate and attack-decay envelope in OFCCvib and TBCC respectively. 4. EXPERIMENTS AND DISCUSSION We compile a database of 84 popular songs from albums of 12 solo singers. Examples of album titles are 'My own best enemy' (singer: Richard Marx) and 'Like a virgin' (singer: Madonna). A total of 7 songs are extracted from each album. Four songs of each singer are allocated to TrainDB and the remaining 3 songs to TestDB. Vocal and nonvocal segments of each song are manually annotated to provide the ground truth. We define error rate (ER) as the number of errors divided by the total number of test trials. We first segment a song into 1 second fixed-length segments. Then, each segment is classified as vocal or nonvocal class using the method mentioned in section 2. The average vocal/nonvocal classification error rate of vocal detection system is reported at 8.7%. Using the vocal segments from vocal detector, SingerlD experiments are further conducted. We use the continuous density HMM with four states and two Gaussian mixtures per state for all HMM models in our experiments. Using the TrainDB, we train a singer model, As, for each of 12 singers. To identify singer for a vocal segment O, we estimate the likelihood score of O being generated by each of 12 singer models. The model with the highest likelihood suggests the singer identity. We conduct experiments to compare the SingerlD performance of five feature types, namely, TBCC, OFCCvib, OFCChar, MFCC, and LPCC. Window size is 20ms and frame shift is 7ms in all tests. TBCC OFCCvib OFCChar MFCC LPCC 12.2 12.8 13.6 12.9 21.9 Table 1. Error rate (ER%) of SingerID on TestDB. Table 1 shows that the TBCC feature, with an average error rate of 12.2%, outperforms all other features. As mentioned in section 3.3, duration of about 60ms is necessary to recognize the timbre of tone. Hence, the error rate of 12.2% may not give the optimal performance for TBCC since window size is only 20ms. We further conduct several experiments with TBCC feature of different window sizes, with fixed frame shift of 7ms in all tests. In Table 2, the results show that performance peaks when window size is 63ms, giving the best error rate of 11.9%. It is observed that timbre based features capture the singer characteristics well by 0.9% and 1.7% (or 7% and 12.5% relative) error reduction over OFCCvib (reported earlier in [11]) and OFCChar features respectively. We believe that a window size of around 60ms is suitable to extract timbre characteristics from a music signal. Window size (ms) 59 60 61 ER (%) 12.1 15.3 15 E Window size (ms) 63 64 65 ER (%) 11.9 15.4 13.6 Table 2. Error rate (ER%) on TestDB with different window sizes of TBCC 486

Page  00000487 Both vocal and instrumental sounds have musical characteristics such as harmonic, vibrato, and timbre. This gives rise to a question as to whether the three features: TBCC, OFCCvib and OFCChar, capture these musical characteristics from either vocal or instrumental sound. To look into this, we conduct SingerlD experiments using manually annotated nonvocal segments. SingerlD system performance using a) vocal segments, b) nonvocal segments are presented in Table 3. I I I I Features TBCC OFCCvib OFCChar FE Case (a) 12.2 12.8 13.6 I Case (b) 60.1 66.7 60.6 Table 3. Average error rates using (a) vocal segments and (c) nonvocal segments Without surprise, in the presence of vocal timbre, vibrato and harmonic, the three features work the best as in Case (a). With absence of vocal timbre, vibrato and harmonic in Case (b), the error rate increases. This is because the singing voice usually stands out of the background musical accompaniments [13] and the three features are able to capture musical characteristics from vocal rather than from background instruments. 5. CONCLUSIONS We have presented an approach for singer identification of popular songs. The proposed approach explores perceptually motivated timbre characteristics for SingerlD. The main contributions of this paper are summarized as follows: 1) we propose using several perceptually motivated features using harmonic, vibrato and timbre information to represent the singer's characteristics. 2) With these features, we found that there is a strong correlation between singer characteristics and system performance. We conclude that perceptually motivated features especially timbre features are effective that help improve system performance. 6. REFERENCES [1] Bartsch, M.A., and Wakefield, G.H. "Singing Voice Identification Using Spectral Envelope Estimation," IEEE Transactions, Speech and Audio Processing, vol. 12, pp.100-109, March 2004. [2] Bretos, J., and Sundberg, J. "Measurements of Vibrato Parameters in Long Sustained Crescendo Notes As Sung by Ten Sopranos", Journal of Voice, vol. 17, pp. 343-352, 2003. [3] Cleveland, T.F. "Acoustic Properties of Voice Timbre Types and Their Influence on Voice Classification," Journal of Acoustical Society ofAmerica, vol. 61, pp. 1622-1629, June 1977. [4] Dejonckere, P. H., Hirano, M., and Sundberg, J. Vibrato, San Diego: Singular Pub., 1995, ch. 2. [5] Dromey, C., Carter, N., and Hopkin, A. "Vibrato Rate Adjustment" Journal of Voice, vol. 17, pp. 168-178, 2003. [6] Erickson, M., Perry, S., and Handel, S. "Discrimination Functions: Can They Be Used to Classify Singing Voices?" Journal of Voice, Elsevier, vol. 15, pp. 492-502, December 2001. [7] Everest, F.A. Master Handbook of Acoustics. New York, McGraw-Hill Professional, 2000. [8] Joliveau, E., Smith, J., and Wolfe, J. "Vocal Tract Resonances in Singing: The Soprano Voice," Journal of Acoustical Society of America, vol. 116, pp. 2434-2439, 2004. [9] Poli, G.D., and Prandoni, P. "Sonological Models for Timber Characterization," Journal ofNew Music Research, vol. 26, pp. 170-197. [10]Nwe, T. L., Foo, S. W., and De Silva, L. C. "Stress classification using subband based features," IEICE Trans. Information and Systems, Special Issue on Speech Information Processing, Vol. E86-D, no.3, pp. 565-573, March 2003. [11]Nwe, T.L., and Li, H. "Exploring VibratoMotivated Acoustic Features for Singer Identification," IEEE Transactions, Audio, Speech and Language Processing: vol. 15, no. 2, 2007. [12] Sukkar, R. A., and Lee, C. H. "Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition," IEEE Trans. Speech and Audio Processing, vol. 4, pp. 420-429, November, 1996. [13] Sundberg, J. The Science of Singing Voice, Northern Illinois University Press, 1987. [14] Timmers, R., and Desain, P. "Vibrato: Questions and Answers from Musicians and Science". Proc. Int. Conf On Music Perception And Cognition, England, 2000. [15] Winckell, F. Music, Sound and Sensation, Dover, NY, 1967. [16] Zhang, T. "System and method for automatic singer identification," in Proceedings IEEE International Conference Multimedia and Expo, Baltimore, MD, 2003. [17] Zhang, T., and Kuo, C.C.J. Content-Based Audio Classification and Retrieval for Data Parsing, Kluwer Academic, USA, 2001. 487