Page  79 ï~~Musical Analysis using a Real-Time Model of Peripheral Hearing Tim Brookes Andy Tyrrell David Howard Dept of Electronics, University of York, York. YOl 5DD. England. Abstract The sound spectrograph is one of the most widely used tools for the analysis of sound. In this paper the authors suggest that a superior analysis can be attained using a spectrographic display driven by the output of a real-time model of the human peripheral hearing system. Comparison of conventional and 'auditory' spectrographic analyses of a number of musical sounds reveals that many acoustic features are shown more clearly and with greater 'perceptual salience' on an auditory spectrogram than on a conventional one. 1. Introduction One of the most widely used computational techniques in the analysis of sound is the fast Fourier transform [Dodge & Jerse, 1985] and it is common for a visual representation of the results of a Fourier analysis to be displayed in the form of a sound spectrogram [Howard, 19911. This shows the change, with respect to time, of the distribution of acoustic energy across a number of frequency bands. In this paper the authors suggest that a spectrographic analysis superior to that achievable with Fourier transforms can be attained using a model of the human peripheral hearing system [Brookes et al, 1994]. Such a model has been designed around the GammaTone [Darling, 1991] filter (a filter having an impulse response close to that of the human basilar membrane) and has been implemented to run in real-time on an array of second-generation transputers [INMOS, 1993]. The output of the model has been chosen to have a similar appearance to that of a conventional spectrograph, facilitating easy comparison between the two systems. Conventional and 'auditory' spectrographic analyses performed on a number of musical sounds are presented and features visible on the 'auditory' spectrograms which are either less clear or even invisible on the conventional ones are identified. The perceptual relevance of the auditory display is examined and possible applications for the system are suggested. 2. Auditory Spectrograph There are essentially three differences between the conventional spectrograph and the 'auditory' version: " Firstly, the bandwidth of the analysing filter used in conventional spectrography is fixed, typically at either 45Hz or 300Hz (originally chosen to be either narrow or wide with respect to the typical fundamental frequency of male speech). The bandwidth of the GammaTone filter, however, is approximately proportional to its centre frequency. " Secondly, the frequency scale upon which a conventional spectrogram is plotted is linear, whereas the scale used in this version is an auditory scale [Terhardt, 1992], based on the spread of frequency along the basilar membrane in the inner ear, which tends toward linearity at lower frequencies and toward a logarithmic scale at higher frequencies. " Finally, before the inner ear can perform its frequency analysis the incoming sound must pass through the outer and middle ear. These two stages alter the sound's frequency spectrum, enhancing frequencies between about 1kHz and 5kHz. The auditory spectrograph includes a pre-emphasis section which approximates this response. 3. Musical Analysis The consequences of the differences outlined above will now be investigated, using musical sounds as examples. Each illustration in this section shows the results of three spectrographic analyses: a conventional wide band (top of diagram), a conventional narrow band (middle) and an auditory (bottom) spectrogram. 3.1. Synthetic Sound The consequences of the first two differences are illustrated in figure 1, which shows spectrograms of a square wave signal, the frequency of which has ICMC Proceedings 1996 79 Brookes et al.

Page  80 ï~~been swept exponentially upward from 60Hz to 850Hz over one second. 7k! 5k4kw3klk J I fq 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 lime (s) 6kg 5k4ke" 3k 2k0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Timc(s) 6k - 4k- o s 3k0 0.1 0.2 0.3 0.4 o.5 0.6 0.7 0.0 0.9 "lime (s) Figure 1 - Chirp square wave displayed on wide band (top), narrow band (middle) and auditory (bottom) spectrograms. Due to its broad analysing filter bandwidth the wide band spectrogram has poor frequency resolution but its time resolution is good and vertical striations, synchronised to each cycle of the fundamental, are often visible. Conversely, the narrow band spectrogram has poor time resolution but its frequency resolution is much better than that of the wide band analysis. The variable bandwidth filters used to generate the auditory spectrogram cause it to exhibit good time resolution in the lower part of the display and good time resolution higher up. Thus the first six or so harmonics of the sound are always discernible regardless of the frequency of the fundamental. This makes it more convenient, since the information from the two conventional analyses is combined in the auditory spectrogram, and it provides a more perceptually realistic representation of a sound, since the ear is known to possess good low-frequency fr'equency resolution and good high-frequency temporal resolution. The image produced by the auditory spectrograph is therefore nearer (than that produced by a conventional spectrograph) to the representation of sound which the brain receives from the ear. The two conventional spectrograms in figure 1 both show the harmonics rising at an increasing rate and moving progressively further apart. The auditory spectrogram, however, shows the harmonics rising linearly and remaining parallel. Thinking about this musically, the auditory spectrogram is more perceptually realistic: " The 'auditory distance' between two notes spaced a fixed musical interval apart (as is the case with harmonics) will sound the same regardless of where on the musical scale those notes are played, so it makes sense for harmonics to appear as parallel lines on a visual display. " A tone with exponentially increasing frequency (eg. an ascending chromatic scale played with glissando at a steady tempo) has a pitch which rises at a constant number of semitones per second. It will sound as if its pitch increases at a constant rate, so a straight line representation seems much more reasonable than one with ever increasing upward curvature. 7k. 6k. 5k. 4k. 3k. 2k. 1k. 7k. 6k. 5k. 4k. S3k. 2kIk. 6k4k3k2klkw 0.1 01. 03 0.4 0.5 0.6 0.7 0.8 0.9 Time (s) 0 0.1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time (s) D 10t 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.9 Time (s) Figure 2 - Snare Drum, displayed on wide band (top), narrow band (middle) and auditory (bottom) spectrograms. 3.2. Percussion Figure 2 shows spectrographic analyses of a snare drum being struck once. The two principal acoustic features of the sound are both much more clearly visible in the auditory spectrogram. The resonant drum sound at approximately 250Hz is given more Brookes et al. 80 ICMC Proceedings 1996

Page  81 ï~~vertical space due to the auditory frequency scale, and the triangular shape of this feature (caused by the resonance becoming gradually more focused) is all but invisible on both of the conventional analyses. This phenomenon was also evident on spectrograms produced from the sounds of tablas. Additionally, the sound of the wires of the snare itself (2kHz to 4kHz) is considerably enhanced in the auditory spectrogram. This is due to the pre-emphasis added to the signal by the outer and middle ear representation. 3.3. Brass and Wind The trumpet analyses shown in figure 3 show the benefits of all three differences between conventional and auditory spectrography. Whilst the wide and narrow band spectrograms show good temporal and good harmonic detail respectively, the auditory spectrogram combines the two due to its variable filter bandwidths. The auditory frequency scale results in more visual weight being given to the fundamental (rather than an equal weighting for each harmonic as in the conventional analyses) which seems perceptually reasonable. The preemphasis applied to the signal prior to auditory analysis results in the sound at the start of the note, before the air inside the trumpet has started resonating properly, being clearly visible. This cannot be seen at all in the conventional analyses. These same characteristics can also be seen on analyses of saxophone and flute sounds. Furthermore, the differences between these three instruments show up much more clearly on the auditory spectrographic analyses than on the conventional ones. This is partly due to the fact that temporal differences and harmonic differences are visible together and partly due to more visual weight being attached to more perceptually relevant features. 6k4kS3k2k Time (s) 4... _..k.,,,..M a -r,,.,,'.,.a - -,,,.. P{' nooamRa -"'x ".)x i lk ----- -- - - ---- vo. 0.1 0.2 03 0.4 0.5 Time (a). 0.6 0.7 0.8 0.9 6k4k-.3k~2k~Ik 7k6k~ S5k~ 4k. S3k~ 2kH 1k7k6k~5kS4k3k2k1k. 100, 6 0.1 0.2 0.3 0.4 0.5 Time(s) 0.6 0.7 os o9 P 0.1 6k{ 4k '< 3k1.2k1 ~1kj Figure 4- The vowel/a/sung in an operatic style by 0.6 07o o9 a female and pitched at G4, displayed on wide band (top), narrow band (middle) and auditory (bottom) _ spectrograms. 3.4. Singing Singing is perhaps the most obvious musical candidate for spectrographic analysis since the spectrograph was originally developed to study the 0.6 0.7 0.8 0.9 voice [Potter et al, 1966]. The spectrograms in figure 4 show analyses of a female operatic voice singing the vowel/a/at G4. A number of features are particularly interesting. The vibrato is not easily discernible on the wide band analysis. The narrow band analysis shows it more clearly but gives the impression that the upper 0.6 0.7.. 0.9 harmonics have more vibrato than the fundamental. The auditory spectrogram shows the vibrato both displayed on wide clearly and realistically. 0.1 0.2 0.3 0.4 0.5 Time(8) Figure 3 - Trumpet playing C4, band (top), narrow band (middle) and auditory (bottom) spectrograms. The frequency scale allows the auditory spectrogram to show harmonics and formants where the wide and ICMC Proceedings 1996 81 Brookes et al.

Page  82 ï~~narrow spectrograms will only show one or the other. As a result of this it has been found that the extents to which the 'singer's formant' [Rossing, 1990] and vibrato are used in a particular vocal style can both be seen more clearly on auditory spectrograms. The combination of good time and good frequency response means that the cyclic amplitude envelope of the third formant (at about 3kHz) is visible on the auditory spectrogram but not on the conventional analyses. This formant has the appearance of a string of triangles synchronised to the opening and closing of the vocal folds and could therefore contain valuable information, missing from the conventional analyses, relating to the 'Closed Quotient' (CQ) of the glottal waveform. The CQ, a measure of the proportion of the glottal waveform over which the vocal folds are closed, is one of the key differences between different styles of singing [Evans & Howard, 1994]. 4. Applications The principal applications envisaged for the system relate to speech analysis [Howard et al, 1995]. One musical application, however, which is currently under investigation is the provision of real-time visual feedback in singing lessons. The ability to see all frequency and time related features in a single analysis could make the analysis system particularly useful as a front end to auditory streaming applications. Additionally, any musical system involving Fourier analysis of sound (with or without subsequent visual display) may benefit from use of an analysis technique based on a model of human hearing instead. 5. Conclusions & Future Work The differences between conventional spectrographic techniques and a new auditory spectrograph have been described. The implications of these differences to the spectrograms produced have been outlined and comparisons have been made using a variety of musical sounds as test data. The results of these comparisons indicate that the auditory spectrographic analysis is superior to the conventional ones in terms of perceptual realism, clarity of acoustic features, quantity of visible detail and convenience (since one display replaces two). A future modification to the system could provide an output showing a continuously evolving twodimensional (energy vs. frequency) long-term spectrum, since analyses of musical instrument sounds are more commonly displayed in this way. An enhancement planned in the immediate future is the addition of a model of acoustic to neural transduction [Meddis, 1986]. It is hoped that this will have a significant effect on the temporal characteristics of the system. Acknowledgements The authors would like to thank Michelle Evans for providing the recordings of singing which were analysed. The work presented in this paper is supported by UK-EPSRC research grant number GR/J42267. References [Brookes et al, 1994] T.S.Brookes, D.M.Howard, A.M.Tyrrell. A T9000 Simulation of the Human Peripheral Hearing System. IEE Digest no.1994/208, pp4.1-4.4 [Darling, 1991] A.M.Darling. Properties and Implementation of the GammaTone Filter: A Tutorial. Speech, Hearing & Language, Work in Progress, University College London, Dept of Phonetics & Linguistics, vol.5, pp43-61 [Dodge & Jerse, 1985] C.Dodge & T.A.Jerse. Computer Music. Schirmer Books. [Evans & Howard, 1994] M.Evans, D.M.Howard. A Study of Two Vocal Qualities Exhibited by Professional Female Singers. Proceedings of the Institute of Acoustics. vol.16, part 5, pp139-46 [Howard, 1991] D.M.Howard. Speech Measurements. Concise Encyclopaedia of Biological & Biomedical Measurement Systems, P.A.Payne (ed), pp370-6 [Howard et al, 1995] D.M.Howard, A.Hirson, T.S.Brookes, A.M.Tyrrell. Spectrography of Disputed Speech Samples by Peripheral Human Hearing Modelling. Forensic Linguistics, vol.2, no.1, pp28-38 [INMOS, 1993] The T9000 Transputer Hardware Reference Manual. [Meddis, 1986] R.Meddis. Simulation of Mechanical to Neural Transduction in the Auditory Receptor. Journal of the Acoustical Society of America, 79(3) pp702-11 [Potter et al, 1966] R.K.Potter, G.A.Kopp, H.G.Kopp. Visible Speech. Dover Publications Inc. (first published 1947) [Rossing, 1990] T.D.Rossing. The Science of Sound. (Chapter 1 7.'Singing). Addison-Wesley [Terhardt, 1992] E.Terhardt. The SPJNC Function for Scaling of Frequency in Auditory Models. Acustica vol.77, 1992, pp40-2 Brookes et al. 82 ICMC Proceedings 1996