Page  457 ï~~SeeMusic: A Tool for Music Visualization Bernard Mont-Reynaud* Senior Scientist, Studer Editech studer@applelink.com Abstract The paper presents processing techniques and user interface design ideas for the display of musical sound: harmonic cursor, multi-resolution filter banks, and a polyphonic pitch detection method based on image convolution; applications go from simple to advanced. 1. Introduction Sound reaches the ear, not the eye. So why should one try to visualize audio or music? A general answer it that it enhances our perception. Sounds are patterns in time, and images patterns in space; so we can give each of them complementary roles in communication and user interface design. The ear receives data by unfolding it in time; speeding up playback distorts the percept appreciably. The eye can scan data faster, because it is not constrained to a sequential scan. Aimed at a music score, the eye "jumps" to the clarinet entrance in bar 37; locating the same event in an audio file can take much longer, depending on how much trial-and-error is used in the search. To speed up indexing through sound, digital audio workstations provide high speed playback or crude audio visualization via amplitude envelopes, but could use better visualization. Indexing is one application; audio visualization enhances perception in more subtle ways, via synesthetic effects. If we hear a sound while seeing an associated image, the eye helps the ear pick out details, and conversely. * Most of this work was done while at CCRMA, Stanford University. If the visual display is well suited to the task, one quickly learns to associate features of the visual display with auditory percepts. For example, this is how Victor Zue at M.I.T. has been teaching many people to read from speech spectrograms. Can one do similar things for musical audio? How does one provide visual access to the significant musical features present in the sound? Does the resolution of the representation match that of the ear? Can it support the indexing of sound events? Does it support onset detection, pitch extraction, and other identification tasks? Can it handle multiple sources? music transcription of polyphony? Also, if synesthetic effects enhance the listening experience, can this be used for ear training? for aesthetic experiences? If key musical features are visually prominent, can the interface allow composers to use the system, visually manipulating musical relationships? How about retrieval by content for multimedia applications? Or more mundane applications, such as sound indexing in studio tools? All these questions are relevant to the SeeMusic system. Our goal was to build a multi-purpose visualization tool to serve the needs of auditory perception research and its relation to the developement of new applications in music and audio. The lCMC Proceedings 1993 457 4D.3

Page  458 ï~~key contributions to date are covered in the next two sections. The final section discusses the status of the system, and the various plans for future work. 2. A Multi-Resolution Filter Bank Most tools used for music or speech research depend on Fourier analysis to extract the frequency content of audio data. Yet, hearing research shows major discrepancies between the results of the Fourier Transform and psychoacoustic or physiological evidence. Best known is the fact that the ear's frequency sensitivity is not linear in Hz. In the middle and upper parts of the audible spectrum, the ear's resolution is best expressed as a constant number of semitones; this matches the sensitivity of constant-Q filters (such as wavelet filters) whose frequency bins are equally spaced on a log(F) axis; at the bottom of the audible spectrum, a linear (Hz) scale is an adequate approximation; and there is a transition region between the two. Many systems use the Bark scale, or other experimentally derived representation of the ear's bandwidth. In our view, trying to closely match the frequency scale to experimental ear sensitivity data is not essential. For music applications, the advantages of a constant displacement per semitone far outweigh the disadvantages. SeeMusic uses a log scale for frequency. There are deeper issues. It is not at all obvious that the front-end transformation performed by the ear is well modeled by the magnitude output of a single linear filter bank, i.e., an energy spectrogram. Such output is subject to a bandwidthtime tradeoff (often called the uncertainty principle) which limits the product of frequency resolution and time resolution for the spectrogram from any single filter bank. There is evidence that the ear does better than theory predicts, i.e., it beats the uncertainty principle. This (as well as experimental results from physiology) has prompted researchers towards nonlinear methods, time-domain methods that make greater use of phase information, and alternative frequency-time approaches such as those derived from the Wigner transform. These include such models as channel auto-correlation, and others too numerous to mention. Our own research led to a multi-rate, multi-resolution approach. To "beat" the uncertainty principle, we integrate data from multiple filter banks. The general idea of a multi-resolution filter design is as follows. Simply put, the uncertainty principle states that an image that is sharp in time must be blurry in frequency, and conversely. Any single filter bank has its own systematic blurring, but the theorem does not apply to non-standard filters obtained by combining the resolution of several filter banks. We use several views with varying blur shapes, and combine them by non-linear operators. Using this idea, we construct an image sharper than any of the original images; the final blur is smaller than predicted by the theory for any single filter bank. 3. On Image Convolution for Sound Let us introduce an image processing operator known as convolution. The result of convolving two images is called their product. Let us try to visualize this using black and white images, assuming familiarity with MacPaint, or some other paint program where the user can choose brush shapes. In the figure, we have two example products. One example shows that the product of a horizontal line by a vertical line is a rectangle; the other that the product of a pattern of three thick dots by a pattern of a slim slanted line consists of three thick slanted lines. Pattern * Pattern = Product -[I I. Pattern * Pattern Product To capture this intuitively, imagine painting the product by using one patterm as a brush, the other pattern as a tracing guide, or brush trajectory: drawing a horizontal line with a vertical brush, we 4D.3 458 ICMC Proceedings 1993

Page  459 ï~~paint a rectangle; tracing the vertical line with a horizontal brush, we get the same rectangle. Then, try to visualize the fat slanted lines as a convolution product of fat dots and a slanted line. Returning to sound representation, we will work with two images: - a spectral image (or spectrogram) showing acoustic energy on a frequencytime display, - a virtual pitch image, on the same axes but representing pitch trajectories. In order to appreciate this, consider the visual effect of additive synthesis. In the figure below, the input to additive synthesis consistis of pitch trajectories: three notes in ascending fixed pitches, and one note with a continuously varying pitch. Additive synthesis expands these pitch trajectories by generating harmonics partials for the given fundamentals. This is done by constructing the musical ratios of 1:2, 1:3, 1:4 and so on, according to the harmonic series. The resulting spectrogram is shown in part B. Now observe that because we use a log scale in frequency, each musical ratio n:m causes a fixed vertical displacement. In other words, this is a if the harmonics could be generated by a special brush! To get from A to B, we trace image A using a "harmonic comb" as a brush. The next idea is that a convolution can be inverted. Ideally, the convolution that took A into B can transform B back into something much like A. This is shown as C. If we obtained B from A by linear convolution, inverse linear convolution would gives an image C identical to A. To analyze real sound data, including polyphonic data, we transform a sound spectrogram to a pitch image by inverse convolution using a harmonic brush. The spectrogram in B, obtained from our front-end filter is a log(F) spectrogram. Using de-convolution with a harmonic comb is the starting point of a simple yet powerful approach to polyphonic pitch detection. In fact, we do not use linear convolution; linear convolution create a scattering of subharmonics in the virtual pitch domain; image processing is done instead with mathematical morphology operators, nonlinear operators closely related to linear convolution). - - B.Setrga pitch detection b11 inuerse conuolution -u P C. Virtual Pitch sgnthesis bg conuolution A. Pitch mmm mm m mm B. Spectrogream There is a good reason to call "virtual pitch" the image created in C. Several researchers (including notably Terhardt) have described polyphonic pitch detection methods which operate by generating pitch trajectories in a virtual pitch domain. Close examination of the methods reveal a process essentially equivalent to the inverse convolution just described, in the same sense that additive synthesis is seen as convolution. But in the SeeMusic system, we take the inverse convolution literally as a basis for actual computation. This technique is actually very fast, based The amazing result is that additive synthesis amounts to image convolution; of course, the conclusion ignores much of the detail of a real-life synthesizer, but it captures the essence of a mapping that is at the heart of the SeeMusic system. ICMC Proceedings 1993 459 4D.3

Page  460 ï~~on a very fast implementation of the image morphology operators. 3. Intuitive, powerful user Interface The original goal for the SeeMusic system was to provide an environment for research in which complex automatic operations (such as polyphonic pitch detection, source segregation and music transcription) could coexist with manual editing operations. This would provide a powerful and flexible semi-automated system for music transcription and pattern recognition research and applications. Although this is still a valid long-term goal, it has become clear that the visualization techniques and the integrated interface concepts can be successfully applied to much simpler problems. The useful concepts include the use of the dual representation (spectral image and pitch image) with user-controlled convolution relations between the two. Generally speaking, this provides a correspondence map between sound data, viewed as log-frequency spectrograms, and MIDI data or musical score data, viewed on the same coordinate system. The harmonic comb described above as the brush for image convolution is actually a cursor that is used over the image. There are many variants and functions associated with such a cursor. This allows the direct manipulation of musical relationships such as those between the partials of a complex sound, or the musical intervals between notes. These intervals are directly sensed within the spectral image. ot pitch image, or across from the one to the other. It is difficult to describe verbally all the uses of these concepts. While in use, the harmonic cursor becomes a natural tool for researchers, or composers, musicians and even for the "naive" users we all wish to be treated as. 4. The SeeMusic System: present and future The SeeMusic system presently runs on a Macl or later Macintosh with 8-bit color video; signal processing is done on a DSP board from Spectral Innovations, running an AT&T DSP32 chip. The system is implemented in C using the Think C development system, and ATT's signal processing library. The user interface was designed and implemented by the author. The DSP code is by Emmanuel Gresset. Except for more recent developements by the author, the work was done at Stanford's CCRMA with support from the NSF and SDF foundations. The signal processing technique is designed for realtime operation, but the single-DSP implementation cannot handle realtime computational requirements. The system would be much more effective with sufficient DSP power to keep up with the audio on the fly. The most straighforward conversion of the system would be to a multi-DSP system involving four or more DSP 3210's; this chip, jointly developed by Apple and AT&T, is a truly 32-bit clean DSP chip that would allow direct access to the frame buffer and be very suitable for realtime graphics. We are contemplating the reimplementation of the system on a more modern multi-DSP board from Spectral Innovations, tne NuMedia. The SeeMusic user interface tools can be applied without any changes to other front-end transforms that have a log frequency scale, such as wavelets or cochlear model with this scaling. This suggests a more general implementation of SeeMusic, as an open system in which all or part of the processing could become user-supplied procedures. The user interface concepts can also be adapted to other frequency scales, To do so, the harmonic cursor has to change shape as it moves; some of the elegance and/or speed of the underlying image processing may go away, traded for generality, and the help of a faster processor. 4D.3 460 ICMC Proceedings 1993