Page  75 ï~~A System for the Separation of Simultaneous Musical Audio Signals Thomas Stainsby La Trobe University Department of Music Bundoora, Victoria, 3083, Australia email: stainsby@klang.latrobe.edu.au Spectralyse is a NeXTStep application which has been developed to investigate the separation or demixing of simultaneous musical sound sources in monaural audio signals. The output of this non real-time system consists of individual digital soundfiles, one representing each instrument present in the original mixed source signal. The application works by constructing a sinusoidal model of the significant spectral components of the input signal, using a quasi-logarithmic frequency representation. Following this, the behaviour of different partials is compared, and ones exhibiting similar behaviour according to certain criteria are grouped together, on the assumption that they have arisen from a common sound source. Once so grouped, an additive resynthesis of these partials is effected, with the resultant soundfiles representing each individual sound source. At this stage, the system offers reasonable preliminary success for the separation of two-note synthetic test examples, though the problem of sequentially integrating successive notes from the one instrument, and assigning them to the same source, has yet to be tackled. The potential exists to expand the system in the future to analyse stereo soundflles, thus allowing the comparison of the spatial locations of partials as an additional grouping criterion. 1 Introduction The modelling of higher auditory function, particularly the ability to isolate and form auditory images of different simultaneous sound sources, has received a good deal of attention in recent literature. This general area of investigation has been termed Computational Auditory Scene Analysis. Work which has investigated the application of such techniques to polyphonic musical signals includes that of Guy Brown and Martin Cooke (1994), Dan Ellis (1992), Robert Maher (1989), David Mellinger (1991) and Avery Wang (1994), among others. This paper reports on the design and current status of Spectralyse, a NeXTStep application being developed by the author to achieve the isolation and individual representation of different musical instruments in monaural audio signals, a process also known as signal separation. 2 System Design The schematic diagram below, figure 1 summarises the basic procedure employed to achieve the separation. We begin with a mixed soundfile containing multiple sound sources, and perform an analysis which describes the sound as a set of frequency components. The frequency partials thus described are then compared to each other, with ones exhibiting similar behaviour over time being grouped together. The types of behaviours compared are described later in the text. Once partials have been grouped together, an additive synthesis is performed using the partials' data, creating a sound which represents the sum of similar partials, which hopefully all derive from the same input sound. Miiced F requency P conVw mao. Awiiiiet toa Sorce [Swdie Analyuis of Prials Given Source Figure 1: Overall separation procedure The programming environment is the NeXTStep objectoriented environment, which provides a powerful and attractive user interface, as well as an integrated UNIX development environment with a range of convenient sound editing and analysis tools. The programming language employed is Objective-C, whose object-oriented nature is well suited to the task of organising and manipulating the large volumes of hierarchical data encountered in computational auditory modelling. The soundfiles to be analysed are recorded in a mono 16-bit linear format with a 44100 Hz sampling rate, chosen to capture the full range of timbral subtleties that audiophile processing would need to accommodate. In running Spectralyse, the user is actively involved in visually examining data and setting parameters at various stages of the process, thus the separation is not overly automated and is open to a large amount of user-optimisation. A range of analysis and data display methods is provided by the application, offering a choice between fixed bandwidth and bounded-Q FFT analysis and fixed-bandwidth and bounded-Q MQ analysis. Each type of analysis employs its own display and data storage objects and methods. For most separation investigations, a Bounded-Q MQ analysis is favoured, although Spectralyse still allows the user to examine the sound data using other methods. The separation techniques employ data from the MQAnalysis object format, ICMC Proceedings 1996 75 Stainsby

Page  76 ï~~which is the same data representation object for both fixed bandwidth and Bounded-Q calculated data. Spectralyse offers FFT analysis of sound data, which while it is a very common and well established frequency analysis tool, is not the only form in which we can represent a sound in terms of its frequency content. One prominent characteristic of the FFT is that its output is that of a linearly spaced bank of frequency channels, each an harmonic multiple of the fundamental analysis frequency. This differs from the type of frequency transform carried out by the inner ear, in which the frequency spectrum is measured more linearly for the first 1000 Hz, and more logarithmically beyond this range. Thus a frequency transform system which yielded a logarithmic or quasi-logarithmic frequency representation could be relevant and applicable to our analysis problem. In a Constant-Q frequency analysis, the signal is analysed by a bank of bandpass filters, spaced so that the each one has the same relative frequency bandwidth or, in filter terminology, the same Q coefficient. This Q coefficient represents the ratio of the bandwidth encompassed by the filter to the centre frequency of the filter. A series of such filters with constant fractional overlap between them would result in a logarithmically spaced filter bank, in turn yielding a logarithmic frequency representation. Spectralyse offers a Bounded-Q analysis which employs a series of FFT style data representations, each with fixed channel bandwidths, with the spacing between channels changing at every octave, following a series of different resolution FFT's. Thus, a BQFT does not yield a true, constant logarithmic spacing of frequency channels across the whole audio range, yet it does offer a quasi-logarithmic frequency representation similar to that provided by the inner ear. Figure 2 illustrates how a frequency spectrum would be represented by a BQFT analysis. resolution, and broad yet sufficient resolution for the upper frequencies. Only the top half of the frequency spectrum data is written to file and kept. Next, the whole soundfile is lowpass filtered and downsampled by a factor of 2 to half the original sampling rate, at which point another FFT of the same sample length is performed on the sound. This time however, as the sampling rate has been halved, this length window now represents twice the length of time. The temporal resolution has been halved, but the frequency resolution, in terms of the bandwidth per frequency channel, has been doubled. The top half of this frequency spectrum is then written out to file and kept, while the now downsampled soundfile is returned for a further low-pass filtering and downsampling. This process can be repeated for as many octaves as required. The full spectrum of the bottom octave FFT is written to file, as there will be no further downsampling of the sound beyond this point. The BQFT is thus a multirate system, with a given number of samples representing different length time segments at different octaves. This has implications later in our analysis process for when we form partial tracks. Timing information needs to be expressed in absolute time measured in milliseconds, as the sample count cannot be used as an absolute system clock............................................... Â~Â~................. O,, t'tll................." Figure 3: BQFT Algorithm Y1 0 "C I 1 l Fr. Figure 2: Frequenc frequency channel wid Bounded-Q analysis al; advantages over linearly s] representing fewer frequency transform offers a certain coi to be implemented with a shc the Bounded-Q transform frequency resolution at the lc it could prove much oo fundamental frequencies of involved when they are only et al. (1985) have noted advantages of a BQFT. The in higher octaves to track Â~ while still giving us fine fi octaves. The actual BQFT algori figure 3. It is simple in conce performed at the original sa In this implementation, a BQFT with an analysis window of 256 points was chosen, which is then applied to the soundfile 2at five different octaves. This yields us a frequency channel width of 172.265 Hz in the top octave, with an analysis period of 5.8 ms, while the bottom octave offers us a channel width of 10.766 Hz with an analysis period of 92.8 ms. These values can be seen to be readily applicable to a wide range of musical signals. Having produced a frequency analysis of the input signal, with either a linear FFT or Bounded-Q FFT transform, the next stage of the separation procedure is to construct a model of the sound closer to the form in which our auditory system 2F 3F 4F represents it, as a set of discrete time-varying frequency quency components or partials. To do this, we need to identify the exact frequency and amplitude of a spectral component at any y spectrum with varying given time, and then accurately infer the behaviour of that Ith as produced by a BQFT component for times in between calculations of these exact values. Once this has been achieved, we then need to be able to so has two other noteworthy organise the data in such a way that each spectral component paced FFT analysis. Firstly, by can be addressed and manipulated independently, in order for, channels in total, the Bounded-Q collections of partials to be assembled. From such collections, mputational efficiency, as it is able we could in turn ascertain the nature of the various individual ort analysis window. Secondly, as sound sources present in the original composite signal. would yield improved linear An applicable transformation here is that of sinusoidal )w end of the frequency spectrum, modelling using MQ Analysis, named after McAulay and re capable of determining the Quatieri (1986). In such analysis, each peak in the frequency the various instrumental voices spectrum is assumed to represent the behaviour of an a small interval apart. Chris Chafe underlying sinusoidal partial at that point in time. When this in their discussion of the detecting peaks, a threshold level needs to be specified to make BQFT allows faster analysis times sure that a peak in the frequency spectrum actually represents a luicker moving note information, true sinusoidal component, and not just a ripple in the noise requency resolution in the lower floor. If significant peaks in adjacent frames are close enough to each other in frequency and time, we can assume that they thm employed here is shown in have been produced by the same sinusoidal component, and we Spt, an FFT with a short window is should also be able to determine the behaviour of that mpling rate, giving fine temporal component between analysis frames by a simple interpolation Stainsby 76 ICMC Proceedings 1996

Page  77 ï~~function. The output of the initial FFT or BQFT frequency analysis only yields information pertaining to predetermined fixedfrequency analysis channels. Smith and Serra (1987), describe a method to determine more exact frequency and amplitude values by means of parabolic interpolation in the frequency and amplitude domains. A unique parabola will fit the three points specified by a frequency analysis channel with a localmaximum amplitude, and the two channels immediately above and below this channel. The maximum of this parabola is taken to be the exact amplitude and frequency of the underlying sinusoidal spectral component. Whilst fitting a parabolic curve to a local region within such a spectrum is not entirely accurate, it yields sufficiently accurate results for our purposes, especially when using a Hamming window for the initial FFT analysis window, as noted by Ellis (1992). Having determined an accurate frequency and amplitude value for each sinusoidal component in each frame, the next task is to link together the peaks from adjacent frames which represent the same spectral component. In our algorithm, peaks are matched from frame to frame if they are considered to be satisfactorily continuous in frequency. The resulting trajectories of linked peaks are termed partials or tracks. To avoid the abrupt cutting in and out of tracks when they are first or last detected in a spectral frame, which will of necessity occur only at the average times of each analysis window, a fade in or fade out is calculated from the nearest 2 frequency and amplitude points, to extrapolate to a zero amplitude point of the appropriate frequency at some time before or beyond the analysis frame time. This has great significance for partial tracks in the BQFT, given that tracks in different analysis octaves will have been calculated at different analysis intervals. Partials which were initiated at the same time, yet detected in different octaves, should still extrapolate to the same time point, thus preserving their synchrony. This also helps define partials whose trajectories cross analysis octaves, as a frequency and amplitude fade-out in one octave should overlap with a corresponding fade-in in another octave. Whilst MQ Analysis can create a very accurate and robust representation of a sound, it can be very data intensive, particularly if the sound source is rich in spectral content, as every significant frequency component is represented. As a way of reducing the amount of data required, this system performs a data reduction on each track by further analysing its frequency and amplitude trajectories and approximating them with 'breakpoint' style envelope functions. A breakpoint envelope describes a curve as a set of straight line segments between an arbitrary number of arbitrarily valued x and y coordinates. The big advantage of this for our system is that it greatly reduces the amount of data to be sorted through and compared when matching up similar partial track behaviour at the signal separation stage. A breakpoint is defined to occur when the gradient changes by a significant fraction, as long as the magnitude of the new value has varied by a degree sufficiently above what one might expect to be minor fluctuations or 'ripple' on a given track. An example of breakpoint determination given a certain peak contour is given in figure 4. Ti T2T3......Tn Figure 4: Track data and breakpoint envelope formation Partials are grouped together and judged to belong to the same sound source in accordance with the frequency component grouping principles discussed by Albert Bregman (1989), and identified by Dan Ellis (1992), amongst others, in line with general theories regarding auditory scene analysis. Figure 5 shows some of the criteria that can be used to group partials together which exhibit similar behaviour, which are hence likely to be coming from the same sound source. The criteria illustrated here include common onsets and end-times, common harmonic frequency relationships, and common rates of frequency modulation. f Cmo C' a -tie V, 8 e Harmonic N / frequency Com relatonship IJ Frequency Figure 5: Grouping criteria between partials The partial track structures in this implementation are grouped in respect of their harmonicity, onset times and end times, according to initial values set in the header data structure for each track. Common rates of amplitude and frequency modulation are assessed in relation to the behaviour of the breakpoint envelopes, as shown in figure 6. Modulations for different partials are judged to be similar when they occur synchronously with similar gradient changes. At the end of the analysis and separation processes, audio signal separation requires the construction of a representation of a source sound by adding together all the partials determined to belong to it, and using them for synthesis. MQ Analysis is an appropriate and convenient data representation to this end, as the amplitude and frequency trajectories of each track are supplied directly to a sinusoidal additive synthesis engine. ICMC Proceedings 1996 77 Stainsby

Page  78 ï~~'L7 C' C.r Breakpoint. "..".. t.. Breakpoint" "Breakpoint S Tm" Tn Time Figure 6: Matching of modulation curve by breakpoint synchrony Figure 7 provides an outline of the general operation of the Spectralyse program, while figure 8 shows one of the main user interface windows. Soundflle BQFT analysis MQ analysis Match peaks Form tracks Calulate fades In & out and Zi ~ ~ ~ tr O&te rnfin, Apisdotime F PCA reductpionai Form BP Envelopes | eheade neop neope a.e.... co rlatd p Var itio n Figure 7:Op eyntioofsisgf nd eprto rcdr as implemented in Spectralyse 3 Conclusion Spectralyse provides a useful tool for analysing mixed musical sounds and investigating the separation and grouping of frequency partials according to criteria known to be influential in auditory scene analysis, as carried out in the auditory system. In its current implementation, this program offers computational elegance in the utilisation of a BQFT and its reduction of sinusoidal partial information to significant breakpoint envelopes. The real potential of the software lies in its being an environment in which to develop more sophisticated partial grouping algorithms, especially strategies which deal with the grouping of note information over more distantly spaced events. It is also hoped to extend the system to the analysis of stereo soundfiles, thus allowing the comparison of spatial behaviour as an additional grouping criterion. These applications stand as the subjects of future research by the author. Acknowledgements Thanks go to David Hirst, Jim Sosnin, Jeff Pressing and Chris Dick for their supervision and advice during this research, and La Trobe University for its funding with the provision of a LUPA scholarship. References Bregman, Albert S., 1989, Auditory Scene Analysis: The Perceptual Organization of Sound, MIT Press, Cambridge, Massachusetts and London, England. Brown, Guy J. and Martin Cooke, 1994, "Perceptual grouping of musical sounds: A computational model", Journal of New Music Research, Vol. 23 pp. 107-132. Chafe, C., Jaffe, D., Kashima, K., Mont-Reynaud, B., and Smith, J., 1985, "Techniques for note identification in polyphonic music", Proceedings of the 1985 International Computer Music Conference, Computer Music Association, San Francisco, California, pp. 399-405. Ellis, Daniel P., 1992, A Perceptual Representation of Audio, M. Sc. dissertation, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Maher, Robert C., 1989, An Approach for the Separation of Voices in Composite Musical Signals, Ph.D. dissertation, University of Illinois, Urbana. McAulay, R. J., and Quatieri, T. F., 1986, "Speech analysis/synthesis based on a sinusoidal representation", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-22, no. 5, pp. 330-338. Mellinger, David K, 1991, Event Formation and Separation in Musical Sound, Ph.D. dissertation, Stanford University. Smith, Julius 0., and Serra, Xavier, 1987, "PARSHL: An analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation", Proceedings of the 1987 International Computer Music Conference, Computer Music Association, San Francisco, California, pp. 290-297. Wang, Avery, 1994, Instantaneous and Frequency-Warped Signal Processing Techniques for Auditory Source Separation, Ph.D. dissertation, Stanford University, Dept. of Electrical Engineering, Stanford, California. Figure 8: The Analysis Parameters window from Spectralyse Stainsby 78 ICMC Proceedinoq 1Q96