Page  199 ï~~AUDIO ANALYSIS FOR CONTENT-BASED RETRIEVAL Douglas Keislar, Thorn Blum, James Wheaton, and Erling Wold Muscle Fish 2550 Ninth St., Suite 207 B, Berkeley, CA 94710, USA tel: 1-510-486-0141 email: Web page: ABSTRACT: Musicians and sound designers could greatly benefit from content-based retrieval, whereby recorded sounds are selected from a database on the basis of desired acoustical or subjective features. This paper describes an audio analysis engine, currently under development, that is designed to help automate the process of sound classification, retrieval, and selection using an audio database. Introduction Existing sound-effects databases and "librarians" typically permit the user to associate a limited number of textual keywords and/or descriptions with each sound. While words can sometimes serve as sufficient keys for the retrieval of data, there are many cases where one would like to query the source more directly. Content-addressable databases or content-based retrieval are the terms commonly used to describe this kind of information storage and retrieval system. Such systems permit searches for features, keys, or triggers extracted directly from the data (as opposed to being one step removed from the data, as in a keyword-only description of that data). While this capability has ceased to be a problem for text-only databases, it is just starting to be addressed in the area of multimedia document management, particularly in the case of audio. In this paper, we introduce an approach to analyzing, cataloguing, and retrieving sounds from an audio database. Ways are suggested in which sounds may be retrieved from repositories by using any one or a combination of objective (acoustic) metrics, by specifying subjective perceptual features, or even by selecting or entering a reference sound and asking the database to retrieve all sounds that are similar (or dissimilar) to it. We explain here the signal-analysis techniques that we are developing for use in a sound database engine on a UNIX platform. Not discussed here is the client application with a graphical user interface for retrieving sounds (Blum et al). Basis of the analysis technique By content-based retrieval of audio, one can mean a variety of things. At the simplest level of implementation-but the least simple level of usage--one could retrieve a sound by specifying the exact numbers in an excerpt of the sound's sampled data. At the next higher level of abstraction, the retrieval would match any sound containing the given excerpt, regardless of the dat's sample rate, quantization, compression, etc. At the next level, the query might involve frequency-domain information or other acoustic attributes that can be directly measured. Finally-at the most difficult level of implementation but potentially the most user-friendly level-the query could include perceptual (subjective) properties of the sound. (At an even higher level, one might eventually like the system to actually identify the sound sources, but research with this goal is in its infancy (McAdams, S.).) It is the third and fourth levels-acoustic and subjective properties-with which we are most concerned. (The implementation of the first two levels is conceptually straightforward and need not be discussed here.) Some of the perceptual properties of a sound, such as pitch, loudness, and brightness, correspond closely to measurable attributes of the audio signal, making it logical to provide fields for these properties in the audio database record. However, other subjective properties (for instance, "Â~scratchiness") are ICMC PROCEEDINGS 199519 199

Page  200 ï~~more indirectly related to easily measured acoustical attributes of the sound. Some of these properties may even have different meanings for different users. (The phenomenon of synaesthesia is an extreme case of subjectivity: a user might call certain sounds "blue" and others "red.") To support subjective properties, the database record format is user-extensible. To be able to use different perceptual criteria to retrieve a sound, we first measure a variety of acoustical attributes of each sound. This set of N attributes is represented as an N-vector. In text databases, the resolution of queries typically requires matching and comparing strings. In an audio database, we would like to match and compare the sort of subjective properties described above (such as "scratchiness"). For example, we would like to ask for all the sounds similar to a given sound or that have more or less of a given property. To guarantee that this is possible, the space of N-vectors should satisfy the following constraints for each subjective property to be used in retrieval: 1. Sounds which differ in the subjective property should map to different regions of the N-space. If this were not satisfied, the database could not distinguish between sounds with different values for this property. 2. If the user ranks a set of sounds in increasing amounts of the given subjective property, these sounds should map approximately to a smooth path in some M-dimensional projection of the N-space where MV. Since we use a linear model, we have the additional restriction that the smooth path should be approximately linear. Note: In the case where the subjective property is binary-that is, where a sound either has the property or it doesn't-the linear constraint is not so important. Since we cannot know the complete list of subjective properties that users may wish to specify, it is impossible to guarantee that our choice of acoustical attributes will meet these constraints. However, we can make sure that we can meet these constraints for many useful subjective properties. Acoustical attributes The following aspects of sound are analyzed: * Pitch. Pitch is estimated by taking a series of short-time Fourier spectra. For each of these frames, the frequencies and amplitudes of the peaks are measured and an approximate weighted greatest common divisor algorithm is used to predict the pitch (expressed as log frequency). The pitch algorithm also returns a pitch confidence value which can be used as a measure of "how pitched" the sound is. o Harmonicity. This parameter distinguishes between harmonic spectra (e.g., vowels and most musical sounds), inharmonic spectra (e.g., metallic sounds), and noise (spectra that vary randomly in frequency and time). " Loudness. Loudness is approximated by the signal's RMS level in decibels, which is calculated by taking a series of windowed frames of the sound and computing the square root of the sum of the squares of the windowed sample values. (This method does not currently account for the frequency response of the human ear; we might later add equalization by applying the Fletcher-Munson contours.) * Brightness. Brightness is computed as the centroid of the short-time Fourier spectra. * Spectral peaks. The system smooths the short-time Fourier spectra and looks for broad peaks, which are parameterized by logarithmic values of frequency, magnitude, and width. All the above aspects of sound can vary over time. The trajectory in time is computed during analysis, but for efficiency it is not stored as such in the database. Instead, for each trajectory, several parameters are computed and stored, including: * Average. * Maximum and minimum. * Variance. * Autocorrelation. This is a measure of the smoothness of the trajectory. This can distinguish between a pitch glissando and a wildly varying pitch (for example), which a simple variance measure cannot. 200 0IC MC PROCEEDINGS 1995

Page  201 ï~~" Parameters relating to the shape of the smoothed trajectory: critical points, number of inflections, attack and decay time (of loudness trajectory). In addition, the duration of the sound is stored. The N-vector of measured attributes thus consists of duration plus the parameters just mentioned (average, maximum, minimum, variance, autocorrelation, and shape parameters) for each of the aspects of sound given above (pitch, harmonicity, loudness, brightness, and spectral peaks). Training the system and retrieving data As we mentioned above, some subjective properties will directly relate to the measured attributes above. However, we need a method to teach the system about new subjective properties, especially those that are subjective and can vary between different users. To train the system, the user picks a relatively small subset of sounds which show varying amounts of the property in question. The user sorts the sounds according to their perceived ranking, assigns an approximate numerical value of the property for each sound, and submits them to the system. When training by example, the more examples the user has, the better the system's understanding will be. It would also be best if the user submitted examples which covered a wide range of values for the property. For each sound s[j], j=0 to M-J, the system computes the N-vector a, if it is not already computed. (M is the number of sounds to be analyzed, and N the number of acoustical parameters.) To find the relationship between the subjective property p]jJ of each sound and the measured attributes aij], we use a standard linear regression model with parameter vector b. That is: p[j] = bTa[j] + e[j] where e is the error in the model. (The superscript T indicates a transposed matrix.) Given p[j] and af]], the b parameters can be estimated using least squares, which is the unbiased, minimum variance estimate for b. Note that the elements of b can be reported to the user to indicate which elements of a are most significant to the subjective property under consideration. The algorithm computes the variance of e, which can be examined to give the user an indication of how well the model fit the data. At some threshold, the system could indicate that the training failed-that is, the user needs to supply more training sounds, or the currently measured attributes do not meet the constraints (listed in Section 3.1, above) for the specified subjective property. In the latter case, the property might be an uncomputable property (e.g., "how much I like the sound"), or there might be further measurements which could be added to the repertoire to increase the usefulness of the database. Once the mapping between measured attributes and the subjective property is understood by the system, the mapping is applied to each sound in the system (or each of the sounds currently of interest to the user). The name and value of the property is included in the database record for that sound. Once this is accomplished, it is straightforward to select sounds from the database with queries relating to the property. For example: " Query by value. Retrieve all the sounds with values of property PO greater than 0.9 and property Pl less than 0.2. " Query by example. Retrieve all the sounds similar to this sound with respect to property PO. "Similar to" means "within some delta of." Retrieve all the sounds with less pl than this sound. * Organization/Browsing. Sort the current sounds by property p0. Group the current sounds by properties P0 and pj. Binary or discrete properties In the case of binary or discrete properties, the property is used as a classifier. In the binary case, a sound either has this property or does not, and in the more general discrete case, the sound falls into one of I C M C PNROCEED]NGS 1995 201

Page  202 ï~~several categories. To train the system, the user selects examples of sounds which have the property or do not (in the binary case) or which illustrate the different categories (in the general discrete case). For each sound, the a vectors are computed if they have not already been computed. The mean vector 4u and the covariance matrix R for the a vectors in each category are then calculated. When a new sound needs to be classified, a likelihood value is calculated using the multivariate normal distribution. If the likelihood for the category is above a threshold, the sound is determined to be in that category. If there are several mutually exclusive categories, the sound is placed in the category with the highest likelihood. Miscellaneous Comments The above discussion handles the case where each sound is a single gestalt. Longer recordings need to be segmented before using the retrieval features above. Segmentation is accomplished by applying the acoustic analyses above to the signal and looking for transitions (sudden changes in the measured attributes). Because the sound analysis procedures are time-consuming, various schemes exist for telling the system when it should take the time to analyze a sound that has been added to the database. These include: immediately, upon first reference to the sound, when idle, and at a scheduled time. In some cases, the user might know which measured attributes are relevant to the subjective property in question. The analysis of a subjective property can be made faster and more reliable in these cases by allowing the user to constrain the analysis to a subset of all the measured attributes. One can also imagine making the analysis engine extensible, by allowing a user to "plug in" new analysis algorithms. Content-based retrieval is not mutually exclusive with text-based retrieval. The client application can use textual keywords and categories in addition to the acoustical and subjective properties known to the audio engine. Sufficiently flexible browsing and searching utilities can combine these criteria in different ways (Blum et al). Neural nets provide a different way to find a mapping between between acoustical attributes a and perceptual properties p (Feiten, B. and S. Ginzel). The advantage of neural nets is that they can discern non-linear mappings. The disadvantage is that it is difficult to see what is going on "inside" the net_ Thus, it is harder to estimate how good a model the system has discovered (how well it understands), or to see which measured attributes correlate most strongly with p. There are other applications for an audio analysis engine of the sort described. Sound editing and mixing can be integrated with the database application (Blum et al). Other researchers have described an analogous database for sound synthesis (Vertegaal, R. and E. Bonis). Finally, non-audio data can be "sonified" (converted to audio) for a wide variety of scientific and medical applications, in which contentbased retrieval would probably be useful. References Blum, T., D. Keislar, J. Wheaton, and E. Wold. 1995. "Audio Databases with Content-Based Retrieval." Proceedings of the 1995 International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Intelligent Multimedia Information Retrieval. Feiten, B. and S. Giinzel. 1994. "Automatic Indexing of a Sound Database Using Self-organizaing Neural Nets." Computer Music Journal 18(3): 53-65. McAdams, S. 1993. "Recognition of Sound Sources and Events." In Thinking in Sound: The Cognitive Psychology of Human Audition, S. McAdams and E. Bigand. Oxford: Clarendon Press. Vertegaal, R. and E. Bonis. 1994. "ISEE: An Intuitive Sound Editing Environment" Computer Music Journal 18(2): 21-29. 202 I0 CMC PROCEEDINGS 1995