Page  00000001 Spectral Anticipations Shiomno Dubnov CRCA / Music, UCSD sdubnov@ucsd.edu Abstract The question of noise vs. structure is basic to many musical applications, as well as audio processing and sound understand'ing. In this paper we present a: new measure for the extent of randomness present in a sigonal that we call Information Rate. This measure is formulated in terms of the reduction of uncertainly that an information processing system achieves when it forms anticipation about future values of a process based on its past. In the case of a simple source-filter signal model it is shown that this measure is equivalent to the well-known Spectral Flatness M~easure that is commonly used in audio processing. The information rate is extended to multivariate processes such as a: geometric signal, spectral or feature vector representations. The measure is tested on various synthetic, natural and musical signals. It is shown that it is capable of detecting structure in cases when stand'ard' methods based on signal spectrum seem to be unable to operate. 11 Introduction Noise has been playing an important role in music as well as engineering. When dealing with 'noise-like' acoustic signals, the canonical engineering def~inition is usually adopted as being a signal with equal distribution of energies in all frequencies. This definition, suggesting that noise is a rich or complex signal, might obscure other significant properties of noise, such as being a signal with no temporal structure or signal that cannot be predicted and etc. In this paper we consider the problem of noise and structure from an information theoretical point of view. Our starting point is the well-known measure for signal whiteness called the Spectral Flatness Measure (SMF). Using simple mathematical equivalences we will show that function of the SMF actually measures the rate with which information is carried over by a signal in time, which we shall call Information Rate (JR). JR represents the amount by which information in a random process grows with every additional observation of the process obtained over time. This is also the difference in uncertainty or entropy of the signal with or without prediction. Thus, noise carries zero JR since prediction does not allow any reduction of information (same number of bits are required for coding the next sample with or without prediction). The more structure a signal has, the higher is its JR. The significance of using JR as a measure of signal structure is two fold: First, it seems to correct some misconceptions about the nature on noise versus structure, allowing a new, information processing view of the question. A second important contribution is the possibility to generalize or extend the concepts of noise and structure to more complicated situations of multi-varied processes and complex signals. Using "geometrical" (subspace) methods, the signal is represented in terms of expansion coefficients in some space. We show that the problem of determining signal structure becomes a problem of determining the "noisiness" of the expansion coefficients. 2 Spectral Flatness and Information Rate Spectral Flatness Measure (SFM) [Jayant] is a wellknown method for evaluation of the distance of a process from being a white noise. Jt is also widely used as a measure for "compressibility" of a process, as well as an important feature for sound classification [Allamanche]. Given a signal with power spectrum S (W),SFM is defined as exp( 1 fin S(wu)dw) SFM~= (1) 2.i jdwd and is positive and less or equal to one. SFM equals for a white noise signal. Jnformation Rate (JR) is def~ined as the difference between the information contained in the variables X1, 2,.***X, and x ~x,..x, i.e. the additional amount of information that is added when one more sample of the process is observed Proceedings JCMC 2004

Page  00000002 P(xi,..., xn) = {(xi,..., xn) - I(x,..., xn_,)}.(2) n It can be shown that for large n IR equals the difference between the marginal entropy and entropy rate of the signal x(t), p(x) = lim p(x,...,x,) = H(x) - Hr (x) (3) n ---oo As will be shown in the Appendix, using expressions for the entropy and entropy rate of a Gaussian process, one arrives at the following relation SFM(x) = exp(-2p(x)). (4) Equivalently, one can express IR as a function of SFM p(x) = - log(SFM(x)). (5) 2 2.1 IR as transmission over a "time-channel" The concept of mutual information is commonly used for theoretical characterization of the amount of information transmitted through a communication channel. Since a channel introduces errors during the transmission process, some information between the input and output signals is lost. Mutual information I(x, y) measures the reduction of uncertainly or entropy H(x) of the source signal that a received signal carries about the input signal H(x I y), as defined in equation 6. I(x, y) = H(x) - H(x I y) (6) One system has a correct model, which allows making good predictions about the next symbol. In such a case the amount of information that is required in order to describe the next sample is proportional to the size of the remaining error. If the original uncertainty about the signal was large but the uncertainty remaining after the prediction is small, we say that the system managed to capture the signal structure and achieve its information-processing task. If a second system does not have the capability of making correct predictions, the amount of bits needed to code the next symbol remains almost equal to the amount of bits required for coding the signal "as is", i.e. without any prediction. In such a case the discrepancy between the two coding lengths before and after prediction is zero and no information reduction was achieved by the second system. Music Listening system Let us consider a musical situation: Listener - makes predictions, forms expectations. Music source - presents or generates new samples. Structure is something that can be predicted or anticipated. Noise is an unpredictable flow of data. In case of a white noise, no information is carried by the signal into the future. Mathematically it is expressed by the fact that the entropy rate and the marginal entropies are equal. Signal Characterization The above discussion suggests that the IR value is dependent on the nature of the information processing system, as well as the type of a signal. Only in case of an ideal system that has access to complete signal statistics, the IR becomes a characterization of the signal alone, independently of the information processing system. Nevertheless, comparative analysis of sounds is still possible within one practical system. 3 Geometrical Signal Representation and Information Rate In case of n dimensional multi-variate process, we introduce a matrix notation for a sequence of column vectors XlX2... These vectors could be signal samples divided into frames, columns of Short Time Fourier Transform (STFT), STFT Magnitudes (Spectrogram) or some other sound features ordered in time. Considering a basis given by columns of the matrix A, the data vectors can be represented as a sequence of expansion coefficients s in the basis A Source x | Channel. Receiver y Figure 1: Music Anticipation as Transmission over a noisy "Time Channel". For the purpose of current discussion we will consider a particular type of channel, whose input is the history of the signal up to the current point in time and the output is the next sample. The transmission process over a noisy channel has now the interpretation of prediction/ anticipation in time. The variable y is the history available at the receiver and we would like to know how much information is carried by the history towards the future sample x. One can interpret IR as the capacity of a "time-channel", or the amount of information that a signal carries into the future, as shown in Figure 1. The number of bits needed to describe the next event, once an anticipation or prediction based on the past is established quantitatively measures this capacity. Relativity of Noise. Let us consider a situation where two systems are trying to form expectations about the same data. [X1X2...]=A s2(1) sn(1) (7) Proceedings ICMC 2004

Page  00000003 One of the main questions in geometric representation is how to find the "correct" basis vectors. Different criteria exist in the literature, such as minimizing the error between the original signal and its reconstruction using a limited number of components (so called low rank representation), independence of the expansion coefficients and more. As we will see in the following, we would like to require that the expansion coefficients are independent. In practice we will a use a decorrelation procedure that achieves only an approximate independence. We generalize the IR definition for the multivariate case as PL (X I X21,...,I XL) =I(X,I X21,...,I XL (8 (8) -{I(X1, X2,..., XL-1) + I(XL)}I The new definition considers the difference in information over L consecutive signal frames versus the sum of information of the first L-1 frames and the information in the last frame XL. We shall call this sometimes vector IR (VIR). In geometrical representation, using expressions for entropy of a linearly transformed random vector, and assuming independence of the expansion coefficients, it can be shown [Dubnov] that IR may be calculated as a sum of IRs of the individual expansion coefficients, n PL "(XI I X21,...,I XL! Y P (Si (i)I...,I siL),(9) i=1 with equality being held when the components si (n), i = 1..n are independent. The vector IR generalization shows that in case of complex signals the overall structure could be estimated from the sum of IRs of its individual components. We can identify structural components as the ones that have predictable trajectories in the new space as function of time. Components that have no structure would have low IR in their expansion coefficients. 3.1 Audio Spectral Basis When vector IR analysis is applied to spectral vectors, we call this "spectral anticipation structure" of the audio signals. There are several possible methods or transforming signals into spectral representations. The most common method is the short time Fourier transform (STFT) that represents the signal as a sequence of Fourier transform magnitude values, derived over short signal frames. Jn context of JR analysis, the transform coefficients are sequences of spectral amplitudes at different frequency bins. Jn order to assure independence between the frequency coefficients, a decorrelation procedure needs to be applied to the different frequency bins, using for instance the Karhunen-Loeve transform (KLT) [Hayes]. Another common representation of spectral contents of audio signals is by means of cepstral coefficients [Oppenheim]. One of the great advantages of the cepstrum is its ability to capture different details of signal spectrum in one representation. For instance, the energy of the signal corresponds to first cepstral coefficient. Low cepstral coefficients capture the shape of the spectral envelope or represent the smooth or gross spectral details. Detailed spectral variations, such as spectral peaks due to pitch (the actual notes played), or other long signal correlations, appear in the higher cepstral coefficients. Using part of the cepstral coefficients allows an easy control over the type of spectral information that we would like to consider for the IR analysis. 4 Analysis of natural sounds In order to evaluate the method of spectral IR analysis we consider and example of a "noise-like" sound, i.e. sound that has a dense spectral content throughout all frequencies. Figure 2 shows the spectrogram of a cheering crowd sound. This sound contains a dense mixture of applauds, cheering and other noises. Some of the hand clapping can be seen as the vertical lines while the vocal exclamations appear as several narrow spectral peaks varying in time. Figure 2: details. Cheering Spectrogram. See text for Performing vector IR analysis of the spectrogram matrix consisted of the following steps: 1. Performing decorrelation of the rows of the spectrogram matrix using KLT. This can be done efficiently using Singular Value Decomposition (SVD). Proceedings ICMC 2004

Page  00000004 2. Estimating scalar IR for each of the coefficients (decorrelated rows) by estimating their SFM and applying the transformation from SFM to IR. 3. Vector IR value is obtained by summing the scalar IR's. The analysis in step 2 shows that different components contain different structure. The IR values for the different components are presented in Figure 3. It shows the result of scalar IR analysis of a spectrogram matrix that uses FFT of size 256 with 50 percent overlap. Scalar IR was obtained from SFM, using Welch spectral estimation method with 64 spectral bins. Additionally, a synthetic signal with approximately similar spectral envelope was constructed by passing a white noise through a filter', whose spectral shape resembled the shape of the power spectral density of the signal. The resulting IR values for the synthetic signal appear in Figure 42. One can note that the IR values for the components of the synthetic signal are of an order of magnitude smaller then those of the original signal. Vextor IR Vsing SVD basis that has the same spectral envelope as the Cheering Crowd. One can note that the values of the components are......................'i ri j 11:.. I '. i..| i.......... order of magnitude smaller then of the original signal. Original signal: Vector IR 13.62 Scalar IR 1.927 Noise signal with equivalent average spectral envelope: Vector IR 2.58 Scalar IR 1.65 This example shows that vector IR detects significant amount of structure in the original sound. Naturally, the synthetic signal misses both the temporal variation and part of the spectral detail compared to the original. Applying Scalar IR (to account for the spectral detail) results only in little more structure in the original signal compared to the noise signal. Sound Effects: characterization of complexity using IR Next application of IR analysis is characterization of complexity of sound effects or sound textures. One should note that IR analysis does not directly apply for classification or discrimination between sound effects since different sounds of thpesame type can have very different levels of IR complexity or anticipation structure. For instance, sounds of cheering crowd could have diverse IR levels, from a smooth mass of noise to widely varying cheering and applause sounds. In Figure 5 we present the distribution of Energy and IR values for several classes of sound effects. These sounds include Cheering, Fire, Car Crash, Glass Break and collection of different Phone Rings. The graph shows distribution of the sounds in terms of the average Energy and vector IR. The IR analysis was done using the cepstral method using similar parameter settings to the ones above (30 cepstral coefficients, frames of 512 samples with 50 Figure 3: IR of the different components after SVD analysis of the Cheering Crowd Spectrogram. Comparative scalar and vector IR analysis were applied to the original and synthetic signals. The different IR values are as follows: 1 Filter parameters were estimated using Linear Prediction method with 8 filter coefficients. 2 The values in the IR graph are sorted in descending order according to the eigenvalues of the SVD. One can note that IR values do not appear in descending order and thus capture something different then signal variance. Proceedings ICMC 2004

Page  00000005 percent overlap, power spectral estimation method with 64 FFT bins.). using Welsh Enerwg and IR of Diferent N1-atura Sounds a sh sis Sound In The Method: Sound In Feature extraction: Get timbral / textural descriptors (Cepstrum analysis of 20-600ms frames. ) Factorization: Timbral prototypes and their time evolution (SVD analysis of the cepstral data. ) Si(t) S2(t)... SN(t) SFM SFM SFM SVD - Singular Value Decomposition SFM - Spectral Flatness Measure Figure 6: Scheme of the of the IR algorithm applied to musical signals. When the IR graph is plotted against time, one obtains a graph of IR evolution through the course of a musical signal. This graph may be used for analysis of complex audio signals or even musical pieces. In the following we present the results of IR analysis of and excerpt of Schumann Piano Quartet E-flat Major I. The music contains three sections of sustained string cords that end with a piano phrase and a faster last Tutti section. Figure 5: Distribution of Energy and IR values for several classes of sound effects. One can see that Fire sounds have low IR (they are closest to pure noise) and little variation in energy. Cheering and Car Crash have widely varying IR, with Cheering being more limited in terms of Energy spread. Glass Break falls in the higher range of the Car Crash and Cheering IR. This seems reasonable since Glass Break sounds had relatively more predictable structure in comparison for instance to Car Crash that had widely varying sounds such as car engine, squeaking tires and crashing sounds. Cheering examples were also widely varying from sparse applause to very dense stadium crowd cheering. It is interesting also to observe that phone ring has the most structure, as might be intuitively expected. 4 Short Time IR Analysis and its Application to Music The method of IR analysis can be further used in a time varying fashion by applying it over short times. This allows analysis of longer time varying signals. The method described below uses sequences of spectral descriptors derived from short signal frames. Sequences of these short frames are grouped into macro frames that contain a several spectral vectors. Each macro frame is used for IR anlaysis and results in a single IR value. Figure 6 shows the details of the vector IR algorithm. Note that some analysis parameters could be set in a different fashion, depending upon the type of music in question (for instance orchestral music seems to require larger analysis frames). Figure 7: Scalar (dashed) and vector (solid) IR analysis results of the Schumann excerpt overlaid on top of the signal spectrogram. Figure 7 shows the scalar and vector IR analysis results of the Schumann excerpt overlaid on top of the signal Proceedings ICMC 2004

Page  00000006 spectrogram. Solid line shows vector IR and dashed line corresponds to scalar IR. The scalar IR values were smoothed using a sliding averaging window 1.5 seconds long. For plotting purposes the actual IR values were scaled so as to fit nicely the display on top of the spectrogram. The vector IR analysis was done using 30 cepstral features. The short frames used for cepstral analysis were 20 milliseconds and the macro frame for IR analysis was three seconds long with 75 percent overlap between successive frames. These results were compared to scalar IR measure that was estimated directly from the signal. Evaluation of the graphs shows that the scalar IR tends to assign more structure to the less dense transition areas (piano phrases) where the spectrum has a more prominent harmonic structure. Moreover, scalar IR shows little structure at the last Tutti section, which is the most dense and complex. One can see the vector IR corresponds more naturally to the texture complexity of this music example. 5 Discussion and Conclusion This paper dealt with noise and structure in audio and musical signals. Traditionally structure is associated with signal complexity in terms of multiplicity of different frequency components. This approach is misleading since it labels noise as a complex signal, since it contains all frequencies. A more correct point of view about complexity of a signal is considering the amount of structure in its model, or more precisely the amount of structure that a model explains versus the unexplained part of the signal left over due to its inherent stochasticity. In order to deal with this problem we described first the SFM as a correct way to consider structure versus noise when looking "at the signal" itself, i.e. considering the sound samples as the variables. Next we addressed the question of structure looking at sound vectors, spectra or features. We used a new measure for randomness that was developed for the case of a multivariate or vector processes. We applied this new measure to a sequence of spectral features, thus denoting it as "spectral anticipation". One of the main contributions of this paper is in conceptual formalization of the question of structure as the extent to which the uncertainty (entropy) of a variable is reduced by prediction of its future values based on its past. We called this measure Information Rate (IR). Additionally, we presented a simple computational method for estimation of IR in the signal and feature vector cases. It was shown that for multivariate or vector case, IR could be estimated from marginal IRs of the constituent components, if these are independent. The vector IR approach allows generalizing and extending the concepts of noise and structure to complex signals. For instance, when several signals are mixed, even when its individual constituents have structure, the resulting mix might look noise like in its spectrum. By decomposing the signal into its separate components, a more correct characterization of the signal complexity may be estimated. Even though we are not fully capable at this point to decompose a signal into its true constituents, we suggested using the geometric representation as an approximation. Using a variety of "geometrical" signal analysis methods, one can derive a representation of the signal in terms of expansion coefficients in some space, with basis functions considered as "signal prototypes". This representation can be applied to the signal itself, to signal spectral amplitudes or to sequences of signal features vectors. The problem of determining structure becomes then a problem of summing IRs of the expansion coefficients in the representation space. In the paper the IR algorithm was applied to several natural sounds and a musical recording. We hope that the suggested framework for complexity / anticipation analysis outlined in this paper could serve as a basis for developing various further methods for musical understanding and sound analysis, with application to domains as diverse as Spectro-Morphology, Music Cognition or Music Information Retrieval. References Allamanche, E., J. Herre, O. Hellmuth, B. Fr6ba, T. Kastner, M. Cremer, (2001), Content- based Identification of Audio Material Using MPEG- 7 Low Level Description, in proceedings of International Symposium on Music Information Retrieval (ISMIR), Indiana University. Cover,T. M. and J. A. Thomas (1991), Elements of Information Theory, John Wiley & Sons, New-York. Dubnov, S. (2003), "Non-Gaussian Source-Filter and Independent Components Generalizations of Spectral Flatness Measure", Proceedings of International Conference on Independent Components Analysis (ICA2003), Nara, Japan. Hayes, M. (1996), Statistical Signal Processing and Modeling, Wiley. Jayant, N.S. and P.Noll, (1984), Digital Coding of Waveforms, Prentice-Hall Signal. Oppenheim, A.V. and R.W.Schafer, (1989) Discrete Time Signal Processing, Prentice Hall, New Jersey. Appendix A: The relation between SFM and IR In order to assess the amount of structure present in a signal in terms of its information content, we observe the following relations between signal spectrum and its entropy. Entropy of a "white" Gaussian random variable is given by Proceedings ICMC 2004

Page  00000007 1 i1. H(x) = -In( - S(w)dc) + In 2e, (A10) 2 2; while the entropy rate of a Gaussian process (the so called Kolmogorov-Sinai Entropy) is given by Hr (x) = IfInS(w)dw+In 2Te(All) 4;T IR is defined as a difference between the marginal entropy and entropy rate of the signal x(t), p = H(x) - H, (x). Inserting the expressions for entropy and entropy rate, one arrives at the following relation SFM(x) = exp(-2p(x)) exp( fIn S(w)dw) (A12) fS (w) d c 2;T One can see equally that IR is equal one half of a logarithm of SFM. Appendix B: SVD In our work, decorrelation is achieved by application of Singular Value Decomposition (SVD) [Hayes] to a matrix obtained from initial short-time cepstral analysis of a musical signal. In SVD, a rectangular matrix is decomposed into the product of three other matrices. A singular value decomposition of an (m x n) matrix X of cepstral vectors is a factorization of the form: X = USV T(A13) where U is (m x m) orthogonal matrix, S is (m x n) diagonal matrix which contains the singular values and V is (n x n) orthogonal matrix. We constrain the SVD to a factorization where the singular values in S are in decreasing order. The left component matrix describes the original column entities as vectors of derived orthogonal factor values, which in geometric interpretation are the basis vectors. The right matrix describes the original row entities, which in our case corresponds to expansion coefficients in this bases. The middle is a diagonal matrix containing scaling values or variances, such that when the three components are matrix-multiplied, the original matrix is reconstructed. There is a mathematical proof that any matrix can be so decomposed perfectly, using no more factors than the smallest dimension of the original matrix. It must be noted that SVD causes the rows of V to be orthogonal, which in statistical interpretation amounts to lack of statistical correlation. Assuming Gaussian statistics, we say that the rows are independent, and thus contributing to the overall IR as a simple summation of the individual IR contributions. Appendix C: Vector IR as sum of independent component scalar IR Given a linear transformation X = AS between blocks of the original data (signal frame of feature vector X) and its expansion coefficients S, the entropy relations between the data and coefficients is H(X) = H(S) + log | det(A) 1. (A14) For a sequence of data vectors we evaluate the conditional IR as the difference between the entropy of the last block and its entropy given the past vectors (this is a conditional entropy, which becomes entropy rate in the limit of an infinite past). Using the standard definition of mutliinformation for signal samples x,...xnLL, Ln I(XI I X21,... XL) = H(xi) - H(x,...,I xLn), (Al15) i=1 one can show that vector IR is PL (Xi,..,XL) =I(X,..,XL - {I(XI,..., XL-1)+I(XL)} (A16) =H(XL) - H(XL | Xl,...,IXL-1) This shows that the vector IR can be evaluated from the difference of the entropy of the last block and the conditional entropy of that block given its past. Using the entropy transform relations one can equivalently express vector IR as a difference in entropy and conditional entropy of the transform coefficients P~n (X,,I...,I XL) = H(SL) - H(SL I S,,., SL-1) (A17) If the coefficients are independent, we arrive at the relation n H(SL) = H(si (L)) (A18) n H(SL I... L_1) =YH(si (L)Isi (1)... si(L - 1)) pnj(Xl, X2,..., XL) = ~p(S, (i),..., s,(L)) (A19) i=1 Proceedings ICMC 2004