# FEATURE MODULATION SYNTHESIS (FMS)

Skip other details (including permanent urls, DOI, citation information)This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact mpub-help@umich.edu to use this work in a way not covered by the license. :

For more information, read Michigan Publishing's access and usage policy.

Page 00000368 FEATURE MODULATION SYNTHESIS (FMS) Tae Hong Park, Jonathan Biguenet, Zhiye Li, Conner Richardson, Travis Scharr Tulane University Music Department New Orleans, LA 70118 ABSTRACT This paper introduces FMS (Feature Modulation Synthesis), a synthesis-by-analysis sound synthesis system. Unlike many of the traditional synthesis methods such as AM, FM, additive/subtractive synthesis, waveshaping, wavetable, and physical modeling which results in sound generation via sonic parameters that in essence were developed in mimicking or modeling existing target acoustic instruments, FMS takes the approach of extracting salient timbral features directly from a sound object where the end goal is the resynthesis of an altered sound object reflecting the modulated timbral features. The FMS system comprises of analysis of a sound object and salient feature extraction; modulation of single or multiple features; and synthesizing the new sound object via the modulated features. The essential idea behind FMS is thus synthesizing a sound object by modulating and affecting only the desired timbral dimensions while ideally leaving all other timbral aspects intact giving very fine control over specific sonic attributes. 1. INTRODUCTION The origin of the idea for FMS came about during 2003 -2004 research in Music Information Retrieval and feature extraction for automatic timbre recognition studies [3] as outlined as a "future work" project evolving to the present project. The motivation for this research can be traced to the somewhat limited number of synthesis/re-synthesis tools for composers who wish to directly deal with perceptually relevant dimensions of sound and be able to control and manipulate such dimensions in the form of alteration of salient features. Classic synthesis algorithms commonly generate sound from scratch or by using sine tones or sculpting noisy signals via control parameters that are often abstract and seemingly have little relationship to perceptually relevant nomenclatures. FMS tries to alleviate some of this disconnect between abstract control parameters and perceptually relevant ones by facilitating timbral control via modulation of salient features. The architecture follows a synthesis-by-analysis model with salient feature extraction, feature modulation, and synthesis modules as shown in figure 1 v Vsalient feature feature S extraction modulation synthesis Figure 1. Basic FMS architecture where s is the input sound object, s' synthesized result, v and v' the extracted and modulated feature vectors respectively. The fundamental approach to sound synthesis in FMS is to analyze a sound object for salient features, modulate a select number of these feature vectors, and synthesize a new altered sound object while keeping all other timbral features untouched. Although modulating one specific timbral dimension ideally should not affect any other features, in practice, some artifacts do occur and are perhaps inevitable to a certain degree. This synthesis method is not a completely unique one in a sense that FM synthesis, granular synthesis, or subtractive synthesis is and includes general ideas that touch upon sound morphing and spectral shaping research [2, 7, 8]. 2. FEATURE EXTRACTION The current set of implemented feature sets for FMS include amplitude envelope detection using RMS (Root Mean Square) and windowing, spectral envelope, harmonic expansion/compression [3], inharmonicity, spectral spread, spectral flux, spectral centroid, and spectral jitter/shimmer. The more general feature spectral envelope is also included which is a threedimensional envelope representation of a signal in the frequency-domain with frequency, magnitude, and time axes. The frequency-domain analysis is achieved using Short-Time Fourier Transform. Harmonic analysis is conducted via a combination of harmonic product spectrum [1], autocorrelation [4], mode analysis, and other custom algorithms. 3. FEATURE MODULATION AND SYNTHESIS Feature modulation in FMS is divided into time-domain and frequency-domain processes where most occur in the latter using STFT. There are two main approaches used for frequency-domain manipulation that are common for each modulation module - convolution based modification of the spectrum and direct manipulation of each frequency bin using an interpolated spectral envelope as reference. 3.1. Amplitude Envelope The amplitude envelope feature is modulated by computing the product of the time-domain signal s and the envelope shaper vshaper where the RMS amplitude envelope of s is used as a reference as shown in figure 2. The envelope shaper Vshaper is formed by modulating a copy of the original amplitude envelope of s via addition of nodes and computing the difference between the modulated envelope and original envelope Envs as shown in eq. 1 and 2. The synthesized output is then computed via eq. 3. 368

Page 00000369 no change: default envelope shaper decrease in amplitude from original profile envelope shaper nodes no change -.eg DC Nyquist/2 Nyquist -e time (samples) Figure 2. Amplitude envelope modulation Envdiff [n] = Envmod [n] - Envy [n] + 1 (1) Vshaper [n] = Envdiff []sign (Endff [n]) (2) s'[n] = s[n] - vshaper [n] (3) An additional feature that was added is the "envelope flattener" which "flattens out" the amplitude envelope (eq. 8). s[n] '' =Ens[n] *Vshaper [n] (4) The flattened envelope can be likened to a blank sheet of "amplitude envelope paper" so to speak, allowing the possibility to define a new envelope on a sound object with a flat envelope. 3.2. Spectral Centroid The approach in modulating the spectral centroid was inspired by the mechanics of a "see-saw." That is, more energy towards the lower frequencies will result in the seesaw being tilted towards the DC and more energy in the high frequency area will result in tilting towards the Nyquist. The see-saw is thus initially balanced at Nyquist/2 frequency and tilts either towards the DC or Nyquist depending on the desired centroid location during modulation. Similar to eq. 3 a multiplication of the "seesaw" and the spectrum occurs in the frequency-domain resulting in convolution as seen in eq. 5. X'I= X - Vsee-saw (9, 5) The tilting of the see-saw effectively moves the centroid either towards the DC or Nyquist. Additionally, to achieve more "tilt" the see-saw is "bent" causing the line to become a polynomial shape as seen in eq. 6 and figure 3 where convovler V'see-saw represents the degree of bending determined via polynomial order Q, scalar a, line u, frequency bin index k, and DFT length N. Q Vsee-saw [k]=> a u [k] (6) q=0 Spectral Centroid DC Nyquist/2 Nyquist Figure 3. Spectral biasing through see-saw The desired centroid is reached by finding the appropriate "see-saw" in an iterative manner using the method of gradient descent [5]. The gradient descent takes the usual form shown in eq. 8 with error E (eg. 9) between the actual centroid and desired centroid, and step size q. u[n] = u[n -1]-qOE (8) au E SC[X'] - SCdesired (9) 3.3. Spectral Spread Spectral spread modulation is concerned with the width and placement of upper and lower frequency bounds in the frequency-domain and threshold value / with respect to the maximum magnitude. The current implementation of spectral spread modulation can be applied to a single frame or multiple frames using the following two methods: increasing or decreasing the spread equidistantly towards the DC and Nyquist limit or individually setting the upper and/or lower frequency bounds. Each STFT frame is divided into three regions defined by the analyzed upper (fu) and lower frequency (f) bounds as seen in figure 4 and 5. The spectral spread region undergoes either up-sampling or down-sampling to expand or compress the spectral spread width and keep the general shape of the spectral envelope unchanged as further depicted in figure 4 and 5. This is followed by non-linear or linear cross-fading (d[k] eq. 14) of the original flanking frequency regions with the up-sampled or down-sampled envelope to help retain the overall shape of the original envelope. 0.0O! d[k]~! 1.0 (10) The cross-fading between the original envelope and modulated envelope is defined by eq. 11 and 12 and detailed in figure 4 and 5. For the flanking regions the following equations are used Eflflanking[k (1 - dDC[ k]) - Envmnod[ k] +dDC[ k]- Enlvorg [ k], 0~: k <K1 t(i1- df/2 [k]). Envmnod [k] + df/2 [k] -Enlvorig [k], K, < k < Kf/ (1 1) 369

Page 00000370 whereas for the spread region defined by K, and K, eq. 12 is used. The combination of Envflanking and Envspread results in the modulated envelope shown in eq. 13. 03 -eý V1 fi' f, I spread' up-sampled Umax 7'............... original cross-faded Frequency Figure 4. Spread expansion 0c b-eOc 3.4. Harmonic Expansion/Compression Harmonic expansion/compression [4] is a feature that describes the deviation of actual harmonics from their ideal locations as depicted in figure 6. o 10N-2 N-1 ideal harmonic locations Figure 6. Harmonic expansion/compression f0 g [m] = m 2 N(1N8) Equation 17 shows the modulation algorithm where f,, denotes the extracted mh harmonic versus its ideal counterpart mfo, where fo is the fundamental frequency, andf,,' the modulated harmonic. spread' down-sampled original ) max 4 - cross-faded spread Frequency Figure 5. Spread compression for KI < k <: K Env d[k] {Envor,ý [k], spread increase (12) ~spread ~Envorigj [k], spread decrease Equation 13 is used to create the target modulation envelope where k is the frequency bin number from DC to Nyquist. Envnew [k] = Envspread [k] + EnVflanking [k] (13) The remaining equations are used to determine Envcony where the convolution product results in X[k] which is then synthesized to the time-domain signal s'[n] after adjustment of spectral energy to conserve the original energy level. K[k] = Envnew[k]-EnVorig [k] (14) Envyon [k] = (r[k] +~X[k]) (15) X[k] 0t k1 k' km Fi rmeum Frequency Figure 7. Harmonic shifting Harmonic expansion and compression thus consists of harmonic analysis and extraction followed by altering a frame's expansion or compression profile with a nonlinear function g[m]. A non-linear g[m] example is shown in eq. 18 where a denotes the expansion or compression characteristic. Compression occurs when a is negative and expansion when positive. Harmonic regions are defined by upper and lower frequency bounds determined by percentage P,, of each harmonic's magnitude. Harmonic regions that are shifted to a new region are cross-faded with the surrounding region and the "hole" left behind is attenuated and cross-faded using P,, as reference as seen in figure 7. X[k] = X[k] -Envonv[k] (16) 370

Page 00000371 3.5. Inharmonicity The inharmonicity for each harmonic is modulated using eq. 19 f',, = mfo + (f, - mfo)e(19) where f', is the modulated mth harmonic, fo is the fundamental, and 3 the inharmonicity coefficient. When p =0 the degree of inharmonicity is unchanged whereas positive f values result in increased inharmonicity and negative values in decreased inharmonicity. Inharmonicity uses the same procedure for shifting the harmonic regions as used in harmonic expansion/compression modulation. 3.6. Spectral Shimmer and Jitter The first step in shimmer modulation requires tracking of harmonics over each STFT frame (Xr) followed by modulation of each harmonic via function g[l] (eg. Gaussian noise) to "spread" the magnitude as shown in eq. 20 where I is the index for g[ ], scalar 2 determines its magnitude, and k, the harmonic bin index. Xr '[k, ] = Xr[k, ] - (1 + A - g[l]) (20) envelope bounded by the upper and lower frequency limits. 0 Soriginal IX'[km2]--~~~: j -2 "-- ' shift +interp. - ---- -----.... cros-t cross-faded -* \^ shift +interp. k'12 k im kk2 k' - freq. Figure 9. Attenuation of shimmer X' [k,, 1] inter olated original \X[kml]}\| (D1 -round k-L - *- k + roundK k k km k, km Frequency k ',m ku Figure 10. Spectral jitter modulation 3.7. Spectral Flux The current implementation of flux utilizes comparisons of successive STFT frames yielding quantifiable changes in frequency content over time needed to measure spectral flux. This is done for all successive frames, and the values produced from each pair are averaged to get an overall spectral flux measure for an input signal. The current version is only one method of analyzing flux. We have also found that measuring this feature in subbands can give an alternative method of flux analysis. I1 --No-- - - trpen k, k k. m U SI%. 1. Figure 8. Amplification of shimmer As seen in figure 8 shimmer amplification is realized via "magnitude interpolation" for the region defined by upper (ku) and lower frequency (k1) bounds. The bounds are dynamically determined as a function of the fundamental as shown in eq. 21 where L is a constant defining the neighboring area of each harmonic. Similarly, for harmonic attenuation the algorithm finds boundaries k, and ki which is then followed by shifting, interpolation, and cross-fading as depicted in figure 9. kui = k~,, round fo (21) Spectral jitter is modulated in a similar fashion to shimmer as shown in figure 10, the main difference being the method of shaping the altered spectral envelope around each harmonic. As we can see in figure 10 each half of the shifted envelope around the modulated harmonic is shaped either via down-sampling or up-sampling to preserve the general shape of the sr previous frame IJ > current frame Y-' previous frame 7 \/ /. current frame / i Frequency Figure 11. Spectral flux increase (left) and decrease (right) The spectral flux is modulated either by increasing or decreasing the Euclidian distance for each bin between successive frames. An increase represents a movement of the current frame's frequency components away from the previous frame's corresponding frequency bins, and vice-versa for a decrease in flux as depicted in figure 11. 371

Page 00000372 Modulation of the spectral flux for each frame follows the same procedure used in spectral spread in re-shaping each STFT frame. 3.8. Spectral Envelope After computing the envelope for each STFT frame followed by interpolation to match the STFT frame size each of the interpolated frames are combined to render a three-dimensional spectral representation. The threedimensional spectral envelope is modulated first by selecting an existing point on the 3-D representation and changing its magnitude value. Once the magnitude value is altered, interpolation of the surrounding regions is computed in three dimensions. 4 0......................... 20 zero magnitude plane altered peak30 020 frequency Figure 12. 3-D modulation of spectral envelope The "range," or the "decay rate," of interpolation for _20:0 each dimension is individually controlled as seen in figure 12. modulation envelope Envr mod [k] in eq. 22 which is initially set to a plane with 0 magnitude (no modulation) as seen in figure 12 above. The "envelope shaper" Envi shaper[k] for spectral frame r is then used to modulate the original sound object s rendering a new sound object characterized by Xj[Ak]. 0.6 " ~.....................i i.i.........:-...::: T:.:: -.0 4..PRELIMINARY RESULTS AND CONCLUSION Various sounds were used in testing the algorithms which included electric basses, flutes, percussion samples, guitar riffs, and popular music samples. FMS was also used in compositions and presented at electroacoustic music concerts by Zhiye Li composing a tape piece and a live piece for electric bass and computer altering the harmonic expansion and contraction characteristics. Most of the FMS algorithms performed as we had anticipated although there were degrees of success that depended on the musical signals and not surprisingly DFT parameters. For example, spectral spread modulation presented an issue as some input signals exhibited a large amount of low frequency energy, thus allowing the spread only to expand by a limited amount. Forcing a wider spread requires the main energy concentration to extend further towards the Nyquist frequency than it does towards DC, and this "imbalance" caused additional distortion. In one spectral centroid modulation example we used the centroid profile of a flute to impose it onto the electric bass guitar which had some interesting musical results - the overall timbre becoming brighter as expected but at the same time adding high frequency noise thus changing the resulting timbre. Also, simple changes to the amplitude envelope in synthesizing a "flattened envelope" version of the original bass drum surprisingly exposed potential musical applications as it brought to the foreground what normally would be masked out thus ultimately altering the identity of the sound object drastically. By no means is this study the end but rather the beginning of FMS where we anticipate much improvement to the algorithms by tweaking the harmonic tracker and making it more robust to various types of signals and adding a segmentation/onset analysis module. Our plan is to have a growing set of features on top of the current basic yet important ones, including LPC for noise content analysis via residual analysis. Sound examples can be found at tmt.tulane.edu. 5. REFERENCES [1] Cuadra, P., Master A., Sapp C. "Efficient Pitch Detection Techniques for Interactive Music", Proceedings of the 2001 ICMC, Havana [2] C. Hourdin, Charbonneau G., Moussa, T. 1997. "A sound-synthesis technique based on multidimensional scaling of spectra". CMJ, 21(2):40-55. [3] Park, T. H. "Salient Feature Extraction of Musical Instrument Signals", Dartmouth College, M.A. Dissertation, 2000. [4] Park, T. H. "Towards Automatic Musical Instrument Recognition", Princeton University, Music Department, Ph.D. Dissertation, 2004. [5] Rabiner L. R., R. W. Schafer. "Digital Signal Processing of Speech Signals", Prentice Hall, Signal Processing Series 1978. [6] Richardson, C. "A Toolbox for Cross-Modulation Sound Objects using Feature Modulation Synthesis ", Tulane Univ., Music Dept., M.A. 2007. [7] Verfaille, V., Zoelzer, U., Arfib, D. "Adaptive Digital Audio Effects (A-DAFx): a New Class of Sound Transformations", IEEE Transactions on Audio and Speech Signal Processing, 2006. [8] Wessel, David L. "Timbre Space as a Musical Control Structure". CMJ. Vol. 3 Num. 2, June 1979. 372