Page  00000347 Scalable Wavetable Matching for Real-Time Polyphonic Synthesis Simon Wun and Andrew Homer Mixed Media Modeling Lab, Institute for Infocomm Research cwwun@i2r.a-star.edu.sg Department of Computer Science, Hong Kong University of Science and Technology homer@cs.ust.hk Abstract Recent work on wavetable synthesis of musical instrument tones assumes that a constant number ofwavetables are used throughout the duration of a tone. A scalable approach would dynamically allocate wavetables to simultaneous voices to allow more polyphony and improve the sound quality. Iterative wavetable matching finds the basis spectra one by one over several iterations, and offers real-time control of the number of wavetables, supporting scalability. This paper describes and evaluates several iterative methods. Matching results for a range of instruments show that, on average, iterative local search can find matches with errors within 0.5% of nearoptimal non-iterative solutions. Iterative local search only rarely gets stuck on bad local optima. Output Introduction Wavetable synthesis reproduces a musical instrument tone of a definite pitch by adding several enveloped static waveforms (see Fig. 1) (Serra, Rubine, and Dannenberg 1990; Homer, Beauchamp, and Haken 1993). During synthesis, the wavetable samples are multiplied by an amplitude envelope, and the signals from different enveloped wavetables are added to produce the matched tone. Wavetable synthesis uses sums of harmonic sinusoids with constant frequencies and amplitudes as the waveforms. It stores one period of each waveform and is therefore memory-efficient. It is especially popular today in mobile phones where memory is very limited (e.g., 4K for synthesis parameters and ringtone "scores"). Memory efficiency will probably continue to be critical in mobile phones in the near future since bigger memory chips and faster CPUs require more power than current battery technologies can provide in a small package. While the cost of PC memory and microprocessors continually decline, the savings always mean increased polyphony, more complex audio effects, and therefore more creative possibilities with fixed resources. Previous work on wavetable synthesis (Serra, Rubine, and Dannenberg 1990; Homer, Beauchamp, and Haken 1993) assumes a synthesizer architecture where computing resources (e.g., oscillators and envelope generators) are statically allocated to several voices. A synthesizer of this kind produces a note using the same resources throughout its duration. It Figure 1: Multiple wavetable synthesis block diagram. has fixed polyphony (that is, the maximum number of simultaneous voices) and fixed quality of each sound. When too much polyphony is requested, the synthesizer drops notes by "note stealing"-stopping some old notes, not playing some new notes, or both. On the other hand, when fewer notes than the maximum are played, the computing resources reserved for the unused voices idle. Ideally, during real-time synthesis, if computation power runs out due to too much polyphony, a synthesizer decreases the computing resources for some voices to avoid dropping notes. If fewer than the maximum number of notes are played, it could put the excess computing resources to synthesize the active voices with improved quality. Such scalability lies in a resource allocation strategy that dynamically exploits the computing resources. Scalable synthesizers can adjust the computing resources used for producing a note midway through its duration. Specifically, scalable wavetable synthesis requires, besides the dynamic resource allocation strategy, real-time control of the number of wavetables for each note. Wavetable matching tries to find the basis spectra (that is, the spectra of the wavetables) and amplitude envelopes that can best approximate the original tone. Early approaches to wavetable matching were non-iterative: they tried to find the complete set of basis spectra all at once, taking into account their interactions (Homer, Beauchamp, and Haken 1993; Wun and Homer 2005a). Changing the number of wavetables requires rematching, which is too time-consuming if done on the fly. Prior matching and storing extra sets of parameters is not memory-efficient. Recent work has shown that iterative matching only sacrifices an insignificant degree of matching accuracy for computation- and memory-efficient realtime control of the number of wavetables (Wun and Homer 2005b). Iterative matching finds one basis spectrum at a time, 347

Page  00000348 only ensuring that it works well with the basis spectra already found, without regard for future consequences. Iterative basis spectra are cumulative in the sense that a two-wavetable match reuses the basis spectrum found by the previous onewavetable match, a three-wavetable match reuses the basis spectra found by the two-wavetable match, and so on. Thus we can match once for a maximum number of wavetables, and have on hand basis spectra that work well for numbers of wavetables up to the maximum. Wavetable matching using principal components analysis (PCA) happens to find cumulative basis spectra too. PCA is an alternative to iterative matching, though less effective because it usually poorly matches the low-amplitude parts of a tone (e.g., the attack and decay) due to its inherent statistical bias. The authors have successfully used a local search method to non-iteratively generate artificial basis spectra using numerical optimization techniques (Wun and Homer 2005a). The main idea is to start with a rough guess, and refine it through local search. A simple strategy for selecting an initial set of basis spectra as the rough guess picks the frames equally-spaced through the duration of the tone, including the end frames at the beginning and end. The improvement heuristic is based on the hillclimbing idea: we repeatedly transform the current solution into a nearby and better solution that then becomes the "next current" solution until improvement is not found with any transformation. bill bl,NFRM '. bNHAR,NFRM AW = B (1) where NHAR = number of harmonics to match NTAB = number of wavetables NFRM = number of frames in the target spectrum ak,J = amplitude of the kth harmonic of the jth basis spectrum yn = amplitude envelope value for the jth wavetable and the nth target frame bk,n = amplitude of the kth harmonic of the nth frame of the target spectrum. Normally, there are far more harmonics than wavetables (e.g., NHAR = 30 and NTAB = 3). In that case the match, AW = B* is the best least-squares approximate to the target, B. One metric to measure the difference between the original and matched tones is the relative spectral error as follows: 1 NFRM NFRM S n=1 NHAR (bk n- b* ZNHAR b2k k 1 k~n (2) Iterative Matching The non-iterative approach decides the complete set of basis spectra all at once, whereas the iterative approach finds them one-by-one. This reduces the growth in the basis spectra search space size (with the number of wavetables) from exponential to linear, making iterative methods more efficient. Below we describe a framework from which iterative methods develop and three different iterative methods which are evaluated and compared in later sections. First, we give an overview of the wavetable matching process. To begin with, a phase vocoder analysis of the original tone transforms the sound into a time-varying spectrum (Beauchamp 1993). The target spectrum is typically limited to only some of the original frames to save computation in the later matching phase. Using 10 evenly spaced frames from the attack (that is, before the peak rms amplitude is reached) and 10 from the rest of the original tone keeps most of the perceptually important spectral information. Next, assuming we have found the basis spectra, the amplitude envelopes can be determined by solving the following system of linear equation: A value of c = 0.1 represents a 10% relative spectral error. Recent work has suggested the relative spectral error is more than 90% accurate (Homer, Beauchamp, and So 2006). The iterative approach is a greedy algorithm. It finds the basis spectra one-by-one over several iterations. At each iteration, we find the best single basis spectrum to add to those already found on previous iterations. Hence, the procedure is "greedy" in the sense that it constructs, in phases, the solution from parts appeared good at each iteration. The number of iterations is equal to the required number of basis spectra. The algorithm for the iterative approach is as follows: ITERATIVE-FRAMEWORK(NTAB) 1 foundBasisSpectra +- NIL 2 while size [foundBasisSpectra] < NTAB 3 do newBS +- BLACK-Box(foundBasisSpectra) 4 LIST--INSERT(foundBasisSpectra, newBS) 5 return foundBasisSpectra foundBasisSpectra keeps track of the partial solution (in a list) during the search, and finally stores a complete set of NTAB basis spectra. The subroutine LIST-INSERT inserts newBS into foundBasisSpectra. The subroutine BLACKBox returns a new basis spectrum. The following subsections give further details. 2.1 Iterative Enumeration This simple black-box algorithm finds the new basis spectrum by selecting the best from a candidate pool which initially consists of all the original spectral snapshots (that is, a,1 a2,1 al,NTAB w1,1 ''' w2,1l W1,NFRM... aNHAR,NTAB ''' WNTAB,NFRM 348

Page  00000349 the frames in the target spectrum and the remaining original frames). Iterative enumeration tests every candidate spectral snapshot combined with the previously-found basis spectra. Also, the residual spectra are added to the candidate pool after each iteration, an idea borrowed from iterative combinatorial basis spectra selection (Ng and Homer 2002). (A residual spectrum is the difference between the original and synthesized spectrum after each iteration.) 2.2 Iterative Local Search Iterative local search randomly picks a spectral snapshot from the original spectrum, and from there starts a hillclimbing search. The previously-found basis spectra remain unchanged, with only the new spectral snapshot modified and improved. (By contrast, non-iterative local search modifies all the basis spectra.) Finally, the improved spectral snapshot is returned. 0.7 0.6 " 0.5 S0.4 S0.3 1 2 3 4 5 No. of wavetables Figure 2: Matching results for a G2 pipa tone. 2.3 Seeded Iterative Local Search Seeded iterative local search is a variant on iterative local search. It starts the search from the best original spectral snapshot in place of a randomly picked one. Therefore, an additional enumeration step is necessary at each iteration. Results Instrument Trumpet Trombone Tuba Clarinet Oboe Bassoon Chinese dizi Chinese sheng Violin Cello Chinese erhu Guitar Chinese pipa Chinese zheng Piano Glockenspiel Tenor voice Pitches A3, F4 G2, Ab4 Cl, Eb2, Gb3 Eb3, Eb4, G5 D4, D5, D6 Db2, Bb3, C5 A3, G5 A4, A5 G3, Eb5, B6 C2, Eb3, B4 D4, G#5 A2 G2 Gb2, Db4, Gb5 Al, C3, Eb4, Gb5 G5, G6 G3 3.1 Iterative Enumeration versus Iterative Local Search This subsection compares the performance of the two basic iterative methods, iterative enumeration and iterative local search. We use non-iterative local search results as a benchmark for the iterative matches. Non-iterative local search finds near-optimal solutions almost all the time in instrument matching (Wun and Homer 2005a). It gives a practical lower bound on the relative spectral error. Fig. 2 shows the matching results for a pipa tone. The pipa is a Chinese plucked string instrument similar to the guitar. Iterative local search is consistently better than iterative enumeration by about 2%. Moreover, iterative local search is nearly as good as non-iterative local search. Their error differences are less than 0.5%. With one wavetable in particular, iterative and non-iterative local search find different matches with the same error. Table 1 lists the 40 total tones we tested from 17 instruments. Fig. 3 shows the average results over all of them. Iterative enumeration is consistently worse than the other methods by about 2%. Also, adding residual spectra to the candidate pool improved the iterative enumeration results. Slightly more than 15% of the selected spectral snapshots were from the residual spectra. Iterative local search outperforms iterative enumeration overall. Furthermore, it is nearly as good as non-iterative local search (within 0.5%), which is a little surprising. With one wavetable iterative and non-iterative local search always Table 1: 40 tones tested, from 17 instruments. converge in the sense that they find matches of approximately the same error level. With two or more wavetables, iterative local search only rarely gets stuck on bad local optima, of which the worst in our test set happens to be about 5% worse than the non-iterative result. A practical solution to this problem is to repeat the search from two or three random starting points, and keep the best result. 3.2 Seeded Iterative Local Search Seeded iterative local search starts from a better initial solution compared with iterative local search (unless iterative local search luckily picks the same starting point). Is it guaranteed a better final result, then? For our test set, seeding resulted in sometimes better, sometimes worse solutions without any clear trend. On average, seeded iterative local 349

Page  00000350 0 0.15 Qa U) - 0.1 previously-found basis spectra. As more basis spectra are found, the linear search space represents a smaller portion of the exponential solution space. Despite the increasingly restricted search space, iterative local search finds nearly as good solutions as non-iterative local search, and the error differences between them are fairly constant with 2-5 wavetables. These results confirm there are many near-optimal solutions in the basis spectra solution space, as the previous non-iterative local search results suggested (Wun and Homer 2005a). The "shallowness" of the solution space implies our simple approach of iterative local search-finding locally optimal basis spectra one by one is ideal. Moreover, global optimization techniques such as the genetic algorithm seem irrelevant. A straightforward decomposition or other algebraic techniques potentially solve the problem, but we are conservative in this regard. 1 2 3 4 5 No. of wavetables Figure 3: Average matching results for all test tones. Conclusion search slightly outperformed iterative local search. Using a dependent samples t-test with a 0.05 level of significance, we verified that the average relative spectral error for seeded iterative local search was significantly smaller than that for iterative local search when there were 3-5 wavetables. Though statistically significant, the differences were very small, only in the 0.1% range. We found this level of error differences to be negligible in an informal listening test. 4 Basis Spectra Orthogonalization In general, each amplitude envelope not only depends on its wavetable basis spectrum, but the other wavetable basis spectra as well. Removing wavetables causes the remaining amplitude envelopes to change. The amplitude envelopes are not reusable, except for orthogonal basis spectra. Normally, the basis spectra are not orthogonal, which poses no problem if the number of wavetables is constant. To allow the number of wavetables to dynamically change, we can orthogonalize the basis spectra by the GramSchmidt process (Johnson, Riess, and Arnold 2002). Viewing the basis spectra as vectors, the Gram-Schmidt process iteratively orthogonalizes each basis spectrum by replacing it with the difference between it and its projection onto the previously orthogonalized basis spectra. The orthogonalization does not affect the spectral space that the basis spectra span or the errors of matches. We have described a framework for several iterative wavetable matching methods including iterative enumeration, iterative local search, and seeded iterative local search. Iterative methods find the basis spectra one-by-one over several iterations, whereas non-iterative methods decide the complete set of basis spectra all at once. The search space size of iterative methods only grows linearly with number of wavetables, while that of non-iterative methods grows exponentially. As a result, iterative methods run faster than their non-iterative counterparts. However, we have not compared the computation efficiency of the iterative and the non-iterative methods. We evaluated and compared the three iterative methods, using near-optimal non-iterative local search results as benchmarks. Iterative enumeration was consistently worse than the other methods by about 2%. To put it intuitively, the other methods roughly obtained the enumeration results using one wavetable fewer than does iterative enumeration, and 2% could differentiate an "excellent" match from a "good" one. Surprisingly both iterative local search and seeded iterative local search were nearly as good as non-iterative local search (within 0.5%, which was perceptually negligible). They only rarely got stuck on poor local optima. Iterative matching allows scalable wavetable synthesis. Iterative local search finds matches that are only slightly worse than those tailor-made by non-iterative local search using the same number of wavetables. It offers an excellent compromise between scalability and matching accuracy. Wavetable synthesis with iterative basis spectra plays extra notes while keeping the old ones by "stealing" oscillatorsstopping some active oscillators and reallocating them to the new notes. When playing monophonically, it can use all the oscillators to achieve superb sound quality. Future work can address issues concerning the dynamic oscillator allocation strategy such as what makes an effective and efficient strategy, and how it takes into account high-level musical information. Discussion As mentioned in Section 2, the iterative approach reduces the growth in the basis spectra search space from exponential to linear. The solution space, however, grows exponentially. Consequently, part of the solution space is ignored. The final iteration of a five-wavetable iterative local search effectively ignores every solution that does not contain the four 350

Page  00000351 7 Acknowledgements This work was supported in part by the Hong Kong Research Grant Council's Projects HKUST6167/03E and HKUST6135/05E. References Beauchamp, J. (1993). Unix workstation software for analysis, graphics, modification, and synthesis of musical sounds. In J Audio Eng. Soc. (Abstracts), Volume 41, pp. 387. preprint 3479. Homer, A., J. Beauchamp, and L. Haken (1993, May). Methods for multiple wavetable synthesis of musical instrument tones. J Audio Eng. Soc. 41(5), 336-356. Homer, A., J. Beauchamp, and R. So (2006, March). A search for best error metrics to predict discrimination of original and spectrally altered musical instrument sounds. J Audio Eng. Soc. 54(3), 140-156. Johnson, L. W., R. D. Riess, and J. T. Arnold (2002). Introduction to linear algebra (Fifth ed.)., pp. 219-222. Boston: AddisonWesley. Ng, A. and A. Homer (2002, December). Iterative combinatorial basis spectra in wavetable matching. J. Audio Eng. Soc. 50(12), 1054-1063. Serra, M. H., D. Rubine, and R. Dannenberg (1990, March). Analysis and synthesis of tones by spectral interpolation. J. Audio Eng. Soc. 38(3), 111-128. Wun, S. and A. Homer (2005a, April). A comparison between local search and genetic algorithm methods for wavetable matching. J. Audio Eng. Soc. 53(4), 314-325. Wun, S. and A. Homer (2005b, September). Evaluation of iterative methods for wavetable matching. J. Audio Eng. Soc. 53(9), 826-835. 351