Page  00000104 Comparison of different audio restoration methods based on frequency and time domains with applications on electronic music repertoire Sergio Canazza, Giovanni De Poli, Gian Antonio Mian, Alessandro Scarpa Dept. of Electronics and Informatics University of Padova - V. Gradenigo 6/a - 35100 Pd - Italy {canazza, depoli, mian, skarpa} @dei.unipd.it Abstract In recent years the evolution of digital technologies for the coding of audio signals has lead to new perspectives in the field of preservation and fruition of audio archives, even though new models for digital audio archives have not been yet developed. The most important international archives make use today of digital databases, and an increasing number of digital supports (DAT, CD-A, DVD-A). At the same time, in the field of musical philology, restored editions of electroacoustic operas have never been released. these would lead to scientific problems never addressed so far, about methodologies and algorithms. We aim at applying restoration algorithms to recordings of electronic music; this repertoire has some peculiarproblems: it requires aphilological analysis of the piece, as many of the synthesized sounds have acoustic properties in common with noise, both impulsive and white. I. INTRODUCTION T HIS work presents the results of an experiment aimed to evaluate the quality of different audio restoration algorithms based on different methods, based both time and frequency domain. A look at the most interesting ideas that have developed from the eighties onwards has revealed that the methodologies and techniques for the compensation and removal of signal deteriorations adopted up to now have been conceived, above all, on the basis of sound sources of an acoustic nature: voices, musical instruments, live or recorded artistic performances, etc. Very rare are, instead, the contributions in the field of the preservation and restoration of tapes of electronic music. A substantial patrimony of compositions for magnetic tapes still lies almost forgotten and awaits to be saved and transferred onto more permanent and long-lasting carriers. These tapes, which were mostly recorded over the first twenty years after World War II and which are held in the archives of many musical institutes and radios, need to be safeguarded with specific measures of preservation, without which the works they contain - which exist as recordings only - would be irreparably lost. Besides lying in the impalpable and purely artistic values which are materialized thanks to the recording instruments of the time, their documentary quality stands out from that of other musical reproductions for the specific knowledge on the technology of electronic music that is an integral part of the composition process fixed on the magnetic tapes. These are fundamental aspects to consider when making decisions on the document, on its preservation and on the process of transferring the information it contains. The peculiarity of the audio material present in the electronic compositions on tape, the technology with which they were generated, and the cultural modifications produced by a new medium raise extra problems which cannot be reduced to the normal praxis of audio restoration. A set of electronic-music recordings was selected from the archive of the Center of Computational Sonology (CSC) of Padova University and from the works of Bruno Maderna, preserved in the "B. Maderna" archive of Bologna University and in the archive of "Studio di Fonologia musicale" (RAI, Milano). To single out the best computational methodologies for audio restoration an experiment was made, implementing a software (in DirectX plug-in form) which uses the methods presented in Section II. In commercial audio-restoration software, adaptations of the spectrally-based techniques applied in speech processing have been applied. They are based upon spectral weighting, in which individual spectral components are weighted according to expected noise and signal components. Such techniques can be viewed as finite block-size approximations to frequency domain Wiener filtering. As a result of these approximations (necessary to follow the time-varying nature of the useful signal) undesirable distortions can occur, the most notable being known as "musical noise" in which statistical fluctuations in the frequency components of noise lead to random tonal artifacts in the processed signal. Various techniques have been applied to mask or eliminate these distortions. In our implementation different version of Wiener filtering and the Ephraim and Malah suppression rule were introduced. Moreover, we developed some algorithms based 104

Page  00000105 on psychoacoustic models. This task requires transforming the audio signal from an "outer" to an "inner" representation that is to resort to a representation that takes into account how the human ear perceives the sound. The combination of the psychoacoustic model and frequency-domain algorithms permits to define a promising restoration methodology. Time domain methods are also emerging in commercial software because of better quality produced. The main advantage is the absence of musical noise. On the other hand, these methods are computationally highly intensive compared with frequency-domain ones. The software developed (presented in Section III), implementing these algorithms, permits to compare the different methodologies in an objective way: in fact, it is possible to use the same software environment to do restoration. In this way, is feasible to compare the results of the different restoration methods. In Section II, a short overview of the most used audio restoration methods will be presented. Section IV will be devoted to a detailed description of a time-frequency analysis of the restored stimuli in order to characterize the different algorithms used. II. RESTORATION METHODS A. Frequency-domain methods In commercial audio-restoration software adaptations of the spectrally-based techniques applied in speech processing are used. They are based upon spectral weighting [3], in which individual spectral components are weighted according to expected noise and signal components. Such techniques can be viewed as finite block-size approximations to frequency domain Wiener filtering. As a result of these approximations (necessary to follow the time-varying nature of the useful signal) undesirable distortions can occur, the most notable being known as "musical noise" in which statistical fluctuations in the frequency components of noise lead to random tonal artifacts in the processed signal. Various techniques have been applied to mask or eliminate these distortions. The algorithm that gave the best results, within the explored techniques, was the Ephraim and Malah Suppression Rule (EMSR) [5]-[6]. With this method, the "musical noise" artifact is eliminated without bringing distortion into the recorded signal even if the noise is only poorly stationary and without using a crude overestimation of the noise average spectrum. The attenuation to be applied to the Short Time Fourier Transform coefficients can be expressed as the time and frequency dependent spectral gain G(p, f). G(p, f) depends on two parameters, Ros and Rprio evaluated at each frame p. Rpo, ("a posteriori" Signal-toNoise Ratio) is a local estimate of the Signal-to-Noise Ratio (SNR) computed from the data in the current short-time frame. Rprio (a "priori" Signal-to-Noise Ratio) represents the information on the unknown spectrum magnitude gathered from previous frames. Special consideration was paid to the perceptually relevant characteristics of the signal. This task requires to transform the audio signal from an "outer" to "inner" representation, that is to resort to a representation that takes into account how the human ear perceives the sound. For this purpose, according to Beerends and Stemerdink model [2], an outer to inner ear transformation is applied to the short time power spectral density of the audio signal. 1) Noise filtering To cope with the non-stationarity of the audio signal the Wiener filter is time-varying and is based upon short time Fourier processing (see Fig. 1). x(n) y(n) IN V. (n) + S/P STFT G STFT' P/S Fig 1: The model used for restoration: x(n) is the "noise free" signal, d(n) the noise, y(n) the degraded signal, and k(n) the restored signal; S/P and P/S represent series-to-parallel and parallel-to-series converters; G = diag{G(p, f)}. Let Y(p, f) denote the Short Time Fourier Transform (STFT) of y(n), where p = Ln is the analysis time index and f, kF, / N the frequency, with F, the sampling frequency, N the window length and k = 0,1,..., N - 1. The method basically consists in applying a time-varying attenuation G(p, f,), with 0 <G < 1, to the short time spectrum of the noisy signal: X (p f )= G(p, f,). Y(p,f)\ to obtain the restored signal X(n). According to Ephraim and Malah [6], the gain G(p, f), f = f, is calculated for each frame as: 1 + Rprio( ) where M[.] is the hypergeometric function. The values of Ro(f) and prio(f) correspond to "a posteriori" and "a priori" estimates of the SNR at frequency f. The first term corresponds to a local estimate of the SNR, that is to an evaluation done on the basis of the current frame only. The second term estimates the SNR via a convex combination of the local SNR estimate, multiplied by (1 - a) (0 < a < 1), and the estimate gathered from previous 105

Page  00000106 frames, multiplied by a. The factor a has an important meaning because if a 1 there is a greater contribution of the past frames, while if a s 0 there is a greater contribution of the actual frame. When the signal is very noisy, the value of a should be near to 1 because there must be a great attenuation, at contrary, for high SNR values, a should be near to 0. In [5], Capp6 suggests the value a = 0.98. A variant of the method, in the sequel denoted as W2 takes into account this fact: if Ro,(p, f)> 0, a is set to 0.98; else, if Ro, (p, f) and Ros,(p - 1, f) <0, then a = 0, while if Rpos,(p, f) < 0 and R,,(p - 1, f) > 0, then G(p, f) = G(p - 1, f). This provision slightly improves the behavior of the algorithm in the attacks, i.e., in transitions from noise to signal. 2) Psychoacoustic model To filter the noise in a perceptually meaningful way, it is necessary to transform the audio signal from an "outer" to "inner" representation, i.e., into a representation that takes into account how the sound waves are perceived by the auditory system. The device used is the Beerends and Stemerdink model [2], sketched in Fig. 2. The signal x(n) is first windowed by the w(n) window and transformed in the frequency domain. The short time spectral power is transformed from Hertz (f) to Bark (z) scale, bandlimited and spread both in time and frequency. modeled as a p order time varying autoregressive (AR) model: s(t + 1) = a,(t)s(t- i + 1) + e(t) i=1 driven by the gaussian zero-mean white noise sequence e(t) with variance a2. The time evolution of the time varying coefficients a,(t) is modelled by the random walk model: a,(t + 1) = a,(t)+ w,(t), i = 1,...,p with w,(t) zero mean gaussian white processes of variance oa mutually uncorrelated, i.e., E[w,(t)w,(t)] - 0 for i zj, and independent of e(t). It is important to note that this model can be extended with impulsive noise detection and reduction. However in this paper we focus only on broadband noise. In [8] is developed a procedure for detection/tracking/restoration of signal at the same time, as a non-linear filtering problem using Extended Kalman Filter (EKF) theory. Moreover in [1] are explained some improvements such as forward/backward filtering and variable memory tracking (VFF algorithm). III. RESTORATION TOOL To single out the best computational methodologies for audio restoration, an experiment was made, implementing a software (in DirectX plug-in form) which uses the methods presented in Section II (see fig. 3). time spreading frequency spreading...... d Lt 4 I i H I 5.... '\ F -4J |.... i I i. \..... y ~.i, Fig 2: The audio signal transformation from "outer" to "inner" representation. As a result, the outer frequency domain representation Y(p, f) = X(p, f) + D(p, f), with X and D signal and noise spectrum estimates, is transformed into the internal representation Y(p, z) X(p, z) + D(p, z), defined in the Bark domain, bandlimited and processed taking into account the spreading both in time and frequency. Finally, the Rpio(pz) and R,0o5,(p,z) terms are calculated according to the inner representation and the gain G(p, z) is derived. B. Time-domain methods Substantially different from previous framework, it is possible to work completely in time domain. Here the Niedzwiecki and Cisowski procedure is considered. Details can be found in [8], and improvements in [1]. The signal is Preset: Idefault] '" Properties;............................................................................................................................................................ tiltefI Percept IEMSR] 100 bands Kokkinakis law j reset naiseprrnt Windowing-----------i oiseprint: --pe Manning J -~~--~ Present (4098) IVpe |Hanning | o Size 4096 [ L jbo F sep rpo0osse Overltap (40 to 98: % Noise ov snatn(+.0d 21.0 c I -Noiseprint vetestimation pe tark (+ - 4.0dB--------- -- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 o0.0 0.00 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 0.0 0.0 0.0 0.0 0.0 00 0.0 00 0.0 0.0 0.0 0.0 1 1 1 1 1 1 l I I 14 15 16 17 18 19 20 21 22 23 24 25 EMSR alphai 1 -0.98 0.1 099 S0.81-0.99 I 0.991-0.999 LJ Cancel | Close | Preview I Add Plest.:.. eetete Preset Help Fig. 3a: The user interface of CSC Restoration Tool. The restoration, the setting of a value and the noise print modification are performed in real time. 106

Page  00000107 'CSC Extended Kalma Filtei ict 1 0 lot Pest [default] PrPoped es:.............. I AN moder( met]: s Mo1e15Backwardrostep1 of 2] optimal r S.. ol n Smoothng ad lag (gl) [0+201 4 tabiy check Fil tersspecifics S --- STODFoqgetbngconst. varince qurotent): Ck detecton - s[10E- 6-10E-3 1 0 e005 3. eshold jll[3O.994In. 14int -j EWL Forec con ei ent: Forgetingconst innov. va r. e1kE 0 Lambda10950+0990) 0970, VFF Mmurnfoaoetina constant fl ~ f 1.E:e9 M ii (1gan1E-+1 EE&-61E 31 1 0e003 VFloe [G.a1n+1 &01 01~ r Got vairiance forn no"e Dynaic noise overesttnaon- CluStenny windowsAvo10.00009901: 0000 WO Fiter cluntefing wdth (cwl) (0'20: 4 T Tack. et c eVwidth(cw21[0-20): 4 yn 0.000+00991 00001 gootstrap e [0+4400: 3300 lracking decimation factor (Dec)1, 2,4.8]: 0 Cloise Deet PreaR _ described above. As an example, let us consider the restoration of an excerpts taken from "Komposition fiir Oboe, Kammerensemble und Tonband" (1962) of B. Maderna (from archive of Bologna University). Fig. 4 shows the spectrograms of the original recordings (sampled at 44.1 kHz and 16 bit). Fig. 5, 6, 7, 8 show the spectrograms of the same excerpt restored with, respectively, Wiener filter (fig. 5), W2 algorithm (fig. 6), psychoacoustic model based on Wiener filter (fig. 7) and EMSR algorithm (fig. 8). The parameters used to control the different algorithms were subjectively tuned to obtain the best tradeoff between noise-removal and music-signal-preservation. Fig. 3b: The Extended Kalman Filter tool. Usually, different algorithms are implemented in different software, with different user interfaces and the parameters are used in different way (and not documented). Moreover, often, in commercial products, the Software House doesn't indicate in detail the algorithm implemented. So, the comparison carried out with commercial products, test the software quality (i.e. the implementation quality), not the algorithm effectiveness. On the contrary, our software permits to compare different algorithms in objective way: in fact, it is possible to use the same software environment to operate the restoration. In this way, is feasible to compare the different methods. The filters implemented are: Wiener (standard), Ephraim and Malah Suppression Rule (EMSR), a number of EMSR variations (like W2 described in Section II), and algorithms based on psychoacoustic model, that use different noise suppression rule (Wiener, EMSR, W21). In restoration based on EMSR (and its variations), the user can set the a value accordingly to the particular signal considered. Moreover, the tool permits to modify (in real time) the noise print estimated, by an 'equalizer' in bark scale. For the Extended Kalman Filter, the entire procedure of [8] is implemented, with the possibility to perform dehiss and declick with a single pass. The threshold for click detection and the noise variance can be modified in real time. Furthermore other improvements were added (as in [1]) expecially forward/backward filtering and adaptive tracking of music signal (VFF algorithm and decimation of tracking). IV. VALIDATION To validate the model, several recordings (with different SNR) were restored using the CSC Restoration Tool Fig. 4: Spectrogram of "Komposition fir Oboe, Kammerensemble und Tonband " (1962) of B. Madema, from 7'35" to 7'38". Fig. 5: Spectrogram of "Komposition tir Oboe, Kammerensemble und Tonband " (1962) of B. Maderna, from 7'35" to 7'38". Restored with Wiener filter. rig. o: spectrogram or isomposmton rur uOoe, Kammerensemole unc Tonband " (1962) of B. Mademrna, from 7'35" to 7'38". Restored with EMSR algorithm. 107

Page  00000108 Fig. 7: Spectrogram of "Komposition fiir Oboe, Kammerensemble und Tonband " (1962) of B. Maderna, from 7'35" to 7'38". Restored with psychoacoustic model (with Beerends and Stemerdink model). Fig. 8: Spectrogram of "Komposition fiir Oboe, Kammerensemble und Tonband " (1962) of B. Madera, from 7'35" to 7'38". Restored with psychoacoustic model (with Beerends and Stemerdink model, based on EMSR). From the comparison among these frequency-based methods, the following situation is outlined. The W1 algorithm results the best one: with an appropriate setup of a parameter, window size and window overlap, it presents a great noise reduction without to cause audible distortion. The EMSR presents a slight 'musical noise' (annoying only with very noisy signal), but the a parameter must be set to obtain the best tradeoff between noise reduction and transients preservation. The standard implementation of Wiener filter has only educational value, given his analytical simplicity. The perceptual filters have a common characteristic: where SNR r 0 the noise is correctly reduced; on the contrary, where SNR > 0, there is too much residual noise (it is considered erroneously masked by the filters). In particular, the perceptual filter based on EMSR shows a low-pass effect. Probably, this effect is due to the combined influences of spreading (both in time and frequency) and of "a priori" estimates of the SNR (R ). In the presence of a great noise amount, the 'musical noise' is present also using the W2 filter. In this case, an appropriate setup of noise print overestimation (in particular, increasing the overestimation values in high frequency region), can reduce this artifact. The difference among time method algorithms are very subtle and not graphically perceivable. Therefore, it is necessary to listen to the examples, on-line at By means perceptual tests, we deduced that the best quality reached is obtained using Extended Kalman Filter with VFF tracking and scalar combination of forward/backward passes for high SNR values (say 35dB). For lower SNR values W, is more appropriate. V. CONCLUSIONS Audio materials are recorded on various supports in which a rapid degradation of the information occurs. For the preservation, restoration and handling of a huge and often badly preserved audio heritage, we need to develop methodologies which allow to classify degradation of audio material, define a restoration protocol on the basis of the kind of degradation, and to project methodologies for preservation and handling audio archives. We aim at applying restoration algorithms to recordings of electronic music; this repertoire has some peculiar problems: it requires a philological analysis of the piece, as many of the synthesized sounds have acoustic properties in common with noise, both impulsive and white. In this paper, we have presented the results of an experiment aimed to evaluate the quality of different audio restoration algorithms based on different Short Time Spectral Attenuation methods and on Extended Kalman Filter. For this purpose, an experiment was made, implementing a software in DirectX plug-in form. The software, implementing several algorithms, has permitted to compare the different methodologies in an objective way. A global analysis can be carried out. The perceptual filters don't seem suitable to noisy signal with low SNR, because they leave an annoying residual noise. The W, filter results the best one in the frequency domain: with an appropriate setup of parameters, it presents a great noise reduction without to cause audible distortion. Generally, the best choice is EKF because of the absence of musical noise and the inaudible low-pass effect; on the contrary this filter requires a long training to fine-tune its parameters. VI. REFERENCES [1] Scarpa A., (2001). "Un ambiente software per il restauro di documenti audio". Tesi di laurea, Dept. of Electronics and Informatics, Universitt di Padova. [2] J. G. Beerends and J. A. Stemerdink, (1992). "A Percentual Audio Quality Measure Based on Psychoacustic Sound Representation". J. Audio Eng. Soc., 40(12), pp. 963-978. [3] S.F. Boll and A.V. Oppenheim, (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-27(2), April 1979. [4] Canazza S., De Poli G., Maesano S., Mian G. A, (1999). "On the performance of a noise reduction technique based on a psychoacoustic 108

Page  00000109 model for the restoration of old audio recordings". Proc. of Diderot Forum, Vienna, 2-4 December 1999, pp. 29-35. [5] Cappe 0., (1994) "Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor". IEEE Trans. Speech and Audio Processing, vol. 2(2), pp. 345-349. [6] Ephraim Y. and Malah D., (1984). "Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator". IEEE Trans. Acoustics, Speech and Signal Processing, 21(6) pp. 1109 -1121. [7] Godsill S., Rayner P., Cappe 0. (1998). "Digital audio restoration". In Applications of digital signal processing to audio and acoustics. Kahrs - Karlheinz Brandeburg (ed.). Kluwer Academic Publishers. [8] Niedzwiecki M. and Cisowski K. (1996). "Adaptive scheme for elimination of broadband noise and impulsive disturbances from AR and ARMA signals". IEEE Trans. Signal Processing, 44(3), pp. 528 -537. [9] K.K. Paliwal and A. Basu (1987). "A speech enhancement method based on Kalman filtering". Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 177-180. [10] Schueller D. (1991). "The ethics of preservation, restoration and reissues of historical sound". J. Audio Eng. Soc., 39(12), pp. 1014-1016. 109