AN ANALYSIS OF STARTUP AND DYNAMIC LATENCY IN PHASE VOCODER-BASED TIME-STRETCHING ALGORITHMS

Lee, Eric; Karrer, Thorsten; Borchers, Jan

AN ANALYSIS OF STARTUP AND DYNAMIC LATENCY IN PHASE VOCODER-BASED TIME-STRETCHING ALGORITHMS Eric Lee, Thorsten Karrer and Jan Borchers Media Computing Group RWTH Aachen University 52056 Aachen, Germany {eric, karrer, borchers} @cs.rwth-aachen.de ABSTRACT The phase vocoder has become a popular method for timestretching audio (altering its play rate without changing its pitch) in recent years. Despite continuing improvements to the algorithm itself for enhanced audio quality, the latency introduced by the processing is less wellunderstood. Such an understanding is crucial for accurate synchronization in the context of a larger interactive multimedia or computer music system. Our analysis shows that the phase vocoder has an effective startup latency of 2 (Ra - Rs), and a dynamic latency (in response to rate changes) of 2Rs, where Ra and Rs are the input and output hop factors used for time-stretching. 1. INTRODUCTION Computers and processing capacity continue to advance at rates that exceed Moore's original prediction in 1965 [11]. Certain types of processing that were once a fantasy are now possible to perform in real-time. One example is using the phase vocoder for altering the play rate of an audio stream while preserving its original pitch (a process also known as time-stretching) - while it was originally developed in 1966 [3], it wasn't until recently that realtime implementations became possible [6]. Any non-trivial processing of signals will introduce some degree of latency. If this latency is small, it can usually be ignored without any significant impact on the system behavior, and this is often assumed in many interactive media and computer music systems today. The most obvious artifact of improperly handling latency in a system is a loss of synchronization between, for example, the audio and video. Two recent trends in multimedia systems and computer technology, however, motivate the need for a reexamination of processing latency for these systems. Firstly, computers are increasingly being used for professional multimedia applications, replacing both specialized and expensive hardware. A professional studio VTR (video tape recorder) capable of frame-accurate synchronization, for example, can cost upwards of ten thousand dollars. Television and film production studios are slowly migrating to digital production - Star Wars II, Attack of the Clones, for example, was the first major Hollywood film to be captured digitally, rather than on film [10]. More recently, even media companies, such as Current TV, a news broadcaster in the United States, have moved away from tape to a completely digital and computerbased production pipeline [15]. This trend requires system designers to migrate to what Greenebaum [4] refers to as a "sample-accurate" mentality when dealing with latency, rather than the current "best-effort" one. Secondly, with the increased availability of computing power, it is now possible to incorporate increasingly complex processing and still maintain real-time performance. Interactive conducting systems such as our Personal Orchestra family [9], for example, employ a multitude of processing to recognize gestures, stream compressed audio and video from disk, and time-stretch the audio - all in real-time. More specifically, let us compare the complexity of an audio resampler, which was employed in an early version of Personal Orchestra - a resampler requires a few tens of multiply-add operations per output audio sample. In contrast, PhaVoRIT, a phase vocoder-based algorithm employed in our latest system [6], performs the time-stretching in real-time, but requires many orders of magnitude more processing per output audio sample. An unfortunate side-effect of this increased complexity in processing is increased latency. We will divide our discussion of latency into two aspects: startup latency and dynamic latency. Startup latency is introduced when the filter is initially fed with data - many filters require some "priming" before they can begin to produce output. A 64-point sinc kernel used for resampling an audio signal, for example, requires the first 32 samples of input data before it can produce the first output sample. If these samples are being streamed from a real time data source, this introduces a 32 sample latency at startup (see Figure 1). Dynamic latency occurs when filter parameters (for example, the resampling factor) are changed; if the filter cannot respond immediately to a parameter change, latency will be introduced. Resampling using a sinc kernel has, for example, zero dynamic latency - it is theoretically possible to immediately switch from a resampling factor of 0.5 to 2 from one output sample to the next. In contrast, a phase vocoder algorithm is limited to rate changes at specific block intervals defined 73