AN ANALYSIS OF STARTUP AND DYNAMIC LATENCY IN PHASE
VOCODER-BASED TIME-STRETCHING ALGORITHMS
Eric Lee, Thorsten Karrer and Jan Borchers
Media Computing Group
RWTH Aachen University
52056 Aachen, Germany
{eric, karrer, borchers} @cs.rwth-aachen.de
ABSTRACT
The phase vocoder has become a popular method for timestretching audio (altering its play rate without changing
its pitch) in recent years. Despite continuing improvements to the algorithm itself for enhanced audio quality, the latency introduced by the processing is less wellunderstood. Such an understanding is crucial for accurate
synchronization in the context of a larger interactive multimedia or computer music system. Our analysis shows
that the phase vocoder has an effective startup latency of
2 (Ra - Rs), and a dynamic latency (in response to rate
changes) of 2Rs, where Ra and Rs are the input and output hop factors used for time-stretching.
1. INTRODUCTION
Computers and processing capacity continue to advance
at rates that exceed Moore's original prediction in 1965
[11]. Certain types of processing that were once a fantasy
are now possible to perform in real-time. One example
is using the phase vocoder for altering the play rate of an
audio stream while preserving its original pitch (a process
also known as time-stretching) - while it was originally
developed in 1966 [3], it wasn't until recently that realtime implementations became possible [6].
Any non-trivial processing of signals will introduce
some degree of latency. If this latency is small, it can
usually be ignored without any significant impact on the
system behavior, and this is often assumed in many interactive media and computer music systems today. The
most obvious artifact of improperly handling latency in a
system is a loss of synchronization between, for example,
the audio and video.
Two recent trends in multimedia systems and computer technology, however, motivate the need for a reexamination of processing latency for these systems.
Firstly, computers are increasingly being used for professional multimedia applications, replacing both specialized and expensive hardware. A professional studio VTR
(video tape recorder) capable of frame-accurate synchronization, for example, can cost upwards of ten thousand
dollars. Television and film production studios are slowly
migrating to digital production - Star Wars II, Attack of
the Clones, for example, was the first major Hollywood
film to be captured digitally, rather than on film [10].
More recently, even media companies, such as Current
TV, a news broadcaster in the United States, have moved
away from tape to a completely digital and computerbased production pipeline [15]. This trend requires system
designers to migrate to what Greenebaum [4] refers to as
a "sample-accurate" mentality when dealing with latency,
rather than the current "best-effort" one.
Secondly, with the increased availability of computing
power, it is now possible to incorporate increasingly complex processing and still maintain real-time performance.
Interactive conducting systems such as our Personal Orchestra family [9], for example, employ a multitude of
processing to recognize gestures, stream compressed audio and video from disk, and time-stretch the audio - all
in real-time. More specifically, let us compare the complexity of an audio resampler, which was employed in an
early version of Personal Orchestra - a resampler requires
a few tens of multiply-add operations per output audio
sample. In contrast, PhaVoRIT, a phase vocoder-based
algorithm employed in our latest system [6], performs
the time-stretching in real-time, but requires many orders
of magnitude more processing per output audio sample.
An unfortunate side-effect of this increased complexity in
processing is increased latency.
We will divide our discussion of latency into two aspects: startup latency and dynamic latency. Startup latency is introduced when the filter is initially fed with data
- many filters require some "priming" before they can begin to produce output. A 64-point sinc kernel used for
resampling an audio signal, for example, requires the first
32 samples of input data before it can produce the first
output sample. If these samples are being streamed from
a real time data source, this introduces a 32 sample latency at startup (see Figure 1). Dynamic latency occurs
when filter parameters (for example, the resampling factor) are changed; if the filter cannot respond immediately
to a parameter change, latency will be introduced. Resampling using a sinc kernel has, for example, zero dynamic
latency - it is theoretically possible to immediately switch
from a resampling factor of 0.5 to 2 from one output sample to the next. In contrast, a phase vocoder algorithm is
limited to rate changes at specific block intervals defined
73