Page  1 ï~~HIGH-PERFORMANCE AUDIO COMPUTING - A POSITION PAPER Richard Dobson Composer's Desktop Project Frome, Somerset John ffitch Codemist Ltd Combe Down, Bath Russell Bradford University of Bath Computer Science ABSTRACT Computing power for desktops is set to increase significantly in the next few years, by moving towards concurrent processing based on machines with an increasing number of CPU cores. Despite a possibly lowering ceiling with regard to clock speeds, computation speeds measured in Teraflops are already being envisaged. We propose that this has great significance for audio processing, in that processes previously discounted as computationally prohibitive may now be realised in real time. There is a need to investigate these processes, to identify those which are best suited to parallel computational models, and which may offer useful new musical tools. We define such processes as examples of High-Performance Audio Computing, or HiPAC, in a direct analogy to current HPC activity. We present the primary aspects of HiPAC as we have formulated it, together with an example in the form of the Sliding Phase Vocoder. 1. INTRODUCTION This is compute power far beyond what even the most starry-eyed fortune-teller could have imagined! It will change the very nature of what audio is, and what audio engineers do, since it changes what is possible at a fundamental level. James Moorer[14] We propose HiPAC - High-Performance Audio Computing as a new domain of study that explores the potential for new advanced processor architectures to transform the current landscape of audio synthesis, processing and music composition. Taking its name from the wellestablished domain of general High-Performance Computing (HPC), it addresses the emergence of new and forthcoming generations of highly parallel floating-point processors, and of large-scale multi-core platforms offering TeraFlop-scale computation power. This offers the possibility of running processes previously disregarded as too computationally expensive, in real-time. A key aspect of HiPAC from the point of view of the computer musician is that within a relatively short timescale, we can expect to see these technologies in consumer-grade hardware. The long-established Moore's law principle (of the doubling of the number of transistors in a chip every 18 months) has persisted as an indirect measure of processing power, so long as the emphasis has remained on increasing the processing power of a single CPU core. However, obtaining further speed from a single core (demanding not least, increased clock speeds) incurs a significant increase in power requirements, expressed not only in terms of wattage but also in terms of heat generation, which in turn demands further power for its dissipation using air or other cooling. This high power consumption is not only directly costly, it also implies an increasingly unacceptable "carbon cost". The universally recognised solution is to increase the number of processing cores, so that more computations can be made concurrently. This move towards parallel computation has profound implications for both users and programmers. In this respect, music and audio processing raises a number of interesting challenges and opportunities. One the one hand, a digital audio stream is by definition serial, naturally suggesting a serial (if fast) mode of computation; on the other, practical audio processes may involve a large number of identical and often independent tasks which can in principle run concurrently. For many musicians, the availability of higher computing power has typically been measured in terms of the number of demanding processes (such as reverberation, or synthesiser voices) that can be performed in real time. While this remains an important measure, we are interested more in the possibility to explore in real time processes that have hitherto been disregarded as too computationally demanding but which the next generation of hardware will bring into to the realm of real-time execution. 2. THE TECHNOLOGY - FROM SUPERCOMPUTERS TO THE DESKTOP The term High-Performance Computing (HPC) supercedes the older "supercomputing", associated with machines designed in the form of a large collection of parallel processor cores. Classic early examples include the original Cray 1, built in 1976 for the Los Alamos National Laboratory[6] and specified to reach the then astonishing speed of 160Mflops and an almost equally astonishing main memory of 8Mbytes. The machine solved a central issue for any cluster, of inter-node communication, by ensuring that no connection was longer than 4 feet. The resulting horseshoe shape became for the general public almost the defining iconic image of a "supercomputer". Later systems such as the MasPar series and the Connection Machine featured a large number of small Processing Elements (PEs) arranged in a mesh or even hyper

Page  2 ï~~cube structure each PE connected to between 4 and 8 neighbours. These machines have been termed, somewhat quaintly, "mini-supercomputers". A key aspect is the use of a Single-Instruction-Multiple-Data (SIMD) computational model. This is unsuited to computations distributed over physically disparate nodes. Instead, computations on each PE are performed synchronously under a common clock. Thus PEs need to be implemented within one chip, or worst-case on a single motherboard or equivalent bus structure. Modern multi-dsp chip audio systems such as Digidesign Pro Tools TDM reflect a similar approach, by distributing audio processing between a large number of mesh-connected dsp chips. Bringing this highly abbreviated history up to date: contemporary expressions of the cluster approach include the widely popular Beowulf systems based on the use of networked commodity PCs (typically running Linux), and the Virginia Tech "System X" supercomputer [18] based on the use 1100 Apple OS X servers, currently rated at a sustained 12.25 Teraflops. Such systems can be described as using a coarse-grained parallel processing model, in which each processing node is a relatively autonomous processor, synchronised and co-ordinated by a master processing node. In contrast, modemrn SIMD fine-grained parallel architectures are associated today firstly with the vector extensions to standard CPUs (e.g. Altivec on the PPC, SSE on x86) whereby several arithmetic units are employed in parallel, and secondly with the graphics accelerator cards now essential to all consumer workstations (especially where high performance in games is required). The difference clearly is that these SIMD systems are typically monolithic, implemented in one chip. An example of this approach is the IBM Cell Broadband Processor[10] employed in the Sony Playstation. This chip employs a conventional PowerPC master CPU coupled with 8 parallel floatingpoint PEs. In the attempt to maintain a generational performance increase, chip manufacturers are already supplying devices and motherboards supporting multiple CPU cores (currently with a maximum of four cores per chip, in the case of the processors from Intel and AMD), while investigating designs which significantly increase the number of cores. Intel, for example, have publicised a development chip featuring 80 cores, and claim performance up to 1 Teraflop, while indicating that a commercial release may still be 10 years away. In the meantime, a highly significant market has arisen for SIMD-style accelerator systems designed to work in conjunction with a host computer, and targeted at the HPC community. One significant example is the Tesla accelerator series from nVidia [17]. The company have for many years provided a SDK for several of their GPUbased video cards, featuring their CUDA development environment. The Tesla series represents their first general purpose accelerator product not specifically developed as a graphics accelerator, while based on the same technology. The Graphics Processing Unit (GPU) has already at tracted the interest of a number of audio researchers (e.g. [16]), especially for the computation of 3D acoustic modelling tasks. A second example is the "Advance" floatingpoint accelerator card from the British startup Clearspeed[5]. Each card features two of their CSX600 chips, each offering 96 PEs supporting double-precision floating point computation, with very low power consumption of some 30W per card. Each card provides a sustained computation speed of 50 Gflops. Both the nVidia and Clearspeed products provide automatic acceleration support for Matlab, enabling them to be rapidly integrated into an existing HPC cluster. Following a similar development trajectory, FPGA devices have increased substantially in both size and speed in recent years, offering a powerful alternative path to custom dsp chip design for both academic research[15] and industry. For example, the new "Crystal Core" system from Fairlight is based on a dynamically reconfigurable FPGA device[7]. It is also of interest to note the use of special "physics" co-processors in some game computers. 3. THE SOFTWARE Mention should also be made of the now defunct Inmos Transputer, almost unique in being closely associated with a dedicated concurrent programming language Occam[ 11]. This in turn was based on the formal language Communicating Sequential Processes (CSP) devised by Hoare[9], and which is still highly influential in the field of concurrent and parallel computation. Amongst computer musicians, the Transputer is celebrated for its use in the first real-time parallel implementation of Csound[2], involving some 170 Transputers, though it is recounted that the problem of heat dissipation was never resolved. The Transputer however lives on in emulation, for example in the forms of an FPGA-based device[13] and of a software emulation of the instruction set supporting Occam[12]. It would seem self-evident that parallel and concurrent programming requires a language directly supporting the paradigm. Nevertheless, the ubiquity of the C and C++ languages, neither of which embodies any explicit support for parallelism, has led to a dependence on general threading libraries (such as POSIX pthreads for C, or the Boost C++ threading library), or on language enhancements such as OpenMP, that may or may not make explicit use of platform-specific facilities (with the many wellknown issues associated with multi-threaded programming using these languages). On the other hand, a variety of custom extensions to C has emerged, serving particular hardware. For example, the Clearspeed hardware is supported by an extended C compiler supporting a custom keyword poly defining a variable stored (with unique values) on each of the 96 PEs and referenced as a single entity. Further extensions and library functions deal with the sometimes complex tasks of transferring data between PEs, and between poly and mono (conventional) memory. With respect to FPGA programming, the company Celoxica provides the language Handel-C (following CSP prin

Page  3 ï~~ciples) enabling a range of FPGA development systems to be programmed using a similarly extended C language supporting both coarse and fine grained concurrency by means of wait and par keywords [4]. This very small and selective snapshot of the fields of hardware and software support for concurrent processing leads us to a consideration of the particular challenges presented by audio. In this regard, we must recall the inherently serial (one-dimensional) nature of raw audio data, which would seem to impose significant constraints on the full exploitation of parallel processing. Many of the fundamental processes in which we are interested, such as recursive filters, are inherently data-dependent and (taking the example of a plain first-order IIR filter) therefore by definition un-parallelisable. The relationship between sequential and parallel computation is summarised in Amdahl's law [1], which is stated as: 1/(S + P/N) where S is the fraction of serial computation, P = 1- S is the amount of parallelisable computation and N is number of processors. The limiting value is therefore 1/S for an infinite number of processors. This law (which has been applied in such areas as business and project management as well as in computing) suggests some serious limits on how much speedup can be obtained from parallel processing. However, this has more recently been demonstrated to be an overly conservative estimate [8], with respect to a hypercube-based system. For audio processes an equivalent to Amdahl's law is as yet undefined, and would seem to be highly dependent on the characteristics of the architecture, as well as on the nature of the process itself. Clearly, a process comprised mostly of recursive processes will gain relatively little speedup (dependent entirely on the mapping of the number of recursive streams to the number of processors). On the other hand, many audio processes are inherently highly parallelisable with few or no data dependencies, most obviously frame-based analysis techniques such as the DFT. Such algorithms can be expected to derive the maximum benefit from parallel computation using SIMD models. As already indicated in the examples of graphic modelling, audio processes based on physical models (e.g. finite element networks) similarly offer a very good fit to a parallel computational model. One factor that must be borne in mind is that the clock speeds of such devices (rated in MHz rather than GHz), as also of most DSP chips, are often significantly lower than those currently employed in desktop PCs. The overall computation speed of such processors depends almost entirely on parallel computation, such that serial computation needs to be kept to an absolute minimum. Given that current technology as exemplified in the systems described above is seeking to reach and exceed 1 Teraflop speeds, we argue that it is now time to revisit audio processes hitherto disregarded on account of their computational demands. Even if they are time-consuming today, they will not be in only a few years, so that we may as well start investigating them now. By the same argument, we expect hardware that is currently relatively expensive to fall to commodity prices and therefore to become accessible to everyone. In particular, we advocate the study of no-compromise algorithms - rather than make simplifications to an algorithm purely for reasons of slowness, HiPAC considers such algorithms in as pure or "ideal" a form as possible, especially where that ideal form may lead to musically useful and novel behaviour. We can therefore summarise the primary defining characteristics of a HiPAC dsp process: * use of highly parallel fine-grained architectures (e.g. following the SIMD model), though we do not exclude more "conventional" multi-core computation * real-time performance or better * implies low latency * ideal and "no-compromise" forms of algorithms * new processes, and hence new effects and sounds, not simply "more of the same" - whether more reverbs or more voices. Finally we note that there are alternative forms of parallel and distributed computation that can be applied to audio, such as grids, web service and clusters. We do not consider these further in this paper. 4. A HIPAC CASE STUDY - THE SLIDING PHASE VOCODER We have proposed the The Sliding Phase Vocoder (SPV) as a canonical example of a HiPAC process[3]. Full details may be read in that paper. We focus here on relating the SPV to the HiPAC aspects defined above. The conventional phase vocoder transforms audio into the frequency domain by means of a series of overlapping Fourier transforms (using the FFT algorithm). All practical implementations (and especially where real-time performance is required) overlap analysis frames by some small fraction of the window size. For a given sample rate R, the analysis rate A is given by the overlap length D in samples, as A = R/D. We can describe this as the "hopping phase vocoder". It is well documented that increasing the overlap leads to improved sonic performance, but at a directly increasing computational cost. In the limit, as D - 1, the "ideal" phase vocoder overlaps frames by one sample, so that A R. While recognised in the literature as offering sonic advantages, this sliding form has hitherto been avoided in practice as being computationally prohibitive. Our implementation of the SPV is based on the use of the Sliding DFT, in which the DFT frame is updated every sample by a simple complex rotation of the bin values, a process that is itself highly parallelisable. This has been found to reduce the latency by as much as 75% compared to conventional pvoc. The computational demands are of course increased, simply by virtue of raising the analysis rate to the sample rate. However, most transformations applied to analysis frames are also parallelisable (often involving standard vector arithmetic operations). We have also shown that pitch shifting is a much sim

Page  4 ï~~pler process compared to the hopping pvoc, since the frequency range of each bin covers the whole audio range (resynthesis is by oscillator bank). This enables, for example, audio-rate Frequency Modulation to be applied cleanly to an arbitrary input, a process we have termed Transformational FM[3]. The single-sample update (permitting high modulation rates) is essential to the implementation of audio-rate FM, in both time and frequency domain forms; thus TFM is an example of a frequency-domain process that cannot be implemented using the hopping pvoc. In terms of the HiPAC criteria listed above, we can see that they are all met by the SPV: * highly parallelisable * streamable in real time given fast enough hardware * lower latency compared to conventional pvoc * an "ideal" or no-compromise version of pvoc * enables a new class of transformation, TFM, not realisable using standard pvoc. 5. CONCLUSIONS We have defined HiPAC as descriptive of new classes of computationally demanding audio processes implemented by means of the next generation of parallel processing platforms and tools. We have described the primary trends with respect both to desktop computers and to hardware accelerators. The latter are already offering close to Teraflop-class computing power. We expect the cost of these devices to drop significantly within the next decade, so that parallel computing will move from the domain of HPC to the home desktop. We have presented the Sliding Phase Vocoder as a canonical example of a HiPAC process. Many other audio processes are known that are well suited to parallel implementation. We cite in particular processes based on physical models (whether of instruments or of acoustic spaces), which can be realised not only with optimised waveguides, but also in a more "ideal" form using finite element models. We suggest that by implementing these computationally intensive processes now, we prepare the ground for the immediate exploitation of the next generations of parallel computing hardware, which may be with us much sooner than we might have supposed only a year or two ago. 6. REFERENCES [ 1] AMDAHL, G. M. Validity of the single processor approach to achieving large scale computing capabilities. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2000, pp. 79-81. [2] BAILEY, N. J., PURVIS, A., MANNING, P. D., AND BOWLER, I. W. An Implementation of Csound on The Transputer. Applications of Transputers 1 (1989). [3] BRADFORD, R., DOBSON, R., AND FFITCH, J. The sliding phase vocoder. In Proceedings of the 2007 International Computer Music Conference (August 2007), S. O. Ltd, Ed., vol. II, ICMA and Re:New, pp. 449-452. ISBN 0-9713192-5-1. [4] www. celoxica. corn, accessed 2008. [5] CLEARSPEEDTECHNOLOGY. ClearSpeed Whitepaper: CSX Processor Architecture. http: //www. clearspeed. corn, Feb 2007. [6] Cray History. http://www.cray.cornm/ about_cray/history. html, 2008. [7] http: //www. fairlightau. com, 2008. [8] GUSTAFSON, J. L. Reevaluating Amdahl's law. Commun. ACM 31, 5 (1988), 532-533. [9] HOARE, C. A. R. Communicating sequential processes. Communications of the ACM 21, 8 (1978), 666-677. [10] IBM. http: power Cell Broadband Engine resource center. //www. ibm. com/developerworks/ 'cell, 2007. [11] INMOS, Ed. The occam2 Programming Manual. Prentice-Hall, 1988. [12] JACOBSEN, C. L., AND JADUD, M. C. The Transterpreter: A Transputer Interpreter. In Communicating Process Architectures 2004 (September 2004), D. I. R. East, P. D. Duce, D. M. Green, J. M. R. Martin, and P. P. H. Welch, Eds., vol. 62 of Concurrent Systems Engineering Series, IOS Press, Amsterdam, pp. 99-106. [13] JAKSON, J. R16: A New Transputer Design for FPGAs. In Communicating Process Architectures 2005 - WoTUG-28 (2005), J. F. Broenink, H. W. Roebbers, J. P. Sunter, P. H. Welch, and D. C. Wood, Eds., vol. 63 of Concurrent Systems Engineering Series, IUOS Press. [14] MOORER, J. A. Audio in the New Millennium. J. Audio Eng. Soc. 48, 5 (May 2000), 490-498. [15] PFAFF, M., MALZNER, D., SEIFERT, J., TRAXLER, J., WEBER, H., AND WIENDL, G. Implementing Digital Audio Effects Using a Hardware/Software Co-Design Approach. In Proc. of DAFx07 (Bordeaux, France, September 2007), pp. 125-132. [16] ROBER, N., KAMINSKI, U., AND MASUCH, M. Ray Acoustics Using Computer Graphics Technology. In Proc. of DAFx07 (Bordeaux, France, September 2007), pp. 117-124. [17] NVIDIA Tesla GPU Computing Solutions for HPC. http://www.nvidia.com/object/ tesla_computing_solutions.html, 2008. [18] VIRGINIATECH. Systemx. http://www.arc. vt. edu/arc/ SystemX/index.php, 2007.