Page  00000001 Sample Rate Synchronization across an ATM Network. George Robertson (1) Derek McAuley (2) (1) Department of Computer Science, Glasgow University george @dcs.gla.ac.uk, http://www.dcs.gla.ac.uk/-george (2) Department of Computer Science, Glasgow University dm@dcs.gla.ac.uk, http://www.dcs.gla.ac.uk/~dm Abstract This paper looks in detail at the problems experienced in real-time transfer of audio due to sample rate variations between digital audio devices interconnected by means of a computer data network (e.g. Ethernet or ATM). The variation one might expect between the sampling rates of common computer audio hardware is outlined, based on experimental work. Finally, we present our solution: an architecture based around a 'Time Manager' which keeps track of each device's sampling rate and can inform applications of the remedial action necessary to maintain synchronization. heard at the difference frequency between the devices, 1 Introduction known as 'sync slippage' [2]. The MIniMS project is a joint venture between the departments of Computer Science, Music and Electrical Engineering of the University of Glasgow. Our aim is to provide high-quality, real-time audio services to a distributed network of users. It is envisaged that these users will be musicians and recording engineers, who will demand performance from the system akin to that of a recording studio though elements of the system may be kilometers apart. It is necessary, therefore, to provide an infrastructure for sample rate synchronization between elements of the system. The AES recently published their recommended practice for digital audio synchronization in studio operations [1]. Here, equipment is to be synchronized by one of two methods. The first is to lock the sampling rates of all devices to that of a central oscillator, the 'Digital Audio Reference Signal'. The second method is to lock to the embedded sample rate clock within a digital audio input signal, known as 'genlock.' In the studio environment, during the direct digital transfer of audio between two devices, failure to ensure sample rate synchronization will lead to problems. Either the transfer will fail altogether, which is likely when the rates of the two devices differ grossly and the AES/EBU interface cannot 'make contact' between them. Or, the transfer will proceed but clicks will be In the case of digital audio over a computer network, synchronization failure will manifest itself as a variation of 'sync slippage'. However, the effect on sound quality may be much greater. Consider the process of transferring audio from a microphone at one computer workstation (the source) to a loudspeaker at another (the sink). Within the source workstation, the audio hardware will be constantly sampling the microphone signal at a frequency determined by an onboard crystal oscillator. The source will slice this stream into chunks (packets) for transmission over the data network. At the sink workstation the packets are received and the original audio stream is reformed. This is achieved by reading the audio packets into a buffer. Samples will be drained from this buffer, and output by the audio hardware, at a rate defined by another, local, crystal oscillator. Obviously, both the sink and source oscillators will be running at nominally the same rate (e.g. 44.1kHz). However, some small difference between the two rates is inevitable. If the sink is draining the audio buffer faster than the source produces samples, there will come a point when the buffer is empty and there will be a break in the output until another packet is received. If the opposite is true, samples will build up in the buffer until there is too little space at the end of the queue to accommodate an entire audio packet.

Page  00000002 Of course, neither of the methods suggested by the AES can be used when system connectivity is provided by the likes of ATM or Ethernet since any timing information implicit in an audio stream will be lost due to packetisation and transmission at the bit rate native to the network technology. Furthermore, only a very few manufacturers of computer multimedia equipment (e.g. Silicon Graphics) provide the facility of locking the sampling rate to an external oscillator. Strictly speaking, therefore, it is wrong to talk of sample rate synchronization in this area. It is, at present, impossible to achieve. The rate at which samples are produced, or consumed, by the audio hardware is fixed. However, one can effect synchronization by altering the sample stream to make it consistent with the rate of the sink workstation. This may mean re-sampling of the audio stream to match the sink rate or the controlled insertion and deletion of samples. In the future this may change. On a UNIX workstation the system call 'adjtimeo' is used to speed up or slow down the system clock. It would be entirely possible to allow control over the audio sampling rate in a similar fashion. 2 Observed Rate Variations It will be useful to note the sample rate variation one may encounter between any two pieces of computer audio hardware. To assess this a group of machines were measured against a control machine (a Silicon Graphics Indy). The table below shows this, with the sample rate variations given in PPM (parts per million). Six computer workstations were tested; 3 Silicon Graphics Indys, 2 Sun Sparc Station 20s and a Pentium with a 16-bit 'Soundblaster' card. A positive value indicates that the sampling rate of the test machine was higher than that of the control machine and vice versa. Machine Type 32kHz 44.1kHz SGI Indy #1 -9.39 -13.76 SGI Indy #2 -1.13 7.10 SGI Indy #3 3.41 -1.32 SS20 #1 -54.69 -47.51 SS20 #2 -24.75 -49.57 Pentium 240.88 -9.00 The machines were tested for sample rate variation at each of the following rates; 8kHz, 16kHz, 32kHz, 11.025kHz, 22.05kHz and 44.1kHz. It was found that the PPM ratings for 8kHz, 16kHz, and 32kHz were identical- provided that the three readings were taken in close succession. The same was true for 11.025kHz, 22.05kHz and 44.1kHz. This is because the audio hardware in each machine uses two crystals- one for 8kHz and harmonics, and one for 11.025kHz and harmonics. These values represent the fractional difference between the rates of the control machine and test machines. If, however, we assume that the sampling rate of the control machine is always exactly equal to the nominal sampling rate, then we can interpret the results. If we are transferring digital audio from the control machine to the Pentium tested, at a nominal rate of 32kHz, then we can give the rate of the Pentium as 32007.7Hz (240.88 PPM faster). Therefore, each second the 'Soundblaster' card on the Pentium consumes approximately 7 more samples than the control machine produced in that time. If one assumes that the audio output buffer in the Pentium originally contained 100ms worth of data (3200 samples) then, after about 7 minutes, the output buffer will be empty. A break in the sound output is inevitable. From this point on, the Pentium will consistently want about 7 more samples per second than are being placed in the output buffer and it is likely that clicks will be heard between the arrival of each network packet. These problems can be avoided if the audio stream is suitably altered. The table on this page gives the answer. In the example given, the Pentium is running 240.88 PPM faster than the control machine. Therefore, in transferring audio from the control machine to the Pentium we will have to create, or insert, on average 240.88 samples for every million produced by the control machine. Note that, at a nominal rate of 44.1kHz the Pentium runs 9PPM slower than the control machine. In this case we would have to remove, or delete, 9 samples for every million produced by the control machine. 3 Synchronization Architecture The architecture we have developed to allow this form of synchronization is shown in figure 1, overleaf. Audio streams are conveyed (from source to sink) via an ATM network. The 'Time Manager' assesses differences in rate between devices. It is important to note the distinction between the audio and control streams and the differences in what is required of them.

Page  00000003 AUDIO SOURCE AT TnTn.CTNK I Em AUDIO SOURCE i" ~ ss (Timing 'Stream""""""""s) (Synchronization Control Streams) TIM~E M~ANAGER Figure 1: Synchronization Architecture. 3.1 Audio Data Streams. ATM is the natural selection for this type of work. Audio over TCP/IP is often affected by retransmissions due to packets dropped by routers and bridges [3]. Furthermore, ATM can provide guarantees. When these connections are established the ATM endpoint requests a specific Quality of Service (QoS) for the connection. At the moment, the exact details of this QoS negotiation vary fr-om one ATM vendor to the next. However, the ATM Forum (the ATM industrial pressure group) are working on standardizing this. Their User Network Interface (UNI) standard [4] will be what all the others eventually converge to. Despite these proprietary differences, there are some elements of the QoS of a connection that are fundamental. These are; the peak bandwidth, the mean bandwidth and some measure of how 'bursty' the data will be. For PCM digital audio the bit rate is constant and one will request the same value for the peak bandwidth and mean bandwidth. One assumes that the data will have 'zero burstiness'. Therefore, the connection will either be accepted and guaranteed the required resources, or it will be refused. 3.2 Control Streams The control streams convey the information necessary to effect synchronization. As will be seen, they are low bandwidth connections and need not even be transmitted over ATM. Thus, the time manager need not be running on an ATM capable machine. The control streams fall into two categories; 1. Timing Streams 2. Synchronization Control Streams The 'Timing Streams' convey information that allows the Time Manager to calculate the sample production rate of a source. Here, a source is a single piece of audio hardware -such as a soundcard within a workstation. Each source has a single sample production rate (even if the resulting audio stream is multicast to various sinks) and one timing stream will exist between each source and the time manager. Sample rate information will be conveyed in the following atomic pair of values; * Sample count (samples produced/consumed) * High resolution time (nanoseconds) The time value has accurately to reflect when the given sample count was valid. One must remember that the differences observed between rates can be very small (an approximate range of 1 - 240 PPM) and the time values must be of sufficiently high resolution to resolve these differences. Nominally the time will be given in nanoseconds, although the resolution of the underlying hardware will vary with make and model. In our prototype, the sample count and time are both given as 64-bit values. We are using ATM for these connections and both values are packed into a single ATM cell (48 bytes). This information need only be sent intermittently and, in our prototype, the time interval between data transmissions is of the order of tens of seconds. The 'Synchronisation Control Streams' provide duplex communication between the time manager and a sink. Multiple connections are possible as one Synchronization Control Stream will exist for every incoming audio stream.

Page  00000004 To appreciate the workings of the Synchronization Control Stream one must appreciate what occurs within the sink. Figure 2 shows the processes required to receive, synchronize, and mix together two streams. Figure 2: Illustrating workings of 'Synchronization Control Streams'. The two audio streams come into the sink device at 'A' and 'B'. The streams are then passed to a 'Synchronisation Object' - one of these will exist for every incoming audio stream. This object will perform rate matching and buffering of the audio. Each ratematching element receives information from an incoming Synchronization Control Stream ('C' and 'D'). It is at this point that the sample stream will be altered to match the sample consumption rate of the audio device. The audio data is then buffered before it is mixed and sent to the audio device. The outgoing Synchronization Control Streams are shown at 'E' and 'F'. These inform the Time Manager of the sink rate, using the same method as for the sources. Also, the fill-level of each buffer is reported to check that synchronization is taking place. In theory, the rate at which buffer fills should equal the rate at which it is drained. However, the Time Manager will need to know if the instructions it is giving the ratematching element are incorrect. This would be characterized by very low, or very high buffer levels. 4 Timing It is vitally important that time be consistent throughout the system. It is necessary, therefore, to ensure synchronization between the system clocks of each machine. Clock synchronization includes the synchronization of both clock frequency and setting. Much work has already been done in this area, the Network Time Protocol (NTP) having been developed to synchronize the system clocks of machines connected to the Internet [5]. Tests of NTP over Wide Area Networks (WANs) indicate that clock frequency synchronization is likely to be in the range 10-100 PPM, the same order of magnitude as the problem we are trying to solve. For the system to work properly this would have to be more like 1 PPM, if not better. The next stage of our work is to look at the performance of NTP over our Local Area Network (LAN). If NTP fails to give us the required performance it is possible to synchronize machines to radio or satellite time services. 5 Concluding Remarks The key to the success of this architecture is the separation of the audio data and control streams. One could route the audio data streams through the Time Manager to provide the necessary timing information. This would, however, be neither a flexible nor a scaleable approach. The Time Manager would be able to cope with only a very few of these, high bandwidth, audio streams and would have to be running on an ATM capable machine. The use of low bandwidth control streams allows a single process to monitor the timing of the entire system. Future work will include a study of various sample insertion/deletion algorithms and the effects these have on both objective and perceived sound quality. References [1]AES11-1991, "AES Recommended Practice for Digital Audio Engineering- Synchronization of Digital Audio Equipment in Studio Operations." [2]Rumsey, F. J., "Digital Audio Synchronization," Studio Sound, March 1991, pp. 74-79. [3]Beadle, H. W. P.,"Experiments in Multipoint Multimedia Telecommunication," IEEE Multimedia, vol. 2, no. 2, Summer 1995, pp. 30-40. [4]ATM Forum, "UNI V3.1 approved specification," http://www.atmforum.com/pub/approved-specs/afuni-0010.002.ps. [5]Mills, D. L., "Network Time Protocol (Version 3) Specification, Implementation and Analysis," Network Working Group Report RFC-1305, University of Delaware, March 1992.