Page  00000001 SOUND DIFFUSION USING HAND-HELD LIGHT-EMITTING PEN CONTROLLERS Kenneth Brown, Michael Alcorn, Pedro Rebelo Sonic Arts Research Centre Queen's University, Belfast ABSTRACT This paper investigates alternative methods for multichannel sound diffusion which are based on the mapping of physical gestures to spatial location. The approach addresses issues surrounding the inappropriateness of mixing consoles as an effective interface for sound diffusion in multi-channel loudspeaker environments and instead proposes a system that allows the user to traverse all possible trajectories in a multi-loudspeaker environment. 1. INTRODUCTION Many composers of electroacoustic music create stereo works which are intended to be relayed (or diffused) in a concert venue through loudspeakers strategically placed around the audience. The process of diffusion can provide added meaning to the experience in terms of articulating the inherent gestural and structural features of the music. The conventional method of diffusing sounds in multiple-speaker systems uses a mixing console with faders routed to single speakers or groups of speakers. The spatial projection of the music is achieved through variations in the signal gain of the faders. The actions range from subtle adjustments of a fader to complex manipulations of many faders. Diffusing in this way is hampered when a relatively large number of faders need to be moved simultaneously, especially if they are non-adjacent. Whatever mapping is chosen from fader to speaker(s), there will be some sonic trajectories that are harder to perform than others. For example moving a sound image from one extreme position to another position in a room, whilst passing the signal through intermediary speakers. Neither mixing console nor human physiology are well suited to this task. A discussion of these aspects of sound diffusion is presented by Jonty Harrison [1]. However, whilst mentioning alternative controlling methods, Harrison does not address these in detail. This paper presents the development of a more intuitive, gesture-based live diffusion environment. A prototype system was conceived around a pair of handheld light-emitting pen-controllers which enable the exploration of a potentially more flexible and expressive interface for diffusion. The performer gestures with the pens in front of a camera-based system in which the movements of the two pens represent two channels of sound in a three-dimensional space. A computer enhances and analyses the camera frame to extract the exact position of each pen. The computer then maps these co-ordinates to a set of output channel gains, which in turn are used to control the perceived sound location of each channel within the space. The described system is realised using Max/MSP/Jitter running on a Mac G5 computer, with a connected iSight camera. 2. TRACKING If the hands are to be able to 'conduct' the sound image around the room, then some mechanism for the simultaneous determination of each hand position is required. Tracking systems for a pair of hands (possibly holding an object) can be split into occlusionless and occluding. This project explores the potential of occluding tracking systems only. Whilst there are potential occlusions of one hand by the other in a single camera system, a software solution can be achieved that minimises adverse effects. The methodology for the extracting of three-dimensional coordinates from a single (two dimensional) camera picture is discussed later in this paper. Once the coordinates of each hand are determined, any disappearance of a hand can be compensated for, in software, by 'holding' the last known coordinate. 3. PERFORMANCE SPACE 3.1. Camera Orientation In order to pursue the expressive potential of "free" hand gestures, no unnecessary performance restrictions were implemented. With the camera facing the performer the crossing of the hands while using a stereo source would allow the performer to 'cross the channels'. In conjunction with the hand-held pens, which are described later, a front to rear facing camera allows the performer to exploit one of the most novel features of the system: twisting the wrists to control the 'spread' of the sound. With a frontal camera, this gesture can easily be tracked without affecting the actual hand position. 3.2. Delineating Frame The use of both skeletal and enclosed frames at performer's height was tested to help delineate the camera's viewpoint and provide light screening. The area furthest from the camera was designed to be just

Page  00000002 larger than a maximum gesture-size of approximately fifty centimetres, which in conjunction with the camera's focal length effectively determines the overall frame size and shape. After some experimentation, the use of a frame was dismissed with the proviso that ambient lighting in the performance room could be suitably controlled to allow for "free" hand gesture tracking. hand held wands used in Don Buchla's Lightning MIDI controller [3]. 4.1. Light Emitters The prototypes use small wide-angle coloured lightemitting diodes (LED) as a light source, due to their relatively point-source image, and can operate from a low voltage with a low current drain. Two LEDs were employed in each torch (as shown in Figure 3.). By calculating the inter-LED space we can provide an effective way of calculating the distance of the pen from the camera. Other distance methods, such as measuring single LED brightness, were considerably less reliable and were therefore rejected. Fig 1. Camera and Frame. 3.3. Shape of the Virtual Space To negate the need for complex compensation in software, it is left to the performer to compensate for the diverging camera angle by increasing the gesture size as the gestures move axially further from the camera. In effect, the performer must imagine that the room volume is mapped into the smaller diverging threedimensional space in which the diffusion gestures are performed; not unlike the diverging spreading of keys in instruments such as the saxophone. 3.4. Ambient Lighting Concert performances usually occur in relatively low lighting conditions, particularly so for electroacoustic performances where the mixing console is in the centre of the auditorium and where there is no stage performance for the audience to look at. With a view to accommodating varying lighting conditions, hand-held light-emitting devices were used. These have their own light-source and hence do not require stable lighting conditions for reliable tracking. 4. HAND HELD LED PENS To aid in the computer recognition of each hand position, we chose to use modified pen-torches (Fig 2). Fig 3. Close up of LED assembly. 4.2. LED Colours The LED colours were chosen to be uniform for each controller. This provides a smaller processing burden on the prototype system since there are only two pairs and not four separate points to track. Identical colours in each pen still allow 'twist' detection, albeit with reduced resolution. Red and green LEDs (suitably matched in brightness) were chosen, since these colours are relatively easy to be discriminated by the tracking software. 5. THE PROCESSING SYSTEM The individual elements that constitute the image processing system are shown in Fig. 4. The system is implemented using the graphical programming language Max/MSP/Jitter. This environment provided the necessary tools for prototyping and development of system architecture. The application can be easily ported to other real-time image systems such as Shu Matsuda's DIPS [2]. Fig 2. Modified Pen-Torches. Each pen-torch represents the tracking point for each hand and help to give a high contrast image to ease computer tracking of the hand positions. The performance strategies proposed here build up on the

Page  00000003 images of the LEDs are the only bright red and green objects in each frame. The blur object implemented in Jitter is relatively processor intensive, but useful for removing the effect of spurious bright red or green pixels that would otherwise lead to mis-tracking. 5.4. Channels A and B The enhanced frame is sent to two parallel channels denoted A and B corresponding to the two audio channels and their differently coloured pens. 5.5. Finder The finder scans every pixel in the current frame and returns the smallest rectangle that encloses all instances of pixels with the colour falling into its pre-configured RGB range; Fig. 5. shows two enlargements of such regions. The inner region of red and green LEDs after enhancement is yellow/white, so we track the distinguishable red/green regions. Fig 5. Tracked Regions (enlarged). Fig 4. System Flowchart. 5.1. Initial Setup - Room Configuration The physical room and speaker layout for which the system is used is defined in the processing system. A simple text file was used containing the room size and each speaker position relative to the bottom rear left corner. The measurements can be in any units since it is the relative positions of the speakers in the room that modulate the gesture mapping. The order each speaker as declared in this file defines the eventual output order. Changing the diffusion performance room involves simply changing the speaker configuration file, however if the lighting conditions are different, then manual checking of the "finders" may be required to achieve optimum performance. 5.2. Grab Grab is the process of acquiring consecutive video frames from the camera. The number of frames per second (f.p.s.) should be high enough to allow smooth operation. We chose fifteen f.p.s. as an operating figure. The pixel resolution is not particularly important, but should be high enough to allow for a reasonable difference in the number of pixels between LEDs with the pens at the distance extremes, whilst not slowing down processing speed. 5.3. Image Enhancement This performs optional blurring, contrast or brightness of the video input. The purpose is to make the tracking of the pen images more reliable by making sure that the In our system this is performed by the jit.findbounds object. This object takes a pair of minimum and maximum RGB triplets (see Table 1. for an example,) and simply returns the found rectangle top-left and bottom-right coordinates. I I I I I Min. Value Max. Value R G B 0.85 0.00 0.00 1.00 0.10 0.00 I I Table 1. Typical Red Finder Configuration (normalised). From this pair of coordinates it is trivial to extract the centre (which we take to be the position of the pen) and the diagonal length (from which the distance-fromcamera is calculated). The exact positions of the LED centres within the rectangle remain unknown, and to some extent the detected diagonal length is a function of the inter-LED spacing and brightness. However, it was not felt that more involved image processing would produce greatly improved results, though programming custom-designed finders could give scope for better spurious-pixel error detection and rejection. By ensuring the room lighting is relatively dim, and that no other bright red or green objects are within the camera's viewpoint, reliable tracking by the finders can be achieved.

Page  00000004 5.6. Calculating the Speaker Gains An algorithm is required to convert a three dimensional position into a set of speaker gains. 5.6.1. MDAP Multiple Direction Amplitude Panning [4] (MDAP) is a potential candidate, but MDAP assumes that the speakers are equidistant from the centre of the room. MDAP was developed by Ville Pulkki from his Vector Base Amplitude Panning [5] (VBAP) methodology. 5.6.2. Distance Variable Power Algorithm In order to accommodate for non-equidistant speaker placement, a new algorithm was devised which calculates the contribution of each speaker based on a power of its closeness to the mapped hand position. The gains are then normalised to ensure a constant volume. The extra 'twist' performance control can be used to modify the power factor used, effectively changing the spread of the sound. This algorithm can also be employed in volumetric speaker arrays. 5.7. Smoothing A limited amount of smoothing at some point in the processing chain is beneficial in minimising sound position 'glitches' caused by occasional finder failures, however too much smoothing causes an increase in the response time; a trade-off is necessary. 5.8. Output Two different output methods were evaluated. 5.8.1. Midi Output This method converts gains for each channel into midi messages suitable for controlling midi-controllable faders which will distribute the audio signal to each speaker. The exact midi format depends on the software chosen. If a hard-disk recording system such as Digidesign's Pro-Tools is used, it has to be configured such that each speaker has two controllable faders associated with it, one for each of the two input sound channels. 5.8.2. Direct DAC Output Depending on the specific machine hardware andI configuration, it may be possible to output sound directly to the computer's DACs. In this instance, the two source channels can be directly processed by the diffusion software. Each is duplicated to the number of speakers, then the relevant speaker-channel scaled with its particular gain factor. The pair of channels for each speaker is added before being output. 6. RESULTS OF TESTING In preliminary tests with eight and thirty-two speaker systems, the diffusion software performed adequately once correctly configured and once operated within suitably reduced lighting conditions. There was a small but noticeable latency between gesture and effect, nevertheless there was an established relationship between performative action and aural result. Some amount of practice was needed to keep the controllers within the camera's view at all distances. However, deliberately removing a controller from view or switching it off temporarily produced gesture opportunities not initially envisaged, such as "jumping" by switching the pens off and on while moving. 7. FURTHER WORK Objective non-live testing needs to be devised and performed with suitable sets of input values, checking that both gain-array calculation algorithms produce desirable sets of gains. The reason for the perceived latency needs to be determined and minimised, if necessary by optimising the offending code-sections, or reducing the overall resolution. Further use of the device in concert situations would evaluate the suitability of this method for sound diffusion. 8. CONCLUSIONS This project represents an exploration of alternative strategies for sound diffusion. Ultimately, each device or system suggests its own particular gestures and hence its own spatialisation language. The device presented here does not attempt to replace methods currently in use but rather to suggest alternative spatialisation languages which are inherent to the design of the device itself. We have found that the proposal for an alternative to fader-based controller suggests a re-evaluation of the musical and performative role of sound diffusion. 9. REFERENCES [1] Harrison, J. "Diffusion: theories and practices, with particular reference to the BEAST system" eContact 2.4, ix 1999. URL: t.htm [2] Matsuda, S.,Rai, T., "DIPS: the real-time digital image processing objects for Max environment", in Proceedings of the International Computer Music Conference 2000. [3] Buchla, D. "Lightning II MID! Controller". [4] Pulkki, V. "Uniform Spreading of Amplitude Panned Virtual Sources", Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, October 17-20, 1999 [5] Pulkki, V. "Virtual Sound Source Positioning Using Vector Base Amplitude Panning", J. Audio Eng. Soc., Vol 45, No. 6, 1997 June.