Embedding a sensory data retrieval system in a movement-sensitive space and a surround sound systemSkip other details (including permanent urls, DOI, citation information)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. Please contact firstname.lastname@example.org to use this work in a way not covered by the license. :
For more information, read Michigan Publishing's access and usage policy.
Page 00000001 Embedding a sensory data retrieval system in a movement-sensitive space and a surround sound system Insook Choi (email@example.com), Geoffrey Zheng, and Ken Chen University of Illinois at Urbana-Champaign, 405 N Mathews, Urbana, IL 61801 Abstract Voices in Ruins (ViR) is the most recent in a series of interactive audio media installations constructed for gallery settings (Brothers 2000), (Cohen 2000). These installations embed a sensory data retrieval system in a mediated space that measures unencumbered human movement. Movement is translated into control signals for sound synthesis, spatial simulation and diffusion. Computer vision technology mediates this interface, gathering and evaluating changes in physical space, and communicating the data to synthesis engines, databases, and the composed environment. The artistic goal is to bring visitors' attention to their own acts of disposition towards a displayed work in an instrumented space, from roles as distant observers to intimate observers. 1. Introduction The thesis of this installation is to prototype a sensory information retrieval system where the acquisition of information is an acquisition of an experience. A sensory information retrieval system is a composed system of networks in which the connectivity is based upon the analysis of a database in use. In Voices in Ruins we focused on historical speech documents, from which under-explored information may rise above the common practice of "information age" communications. In the common practice written texts are the dominant form of information, the "word." logo-centric focus on texts enables only a part of their information to be available, and discards the rest. Examples of overlooked or discarded information include vocalization techniques of speakers, such as the means by which a speaker distributes his or her vocal energy (longer or shorter vowels, harsh or round consonants, inflections, the unit of articulation, rhythm and pace); also the timbres of voices, and how the speakers address their audiences; also ambient space (outdoor speech or indoor speech), the speakers' views toward the apparatus used in recordings, and how they addressed the apparatus. This information, Figure 1 when available, offers in-depth access to the thought processes of the historical Source f figures and the context in which they (Recordi exercised their voices. Rooseve 2. Semantic networks Rooseve The source materials (historical speeches) are analyzed to create a database of semantic units (SU), and these units are organized by grouping analysis parameters JFK to create semantic networks (SN). The configuration capacity of a SN is in an abstract form (coding networks as directed Attribute graphs), to embody any rationale that is applied as a basis for the logic of analysis. * Cla Analysis is performed as part of the * Ene composition process, based upon * Pos observations of features in the source * Dur materials. These observations provide the * SN foundation for the SU database structure shown in Figure 1, and provide criteria for traversal connectivity (links between nonadjacent SU's in the database). Not every traversal of the database must be contiguous. Each SU includes one or more links to other SU's, enabling traversal with variable depth of field. Analysis supports the organization of multiple SN's. The database is sorted according to observable features of an SU such as inflection, text, or duration. Sorted SU's are grouped to form an SN, a directed graph structure with an SU at each node and Links to other SU's as edges. Ordering on one attribute does not enforce similarity in other attributes, so the SN's contain variety as well as a common characteristic or theme. In the installation acoustic resources are selected by traversing SN's. Visitors' movements are evaluated in real-time to determine the SN position. The SU database and SU analysis are external to the interactivity of the installation; their logic is accessible to observers via the SN's. The physical space of the installation registers the Semantic Unit Database Format Text Analysis Semantic Units ng) (Method) (Sequential Order in Source) Flt Al SUAl SUAl2 SUAl1 SUAlm lt A2 SUA2 1SUA2 SUA2 SUA2 Al SUAl, 1SUA12 SUA13..,SUAl1 Fields Common to All Semantic Units: ss: inflection type or text group rgy: relative audibility of attribute ition: start time in Source Sound File ation Membership: edge list of links to other SUs
Page 00000002 dynamics of the observers' movements and the resulting auditory dynamics. The SN's are embedded in the physical space by indexing the observer's position in the installation to a node position in the SN. As observers traverse the installation they also traverse the SN, or multiple SN's, thus the SN embedding in the space is not static (see Data Embedding). The computer vision technology communicates the room dynamics to synthesis instructions. Programmable statistical measures record and evaluate the movement over time, so an observer's interaction in the space can build a history (histeresis). Levels of activation accumulate and dissipate over time, influencing the duration and structure of phrases and formal sections of the composition. 3. Sound Production Sound synthesis and processing includes spectral models and analysis-resynthesis techniques, providing dynamic modeling of noise and residuals, and of the acoustic environment that hosts them. Historical speeches in sound files were analyzed in terms of signals and residuals using several tools including SMS (Serra 1997). Elements and artifacts of recordings and radio transmissions comprise a composition vocabulary. Oceans of artifacts play as metaphor within the residuals, environmental backgrounds, ambiences and traces of transmitting signals. Visitors can separate and re-composite speech signals from residuals as if practicing a form of archeology. We engineer the synthesis system to afford multiple layers of acoustics, responding to local or global activities in the room by techniques such as spatialization (source position/distance and acoustic reflection simulation) and de-correlation. These are activated by speech and sound synthesis data stored in the semantic networks. A SN is implemented using a state machine in VSS (Bargar 1997). The VSS SmActor stores instructions in a directed graph structure. Each node in the graph stores SU attributes, timing data for transitions to other nodes, and tuning parameters for signal processing, sound synthesis, and simulated spatial position of the node. An SU edge list can provide multiple links to other SU's; in the state machine multiple edges from a single node are assigned probability weights summing to 1.0. When visitor movements traverse the SN, the SmActor determines the edge traversal according to weighted probabilistic selection. SN transitions include timing information related to the SU duration. Figure 2 details the ViR system consisting of three computers, a video subsystem including a video analyzer board, and an audio subsystem including DAC, mixer, power amplifier, speakers, and subwoofers. The video data is sent to a Windows PC for processing via serial cable. The analysis results are encapsulated in messages sent to VSS on a Linux PC via LAN. All compositional decisions and synthesis instructions are generated within VSS. Multi-channel VSS audio is sent to jMax on a second Linux PC via an ADAT optical interface, and the data controlling various parts in the jMax patch is sent between the VSS jMax actor and the UDPmessage object in jMax over the network'. Some VSS audio outputs are processed by Spatialisteur, the jMax spatial sound processing package that incorporates source, room, and listening environment models to enhance the perception of the virtual sound environment. The output of jMax is again sent over an ADAT optical interface to the audio subsystem, which ultimately delivers multi-channel audio for the installation. Currently VSS is implemented on SGI Irix, PC Linux, and PC Win9x/2K/NT platforms. The choice of PC/Linux platform for ViR is based upon costeffectiveness and reliability. jMax has been ported to Linux. Using two Linux machines makes system setup consistent and very robust. The installation ran for months and never broke down because of system failure. As of this writing Linux OSS does not officially support any ADAT soundcard except the Sonorus STUDI/O (the OSS driver for RME Digi32 card does not support ADAT mode, and the OSS driver for RME Digi96 card is in beta version). ALSA supports other ADATcompatible soundcards but, unfortunately, not the STUDI/O, which is the only card that has two pairs of ADAT I/O-although the OSS driver does not support dual ADAT mode, so the maximum utilizable number of Figure 2: Voices in Ruins System Components Grid data Visual Basic Simple Video Analyzer data processing module -Control data Door/region data LINUX2:AMax LINUX: VSS channels is ten: 8 ADAT and 2 S/PDIF. The OSS API for the STUDI/O card has many peculiarities that have to be dealt with in VSS core code. Various buffer sizes for the driver and applications (VSS and jMax) had to be painstakingly tweaked to avoid buffer under/overrun problems. The first version of the OSS driver could not 1 For further information on the connection of VSS and jMax, see Bargar et. al. " Coney Island: Combining jMax, Spat and VSS..." in this volume.
Page 00000003 obtain correct sampling rate value from the soundcard's internal oscillator, so we had to resort to an external clock source. 4. Space Engineering The installation is designed to avoid visual art objects, seducing gallery visitors to sounds not to visuals yet facilitating the experience to be a gallery experience. The installation presents acoustic tile and text panels on the walls, and carpet marked with a few chalk circles. Remarks by visitors that the installation is visually "striking", imply certain expectations of gallery goers, and can be attributed to the effect of space layout and design based upon the acoustic criteria and their impacts on materialization of the room. The choice of sensor is based upon space engineering criteria: (1) Sound experience should be interactively scheduled in conditional layers, to the extent that sounds are not merely movement triggered events. This requires a certain level of perceptual mechanism for which vision technology provides an experimental stage. (2) Nontactile space is interpreted as regions hosting luminance point-sets that are the indicators of visitors' presence and their movements in the space. The behaviors of the resulting pixel aggregates are interpreted to initialize simple state machines in the composition. The hardware of the Video Analyzing system including a 3.6 mm FL lens, and the SVA (Decade Engineering 1999), a video signal Micro-processing board which uses a luminance threshold to convert the input image to binary (white/black) form. SVA communicates with the PC host through the serial data link as shown in Figure 2. A Visual Basic (VB) application on the PC host controls the detection modes, the luminance threshold of the Video Analyzer, and displays the image. Another communication link is setup between this VB application and VSS via network. The resolution of the video image is 500 (H) x 240 (V) pixels corresponding to a designated floor region in the installation space, and updated at a preset frame rate. In our application, several detection regions are partitioned from the 2D image of the space, and pixels within the individual detection regions are counted every frame. Pixel data may be directly used to judge the presence of the observers on a frame-by-frame basis. However, the camera recognition system is probabilistic while human movement is contiguous across analysis frames. Accuracy is important when detecting observers' movements from on floor region to another. To make it possible, an equal number of registers are assigned to each of these detection regions. The numerical values stored in the registers will increase when pixels greater than a noise threshold are detected in the corresponding region, and decrease when pixels less than the threshold are detected. Since it takes time for the register variables to increase and decrease to their upper and lower boundaries, short memories are implemented so that movement can be timely captured and detected while avoiding false events. The criteria used to judge a change of region is that the registers in adjacent regions must have values greater than the noise threshold for a reliable duration. Calibration The threshold and other control parameters need to be carefully calibrated according to the lighting condition in the gallery and the status of the video signal. The SVA is designed to work with robust signals and highcontrast images. However, special lighting and reflective wooden floors in galleries provide a noisy light environment. To compensate the noise, we apply auto-adaptive threshold variables to the time-average of the input signals. This strategy largely improved the accuracy of the decision-making logic. And since the system is designed to work continuously for months, this auto-adaptive strategy also effectively eliminated the long-term drifting of the video signals. Good lighting control is the key to good performance. Masking the floor/wall reflection, the choices of the light source, the position and the orientation of the camera as well as the lights are all import factors that need to be carefully designed and adjusted. Excessive latency was experienced in one implementation. This problem was later solved through optimizing the VB code and carefully adjusting the control parameters of the SVA board, including the frame rate and the FIFO buffer size. Data Embedding in Partitioned Space ViR presents a Sensorial database. We distinguish this from a multimedia database by the observer's acts to encounter and move in space, and by the analysis-based retrieval and assembly of sensory information. One might say that in a multimedia database the stored information defines linearity whereas in a sensorial database linearity is defined by the retrieval algorithm with respect to an analysis. In ViR 13 historical recordings from 1910 to 1965 provide a finite number of references. Analyses were encouraged based upon the modalities performed by the historical speakers. For example a collaborator, author and UIUC English professor Michael Berube was given sound files as well as transcriptions, and his analysis was based upon rhetorical devices in the speech acts. The resulting semantic units organized in SN's provide the basis for a large-scale formal organization. Once the analyses and networks are constructed the further rules of traversal can be imposed. A SN can be thought of as "suspended above" the installation floor casting a shadow "projected down." The installation floor is partitioned into several regions, and the SN shadow is defined as an index from an observer's floor position to a SN node. When the observer moves from one partition to the next, the SN is traversed to a new node. Traversing the SN can be thought of as rotating its shadow such that one moves through a semantic neighborhood which includes sound synthesis and residual tuning as well as speech acts. A certain "cloud form" effects the sound experience generated from observers' movements. The rotating shadow of the SN traversal is determined by a position
Page 00000004 based, movement-activated state machine; the shadow also describes the observer's auditory perspective. 5. Temporality and Form The timing of data retrieval and sound synthesis with respect to movement is an element in the composition. In general the amount of movement corresponds to the amount of sound activation and the rate of database traversal. The practice of "triggering sounds" is avoided by a system that modulates polyphony among residuals, tuning sounds and clean speech excerpts. Each sound generator responds to levels of movement activation as a gradient rather than a Boolean switch. The activation response characteristic of each synthesis algorithm is such that some retrievals are simultaneous and others are spread in time. Imagine different cloud forms moving in and out of a region as the wind blows. In this case the "wind" is stirred by observer's movements within and between space partitions. Activation characteristics are determined by histeresis in the sensor and in the sound reproduction models. The camera exhibits analog characteristics of a lightgathering device with a response curve of saturation and dissipation. Pixels do not switch on and off, they fade in and out forming clusters in the 2D camera plane according to the degree of movement projected from the corresponding floor positions. This provides a control function corresponding to an observer's acceleration as well as position. Faster movements cover more space in a fixed time frame, therefor more pixels are activated. Motion increase translates to greater spatial saturation, and greater sound transformation. When observers stand in one place and reduce their body movement, the region of pixel saturation decreases until it falls below the noise threshold, and the sounds calm down accordingly. By the same design the installation with no observers is nearly silent; when one approaches sounds are stirred up. On the time scale of synchronous interactivity, the response curves of sound synthesis algorithms are designed to exhibit energy saturation and dissipation analogous to the histeresis of the sensors. The parameterization of this response can be varied for each floor region as well as for each node in the SN. For example, tuning sounds wake up immediately with new movement in a region, and the harmonicity of tuning spectra vary from region to region, whereas the voices emerging from background residuals require several layers of activation. This variation is used to create the local formal structure of each section of the composition. On a larger time scale histeresis principles are applied to create musical form using condition-dependent events threading. Movements are accumulated in data registers to achieve consequences at multiple time scales. The large-scale form of the composition is encoded as a state machine that responds to movement statistics measured in time windows. Formal sections are indicated by changing the active SN, which in turn determines the sensorial database and the activation range for sound synthesis. By nesting movement-sensitive measurement at several time scales the composition unfolds modest or radical changes of auditory texture and semantic character, in response to observers' sustained interactivity. This enables a responsive difference to increase over time in the presence of observers who are investigating the consequences of their movements, compared to those who are merely walking through the space. This design is intended to resist the decay of difference found in trigger-based presentations, which have been observed to normalize the consequences of observers' attentiveness rather than differentiate observers' attentiveness. 6. Conclusions and Future Projections Condition dependent events threading is a challenging aspect of scheduling in a system that responds to dynamics in parallel time scales. Control gradients from floor regions, partition activation, and statistical measures are applied in parallel to modulate various sound sources. Changes initiated in any one of these measures may require transitions to be schedule for all. Solutions to time-sensitive state changes with histeresis are a research area where much investigation can be conducted in the area of composition applied to interactive systems. Future focus is anticipated in these areas: (1) Generally applicable strategies for transforming musical forms into graph structures to facilitate nontrivial interactive scenarios and structured experience. (2) Generally adaptable architectures of instructional resources within a system for anticipating the range of possibilities of interactions that may occur between observers and systems. 7. References Bargar, R. "Authoring intelligent sound for synchronous human-computer interaction." In Kansei, The Technology of Emotion. Proceedings of the AIMI International Workshop, A. Camurri, ed. Genoa: Associazione di Informatica Musicale Italiana, October 3-4, 1997, pp. 177-188. Brothers, L. Voices in Ruins: A Sound Installation. Dorsky Gallery, New York, 2000. Cohen, M.D. "Insook Choi at Dorsky Gallery" Reviewny, June 1, 2000. http://www.reviewny.com/ Decade Engineering, SVA Documentation. http://www.decadenet.com/, 1999. Serra, X., Bonada, J., Herrera, P., and Loureiro, R. "Integrating complementary spectral models in the design of a musical synthesizer." Proceedings of the International Computer Music Conference, 1997.