Page  00000001 A system for recognizing shape, position and rotation of the hands Leonello Tarabella Massimo Magrini, Giuseppe Scapellato Computer Music Lab of CNUCE/C.N.R. via S.Maria 36, 56126 Pisa - Italy Tel. +39-50-593276 Fax +39-50-904052 Abstract After the experience of the Aerial Painting Hands system [1] developed in our Lab in Pisa, a new system based on real-time image understanding techniques has been developed; it greatly enhances the expressive possibilities of a performer and makes his presence on stage more elegant. Now, the performer moves his/her hands in a video-camera capture area, the camera sends the signal to a video digitizer card plugged into a computer and the computer processes the mapped figures of the performer's hands and recognizes x-y positions, shape (posture) and the angle of rotation of left and right hands. Data extracted from the image analysis every frame are used for controlling real-time interactive computer music and computer graphics performances. 1. Introduction Gesture, that is movement of the hands and head and posture of the mouth and eyes, plays a very important role in human communication when seen as a parallel language for enriching the semantic content of speech or as an alternative way to communicate basic concepts and information between people of different cultures and mother tongues. The milestones of tools and paradigms in man-computer communication have been punched-cards and character printers, then keyboard and character screens and finally mouse and graphic screens. Due to the daily increase of the power of computer and electronics systems able to sense presence, distance, position, temperature (and so on...) of objects, a new field of investigation and implementation as been started in the last few years: the recognition of human gesture. In this direction, one the most promising areas of application is computer vision this is due to two main causes: first, vision is for people the natural way of recognising the gestures of other humans; and second, the required hardware is simple, economic and standard enough: video camera and digitizer cards. Whatever the technology used, the whole problem consists of two main steps and field of research. In speech recognition we move from acoustic signal to words and from sequences of words to semantics. In gesture recognition the first step deals with the recognition of figures (hands, face) in terms of shape and position in space, while the second steps deals with dynamics (changing of shapes in time, trajectories, detection of starting and ending points of trajectories) and how these give meaning to gesture [2,3,4]. Relevant works in progress in this research area are: - "GestureComputer", by C.Maggioni e B.Kammerer at Siemens AG Corporate Research and Development Department for recognizing gesture of the hand and movements of head in real-time [5]. - "Orientation Histogram for Hand Recognition" by W.T.Freeman and M.Roth at Mitsubishi Electric Research Labs for gesture recognition of hand based on pattern recognition techniques [6]. - "Hand Posture Recognition" by D. Banarse at the Neural and Vision Research Lab, University of Galles, based on the Neural Network approach [7]. Even if these projects use different approaches and are bound for different targets, all of them focus their attention to a single hand and do not consider other interesting information coming from the combination and/or distinction of left/right hands. Our work takes into consideration the presence of both hands and deals with the first stage of the problem as described above, the target being the detection of as much information as possible from the gesture of a performer for controlling real-time interactive computer performances. 2. Technology and basic ideas

Page  00000002 After the experience of the Aerial Painting Hands (APH) system presented in concert (Wireless) and as a Studio Report at the ICMC97 in Thessaloniki [1] a new system based on "image understanding" techniques has been developed. The hardware used in APH-system consists of a CCD video camera connected to a video digitizer card plugged into a the computer; a performer dressed in black wears two ordinary cloth gloves with three spots of different colour and moves his hands, lit by a white spot light, in the CCD-camera video area. In the new system the white spot light has been replaced by an UV lamp placed very close to the performer who wears two ordinary white cloth gloves which appears very bright in the darkness and can be easily isolated from the background environment. The basic idea of the new approach to gesture recognition was derived from a paper by Prof. Vito Cappellini [8] based on the Fourier Transform and developed for recognizing bolts and tools sliding on a conveyor belt to be selected and picked up by a mechanical arm. In our system the digitized image coming from the camera is transformed into a binary matrix where ones represent those points whose luminance level is greater than a predefined threshold and zeros correspond to the black background. The next step consists of (for each hand) a computation of the barycenter and the construction of a "one-period-signal" using distances from the barycenter of points along the contour taken on radii at predefined angular steps (see fig. 2). Barycenter, or center of mass, is computed from the weighted mean of rows and columns of the matrix: this is found by scanning the binary sub-matrices which represent the hands and, for each one of them, performing the Fig. 1 - Memory representation of hands following computation Li *Rowi j * Colj = Total Total where Rowi is the number of pixels valued 1 in i-th row, Coli is the number of pixel valued 1 in j-th column and Total is total number of pixels valued 1, that is the number of pixels which represent one hand. C, and C, are the co-ordinates of the barycenter of that hand. The one-period-signal is constructed using the distances between the barycenter and the contour of the shape of the hand; for searching the second point of each c segment (first one being always the barycenter) program searches on lines Y=mX+q for the most distant point of value=1, that is the first value=1 from theP frame; m is the angular coefficient of radius which changes with step AO corresponding to the virtual "sampling rate frequency". Since in general the posture of the hand generates a non-convex figure (as in fig.2), the scanning algorithm just described produces signals corresponding to a Fig.2 - Signal construction "palmed" hand (like ducks feet). However, as experimentally verified, this approximation does not affect the analysis results. More critical is the choice of step AO which, when too large gives aliased signals and when too small requires too much amount of computation for the FFT. Good values have been found to be AO=2p/32 and AO=2p/64. Fig. 3 - Examples of hand postures with the related one-period-signal These figures show three different typical posture of the hands and their corresponding signals constructed as described above: it may be noticed that, in spite of the palmed-hand approximation process, signals are quite distinctive for different shapes coming from the relative positions of the fingers. 3. Recognition

Page  00000003 The one-period-signal which represents the posture of the hand to be recognized is processed with the FFT algorithm: the resulting harmonic spectrum is used for recognizing the shape while phase spectrum is used for computing the angle of rotation. The following figures shows four different shapes and the related spectrum of harmonics: spectra of phases are not reported here but are discussed later. Fig.4 - Four postures with harmonic spectra The harmonic spectrum characterizes very well the shape of the hand and, furthermore, has the very important property of invariance with respect to both rotation and to dimension (which changes with the distance of the hand from the camera). Recognition is then performed by an algorithm based on the idea of least-distance between N-dimensional vectors: considering the harmonic spectrum resulting from FFT as a vector H=(hi,.....,hn) and CI,..., Cm the harmonic-spectrum vectors related to precise shapes stored during a previous learning phase, recognizing a shape means to compute the vector distance IH-Ci I between H and all Ci and to select that Ci which gives the lowest value. Rotation is calculated from phase spectrum: in this case only the first component is meaningful since the higher components are simply multiples of the first one; besides they are difficult to manage and, at the end, of no interest. Rotation is then computed as the algebraic difference between the current value of the first component of the phase spectrum and, once again, the first component of the phase spectrum of the shape found with IH-C1 I. At this point it is important to note that during the learning phase, for every shape it is worthwhile to keep the thumb straight and in a horizontal position in order to give a zero-degree-rotation reference value: rotation is then given as plus or minus degrees in respect to that reference. 4. Comparison with the Neural Network approach When we started to tackle the problem of recognizing the shape and rotation of the hands, we took into consideration, needless to say, the Neural Networks approach. However, we found it requires a computation too large for a real-time application especially when using ordinary personal computers. The learning phase, too, should be huge since it requires considering the different hand postures at all sizes (relative to all distances from the camera) and, for each size, with all rotations. The FFT approach is much better for two reasons: first, during the learning phase each posture requires the storage of data concerning only one size (at a middle distance from the camera) and only one angular position; second, it is much faster at run time. Besides, even if of no great importance from the point of view of efficiency, FFT is ideally closer to the final goal of the whole system: synthesis and processing of sound. So close that the intermediate results, the harmonic and phase spectra, could be used for producing and/or transforming natural sounds (for example by convolution) in live interactive performances under the direct control of hand movements. 5. Two hands recognition We claim that our work differs from the others seen at the beginning of the paper, apart the methodology used, because it takes into consideration both hands: In fact, what we have described until now for only one hand is

Page  00000004 valid for both and, more important, even when they are present simultaneously. The following figures show three situations with both hands in different postures and rotations and in different x-y positions: the results of the recognizing process are reported under each situation. RH:open x= 81 y=2ll 00 RH:halt x=114 y=l37 -5Q0 RH:halt x= 73 y=2O2 +5O LH:open x=318 y=l96 00 LH:halt x=298 y=l35 -45O LH:open x=321 y=l44 +00 Fig. 5 - Three different configurations of both hands: x=O,y=O upper left The recognizing program performs the following tasks: at the beginning it scans vertically the whole image starting from both sides in order to find the left-hand and the right-hand; the hands are then framed into two separate rectangles and processed by the EFT algorithm. In the next frame (and in those which follow) the program scans each hand image in a rectangle slightly larger than the previous one for saving time; however, if the movement of one hand is too fast, the program may lose that hand: in this case it starts the vertical scanning again. At the end of the analysis of one frame, the program yields for each hand: - a) hand status out of a set of possible states defined during the learning phase (for example: open,halt,closed..) - b) x-y position given by the co-ordinates of the barycenter; - c) angular rotation from - 9O" to +9O" with respect to the horizontal position of thumb. In order to get the program to work correctly, some constraints must be taken into consideration: beyond the necessary off-line learning phase for stating the postures to recognize, the system doesn't allow the performer to reverse the hands (he has to keep palms toward the camera, always!) since this leads to a double-right-hand or double-left-hand error. Also, the performer must wear white gloves lit by an UV lamp, move the hands slowly, and have a black background to enanche the hands. 6. Conclusions Considering that the allowed speed of movements depends on the power of the computer in use, some improvements of the system are currently under development. Using a "differential frame" technique, a preelaboration stage will filter all the static areas of the scene and let pass only those pixels regarding the hands: this will make unnecessary the white gloves, UV lamps and black background. Our system has been first developed on a Mac8lOO/JOOAV computer: new efforts are now addressed to use a standard QuickCam and a standard PCMCIA video-converter-card which will make it possible to use the system on laptop computers. This system has been designed and implemented for live interactive computer music/graphics performances; however, due to the low cost standard hardware required in the upcoming version, it could be considered also as a general purpose interface to be used in different areas of application such as games and education. References [1] Tarabella L., "Studio Report of the Computer Music Lab of CNUCE/C.N.R." in ICMC'97 Procs, pp.86-88 [2] Bordegoni M., Faconti G.P., "Architectural Models of Gesture System" in P.AHarling and A. Edwards, editors, Proceedings of Gesture Workshop '96, pp.6 1-74, Springer Verlag, 1996 [3] Rubine D., "Specifying gesture by example" in Computer Graphics, V.25(4), pp.329-337. ACM Press, 91 [4] Wexelbiat A., "An Approach to natural gesture in virtual environment" in ACM ToCHI, pp. 179-200 ACM Press, 96 [5] Maggioni C., Kammerer B., "Gesture Compute", http:/w w w.b ath. entre s/MEDIA/S IEMENS/Ge stureC omp uterM ainFrame.htm [6] Freeman W.T., Roth M., "Orientation Histogram for Hand Recognition" http://atlantic.merl. com/p ub/freeman/ge sture.p s [7] Banarse D., "Hand Posture Recognition", http://www.sees.bangor. ac.ukk-dbanarse/mscm-okhm [8] Cappellini V., Elaborazione numerica delle immagini, Editore Boringhieri SpA, Torino - 1985