IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
 Select a country
 IBM Home
IBM Research
Think Research
Technical Disciplines
Cross-Disciplines
About IBM Research
Resources
Search Research
Feedback

Related Links
  Worldwide Labs
  Page Contact
 
 


IBM Research
User Interface Technologies

Computer Science > User Interface Technologies > Research Spotlight (January 2003) > Selected Papers

G. Potamianos, C. Neti, J. Luettin, and I. Matthews, ``Audio-visual automatic speech recognition: An overview,'' To appear in: Audio-Visual Speech Processing, E. Vartikiotis-Bateson, G. Bailly, and P. Perrier (Eds.), MIT Press, pp. 121-148, 2003.MIT press book chapter on "audio-visual speech recognition".

EDITORIAL:

Computing is becoming increasingly ubiquitous and pervasive with multiple devices and modes of interaction. Traditional modes of human computer interaction based on a keyboard and a mouse are being replaced by more natural modes such as speech, touch, and gesture. This ongoing transformation towards pervasive and ubiquitous computing, induces the need for the next generation of human-computer interfaces (HCI) that are easy to use, transparent and robust in a variety of environments. It is important to not only improve the recognition and understanding of the individual modes (speech, gesture, etc.) in a variety of environments, but also to develop technologies and architectures that choose the most appropriate mode or modes to understand the content and the context of the interaction. Human cognition is an excellent guide to develop the next generation HCI, since it depends on highly developed abilities to perceive, integrate, and interpret visual, auditory, and touch information. In spite of dramatically varying perceptual conditions, the human ability to organize and intelligently combine sensory data derived from multiple sensors, modulated by perceptual relevance and sensory confidence, is crucial for building a robust model of objects and events in our environment.

Although significant progress has been made in developing the individual modes of HCI based on audio (e.g., speech and speaker recognition), or visual input (e.g., face localization and person identification), only recently has there been an increased interest in exploiting the human capacity to process joint audio and visual information to improve systems for recognizing human activity (e.g., speech recognition, speech activity, speaker change, etc.), intent (e.g., speech intent), and identity (e.g., speaker recognition) in pervasive computing real-world environments. Indeed, over the past twenty or so years, joint audio and video signal processing has become an active field of research, attracting considerable attention in the signal processing community as a means to improve the robustness and naturalness of human-computer interfaces in a variety of domains. To realize the full benefit of joint processing in any of these application areas, several technical challenges have to be addressed, most notably the appropriate integration of multiple modalities. This special issue highlights innovative research in joint audio and video signal processing and its application to a variety of HCI critical areas, namely source localization, speech source separation in multi-speaker environments, speech, and speaker recognition.

In the first paper, Zotkin et al. address the problem of joint audio-visual source localization using particle filters. They present a technique for tracking people by integrating audio cues captured using microphone arrays and visual cues captured using camera arrays. The multi-modal tracking problem is formulated in terms of particle filters. The authors also show that the approach can be used for self-calibration, or to deal with situations where the sensors are moving, or where people are partially occluded.

The subsequent papers shift the focus to audio-visual speech. Of particular interest and important to developing HCI applications is the fact that visual information from the speaker's lower face both supplements and complements information provided by the traditional audio speech signal.

Sodoyer and his co-authors describe a novel approach to separating speech sources using the audio-visual coherence of the speech stimuli. In contrast to the classical blind source separation techniques, the authors explore the problem of extracting a source from an additive mixture of speech sources. Using the visual lip tracking information of the desired speech source to extract its spectral envelope, the authors show that this approach compares favorably with independent component analysis.

Jiang et al. provide the first rigorous application of techniques that correlate speech behavior to various visual aspects of speech. They use multi-linear techniques and measures of vocal tract articulator motion, face motion, and speech acoustics during production of sentences and nonsense syllables to characterize redundant properties of visible and audible speech behavior and the role of the vocal tract as a common source of phonetic information in both modalities. They show that the phonetic impact (e.g., of ``place of articulation'') on the degree of correspondence between audible and visible events is not uniform. The study also shows that speaker-specific differences in the strength of the cross-domain correlation in production do not necessarily match perceived differences in speaker intelligibility.

An important requirement for conducting research in joint audio-visual signal and speech processing is the availability of suitable audio-visual speech corpora. Patterson et al. describe an audio-visual speech database, CUAVE, that is compact in its organization and accessibility to other researchers and, at the same time, comprehensive enough to be useful in addressing a variety of methodological and research questions, especially those having to do with automatic detection and measurement of facial features and motion under varying position and orientation. The database consists of more than 7000 utterances (A total of 36 male and female speakers produced digits in isolation and in connected strings). In the latter part of the paper, the authors demonstrate the challenge and utility of the CUAVE database by extracting lip contours from moving and still faces and estimating a speaker-independent baseline for visual only speech recognition, using various visual speech feature extraction techniques.

To allow joint audio-visual speech processing, reliable extraction of visual speech features is required. The problem consists of face detection, mouth region localization, and possibly lip (or, in general, face) contour estimation, followed by extraction of visual features that are informative about the uttered speech. These topics constitute the subject of the next cluster of papers in this special issue.

Daubias and Deleglise consider the first aspect of the visual speech processing problem; namely, they segment the lower mouth area into three classes of interest, i.e., lip, inner mouth, and skin regions, by means of an artificial neural network (ANN) classifier (appearance based statistical model). The authors propose an automatic way of acquiring the labeled data needed to train the appearance model. Lip shape contours are estimated from aligned pairs of audio-visual sequences that contain the same phonetic context and whose alignment was calculated using dynamic time warping on the audio channel. The first pair sequences contain lips that are marked up with blue lipstick, and thus are easy to segment. The lip-shape model is built based on the easily extracted contours of such sequences. The second pair sequences contain unmarked lips, and thus, their segmentation is difficult, requiring both lip shape estimation and localization. The audio based alignment comes to the rescue, as it provides the lip-shape model parameters, thus reducing the problem to that of contour localization alone. The proposed method is reported to give excellent segmentation results, and saves the significant effort associated with data hand labeling.

Aleksic and his co-authors explore novel visual speech features based on MPEG4 facial action parameters (FAPS). They describe an automatic method to extract the FAPS by combining active contour and template algorithms and demonstrate that these visual speech features can be used in conjunction with audio to improve medium vocabulary speech recognition in white Gaussian noise. Since MPEG4 is a multimedia standard that deals with generic coding of audio-visual objects for multimedia applications, this work is an important step towards developing audio-visual speech recognition using standardized features.

Zhang et al. propose novel algorithms for both face segmentation and visual feature extraction. In particular, they employ Markov random fields for estimating the lip contours, and subsequently augment traditional lip contour based features with new visual features, that capture the tongue and teeth visibility. They demonstrate that the additional features help visual-only speech recognition. Finally, they report improved performance of both speech and speaker recognition on two standard audio-visual databases, by introducing the visual modality in addition to the traditional audio input.

The last paper of this cluster, by Gordan et al., concentrates on the issue of visual speech classification. Instead of employing the popular hidden Markov model (HMM) based recognizer, the authors propose a hybrid classification architecture that uses HMMs in conjunction with a parallel network of binary support vector machine (SVM) classifiers, the output of which provides posterior probabilities of the speech classes of interest (here, visemes). The SVMs operate on the pixel values of the mouth region of interest. Results are reported on a four-digit recognition task, and are compared to other visual feature extraction and classification methods.

Last but not least, one of the most exciting research topics in joint audio-visual speech processing is the subject of integration (fusion) of the two speech informative inputs. In this special issue, two papers concentrate on this topic.

Heckmann et al. use a hybrid HMM/ANN architecture for audio-visual speech recognition, and integrate the audio and visual classifiers at a likelihood, decision level, in a ``state synchronous'' fashion. They first consider various schemes to weigh the posterior likelihoods of the audio and visual-only ANNs with appropriate stream exponents (``weights''), and they conclude that their multiplicative combination, which respects their class-conditional independence, is superior. Subsequently, they concentrate on the adaptive estimation of the combination weights, based on the reliability of each stream of information, as captured by three criteria. Results are reported on a single-speaker, connected digits database.

Nefian et al. take a different approach and concentrate on ``state-asynchronous'' architectures for audio-visual fusion, by means of HMMs. They consider integration both at the feature, as well as at the likelihood (decision) level, using a multitude of Bayesian network models, such as the product HMM, the factorial HMM, and the coupled HMM. They present iterative algorithms for obtaining maximum likelihood estimates of the model parameters, as well as their initial estimates. They also compare the complexity of the models, both in terms of number of parameters and recognition time. Finally, they report their performance on a speaker-independent audio-visual speech recognition task of isolated words.

As a final note, we would like to express our thanks to all the contributing authors and reviewers for this special issue. We believe that this is a very exciting area of research, and hope that these papers will further stimulate work on joint audio-visual signal and speech processing. Editors: Chalapathy Neti Gerasimos Potamianos Juergen Luettin Eric Vatikiotis-Bateson

 
  About IBM  |  Privacy  |  Terms of use  |  Contact