IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
 Select a country
 IBM Home
IBM Research
Think Research
Technical Disciplines
Cross-Disciplines
About IBM Research
Resources
Search Research
Feedback

Related Links
  Worldwide Labs
  Page Contact
 
 


IBM Research
User Interface Technologies

Computer Science > User Interface Technologies > Research Spotlight (January 2003) > Selected Papers

C.Neti & G. Potamianos (et al.) wrote the Editorial to the Special Issue "Joint audio-visual speech processing" in Eurasip Journal of Applied signal processing, in Press, November 2003.

Summary and Discussion:

In this chapter, we provide an overview of the basic techniques for automatic recognition of audio-visual speech, proposed in the literature over the past twenty years. The two main issues relevant to the design of audio-visual ASR systems are: First, the visual front end that captures visual speech information and, second, the integration (fusion) of audio and visual features into the automatic speech recognizer used. Both are challenging problems, and significant research effort has been directed towards finding appropriate solutions.

We first discuss extracting visual features from the video of the speaker's face. The process requires first the detection and tracking of the face, mouth region, and possibly the speaker's lip contours. A number of mostly statistical techniques, suitable for the task are reviewed. Various visual features proposed in the literature are then presented. Some are based on the mouth region appearance and employ image transforms or other dimensionality reduction techniques borrowed from the pattern recognition literature, in order to extract relevant speech information. Others capture the lip contour and possibly face shape characteristics, by means of statistical, or geometric models. Combinations of features from these two categories are also possible.

Subsequently, we concentrate on the problem of audio-visual integration. Possible solutions to it differ in various aspects, including the classifier and classes used for automatic speech recognition, the combination of single-modality features vs. single-modality classification decisions, and in the latter case, the information level provided by each classifier, the temporal level of the integration, and the sequence of such decision combination. We concentrate on HMM based recognition, based on sub-phonetic classes, and, assuming time-synchronous audio and visual feature generation, we review a number of feature and decision fusion techniques. Within the first category, we discuss simple feature concatenation, discriminant feature fusion, and a linear audio feature enhancement approach. For decision based integration, we concentrate in linear log-likelihood combination of parallel, single-modality classifiers at various levels of integration, considering the state-synchronous multi-stream HMM for ``early'' fusion, the product HMM for ``intermediate'' fusion, and discriminative model combination for ``late'' integration, and we discuss training the resulting models.

Developing and benchmarking feature extraction and fusion algorithms requires available audio-visual data. A limited number of corpora suitable for research in audio-visual ASR have been collected and used in the literature. A brief overview of them is also provided, followed by a description of the IBM ViaVoice database, suitable for speaker-independent audio-visual ASR in the large-vocabulary, continuous speech domain. Subsequently, a number of experimental results are reported using this database, as well as additional corpora recently collected at IBM. Some of these experiments were conducted during the summer 2000 workshop at the Johns Hopkins University, and compared both visual feature extraction and audio-visual fusion methods for LVCSR. More recent experiments, as well as a case study of speaker adaptation techniques for audio-visual recognition of impaired speech are also presented. These experiments show that a visual front end can be designed that successfully captures speaker-independent, large-vocabulary continuous speech information. Such a visual front end uses discrete cosine transform coefficients of the detected mouth region of interest, suitably post-processed. Combining the resulting visual features with traditional acoustic ones results in significant improvements over audio-only recognition in both clean and of course degraded acoustic conditions, across small and large vocabulary tasks, as well as for both normal and impaired speech. A successful combination technique is the multi-stream HMM based decision fusion approach, or the simpler, but inferior, discriminant feature fusion (HiLDA) method.

The chapter clearly demonstrates that, over the past twenty years, much progress has been accomplished in capturing and integrating visual speech information into automatic speech recognition. However, the visual modality has yet to become utilized in mainstream ASR systems. This is due to the fact that issues of both practical and research nature remain challenging. On the practical side of things, the high quality of captured visual data, which is necessary for extracting visual speech information capable of enhancing ASR performance, introduces increased cost, storage, and computer processing requirements. In addition, the lack of common, large audio-visual corpora that address a wide variety of ASR tasks, conditions, and environments, hinders development of audio-visual systems suitable for use in particular applications.

On the research side, the key issues in the design of audio-visual ASR systems remain open and subject to more investigation. In the visual front end design, for example, face detection, facial feature localization, and face shape tracking, robust to speaker, pose, lighting, and environment variation constitute challenging problems. A comprehensive comparison between face appearance and shape based features for speaker-dependent vs. speaker-independent automatic speechreading is also unavailable. Joint shape and appearance three-dimensional face modeling, used for both tracking and visual feature extraction has not been considered in the literature, although such an approach could possibly lead to the desired robustness and generality of the visual front end. In addition, when combining audio and visual information, a number of issues relevant to decision fusion require further study, such as the optimal level of integrating the audio and visual log-likelihoods, the optimal function for this integration, as well as the inclusion of suitable, local estimates of the reliability of each modality into this function.

Further investigation of these issues is clearly warranted, and it is expected to lead to improved robustness and performance of audio-visual ASR. Progress in addressing some or all of these questions can also benefit other areas where joint audio and visual speech processing is suitable, such as speaker identification and verification, visual text-to-speech, speech event detection, video indexing and retrieval, speech enhancement, coding, signal separation, and speaker localization. Improvements in these areas will result in more robust and natural human-computer interaction.

  About IBM  |  Privacy  |  Terms of use  |  Contact