PROJECTS
IBM Research Homepage 
 Research Home  >> Audio Visual Speech Technology Group


Audio-Visual Speech Recognition


This project explores the use of visual information in speech recognition systems.

Although significant progress has been made in machine transcription of large vocabulary continuous speech (LVCSR) over the last few years, the technology to date is most effective only under controlled conditions such as low noise, speaker dependent recognition and read speech (as opposed to conversational speech etc.).

The potential for joint audio-visual-based speech recognition is well established on the basis of psychophysical experiments {Summerfeld79}.  Canonical mouth shapes that accompany speech utterances have been categorized, and are known as visual phonemes or ``visemes'' {Stork96}.  Visemes provide information that complements the phonetic stream from the point of view of confusability. For example, ``mi'' and ``ni'' which are confusable acoustically, especially in noise situations, are easy to distinguish visually: in ``mi'' lips close at onset, where as in ``ni'' they do not. The unvoiced fricatives ``f'' and ``s'' which are difficult to recognize acoustically belong to two different viseme groups {Stork96}.

Efforts to use visual information for automatic speech recognition have begun recently on experiments with small vocabulary letter or digit recognition tasks (see e.g., {Bregler94,Potamianos98,Stiefelhagen97}). However, most of these efforts have been limited to small vocabulary (e.g., command, digits) and often to speaker dependent training or isolated word speech where word boundaries are artificially well defined.

In this project, we are investigating the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of large-vocabulary, continous speech using sub-phonetic statistical models.
We are interested in demonstrating meaningful improvements for realistic tasks such as broadcast news transcription, large vocabulary dictation and speech reading for the hearing/speech impaired.

Key Research Areas:

  • Face detection, tracking; Facial feature location
  • Facial feature representation for visual speech
  • Fusion of Audio and Visual representations of speech

  • Papers:

  • C. Neti, G. Potamianos, J. Luettin, I. Matthews, D. Vergyri, J. Sison, A.Mashari, and J. Zhou, "Audio-Visual Speech Recognition", Final Workshop 2000 Report, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD (Oct. 12, 2000).
  • G. Potamianos, A. Verma, C. Neti, G. Iyengar. A cascade image transform for speaker independent automatic speechreading. International Conference on Multimedia and Expo, vol. II, pp. 1097-1100, New York, July-August 2000.
  • Ashish Verma, Tanveer Faruquie, C. Neti, Sankar Basu, Andrew Senior. Late Integration in Continuous Audio-Visual Speech Recognition, ASRU, Colorado, 1999.
  • S. Basu, C. Neti, N. Rajput, A. Senior. L. Subramaniam, A. Verma. Audio-Visual large-vocabulary continous speech recognition in the broadcast news domain, IEEE Multimedia Signal Processing,  Conference (MMSP99), Denmark, Sept, 1999.
  • S. Basu, E. E. Jan, Mark Lucente and Chalapathy Neti. Beyond Audio-based speech recognition, 1998 NIST/DARPA Workhop on SmartSpaces, Gaithersburg, MD, 1998.
  • A.W.Senior.Face and Feature Finding for a Face Recognition System.. Audio and Video based Biometric Person Authentication '99. Washington D.C. March 22-24, 1999.
  • Publications List


     Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact