Photo
Audio Visual Speech Technologies

Audio Visual Speech Recognition

This project explores the use of visual information in speech recognition systems.

Although significant progress has been made in machine transcription of large vocabulary continuous speech (LVCSR) over the last few years, the technology to date is most effective only under controlled conditions such as low noise, speaker dependent recognition and read speech (as opposed to conversational speech etc.).

The potential for joint audio-visual-based speech recognition is well established in the literature on the basis of psychophysical experiments.  Canonical mouth shapes that accompany speech utterances have been categorized, and are known as visual phonemes or ``visemes''. Visemes provide information that complements the phonetic stream from the point of view of confusability. For example, ``m'' and ``n'' which are confusable acoustically, especially in noise situations, are easy to distinguish visually: in ``m'' lips close at onset, where as in ``n'' they do not. The unvoiced fricatives ``f'' and ``s'' which are difficult to recognize acoustically belong to two different viseme groups.

Efforts to use visual information for automatic speech recognition have begun recently on experiments with small vocabulary letter or digit recognition tasks. However, most of these efforts have been limited to small vocabulary (e.g., command, digits) and often to speaker dependent training or isolated word speech where word boundaries are artificially well defined.

In this project, we are investigating the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of large-vocabulary, continuous speech using sub-phonetic statistical models. We are interested in demonstrating meaningful improvements for realistic tasks such as broadcast news transcription, large vocabulary dictation and speech reading for the hearing/speech impaired.

Key Research Areas:

  • Face detection, tracking; Facial feature location
  • Facial feature representation for visual speech
  • Fusion of audio and visual representations of speech

Papers:

G. Potamianos, C. Neti, J. Luettin, and I. Matthews, Audio-Visual Automatic Speech Recognition: An Overview. In: Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier (Eds.), MIT Press (In Press), 2004.


G. Potamianos, C. Neti, G. Gravier, and A. Garg, Automatic Recognition of audio-visual speech: Recent progress and challenges, Proceedings of the IEEE, vol. 91, no. 9, Sep. 2003.


G. Potamianos, C. Neti, and S. Deligne, Joint audio-visual speech processing for recognition and enhancement, Proc. Work. Audio-Visual Speech Process., pp. 95-104, St. Jorioz, France, Sep. 2003.


J. Huang, G. Potamianos, and C. Neti, Improving audio-visual speech recognition with an infrared headset, Proc. Work. Audio-Visual Speech Process., pp. 175-178, St. Jorioz, France, Sep. 2003.


G. Potamianos and C. Neti, Audio-visual speech recognition in challenging environments, Proc. Eur. Conf. Speech Comm. Tech., pp. 1293-1296, Geneva, Sep. 2003.


J.H. Connell, N. Haas, E. Marcheret, C. Neti, G. Potamianos, and S. Velipasalar, A real-time prototype for small-vocabulary audio-visual ASR,Proc. Int. Conf. Multimedia Expo., vol. II, pp. 469-472, Baltimore, July 2003.


A. Garg, G. Potamianos, C. Neti, and T.S. Huang, Frame-dependent multi-stream reliability indicators for audio-visual speech recognition, Proc. Int. Conf. Acoust. Speech Signal Process., vol. I, pp. 24-27, Hong Kong, Apr. 2003.


G. Gravier, G. Potamianos, and C. Neti, Asynchrony modeling for audio-visual speech recognition, Proc. Human Language Technology Conference, San Diego, 2002.


G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, Proc. Int. Conf. Acoust. Speech Signal Process., Orlando, 2002.
R. Goecke, G. Potamianos, and C. Neti, Noisy audio feature enhancement using audio-visual speech data, Proc. Int. Conf. Acoust. Speech Signal Process., Orlando, 2002.
G. Potamianos, C. Neti, G. Iyengar, and E. Helmuth, Large-vocabulary audio-visual speech recognition by machines and humans, Proc. Eurospeech, Aalborg, 2001.


G. Potamianos and C. Neti, Automatic speechreading of impaired speech, Proc. Work. Audio-Visual Speech Process., Scheelsminde, 2001.


G. Potamianos and C. Neti, Improved ROI and within frame discriminant features for lipreading, Proc. Int. Conf. Image Process., Thessaloniki, 2001.


C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop, Proc. IEEE Work. Multimedia Signal Process., Cannes, 2001.


G. Iyengar and C. Neti, Detection of faces under shadows and lighting variations, Cannes, 2001.


G. Iyengar, G. Potamianos, C. Neti, T. Faruquie, and A. Verma, Robust detection of visual ROI for automatic speechreading, Proc. IEEE Work. Multimedia Signal Process., Cannes, 2001.


I. Matthews, G. Potamianos, C. Neti, and J. Luettin, A comparison of model and transform-based visual features for audio-visual LVCSR, Proc. IEEE Int. Conf. Multimedia Expo., Tokyo, 2001.


G. Potamianos, C. Neti, G. Iyengar, A.W. Senior, and A. Verma, A cascade visual front end for speaker independent automatic speechreading,Int. J. Speech Technology, Vol. 4, pp. 193-208, 2001.


G. Potamianos, J. Luettin, C. Neti. Hierarchical discriminant features for audio-visual LVCSR, ICASSP, Salt Lake City, May 2001.


J. Luettin, G. Potamianos, C. Neti. Asynchronous stream modeling for large-vocabulary audio-visual speech recognition, ICASSP, Salt Lake City, May 2001.


H. Glotin, D. Vergyri, C. Neti, G. Potamianos, J. Luettin. Weighting schemes for audio-visual fusion in speech recognition, ICASSP, Salt Lake City, May 2001.


G. Potamianos, C. Neti. Stream confidence estimation for audio-visual speech recognition, ICSLP, vol III, pp. 746-749, Beijing, October 2000.


C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A.Mashari, and J. Zhou, Audio-Visual Speech Recognition, Final Workshop 2000 Report, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD (Oct. 12, 2000).


G. Potamianos, A. Verma, C. Neti, G. Iyengar. A cascade image transform for speaker independent automatic speechreading. International Conference on Multimedia and Expo, vol. II, pp. 1097-1100, New York, July-August 2000.


Ashish Verma, Tanveer Faruquie, C. Neti, Sankar Basu, Andrew Senior. Late Integration in Continuous Audio-Visual Speech Recognition, ASRU, Colorado, 1999.


S. Basu, C. Neti, N. Rajput, A. Senior. L. Subramaniam, A. Verma. Audio-visual large-vocabulary continous speech recognition in the broadcast news domain, IEEE Multimedia Signal Processing,  Conference (MMSP99), Denmark, Sept, 1999.


S. Basu, E. E. Jan, Mark Lucente and Chalapathy Neti. Beyond Audio-based speech recognition, 1998 NIST/DARPA Workhop on SmartSpaces, Gaithersburg, MD, 1998.


A.W.Senior.Face and Feature Finding for a Face Recognition System.. Audio and Video based Biometric Person Authentication '99. Washington D.C. March 22-24, 1999.

Publications List

 

Demo:

Transcription Experiment

This demo attempts to show how human communication utilizes visual information in order to better understand spoken content that is delivered in environments with high levels of background noise. Please contact us with your reactions to the experience of attempting to understand the spoken content.