
 |
 |

Audio-Visual Speech Recognition
This project explores the use of visual information
in speech recognition systems.
Although significant progress has been made in
machine transcription of large vocabulary continuous speech (LVCSR) over
the last few years, the technology to date is most effective only under
controlled conditions such as low noise, speaker dependent recognition
and read speech (as opposed to conversational speech etc.).
The potential for joint audio-visual-based speech
recognition is well established on the basis of psychophysical experiments
{Summerfeld79}. Canonical mouth shapes that accompany speech utterances
have been categorized, and are known as visual phonemes or ``visemes''
{Stork96}. Visemes provide information that complements the phonetic
stream from the point of view of confusability. For example, ``mi'' and
``ni'' which are confusable acoustically, especially in noise situations,
are easy to distinguish visually: in ``mi'' lips close at onset, where
as in ``ni'' they do not. The unvoiced fricatives ``f'' and ``s'' which
are difficult to recognize acoustically belong to two different viseme
groups {Stork96}.
Efforts to use visual information for automatic
speech recognition have begun recently on experiments with small vocabulary
letter or digit recognition tasks (see e.g., {Bregler94,Potamianos98,Stiefelhagen97}).
However, most of these efforts have been limited to small vocabulary (e.g.,
command, digits) and often to speaker dependent training or isolated word
speech where word boundaries are artificially well defined.
In this project, we are investigating the problem
of combining visual cues with audio signals for the purpose of improved
automatic machine recognition of large-vocabulary, continous speech using
sub-phonetic statistical models.
We are interested in demonstrating meaningful
improvements for realistic tasks such as broadcast news transcription,
large vocabulary dictation and speech reading for the hearing/speech impaired.
Key Research Areas:
Face detection, tracking; Facial feature location
Facial feature representation for visual speech
Fusion of Audio and Visual representations of speech
Papers:
C. Neti, G. Potamianos, J. Luettin, I. Matthews, D. Vergyri, J. Sison, A.Mashari, and J. Zhou,
"Audio-Visual Speech Recognition", Final Workshop 2000 Report,
Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD (Oct. 12, 2000).
G. Potamianos, A. Verma, C. Neti, G. Iyengar.
A cascade image transform for speaker independent automatic speechreading.
International Conference on Multimedia and Expo, vol. II, pp. 1097-1100,
New York, July-August 2000.
Ashish Verma, Tanveer Faruquie, C. Neti, Sankar Basu, Andrew Senior. Late Integration in Continuous Audio-Visual
Speech Recognition, ASRU, Colorado, 1999.
S. Basu, C. Neti, N. Rajput, A. Senior.
L. Subramaniam, A. Verma. Audio-Visual large-vocabulary continous speech
recognition in the broadcast news domain, IEEE Multimedia Signal Processing,
Conference (MMSP99), Denmark, Sept, 1999.
S. Basu, E. E. Jan, Mark Lucente and Chalapathy Neti. Beyond Audio-based
speech recognition, 1998 NIST/DARPA Workhop on SmartSpaces, Gaithersburg,
MD, 1998.
A.W.Senior.Face and Feature Finding for a Face Recognition System..
Audio and Video based Biometric Person Authentication '99. Washington
D.C. March 22-24, 1999.
Publications List
|
|