|
Summary
and Discussion:
In this
chapter, we provide an overview of the basic techniques for automatic
recognition of audio-visual speech, proposed in the literature over
the past twenty years. The two main issues relevant to the design
of audio-visual ASR systems are: First, the visual front end that
captures visual speech information and, second, the integration
(fusion) of audio and visual features into the automatic speech
recognizer used. Both are challenging problems, and significant
research effort has been directed towards finding appropriate solutions.
We first
discuss extracting visual features from the video of the speaker's
face. The process requires first the detection and tracking of the
face, mouth region, and possibly the speaker's lip contours. A number
of mostly statistical techniques, suitable for the task are reviewed.
Various visual features proposed in the literature are then presented.
Some are based on the mouth region appearance and employ image transforms
or other dimensionality reduction techniques borrowed from the pattern
recognition literature, in order to extract relevant speech information.
Others capture the lip contour and possibly face shape characteristics,
by means of statistical, or geometric models. Combinations of features
from these two categories are also possible.
Subsequently,
we concentrate on the problem of audio-visual integration. Possible
solutions to it differ in various aspects, including the classifier
and classes used for automatic speech recognition, the combination
of single-modality features vs. single-modality classification decisions,
and in the latter case, the information level provided by each classifier,
the temporal level of the integration, and the sequence of such
decision combination. We concentrate on HMM based recognition, based
on sub-phonetic classes, and, assuming time-synchronous audio and
visual feature generation, we review a number of feature and decision
fusion techniques. Within the first category, we discuss simple
feature concatenation, discriminant feature fusion, and a linear
audio feature enhancement approach. For decision based integration,
we concentrate in linear log-likelihood combination of parallel,
single-modality classifiers at various levels of integration, considering
the state-synchronous multi-stream HMM for ``early'' fusion, the
product HMM for ``intermediate'' fusion, and discriminative model
combination for ``late'' integration, and we discuss training the
resulting models.
Developing
and benchmarking feature extraction and fusion algorithms requires
available audio-visual data. A limited number of corpora suitable
for research in audio-visual ASR have been collected and used in
the literature. A brief overview of them is also provided, followed
by a description of the IBM ViaVoice database, suitable for speaker-independent
audio-visual ASR in the large-vocabulary, continuous speech domain.
Subsequently, a number of experimental results are reported using
this database, as well as additional corpora recently collected
at IBM. Some of these experiments were conducted during the summer
2000 workshop at the Johns Hopkins University, and compared both
visual feature extraction and audio-visual fusion methods for LVCSR.
More recent experiments, as well as a case study of speaker adaptation
techniques for audio-visual recognition of impaired speech are also
presented. These experiments show that a visual front end can be
designed that successfully captures speaker-independent, large-vocabulary
continuous speech information. Such a visual front end uses discrete
cosine transform coefficients of the detected mouth region of interest,
suitably post-processed. Combining the resulting visual features
with traditional acoustic ones results in significant improvements
over audio-only recognition in both clean and of course degraded
acoustic conditions, across small and large vocabulary tasks, as
well as for both normal and impaired speech. A successful combination
technique is the multi-stream HMM based decision fusion approach,
or the simpler, but inferior, discriminant feature fusion (HiLDA)
method.
The
chapter clearly demonstrates that, over the past twenty years, much
progress has been accomplished in capturing and integrating visual
speech information into automatic speech recognition. However, the
visual modality has yet to become utilized in mainstream ASR systems.
This is due to the fact that issues of both practical and research
nature remain challenging. On the practical side of things, the
high quality of captured visual data, which is necessary for extracting
visual speech information capable of enhancing ASR performance,
introduces increased cost, storage, and computer processing requirements.
In addition, the lack of common, large audio-visual corpora that
address a wide variety of ASR tasks, conditions, and environments,
hinders development of audio-visual systems suitable for use in
particular applications.
On the
research side, the key issues in the design of audio-visual ASR
systems remain open and subject to more investigation. In the visual
front end design, for example, face detection, facial feature localization,
and face shape tracking, robust to speaker, pose, lighting, and
environment variation constitute challenging problems. A comprehensive
comparison between face appearance and shape based features for
speaker-dependent vs. speaker-independent automatic speechreading
is also unavailable. Joint shape and appearance three-dimensional
face modeling, used for both tracking and visual feature extraction
has not been considered in the literature, although such an approach
could possibly lead to the desired robustness and generality of
the visual front end. In addition, when combining audio and visual
information, a number of issues relevant to decision fusion require
further study, such as the optimal level of integrating the audio
and visual log-likelihoods, the optimal function for this integration,
as well as the inclusion of suitable, local estimates of the reliability
of each modality into this function.
Further
investigation of these issues is clearly warranted, and it is expected
to lead to improved robustness and performance of audio-visual ASR.
Progress in addressing some or all of these questions can also benefit
other areas where joint audio and visual speech processing is suitable,
such as speaker identification and verification, visual text-to-speech,
speech event detection, video indexing and retrieval, speech enhancement,
coding, signal separation, and speaker localization. Improvements
in these areas will result in more robust and natural human-computer
interaction.
|