|
G.
Potamianos, C.
Neti, J. Luettin, and I. Matthews, ``Audio-visual
automatic speech recognition: An overview,'' To appear in: Audio-Visual
Speech Processing, E. Vartikiotis-Bateson, G. Bailly, and P. Perrier
(Eds.), MIT Press, pp. 121-148, 2003.MIT press book chapter on "audio-visual
speech recognition".
EDITORIAL:
Computing
is becoming increasingly ubiquitous and pervasive with multiple
devices and modes of interaction. Traditional modes of human computer
interaction based on a keyboard and a mouse are being replaced by
more natural modes such as speech, touch, and gesture. This ongoing
transformation towards pervasive and ubiquitous computing, induces
the need for the next generation of human-computer interfaces (HCI)
that are easy to use, transparent and robust in a variety of environments.
It is important to not only improve the recognition and understanding
of the individual modes (speech, gesture, etc.) in a variety of
environments, but also to develop technologies and architectures
that choose the most appropriate mode or modes to understand the
content and the context of the interaction. Human cognition is an
excellent guide to develop the next generation HCI, since it depends
on highly developed abilities to perceive, integrate, and interpret
visual, auditory, and touch information. In spite of dramatically
varying perceptual conditions, the human ability to organize and
intelligently combine sensory data derived from multiple sensors,
modulated by perceptual relevance and sensory confidence, is crucial
for building a robust model of objects and events in our environment.
Although
significant progress has been made in developing the individual
modes of HCI based on audio (e.g., speech and speaker recognition),
or visual input (e.g., face localization and person identification),
only recently has there been an increased interest in exploiting
the human capacity to process joint audio and visual information
to improve systems for recognizing human activity (e.g., speech
recognition, speech activity, speaker change, etc.), intent (e.g.,
speech intent), and identity (e.g., speaker recognition) in pervasive
computing real-world environments. Indeed, over the past twenty
or so years, joint audio and video signal processing has become
an active field of research, attracting considerable attention in
the signal processing community as a means to improve the robustness
and naturalness of human-computer interfaces in a variety of domains.
To realize the full benefit of joint processing in any of these
application areas, several technical challenges have to be addressed,
most notably the appropriate integration of multiple modalities.
This special issue highlights innovative research in joint audio
and video signal processing and its application to a variety of
HCI critical areas, namely source localization, speech source separation
in multi-speaker environments, speech, and speaker recognition.
In the
first paper, Zotkin et al. address the problem of joint audio-visual
source localization using particle filters. They present a technique
for tracking people by integrating audio cues captured using microphone
arrays and visual cues captured using camera arrays. The multi-modal
tracking problem is formulated in terms of particle filters. The
authors also show that the approach can be used for self-calibration,
or to deal with situations where the sensors are moving, or where
people are partially occluded.
The
subsequent papers shift the focus to audio-visual speech. Of particular
interest and important to developing HCI applications is the fact
that visual information from the speaker's lower face both supplements
and complements information provided by the traditional audio speech
signal.
Sodoyer
and his co-authors describe a novel approach to separating speech
sources using the audio-visual coherence of the speech stimuli.
In contrast to the classical blind source separation techniques,
the authors explore the problem of extracting a source from an additive
mixture of speech sources. Using the visual lip tracking information
of the desired speech source to extract its spectral envelope, the
authors show that this approach compares favorably with independent
component analysis.
Jiang
et al. provide the first rigorous application of techniques that
correlate speech behavior to various visual aspects of speech. They
use multi-linear techniques and measures of vocal tract articulator
motion, face motion, and speech acoustics during production of sentences
and nonsense syllables to characterize redundant properties of visible
and audible speech behavior and the role of the vocal tract as a
common source of phonetic information in both modalities. They show
that the phonetic impact (e.g., of ``place of articulation'') on
the degree of correspondence between audible and visible events
is not uniform. The study also shows that speaker-specific differences
in the strength of the cross-domain correlation in production do
not necessarily match perceived differences in speaker intelligibility.
An important
requirement for conducting research in joint audio-visual signal
and speech processing is the availability of suitable audio-visual
speech corpora. Patterson et al. describe an audio-visual speech
database, CUAVE, that is compact in its organization and accessibility
to other researchers and, at the same time, comprehensive enough
to be useful in addressing a variety of methodological and research
questions, especially those having to do with automatic detection
and measurement of facial features and motion under varying position
and orientation. The database consists of more than 7000 utterances
(A total of 36 male and female speakers produced digits in isolation
and in connected strings). In the latter part of the paper, the
authors demonstrate the challenge and utility of the CUAVE database
by extracting lip contours from moving and still faces and estimating
a speaker-independent baseline for visual only speech recognition,
using various visual speech feature extraction techniques.
To allow
joint audio-visual speech processing, reliable extraction of visual
speech features is required. The problem consists of face detection,
mouth region localization, and possibly lip (or, in general, face)
contour estimation, followed by extraction of visual features that
are informative about the uttered speech. These topics constitute
the subject of the next cluster of papers in this special issue.
Daubias
and Deleglise consider the first aspect of the visual speech processing
problem; namely, they segment the lower mouth area into three classes
of interest, i.e., lip, inner mouth, and skin regions, by means
of an artificial neural network (ANN) classifier (appearance based
statistical model). The authors propose an automatic way of acquiring
the labeled data needed to train the appearance model. Lip shape
contours are estimated from aligned pairs of audio-visual sequences
that contain the same phonetic context and whose alignment was calculated
using dynamic time warping on the audio channel. The first pair
sequences contain lips that are marked up with blue lipstick, and
thus are easy to segment. The lip-shape model is built based on
the easily extracted contours of such sequences. The second pair
sequences contain unmarked lips, and thus, their segmentation is
difficult, requiring both lip shape estimation and localization.
The audio based alignment comes to the rescue, as it provides the
lip-shape model parameters, thus reducing the problem to that of
contour localization alone. The proposed method is reported to give
excellent segmentation results, and saves the significant effort
associated with data hand labeling.
Aleksic
and his co-authors explore novel visual speech features based on
MPEG4 facial action parameters (FAPS). They describe an automatic
method to extract the FAPS by combining active contour and template
algorithms and demonstrate that these visual speech features can
be used in conjunction with audio to improve medium vocabulary speech
recognition in white Gaussian noise. Since MPEG4 is a multimedia
standard that deals with generic coding of audio-visual objects
for multimedia applications, this work is an important step towards
developing audio-visual speech recognition using standardized features.
Zhang
et al. propose novel algorithms for both face segmentation and visual
feature extraction. In particular, they employ Markov random fields
for estimating the lip contours, and subsequently augment traditional
lip contour based features with new visual features, that capture
the tongue and teeth visibility. They demonstrate that the additional
features help visual-only speech recognition. Finally, they report
improved performance of both speech and speaker recognition on two
standard audio-visual databases, by introducing the visual modality
in addition to the traditional audio input.
The
last paper of this cluster, by Gordan et al., concentrates on the
issue of visual speech classification. Instead of employing the
popular hidden Markov model (HMM) based recognizer, the authors
propose a hybrid classification architecture that uses HMMs in conjunction
with a parallel network of binary support vector machine (SVM) classifiers,
the output of which provides posterior probabilities of the speech
classes of interest (here, visemes). The SVMs operate on the pixel
values of the mouth region of interest. Results are reported on
a four-digit recognition task, and are compared to other visual
feature extraction and classification methods.
Last
but not least, one of the most exciting research topics in joint
audio-visual speech processing is the subject of integration (fusion)
of the two speech informative inputs. In this special issue, two
papers concentrate on this topic.
Heckmann
et al. use a hybrid HMM/ANN architecture for audio-visual speech
recognition, and integrate the audio and visual classifiers at a
likelihood, decision level, in a ``state synchronous'' fashion.
They first consider various schemes to weigh the posterior likelihoods
of the audio and visual-only ANNs with appropriate stream exponents
(``weights''), and they conclude that their multiplicative combination,
which respects their class-conditional independence, is superior.
Subsequently, they concentrate on the adaptive estimation of the
combination weights, based on the reliability of each stream of
information, as captured by three criteria. Results are reported
on a single-speaker, connected digits database.
Nefian
et al. take a different approach and concentrate on ``state-asynchronous''
architectures for audio-visual fusion, by means of HMMs. They consider
integration both at the feature, as well as at the likelihood (decision)
level, using a multitude of Bayesian network models, such as the
product HMM, the factorial HMM, and the coupled HMM. They present
iterative algorithms for obtaining maximum likelihood estimates
of the model parameters, as well as their initial estimates. They
also compare the complexity of the models, both in terms of number
of parameters and recognition time. Finally, they report their performance
on a speaker-independent audio-visual speech recognition task of
isolated words.
As a
final note, we would like to express our thanks to all the contributing
authors and reviewers for this special issue. We believe that this
is a very exciting area of research, and hope that these papers
will further stimulate work on joint audio-visual signal and speech
processing. Editors: Chalapathy Neti Gerasimos Potamianos Juergen
Luettin Eric Vatikiotis-Bateson
|