Human computer interaction will acheive the naturalness
of human to human communication when the impersonal computer interface
of today is replaced by the combination of natural input interfaces using
speech and gesture with visual agents that deliver the information supplied
by the computer. A computer should not only be able to understand the natural
language of the person but should respond in the same natural way. Visual
speech synthesis can also be used for compensating for lack of auditory
information for hearing impaired, movie dubbing, virtual avatars, distance
learning and low bandwidth conferencing.
Researchers have tried various approaches to convert
acoustic speech to visual speech. Approaches include: mapping phonemes
to visemes, vector quantization, direct estimation techniques and HMMs.
It is still a challenge to design facial animation models that easily control
facial expression, gesture and emotion. Researches have taken two different
approaches. One approach is based on 3-D wire frame models with detail
descriptions of motion of facial muscles and articulators like teeth and
tongue (Massaro98). The other method relies on image based techniques like
key framing and morphing.
Presently we are exploring the feasibility of
animating a face given an incoming audio stream and pictures of a speaker
speaking different visemes and showing different expressions. The viseme
and expression set is predecided. To start with the pose of the faces are
aligned to correct for some small rotational and translational discrepancies
between them. The optical flow is then computed for every transition between
the images. For a incoming audio signal the corresponding viseme is identified
and the transition from previous viseme to next viseme is done along the
optical flow previously computed and stored.
Key component technologies:
Acoustic Speech to Visual Speech conversion
Facial Animation
Team:
Multimedia Speech Recognition and Synthesis Group at India Research Lab