IEEE International Conference on Multimedia & Expo
Tokyo, Japan
Electronic Proceedings
© 2001 IEEE


Audio Driven Facial Animation for Audio-Visual Reality

T. A. Faruquie, A. Kapoor*, R. Kate*, N. Rajput, L.V. Subramaniam
IBM India Research Lab
Indian Institute of Technology
Hauz Khas, New Delhi 110016, India
91-11-6861100
{ftanveer,rnitendr,lvsubram}@in.ibm.com
http://www.research.ibm.com/irl 

*on summer training from Indian Institute of Technology, Delhi.

Abstract

In this paper, we demonstrate a morphing based automated audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face speaking different visemes. Rules are formulated based on coarticulation and the duration of a viseme to control the continuity in terms of shape and extent of lip opening. In addition to this new viseme-expression combinations are synthesized to be able to generate animations with new facial expressions. Finally various applications of this system are discussed in the context of creating audio-visual reality.


Table of Contents


Introduction

Humans communicate verbally using words and sentences. Humans also communicate non-verbally using expressions, gestures and prosody. The design and implementation of computer systems that cover the whole range of human-human like interaction by using faces and voices is one of the challenging objectives of Human Computer Interaction research. In this work we look at the conversion of speech into visual information to create audio-visual reality. Given an incoming audio stream and pictures of a face representing different visemes, which are different, distinguishable lip shapes [7, pp. 394-395], an animation sequence is constructed. We have taken 12 face images corresponding to the 12 visemes. These viseme images are aligned and optical flows for transition in-between these visemes are computed (12x11 total) and stored. At run time, for an incoming audio stream, using a phoneme to viseme mapping, the corresponding video frame is identified and transition frames between visemes are generated using the optical flows.

Morphing based animation has been considered in the past [5]. In this paper we seek to extend this to include animation with expression. For a richer scope of animation it is necessary to be able to animate the face with appropriate expressions. In [3] it is assumed that there exists a video database of the head to be synthesized, wherein, the subject is present in the expression to be synthesized at least once. In our system, given visemes with two or more different facial expressions, a method is presented that can generate the remaining visemes with these facial expressions. The Seven basic expressions considered are neutral, surprise, fear, disgust, anger, happiness and sadness. In section 2 we present the audio-driven facial animation system model. In Section 3 the method of generating the animation is discussed. In Section 4 we discuss some applications and present an evaluation of the system in the context of creating audio-visual reality. Finally conclusions are presented in Section 5.

<-- Back to Table of Contents>

System Model

The audio-driven facial animation system consists of the extraction module, the synthesis module and the background processing module.

Extraction of Viseme+Expression combinations
Figure 1. Extraction Module

Figure 1 shows the extraction module. For an incoming stream of synchronized audio+video we first recognize the phoneme and then map this phoneme to its corresponding viseme and take the corresponding video frame to represent this viseme. The expression recognition unit can be either audio based [2][9] or video based [4]. A short sentence like "The sharp quick brown fox jumped over the lazy dog." captures all the 12 visemes.

Module for Computing Optical Flows
Figure 2. Background Processing Module

In the background processing module, shown in Figure 2, the extracted images are corrected for small pose differences. Then it may be possible that all visemes in all expressions may not have been extracted. This module generates the complete set of viseme+expression combinations (en , vm), where, n = 1,...,7, and, m = 1,2,...,12. Finally optical flows between different visemes within an expression and between the expressions are computed and stored.

The Synthesis Module
Figure 3. Synthesis Module

Figure 3 shows the synthesis module. From an incoming audio stream timing information, phoneme transitions and expressions are extracted. The phonemes are then mapped to the corresponding visemes. This mapping is shown in Table 1. The timing information and phoneme transition can also be extracted for a novel language whose speech recognition engine is not available [6].The expression recognition unit based on audio gives the correct expression. However in our case the expression maps have been explicitly provided. Together the viseme+expression combination determines the frame to be used from the database, the timing information tells how long this viseme+expression lasts and the phoneme transitions in turn give the viseme transitions. These viseme transitions are brought about using precomputed optical flows.

Phoneme Vimseme Number
a, h Viseme 1
e, i Viseme 2
l Viseme 3
r Viseme 4
o, u, w Viseme 5
p, b, m Viseme 6
g, k, d, n, t, y Viseme 7
f, v Viseme 8
h, j, s, z Viseme 9
sh, ch Viseme 10
th Viseme 11
silence Viseme 12
Table 1. Phoneme to Viseme Mapping Rule

<-- Back to System Model>
<-- Back to Table of Contents>

Audio Visual Animation

In this section we show how the animation system proposed in this paper achieves a realistic interaction using faces and voices. The voice of the speaker is left unaltered.

Normalization of Images

The system waits for the first occurrence of a viseme+expression combination and extracts all possible combinations from the audio+video footage. The images so obtained may not be aligned. If these images are used for animation then the resulting sequence will have disturbing and unintended head motions. We therefore need to align the images. We use a method similar to [1] to normalize the images. There are two components of motion between the images, 3-D rigid body motion and non rigid motion. The rigid component is due to the head rotation, translation etc. and the non rigid component is due to changes in expression and lip shape. The face can be approximated as a single plane viewed under a perspective projection [8]. As a result it is possible to describe the optical flows by the following eight-parameter model:

u (x, y) = a0 + a1 x + a2 y + p0 x2 + p1 x y (1)

v (x, y) = a3 + a4 x + a5 y + p0 x y +p1 y2 (2)

Since non rigid motions of facial features are not captured well by this model we can use this model to extract the 3D rigid body component of motion and to align the images. To estimate the parameters we use the approach suggested by Tsai and Huang [10] with modifications.Tsai and Huang's method is based on perspective displacement field model which is different from the kind of model we are using. This method is basically a least square fit over the image gradients and we use Singular Value Decomposition to calculate the above parameters.

Given facial images I1 and I2, we first estimate the 3D rigid body motion component from I2 to I1. Next, we warp image I2 using this model to align with I1 and having viseme shape/expression of I2. Some images may have slight facial deformation due to the assumed planar model for the face under perspective projection. Given a set of images we can align them with respect to a single image and repeat the whole process iteratively.

<-- Back to Audio Visual Animation>
<-- Back to Table of Contents>

Lip Synchronization with Audio

The timing information is extracted from the incoming audio stream using the speech recognition unit. The lip movement synchronization and the extent of morph is governed by this timing information.Given two normalized viseme images intermediate frames are generated using optical flow based morphing techniques similar to [5]. Suppose the viseme transition between v1 and v2 occurs in time T. To generate a frame at time 0 < t < T we use image warping using the optical flows. We calculate the optical flows from v1 to v2 (say OF1) and from v2 to v1 (say OF2). The viseme v1 is warped along OF1 and viseme v2 along OF2. The two obtained images are cross dissolved in a weighted sense to obtain a final image which is the generated frame.

We restrict the extent of the morph depending upon the viseme and the duration of viseme transition. Figure 4 shows the rules used by our system. Consider a viseme transition between va and vb in duration Tc. Now, if Tc < Th, where Th is a threshold that is heuristically set, we generate the morph until t = Tc/Th. But there is a catch, consider a transition from viseme vb to vc in duration Tn. If Tn > Th then viseme vb needs to be emphasized and hence the morph to vb should be complete. In this case we extend the duration of transition va - vb and reduce the duration of transition vb - vc by Q, where Q = Min (Th-Tc,Tn-Th). If the transition vb -vc was long enough then viseme vb would be morphed from va. Further, visemes that represent p, b, m and v, f have to be morphed completely because these visemes involve lip closure or near closure. So if transition occurs to any of these visemes, then the morph is completed irrespective of the duration.

Suppose vb was not completely morphed then to generate the morph to viseme vc we cannot use the optical flows between vb and vc computed using the images in our database. We need to know the optical flow between the generated (and incomplete) viseme vb and vc. Since the optical flow computations are too costly and almost impossible in real time, we use the transitivity between the optical flows va - vb and vb -v c to calculate an approximate optical flow, which is used to generate the morph. Our system uses a threshold Th = 100 ms at 30 fps.

Audio Synchronization
Figure 4. Audio Synchronization

<-- Back to Audio-Visual Animation>
<-- Back to Table of Contents>

Facial Expression Synthesis

In the background processing module we complete the set of vimseme+expression combinations. The central problem we solve is that given visemes v1 and v2 with facial expression e1 and viseme v1 with facial expression e2, how to generate viseme v2 with facial expression e2 i.e. given (e1 , v1), (e1 , v2) and (e2 , v1) we want to generate (e2 , v2). We exploit the similarity that is found in transitions between visemes for every facial expression. Here an important task is to appropriately insert the new facial features of viseme v2 (not present in v1) and to delete the facial features not present in viseme v2 (but present in v1). We employ optical flow techniques to accomplish all these tasks.

We accomplish this as follows (see Figure 5 below).

New Viseme-Expression Pair Generation
Figure 5. New Viseme-Expression Pair Generation

Find the correspondence of pixels in (e1 , v1) going to (e1 , v2), call it flow1 and from (e1 , v1) to (e2,v1), call it flow2. Now put the velocity of every pixel in (e1 , v1) given by flow1 on the corresponding pixel of (e2 , v1) (found according to flow2). Call the optical flow of (e2 , v1) thus obtained as flownew. Generate (e2 , v2) from (e2 , v1) using flownew.

Introducing New Features
Figure 6. Introducing New Features

To introduce the new features that appear in viseme v2 (see Figure 6), detect the facial features that appear in (e1 , v2) which were not there in (e1 , v1) using flow1. The pixels in (e1 , v2) which do not correspond to any pixel in (e1 , v1) stand for the new features.Find the correspondence of pixels in (e1 , v2) going to (e1 , v1), call this flow3. Carry the pixels (new features) found using flow1 to (e2 , v2) in the same way as the nearby corresponding pixels in (e1 , v1) go to (e2 , v1) according to flow2. These nearby corresponding pixels in (e1 , v1) are determined by the correspondence of pixels given by flow3 on the nearby pixels in (e1 , v2).

Suppressing Disappearing Features
Figure 7. Suppressing Disappearing Features

To suppress the facial features disappearing in viseme v2 (see Figure 7), detect the features that are present in (e1 , v1) but which disappear in (e1 , v2) using flow3. The pixels in (e1 , v1) which do not correspond to any pixel in (e1 , v2) stand for the disappearing features. Find where these pixels go in (e2 , v1) using flow2. While constructing the new image from (e2 , v1) suppress these pixels. This way these features won't appear in the new image. Figure 8 and Figure 9 are examples of new viseme+expression combinations generated from the existing ones.

Existing Images and the Constructed Image with New Features Appearing
Figure 8. Existing Images and the Constructed Image with New Features Appearing

Existing Images and the Constructed Image with Disappearing Features
Figure 9. Existing Images and the Constructed Image with Disappearing Features

<-- Back to Audio-Visual Animation>
<-- Back to Table of Contents>

System Evaluation

Various application scenarios motivate audio driven facial animation. These include bandwidth reduction for video teleconferencing, movie dubbing, user-interface agents and avatars, and multimedia telephones for hard of hearing people. Simple experiments have shown the value of the visual channel in speech comprehension [7], for example the McGurk effect. In many scenarios it is possible that the listener is in a crowded and noisy environment. Vision adds redundancy to the signal and provides evidence of those cues that would be irreversibly masked by noise or hearing impairments [7]. The system was tested over one person with hearing impairment over different sentences. The audio was left clean in all cases. It was found that the addition of video improved speech understanding by at least 50%.

This system is valuable where video has to be generated. Examples of such scenarios include:

Visual e-mail: At the receiving end the email is "read out" by the sender. The receiver mailbox activates the correct person, to read out the mail, by matching the address.

Newscast: In many cases involving a field reporter, the audio is available but due to various reasons, the corresponding video is not available. Usually a photograph of the person is shown on the TV screen along with the audio. Using the system presented here, a video of the person speaking can be generated and shown along with the audio. Vision directs the listener's attention and sustains interest.

Entertainment: Making people say things they normally would not. For example popular actors are made to say different things and "interact" with people.

Many other uses of this system can be thought of. A talking face has the advantage of directing the listener's attention and sustaining interest. An audio-visual reality is created if the animated face is able to hold human attention and successfully engage the person in useful conversation or task. To obtain feedback on the quality of the animation, clips were made and shown to a number of people. The feedback was very positive and in many cases, unless specifically mentioned, the animated clip passed off as an original. However, when many synthesized expression visemes are used in the animation, noticeable artifacts at the teeth and lips start appearing.

<-- Back to System Evaluation>
<-- Back to Table of Contents>

Conclusions

An automated system for creating an additional channel for communication is presented. From audio and a few images of a person, a facial animation with lip sync and appropriate expressions is generated. The animation looks realistic and individual variability is preserved. It is also possible to generate new lip shapes in expressions previously not seen by the system. For the future it would be worthwhile to consider including other features like correct gaze following, controlled pose variation, eyebrow movement and eye blinking in the animation system.
<-- Back to Table of Contents>

Bibliography

[1]
M. J. Black and Y. Yacoob, ``Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion,'' Proceedings of the Fifth International Conference on Computer Vision, pp. 374-381, 1995.
[2]
J. F. Cohn and G. S. Katz, ``Bimodal expression of emotion by face and voice,'' Proceedings of the Sixth ACM International Multimedia Conference on Face and Gesture Recognition and Their Applications, ACM Press, pp. 41-44, 1998.
[3]
E. Cosatto and H. P. Graf, ``Photo-realistic talking-heads from image samples,'' IEEE Trans. Multimedia, Vol. 2, No. 3, pp. 152-163, September 2000.
[4]
A. A. Essa and A. P. Pentland, ``Coding, analysis, interpretation and recognition of facial expressions,'' IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 19, No. 7, pp. 757-763, 1997.
[5]
T. Ezzat and T. Poggio, "Miketalk: A talking facial display based on morphing visemes," Proceedings of  IEEE Computer  Animation, Philadelphia PA, USA, 8-10 June, 1998.
[6]
T. A. Faruquie, C. Neti, N. Rajput, L. V. Subramaniam and A. Verma, ``Translingual visual speech synthesis,'' IEEE International conference on Multimedia and Exposition, 30 July - 02 Aug, 2000.
[7]
D. W. Massaro, Perceiving talking faces: From speech perception to behavioural principles, MIT Press, 1998.
[8]
F. I. Parke and K. Waters, Computer facial animation, Wellesley MA: A K Peters, 1996.
[9]
V. C. Tartter and D. Braun,``Hearing smiles and frowns in normal and whisper register,'' Journal of the Acoustical Society of America, 96, pp. 2101-2107, 1998.
[10]
R. Y. Tsai and T. S. Huang, ``Estimating three-dimensional motion parameters of a rigid planar patch,'' IEEE Transactions on Acoustics, Speech and SignalProcessing, Vol. 29, No. 6, pp. 1147-1152, 1981.
<-- Back to Table of Contents>