|
Face recognition has recently attracted increasing attention
and is beginning to be applied in a variety of domains,
predominantly for security.
We have instead developed a face recognition system for
video indexing, with the joint purpose of labeling faces
in the video, and identifying speakers.
The face recognition
system can also be used to supplement acoustic speaker identification,
when the speaker's face is shown, to allow indexing of the speakers,
as well as the selection of the correct speaker dependent model for speech transcription.
The first problem to be solved before attempting face recognition is to find the face in the image. Face finding solves the important task of making face recognition translation, scale and rotation independent, and can provide good initial constraints on the location of facial features. The first stage of the process is color segmentation, which simply determines if the proportion of skin tone pixels is greater than some threshold. Subsequently candidate regions are given scores based upon Fisher Linear Discriminant (FLD). This metric is originally built by comparing a large number of face and non-face patches. Candidate are also scored on Distance From Face Space (DFFS), a measure of how much they look like one of a large number of face patches used in training. All candidate regions exceeding a combined threshold are considered to be faces, after applying constraints such as no two faces may overlap.
Next, instead of searching for all the facial features directly in the face image, a few "highlevel" features (eyes, nose, mouth) are first located, and then 26 "lowlevel" features (parts of the eyes, nose, mouth, eyebrows etc.) are located relative to the highlevel feature locations. The approximate locations of the highlevel features are known from statistics of mean and variance (relative to the nose position) gathered on a training database. The discriminant/DFFS templates are used to score each potential matching image patch for a given feature. Typically an area representing around 2 standard deviations is searched. Within the search region, the location with the highest score is deemed to be the location of the feature (see above left). All this is a prelude to the actual face recognition algorithm. For this work, a constellation of local patches has been used as the representation (see above right). We chose this local template approach, in contrast to global identity templates such as those used in Eigenface systems, because of its greater robustness to facial image changes caused by effects such as lighting, expression, or facial appearance change (glasses, beard, haircut etc.). In this case a simple Gabor jet model, similar to that used by Wiskott and von der Malsburg, has been used to describe particular patches of the face corresponding to the 29 facial features found above. Each patch is represented by a feature vector consisting of 40 complex elements each, representing the filter responses of Gabor filters with 5 different scales and 8 different orientations, centered at the estimated feature location. Recognition can now be carried out framebyframe using a training set constructed from the jet coefficient statistics. In this case, for each face found in a sequence, its likelihood given each of the training set models is calculated, assuming the coefficients are Gaussiandistributed. For a sequence of frames, the likelihoods are summed, and compared at the end of the sequence, taking the maximum likelihood training model as the correct answer. For speed of computation, diagonal covariance matrices are used. |
| Contact: Andrew Senior | Last updated: 6/7/02 | ||
|
|
|
|
|