Photo
Audio Visual Speech Technologies

Audio Visual Speaker Identification

 

Humans identify speakers based on a variety of attributes of the person which include acoustic cues, visual appearance cues and behavioral characteristics (such as characteristic gestures, lip movements).  In the past, machine implementations of person identification have focussed on single techniques relating to audio cues alone (speaker recognition), visual cues alone (face identification, iris identification) or other biometrics. More recently, researchers are attempting to combine multiple modalities for person identification. Speaker identification is an important technology for a variety of applications including security, and more recently as an index for search and retrieval of digitized multimedia content (for instance in the MPEG7 standard). Audio-based speaker recognition accuracy under acoustically degraded conditions (such as background noise) and channel mismatch (telephone) still need further improvement. To make improvements in such degraded conditions is a hard problem. We have begun to investigate the combination of audio-based processing with visual processing for speaker recognition to improve the accuracy in acoustically degraded conditions in the broadcast news domain.

Key component technologies:

  • Face detection, tracking and identification
  • Audio-based speaker identification
  • Fusion techniques

Papers:

Publication List