Photo
Audio Visual Speech Technologies

General Overview

Our ability to organize and intelligently combine sensory data derived from multiple sensors (modulated by perceptual relevance and sensory confidence) is crucial for building a robust model of objects and events in our environment, in spite of dramatically varying perceptual conditions. Our group's goal is to exploit this human perceptual principle of sensory integration (currently, the joint use of audio and visual information) to improve the recognition of human activity (e.g. speech recognition, speech activity, speaker change, etc.), intent (e.g. speech intent) and identity (e.g: speaker recognition), particularly in the presence of acoustic degradation due to noise and channel, and the analysis and mining of multimedia content.

The applications for this work include (but are not limited to) accurate transcription of human activity for improved human information interfaces, multimedia content mining and meeting transcription. The links provided in the left side menu contain more detailed information about the different areas we are exploring. The project is currently being managed by Roberto Sicconi.

Members:

Collaborators:

Summer Interns:

  • Jintao Jiang (HLT 2003)
  • Ashutosh Garg (HLT 2002)
  • Roland Goecke (HLT 2001)
  • Atulya Velivelli (HLT 2001)

Past Contributors:

UPDATES: Final Workshop 2000 Report, JHU