|    PROJECTS |
| Research Home >> |
Audio Visual Speech Technology Group |
|||
![]() |
Data
Broadcast News Video data: This data is provided by the LDC. The audio part of the data is a subset of the standard speech recognition evaluation conducted by the DARPA community (known as the HUB4 effort). The speech database consists of large vocabulary (approximately 60,000 words) continuous speech drawn from a variety of news broadcasts. The entire database includes television (e.g. CNN, CSPAN) as well as radio shows. We focus on segments of this data, the video part of which primarily consists of the ``talking head'' type images (e.g. an anchor-person in a CNN newscast). The audio-video data available from LDC in the analog SVHS format is digitized in MPEG2 format (at a rate of 5Mb/sec) using IBM developed MPEG2 encoder cards on a Windows/NT PC. The audio and video streams are then de-multiplexed and decompressed. The resulting decompressed audio is sampled at a rate of 16 kHz and the video at a standard rate of 30 frames/second. VVAV data: To obtain data in more controlled settings, we are also collecting ``read'' large vocabulary continuous visual speech. In this data collection, the subject is videotaped while reading ViaVoice training sentences displayed on a teleprompter placed directly above the high resolution SVHS camera. The data is collected in acoustically quiet, controlled conditions and the resolution of the lip region in the video image was much larger than in the LDC data mentioned above --- thus making video based recognition a more tractable task. For the purpose of fair comparison with the LDC data, the video digitization parameters and audio sampling frequency were kept the same. We label this data the `ViaVoice Audio-Visual' (VVAV) data. |
|||
| Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact | ||||