![]() |
![]() |
![]() |
![]() |
|
| Audio Visual Speech Technologies | |||
|
|
|||
|
|
Data CollectionTo allow experiments on continuous, large vocabulary, speaker independent audio-visual speech recognition, a suitable database has been collected at the IBM Thomas J. Watson Research Center, preceding the Johns Hopkins Summer 2000 Workshop. The database consists of full-face frontal video and audio of 290 subjects, uttering ViaVoice (TM) training scripts, i.e., continuous read speech with mostly verbalized punctuation (dictation style), and a vocabulary size of approximately 10,500 words. The database video is of size 704 x 480 pixels, interlaced, captured at a rate of 30 Hz, and it is MPEG2 encoded at the relative high compression ratio of 16kHz and at a relative clean audio office environment. The duration of the entire database is approximately 50 hrs (24,325 utterances). Example database frames are depicted below: In addition to the IBM ViaVoice (TM) audio-visual database, a much smaller broadcast news dataset has been obtained both at the IBM Thomas J. Watson Research Center and the Johns Hopkins University prior to and during the Johns Hopkins Summer 2000 Workshop. This database contains audio-visual sequences of frontal anchor speech, and it has been digitized from CNN and CSPAN broadcast news tapes, kindly provided by the Linguistic Data Consortium (LDC). The entire duration of this database is approximately 5 hours and it has been collected with the intent to perform audio-visual adaptation experiments. This database is still under development. Example frames are depicted below. Finally, data in other domains / tasks are currently being collected.
|
| About IBM | Privacy | Legal | Contact |