TECHNICAL CHALLENGES

I. Automated Speech Recognition

 

The Automated Speech Recognition (ASR) problem for this project is to develop technologies that are robust and accurate for the languages, topics, and speaking styles found in the VHF Oral History Collection. This is particularly important as all of the information of this archive lies in the spoken audio. Given the difficulty and diverse nature of the data, advances in ASR technology will clearly generalize to related applications and more broadly impact the entire speech recognition area. While the ultimate goal of ASR is to produce readable transcriptions, our immediate goal is to produce transcriptions that are accurate enough to support metadata creation and retrieval. Our ASR research is tightly integrated into our development of novel cataloging and information retrieval methodologies.

 

Despite the considerable progress made in ASR technology in recent years, significant research problems remain unsolved. Current techniques are sensitive to the acoustic and environmental properties of the data, speaker variability and to mismatches in training and usage conditions. Such challenges exist to some degree in any speech application, but they are particularly severe with materials such as oral histories that are often collected under uncontrolled conditions and in which a wide range of speaking styles may be present. Fundamental improvements in ASR technology to address these challenges will span the topics highlighted below.

 

RESEARCH TOPICS

·        Spontaneous and emotional speech

·        Whispered speech

·        Speech with background noise and frequent interruptions

·        Speech from elders

·        Switches between languages, for example, between Yiddish, German, Polish, and English

·        Heavily accented speech

·        Speech with words such as names, obscure locations, unknown events, etc. that are outside the recognizer lexicon

·        Disfluent speech

·        Novel metrics to evaluate ASRs other than the WER metric

·        Language Models

·        Acoustic Segmentation of Speakers

·        Pronunciation Modeling to capture varied speaking styles

·        Confidence Measures

 

II. Information Retrieval (Metadata Creation and Cataloguing)

 

The VHF oral history archive represents a new challenge for the many well-known techniques in information retrieval, such as named entity tagging, segment boundary determination, and classification using the VHF Thesaurus. Named entity tagging is a core language technology that supports segmentation, classification, cataloging, search, and browsing.  The task involves identifying terms that belong to a number of categories (e.g., persons, organizations, locations, or temporal expressions) and labeling them with their category. We will investigate the problem of dividing the text form of testimonies (either manually transcribed or produced as output of an automatic speech recognizer) into short (a few minutes long), topically homogeneous segments to support classification and the creation of metadata. We plan to extend our previous research on topical segmentation of broadcast news stories for the DARPA/NIST sponsored Topic Detection and Tracking (TDT) project.  Assignment of thesaurus terms to segments is equivalent to associating with each term a cluster of interview segments, and assigning the segments to these clusters (which are, of course, highly overlapping.) This process of assigning segments to clusters differs from purely supervised text classification or unsupervised clustering.  It is partially supervised, because an extensive set of cataloged interview segments already exists. However, it also contains aspects of unsupervised classification – the NISO Z39.19 thesaurus standard recognizes that the size of the thesaurus (and hence the number of categories) will naturally increase as more oral histories are processed.  We will test many of these techniques with TREC-like retrieval experiments that will help us to evaluate the value of these automated techniques.

 

RESEARCH TOPICS

·        Segmentation into relatively short partitions

·        Language-specific segmentation

·        Feature Selection (using acoustic evidence such as detection of speaker turns and uncued changes of language)

·        Maximum Entropy Models

·        Query expansion

·        Document Clustering