I. Automated Speech
Recognition
The Automated Speech
Recognition (ASR) problem for this project is to develop technologies that are
robust and accurate for the languages, topics, and speaking styles found in the
VHF Oral History Collection. This is particularly important as all of the information
of this archive lies in the spoken audio. Given the difficulty and diverse
nature of the data, advances in ASR technology will clearly generalize to
related applications and more broadly impact the entire speech recognition
area. While the ultimate goal of ASR is to produce readable transcriptions, our
immediate goal is to produce transcriptions that are accurate enough to support
metadata creation and retrieval. Our ASR research is tightly integrated into
our development of novel cataloging and information retrieval methodologies.
Despite the considerable
progress made in ASR technology in recent years, significant research problems
remain unsolved. Current techniques are sensitive to the acoustic and
environmental properties of the data, speaker variability and to mismatches in
training and usage conditions. Such challenges exist to some degree in any
speech application, but they are particularly severe with materials such as
oral histories that are often collected under uncontrolled conditions and in
which a wide range of speaking styles may be present. Fundamental improvements
in ASR technology to address these challenges will span the topics highlighted
below.
RESEARCH TOPICS
·
Spontaneous and
emotional speech
·
Whispered speech
·
Speech with background
noise and frequent interruptions
·
Speech from elders
·
Switches between
languages, for example, between Yiddish, German, Polish, and English
·
Heavily accented speech
·
Speech with words such
as names, obscure locations, unknown events, etc. that are outside the
recognizer lexicon
·
Disfluent speech
·
Novel metrics to
evaluate ASRs other than the WER metric
·
Language Models
·
Acoustic Segmentation of
Speakers
·
Pronunciation Modeling
to capture varied speaking styles
·
Confidence Measures
II. Information Retrieval
(Metadata Creation and Cataloguing)
The VHF oral history
archive represents a new challenge for the many well-known techniques in
information retrieval, such as named entity tagging, segment boundary
determination, and classification using the VHF Thesaurus. Named entity tagging
is a core language technology that supports segmentation, classification,
cataloging, search, and browsing. The task involves
identifying terms that belong to a number of categories (e.g., persons,
organizations, locations, or temporal expressions) and labeling them with their
category. We will investigate the problem
of dividing the text form of testimonies (either manually transcribed or
produced as output of an automatic speech recognizer) into short (a few minutes
long), topically homogeneous segments to support classification and the
creation of metadata. We plan to extend our previous research on topical
segmentation of broadcast news stories for the DARPA/NIST sponsored Topic
Detection and Tracking (TDT) project.
Assignment of thesaurus terms to segments is equivalent to associating
with each term a cluster of interview segments, and assigning the segments to
these clusters (which are, of course, highly overlapping.) This process of
assigning segments to clusters differs from purely supervised text
classification or unsupervised clustering.
It is partially supervised, because an extensive set of cataloged
interview segments already exists. However, it also contains aspects of
unsupervised classification – the NISO Z39.19 thesaurus standard recognizes
that the size of the thesaurus (and hence the number of categories) will
naturally increase as more oral histories are processed. We will test many of these techniques with TREC-like
retrieval experiments that will help us to evaluate the value of these
automated techniques.
RESEARCH TOPICS
·
Segmentation into
relatively short partitions
·
Language-specific
segmentation
·
Feature Selection (using
acoustic evidence such as detection of speaker turns and uncued changes of language)
·
Maximum Entropy Models
·
Query expansion
·
Document Clustering