"A
Multi-Agent Approach to using Redundancy and Reinforcement in Question
Answering", John Prager, Jennifer Chu-Carroll, and Krzysztof
Czuba. To appear in New Directions in Question Answering, M. Maybury
(ed.), 2004.
Abstract:
We explore how to improve the performance of our Question
Answering system by using redundancy and reinforcement. We have
deployed in our system a variety of agents, each of which is tuned
to a different class of question types, but with considerable overlap.
One source of redundancy and reinforcement is from the multiple
agents: many questions give rise to two or more sets of candidate
answers, which can be merged to provide better performance than
any single agent. We note relative improvement of up to 16.3% using
the Mean Reciprocal Rank metric, and 11.9% using the Confidence
Weight Score metric. We also investigate new approaches we call
QA-by-Dossier and QA-by-Dossier-with-Constraints, in which additional
questions are asked to generate constraints on the answers to the
original question, thus reducing the confidence of many wrong answers
and reinforcing many good ones.
"HowtogetaChineseName:
Segmentation and Combination Issues", Hongyan Jing, Radu Florian,
Xiaoqiang Luo, Tong Zhang, and Abraham Ittycheriah, EMNLP-2003.
Abstract:
When building a Chinese named entity recognition system,
one must deal with certain language-specific issues such as whether
the model should be based on characters or words. While there is
no unique answer to this question, we discuss in detail advantages
and disadvantages of each model, identify problems in segmentation
and suggest possible solutions, presenting our observations, analysis,
and experimental results. The second topic of this paper is classifier
combination. We present and describe four classifiers for Chinese
named entity recognition and describe various methods for combining
their outputs. The results demonstrate that classifier combination
is an effective technique of improving system performance: experiments
over a large annotated corpus of fine-grained entity types exhibit
a 10% relative reduction in F-measure error.
"In
Question Answering, Two Heads are Better Than One", Jennifer
Chu-Carroll, Krzysztof Czuba, John Prager, and Abraham Ittycheriah,
HLT/NAACL-2003.
Abstract:
Motivated by the success of ensemble methods in machine
learning and other areas of natural language processing, we developed
a multistrategy and multi-source approach to question answering
which is based on combining the results from different answering
agents searching for answers in multiple corpora. The answering
agents adopt fundamentally different strategies, one utilizing primarily
knowledge-based mechanisms and the other adopting statistical techniques.
We present our multi-level answer resolution algorithm that combines
results from the answering agents at the question, passage, and/or
answer levels. Experiments evaluating the effectiveness of our answer
resolution algorithm show a 35.0% relative improvement over our
baseline system in the number of questions correctly answered, and
a 32.8% improvement according to the average precision metric.
"Language
Model Based Arabic Word Segmentation", Young-Suk Lee, Kishore
Papineni, Salim Roukos, Ossama Emam, and Hany Hassan, ACL-2003.
Abstract:
We approximate Arabic's rich morphology by a model that
a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix*
(* denotes zero or more occurrences of a morpheme). Our method is
seeded by a small manually segmented Arabic corpus and uses it to
bootstrap an unsupervised algorithm to build the Arabic word segmenter
from a large unsegmented Arabic corpus. The algorithm uses a trigram
language model to determine the most probable morpheme sequence
for a given input. The language model is initially estimated from
a small manually segmented corpus of about 110,000 words. To improve
the segmentation accuracy, we use an unsupervised algorithm for
automatically acquiring new stems from a 155 million word unsegmented
corpus, and re-estimate the model parameters with the expanded vocabulary
and training corpus. The resulting Arabic word segmentation system
achieves around 97% exact match accuracy on a test corpus containing
28,449 word tokens. We believe this is a state-of-the-art performance
and the algorithm can be used for many highly inflected languages
provided that one can create a small manually segmented corpus of
the language of interest.
"Sentiment
analysis: capturing favorability using natural language processing",
Tetsuta Nasukawa and Jeonghee Yi, The Second International Conference
on Knowledge Capture (K-CAP 2003).
Abstract:
This paper illustrates a sentiment analysis approach to
extract sentiments associated with polarities of positive or negative
for specific subjects from a document, instead of classifying the
whole document into positive or negative.The essential issues in
sentiment analysis are to identify how sentiments are expressed
in texts and whether the expressions indicate positive (favorable)
or negative (unfavorable) opinions toward the subject. In order
to improve the accuracy of the sentiment analysis, it is important
to properly identify the semantic relationships between the sentiment
expressions and the subject. By applying semantic analysis with
a syntactic parser and sentiment lexicon, our prototype system achieved
high precision (75-95%, depending on the data) in finding sentiments
within Web pages and news articles.
"Towards
Ontologies On Demand", Youngja Park, Roy Byrd, and Branimir
Boguraev, Proceedings of Workshop on Semantic Web Technologies for
Scientific Search and Information Retrieval.
Abstract:
The Semantic Web aims at adding semantic knowledge into
the web of natural language hypertext, enabling deep-level information
search and information integration. However, building a knowledge
base and an ontology is so costly and time-consuming that it hampers
the progress of the Semantic Web activity. We present a method for
building ontologies on demand from scientific queries by applying
text mining technologies. The method induces ontological concepts
and relationships relevant to the query by analyzing search result
documents together with domain-specific knowledge sources available
on the Web. Users can use this partial ontology not only for ad-hoc
search refinement but also for extending an existing domain ontology.
The presented method can be used to produce, over several sessions,
a personalized ontology.
"tRuEcasIng,
Lucian Vlad Lita", Abe Ittycheria, Salim Roukos, and Nanda
Kambhatla, ACL-2003.
Abstract:
Truecasing is the process of restoring case information
to badly-cased or noncased text. This paper explores truecasing
issues and proposes a statistical, language modeling based truecaser
which achieves an accuracy of ~98% on news articles. Task based
evaluation shows a 26% F-measure improvement in named entity recognition
when using truecasing. In the context of automatic content extraction,
mention detection on automatic speech recognition text is also improved
by a factor of 8. Truecasing also enhances machine translation output
legibility and yields a BLEU score improvement of 80.2%. This paper
argues for the use of truecasing as a valuable component in text
processing applications.
|