IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
 Select a country
 IBM Home
IBM Research
Think Research
Technical Disciplines
Cross-Disciplines
About IBM Research
Resources
Search Research
Feedback

Related Links
  Worldwide Labs
  Page Contact
 
 


IBM Research
Natural Language Processing

Computer Science > Natural Language Processing > Research Spotlight (February 2004) > Selected Publications

"A Multi-Agent Approach to using Redundancy and Reinforcement in Question Answering", John Prager, Jennifer Chu-Carroll, and Krzysztof Czuba. To appear in New Directions in Question Answering, M. Maybury (ed.), 2004.

Abstract:
We explore how to improve the performance of our Question Answering system by using redundancy and reinforcement. We have deployed in our system a variety of agents, each of which is tuned to a different class of question types, but with considerable overlap. One source of redundancy and reinforcement is from the multiple agents: many questions give rise to two or more sets of candidate answers, which can be merged to provide better performance than any single agent. We note relative improvement of up to 16.3% using the Mean Reciprocal Rank metric, and 11.9% using the Confidence Weight Score metric. We also investigate new approaches we call QA-by-Dossier and QA-by-Dossier-with-Constraints, in which additional questions are asked to generate constraints on the answers to the original question, thus reducing the confidence of many wrong answers and reinforcing many good ones.


"HowtogetaChineseName: Segmentation and Combination Issues", Hongyan Jing, Radu Florian, Xiaoqiang Luo, Tong Zhang, and Abraham Ittycheriah, EMNLP-2003.

Abstract:
When building a Chinese named entity recognition system, one must deal with certain language-specific issues such as whether the model should be based on characters or words. While there is no unique answer to this question, we discuss in detail advantages and disadvantages of each model, identify problems in segmentation and suggest possible solutions, presenting our observations, analysis, and experimental results. The second topic of this paper is classifier combination. We present and describe four classifiers for Chinese named entity recognition and describe various methods for combining their outputs. The results demonstrate that classifier combination is an effective technique of improving system performance: experiments over a large annotated corpus of fine-grained entity types exhibit a 10% relative reduction in F-measure error.


 

"In Question Answering, Two Heads are Better Than One", Jennifer Chu-Carroll, Krzysztof Czuba, John Prager, and Abraham Ittycheriah, HLT/NAACL-2003.

Abstract:
Motivated by the success of ensemble methods in machine learning and other areas of natural language processing, we developed a multistrategy and multi-source approach to question answering which is based on combining the results from different answering agents searching for answers in multiple corpora. The answering agents adopt fundamentally different strategies, one utilizing primarily knowledge-based mechanisms and the other adopting statistical techniques. We present our multi-level answer resolution algorithm that combines results from the answering agents at the question, passage, and/or answer levels. Experiments evaluating the effectiveness of our answer resolution algorithm show a 35.0% relative improvement over our baseline system in the number of questions correctly answered, and a 32.8% improvement according to the average precision metric.


"Language Model Based Arabic Word Segmentation", Young-Suk Lee, Kishore Papineni, Salim Roukos, Ossama Emam, and Hany Hassan, ACL-2003.

Abstract:
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.


 

"Sentiment analysis: capturing favorability using natural language processing", Tetsuta Nasukawa and Jeonghee Yi, The Second International Conference on Knowledge Capture (K-CAP 2003).

Abstract:
This paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative.The essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. In order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. By applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within Web pages and news articles.


 

"Towards Ontologies On Demand", Youngja Park, Roy Byrd, and Branimir Boguraev, Proceedings of Workshop on Semantic Web Technologies for Scientific Search and Information Retrieval.

Abstract:
The Semantic Web aims at adding semantic knowledge into the web of natural language hypertext, enabling deep-level information search and information integration. However, building a knowledge base and an ontology is so costly and time-consuming that it hampers the progress of the Semantic Web activity. We present a method for building ontologies on demand from scientific queries by applying text mining technologies. The method induces ontological concepts and relationships relevant to the query by analyzing search result documents together with domain-specific knowledge sources available on the Web. Users can use this partial ontology not only for ad-hoc search refinement but also for extending an existing domain ontology. The presented method can be used to produce, over several sessions, a personalized ontology.


"tRuEcasIng, Lucian Vlad Lita", Abe Ittycheria, Salim Roukos, and Nanda Kambhatla, ACL-2003.

Abstract:
Truecasing is the process of restoring case information to badly-cased or noncased text. This paper explores truecasing issues and proposes a statistical, language modeling based truecaser which achieves an accuracy of ~98% on news articles. Task based evaluation shows a 26% F-measure improvement in named entity recognition when using truecasing. In the context of automatic content extraction, mention detection on automatic speech recognition text is also improved by a factor of 8. Truecasing also enhances machine translation output legibility and yields a BLEU score improvement of 80.2%. This paper argues for the use of truecasing as a valuable component in text processing applications.


 
  About IBM  |  Privacy  |  Terms of use  |  Contact