| IBM
Research
has been actively pursuing research in Natural Language Processing
over the last few decades. Our mission is to offer speech and language
technologies that form the core of current and future products and
solutions for processing natural language. We adopt linguistic,
knowledge-based, statistical, as well as hybrid methods to natural
language processing. Furthermore, our work addresses theoretical
issues of computational linguistics and encompasses a wide range
of application areas such as speech processing, machine translation,
question answering, interactive dialogue systems, text mining and
information extraction, natural language understanding and generation,
information retrieval, and automatic text summarization.
Some current projects
include:
Blue Lego
This goal of this project is business analytics from structured
and unstructured data. Using a tight integration of structured and
annotated unstructured text in an OLAP like data-model this approach
models the aggregated information extracted from unstructured text.
GlossOnt
A domain ontology building tool which applies text mining technology
to build domain-specific ontologies. GlossOnt focuses on a particular
domain concept at a time, and dynamically acquires source documents
about the target concept, such as domain glossaries and search documents.
It induces ontological concepts and relations which are related
to the target concept from the source documents.
Language
Translation
This project deals with natural language analysis and translation
by computer. Since the expansion of the Internet increases the chances
we face documents written in foreign languages, we focuses on the
translation of documents accessible through the Internet.
PIQUANT
(Practical Intelligent QUestion ANswering Technology)
The focus of the PIQUANT project is to explore how best to integrate
and balance various technologies such as a knowledge base, NLP,
planning and traditional text-based IR in order to build an efficient,
modular, multi-agent Question Answering environment. The primary
goal of PIQUANT is to improve question answering performance by
leveraging knowledge-based, statistical, and linguistic approaches
to QA. We have developed a modular and extensible QA architecture
that facilitates the integration of independently produced knowledge
sources, provides a uniform interface to accessing knowledge from
these distinct sources, and enables employment of multiple answering
agents that may employ vastly different strategies to answering
questions.
SAIL
Machine learning applied to information extraction shows promising
results. However, to learn accurate annotators, learning algorithms
require quite a lot of accurately labeled training data, which is
often unavailable. To address this problem in the area of named
entities (e.g. names of people, places, organizations, genes, proteins,
diseases, etc.), we have developed an interactive learning workbench
(SAIL) that substantially reduces the amount of manual labor and
level of expertise required to train annotators. Key to SAIL is
the use of our Robust Risk Minimization learning algorithm (Zhang,
et al 2002), a very efficient statistical algorithm for learning
linear classifiers that also provides in-class probability estimates.
SAIL uses in-class probability estimates for various forms of active
or semi-supervised learning, in which a developer need only evaluate,
and perhaps revise, a relatively small number of candidate annotations
to produce accurate results. SAIL provides a number of functions
that speed up the learning process, e.g., automatic acceptance or
rejection of annotations based on user-determined confidence levels.
Statistical Machine
Translation
In the Statistical Machine Translation project we are exploring
new algorithms to improve the quality of machine translation by
exploiting large parallel corpora consisting of sentence pairs that
are translations of each other to build a statistical translation
model between the two languages. We are building systems in several
language pairs ranging from Arabic-Englsih, Chinese-English, Hindi-English
to English-French. We are exploring systems that use word-to-word,
phrase-to-phrase, and parse-based translation techniques. We have
also developed the world's first statistical machine translation
product for Arabic-English that runs on Windows and Linux and has
been deployed at customer locations.
Text
mining
The aim of the text mining project is to research technologies to
discover useful knowledge from enormous collections of documents,
and to develop a system to provide this knowledge and to support
the user's decisions. Usually data mining technologies mine knowledge
from data with well-formed schemes such as relational tables. But,
text data don't have such scheme, and information is described freely
in the documents. Therefore, we focus on Natural Language Processing(NLP)
technologies to extract such information. Using NLP technologies,
documents are transformed into a collection of concepts, described
using terms discovered in the text.
XTeKS (Extensible
Text Knowledge Services)
A prototype text analysis system that richly annotates documents
in a topic domain, based on thesauri and other input linguistic
resources. Annotations include entity names and selected categories
of facts and relations. These annotations can support advanced text
search, document clustering and text mining applications, and a
version of XTeKS called "BioTeKS" is being used to explore the analysis
of biomedical text for problem solving in the Life Sciences.
|
Real-time
machine translation
|
| Selected
Publications |
| "A
Multi-Agent Approach to using Redundancy and Reinforcement
in Question Answering", John Prager, Jennifer Chu-Carroll,
and Krzysztof Czuba. To appear in New Directions in Question
Answering, M. Maybury (ed.), 2004.
"HowtogetaChineseName:
Segmentation and Combination Issues", Hongyan Jing,
Radu Florian, Xiaoqiang Luo, Tong
Zhang, and Abraham Ittycheriah, EMNLP-2003.
"In
Question Answering, Two Heads are Better Than One",
Jennifer Chu-Carroll, Krzysztof Czuba, John Prager, and Abraham
Ittycheriah, HLT/NAACL-2003.
"Language
Model Based Arabic Word Segmentation", Young-Suk
Lee, Kishore
Papineni, Salim
Roukos, Ossama Emam, and Hany Hassan, ACL-2003.
"Sentiment
analysis: capturing favorability using natural language processing",
Tetsuta Nasukawa and Jeonghee Yi, The Second International
Conference on Knowledge Capture (K-CAP 2003).
"Towards
Ontologies On Demand", Youngja Park, Roy Byrd, and
Branimir Boguraev, Proceedings of Workshop on Semantic Web
Technologies for Scientific Search and Information Retrieval.
"tRuEcasIng",
Lucian Vlad Lita, Abe Ittycheriah, Salim
Roukos, and Nanda Kambhatla, ACL-2003.
|
| |
| Recent
Accomplishments |
|
Salim
Roukos, program committee co-chair of HLT/NAACL-2004
Kishore Papineni, editor-in-chief of ACM Transactions on Speech
and Language Processing
Branimir Boguraev, co-editor of Natural Language Engineering
journal
Christopher
Welty, editorial board member of Journal of Web Semantics
Wlodek Zadrozny, editorial board member of International Journal
of AI Tools
|
|