
![]()
![]()
Research Areas
Text Categorization
The text categorization problem is to determine
predefined categories for an incoming unlabeled
message or document containing text based
on information extracted from a training
set of labeled messages or documents. Our early work (see Apte et al., 1994) first demonstrated the viability of using
symbolic rule induction for text categorization
and the utility of using "local feature
sets" (one feature set per category)
to significantly improve accuracy. Our more
recent research on statistical methods has
resulted in a new approach to linear classifiers
based on regularization methods from numerical
analysis.
Trainable Information Extraction Systems
In the field of natural language processing,
one important research area is the extraction
and formatting of information from unstructured
text. Although the most accurate systems
often involve language processing modules
that are hand-built, substantial progress
has been made in applying supervised machine
learning techniques to a number of processes
necessary for extracting information from
text. We are developing accurate and efficient
machine learning algorithms for commercially
viable information extraction systems including
interactive (bootstrapping) learning algorithms
and a new approach to symbolic pattern generalization
applicable to relational learning from text.
Parsing Systems
We have a long history of research on natural
language parsing, from shallow pattern recognition
to deep syntactic parsing. Check out our
recent work on indexing for typed feature structures and abstract machine technology for finite state
transduction.
Knowledge Representation and Inferencing
for NLP
In this area, our work shows how ideas from lattice theory can
be used in the implementation of a knowledge
representation language that can be used
for natural language processing. On a different thread, we are also developing
an inference engine (LPC) supporting both Horn
clause reasoning and full first-order logic.
Its knowledge base is partitioned into individual
theories, and theories may be added or removed
from the KB as needed. LPC incorporates a
version of constraint logic, and supports
procedural attachment (specifically, the
association of functions and predicates with
Java methods).
Interactive Natural Language Systems
Our group has recently developed an Interactive
Natural Language Response System based on
the idea of Dialogue Categorization. This technology has been used to construct
several WWW Auto-Response Systems in both
business and technical domains.
Statistical Machine Learning
We are working on statistical machine learning
based methods for text mining and natural
language processing. In this approach, we
treat a general text processing task as an
annotation problem. We want to predict some
unknown attributes (annotations) of text
based on observed attributes (a text token
sequence). The way to achieve this is by
using machine learning algorithms on some
human annotated text to construct a predictor
that can automatically annotate future text.
We are interested in the design of machine learning algorithms that
are both accurate and computationally efficient;
the application of machine learning to natural
language (how to extract useful linguistic information
as features for machine learning algorithms,
how to divide and cascade linguistic processes);
and the reduction of human effort for data annotation.
Real World Applications
Our most recent work on Text Categorization is available in product form as the IBM Text Analyzer, a WebSphere Business Component, and is embedded in the IBM Enterprise Information Integrator ( aka IBM Enterprise Information Portal ), a content management offering.
Off-the-shelf, the IBM Text Analyzer achieves
a break-even point of 84% on the Reuters-21578
dataset. The product includes a GUI supporting
iterative development of text categorizers
as well as a rule editor for manually editing
or writing rules. The July 2001 release offers
as an option to the rule induction system
a statistical linear classifier system based
on regularization methods from numerical
analysis. On the Reuters-21578, the linear
classifier achieves a break-even point of
87% (see Zhang et al. 2001).
The IBM Text Analyzer is embedded in several
customer email handling solutions including
Amacis VisibilityTM and IBM-Kana joint announcement of Kana
ResponseTM in the Asia-Pacific market.
Read Success Story at HSBC !!!
Last Update : December 12, 2001
![]()
.