Skip to main content

ibm masterhead
group name
projects page go to home go to papers go to patents go to people go to links default


Research Areas


Text Categorization
The text categorization problem is to determine predefined categories for an incoming unlabeled message or document containing text based on information extracted from a training set of labeled messages or documents. Our early work (see Apte et al., 1994) first demonstrated the viability of using symbolic rule induction for text categorization and the utility of using "local feature sets" (one feature set per category) to significantly improve accuracy. Our more recent research on statistical methods has resulted in a new approach to linear classifiers based on regularization methods from numerical analysis.

Trainable Information Extraction Systems
In the field of natural language processing, one important research area is the extraction and formatting of information from unstructured text. Although the most accurate systems often involve language processing modules that are hand-built, substantial progress has been made in applying supervised machine learning techniques to a number of processes necessary for extracting information from text. We are developing accurate and efficient machine learning algorithms for commercially viable information extraction systems including interactive (bootstrapping) learning algorithms and a new approach to symbolic pattern generalization applicable to relational learning from text.

Parsing Systems
We have a long history of research on natural language parsing, from shallow pattern recognition to deep syntactic parsing. Check out our recent work on indexing for typed feature structures and abstract machine technology for finite state transduction.

Knowledge Representation and Inferencing for NLP
In this area, our work shows how ideas from lattice theory can be used in the implementation of a knowledge representation language that can be used for natural language processing. On a different thread, we are also developing an inference engine (LPC) supporting both Horn clause reasoning and full first-order logic. Its knowledge base is partitioned into individual theories, and theories may be added or removed from the KB as needed. LPC incorporates a version of constraint logic, and supports procedural attachment (specifically, the association of functions and predicates with Java methods).

Interactive Natural Language Systems
Our group has recently developed an Interactive Natural Language Response System based on the idea of
Dialogue Categorization. This technology has been used to construct several WWW Auto-Response Systems in both business and technical domains.

Statistical Machine Learning
We are working on statistical machine learning based methods for text mining and natural language processing. In this approach, we treat a general text processing task as an annotation problem. We want to predict some unknown attributes (annotations) of text based on observed attributes (a text token sequence). The way to achieve this is by using machine learning algorithms on some human annotated text to construct a predictor that can automatically annotate future text. We are interested in the design of machine learning algorithms that are both accurate and computationally efficient; the application of machine learning to natural language (how to extract useful linguistic information as features for machine learning algorithms, how to divide and cascade linguistic processes); and the reduction of human effort for data annotation.


Real World Applications


Our most recent work on Text Categorization is available in product form as the IBM Text Analyzer, a WebSphere Business Component, and is embedded in the IBM Enterprise Information Integrator ( aka IBM Enterprise Information Portal ), a content management offering.

Off-the-shelf, the IBM Text Analyzer achieves a break-even point of 84% on the Reuters-21578 dataset. The product includes a GUI supporting iterative development of text categorizers as well as a rule editor for manually editing or writing rules. The July 2001 release offers as an option to the rule induction system a statistical linear classifier system based on regularization methods from numerical analysis. On the Reuters-21578, the linear classifier achieves a break-even point of 87% (see Zhang et al. 2001).

The IBM Text Analyzer is embedded in several customer email handling solutions including Amacis Visibility
TM and IBM-Kana joint announcement of Kana ResponseTM in the Asia-Pacific market.


Read Success Story at HSBC !!!



Last Update : December 12, 2001

footer go research go to ibm go to products go to privacy go to legal go to contacts



.