dar header page


Data Abstraction Research Project

Technical Activities

This is a brief sampling of ongoing research and development activities. For a more comprehensive perspective, please check the group's publication page.

Systems and Solutions

Current Research


Underwriting Profitability Analysis solution (UPA)

The UPA (Underwriting Profitability Analysis) application embodies a new approach to mining Property & Casualty (P&C) insurance policy and claims data for the purpose of constructing predictive models for insurance risks. UPA utilizes the ProbE (Probabilistic Estimation) predictive modeling class library to discover risk characterization rules by analyzing large and noisy insurance data sets. Each rule defines a distinct risk group and its level of risk. To satisfy regulatory constraints, the risk groups are mutually exclusive and exhaustive. The rules generated by ProbE are statistically rigorous, interpretable, and credible from an actuarial standpoint. The ProbE library itself is scalable, extensible, and embeddable. Our approach to modeling insurance risks and the implementation of that approach have been validated in an actual engagement with a P&C firm. The benefit assessment of the results suggest that this methodology provides significant value to the P&C insurance risk management process.

The UPA solution is currently available in the marketplace as a component of IBM's Decision Edge for Insurance warehouse and data mining solution suite, as well as for use in customer consulting engagements.


Probabilistic Estimation class library (ProbE)

The ProbE (Probabilistic Estimation) class library is a framework for data modeling geared to rule induction algorithms. It is embeddable, i.e., targeted for customized solution building, and can be packaged as a kernel with settings and results files. ProbE is also designed to be extensible, i.e., designed for seamless incorporation of diverse data models.

The ProbE class library is C++ based, with two clearly defined sets of APIs for extension and embedding. It is designed to exploit the IBM Intelligent Miner's data access API, and also designed with a view towards data-parallel implementations and system error-recovery support.

ProbE is available as a research prototype for select customer engagements.


Rule Abstraction for Modeling and Prediction (RAMP)

The Rule Abstraction for Modeling and Prediction (RAMP) system is a research prototype system that packages a collection of innovative algorithms that can be used in classification and regression modeling.

Overview

Generating accurate and robust models is crucial to the successful use and deployment of classifiers on a large scale. Rule induction, i.e., generating decision rule models from data, is often a preferred approach to classification modeling and prediction, due to the enhanced explanatory capability and interpretability of decision rules.

The RAMP system for rules abstraction and modeling is evolving with accuracy and robustness as primary goals. The system provides the following key capabilities:

  1. feature analysis and selection based upon contextual merits technique
  2. optimal discretization of numerical features based upon dynamic programming
  3. generation of minimal DNF (Disjunctive Normal Form) rules based upon the R-MINI algorithm
  4. rule based regression
  5. rule pruning, weighting, and editing
  6. alternate rule application strategies
  7. accuracy evaluation of the model on test data.
  8. hierarchical capability for case management, which helps end-users carry out multiple experiments on a data set, and manage these experiments as a set of related cases.
RAMP has been utilized in several large-scale real-life applications and some benchmark tasks which demonstrate its robustness. A detailed description of this system is available in an IBM Research Division technical report-- RAMP: Rules Abstraction for Modeling and Prediction by C. Apte, S.J. Hong, J. Lepre, S. Prasad, and B. Rosen, IBM RC-20271.


Current Research Activities

Optimal partitioning of nominal valued features for decision tree learning

To find the optimal branching of a nominal attribute at a node in an L-ary decision tree, one is often forced to search over all possible L-ary partitions for the one that yields the minimum impurity measure. For binary trees (L=2) when there are just two classes a short-cut search is possible that is linear in n, the number of distinct values of the attribute. For the general case in which the number of classes, k, may be greater than two, Burshtein et al. have shown that the optimal partition satisfies a condition that involves the existence of L(L-1)/2 hyperplanes in the class probability space. We derive a property of the optimal partition for the Gini and entropy impurity measures in terms of the existence of L vectors in the dual of the class probability space, which implies the earlier condition. Unfortunately, these insights still do not offer a practical search method when n and k are large, even for binary trees. We therefore present a new heuristic search algorithm to find a nearly optimal partition. It is based on ordering the attribute's values according to their principal component scores in the class probability space, and is linear in n. We demonstrate the effectiveness of the new method through Monte Carlo simulation experiments and compare its performance against other heuristic methods. Details are available in the technical report, "Optimal Partitioning of Nominal Attributes in Decision Trees" by D. Coppersmith, S.J. Hong, and J. Hosking, IBM Research Division Technical Report RC-21114, as well as in a forthcoming paper in the Journal of Data Mining and Knowledge Discovery, titled "Partitioning Nominal Attributes in Decision Trees".

Boosting text categorization with adaptive resampling based learning

An important goal of text mining is to automatically classify electronic documents. Programs examine samples of text, looking for recurring patterns that correspond to pre-specified topics. Benchmark data, such as the Reuters-21578 test collection, have been used by researchers to measure advances in automated text categorization. Conventional methods like decision trees have had competitive, but not the best predictive performance. Using the Reuters collection, we show the following: (a) adaptive sampling techniques can be used to boost the performance of decision trees and (b) relatively small pooled local dictionaries are effective. Results on the Reuters benchmark are superior, surpassing all previously reported results. Preliminary results are available in "Text Mining with Decision Trees and Decision Rules" by C. Apte, F. Damerau, and S.M. Weiss, in Conference on Automated Learning and Discovery, Carnegie-Mellon University, June 1998. Details will be apearing in a paper in IEEE Intelligent Systems, titled "Maximizing Text Mining Performance".

Predicting performance of adaptive resampling based learning

Decision tree induction is a prominent learning method, typically yielding quick results with competitive predictive performance. However, it is not unusual to find other automated learning methods that exceed the predictive performance of a decision tree on the same application. To achieve near-optimal classification results, resampling techniques can be employed to generate multiple decision-tree solutions. These decision trees are individually applied and their answers voted. The potential for exceptionally strong performance is counterbalanced by the substantial increase in computing time to induce many decision trees. We describe estimators of predictive performance for voted decision trees induced from bootstrap (bagged) or adaptive (boosted) resampling. The estimates are found by examining the performance of a single tree and its pruned subtrees over a single, training set and a large test set. Using publicly available collections of data, we show that these estimates are usually quite accurate, with occasional weaker estimates. The great advantage of these estimates is that they reveal the predictive potential of voted decision trees prior to applying expensive computational procedures. This joint work with N. Indurkhya of the University of Sydney is available as a IBM technical report, "Estimating Performance Gains for Voted Decision trees" by N. Indurkhya and S.M. Weiss, IBM Research Division Technical Report RC-21199, to appear in Intelligent Data Analysis (IDA).

| DAR project home| Technical agenda| Publications| Seminars| Contact|


[ Research home page | IBM home page | Order | Search | Contact IBM | Legal ]