Skip to main content
IBM Research
  Seminars
DAR Pages

Home

Agenda

Activities

Publications

Seminars

People
Watson KDD Seminar Series

The Computer Science of Big Science Statistics

Prof. Andrew Moore

Carnegie-Mellon University

This talk is about the algorithmic challenges involved in allowing statisticians, probabilists, physicists and biologists to continue using the statistical modeling and probabilistic inference tools they've been happily applying to megabytes of data, when they start drawing in terabytes of data. I will try to give a roadmap to the various literatures and tools from computational geometry, numerical analysis, databases, data mining and artificial intelligence. The technical meat of the talk will highlight some new algorithms and data structures my group has been developing over the period 1994-2001 that fall into the class of "cached sufficient statistics." These are summary data structures that live between the statistical algorithm and the database, intercepting the kinds of operations that have the potential to eat up valuable time if they were answered by direct reading of the dataset. I will give some computer demonstrations showing various classes of accelerations broadly covering density estimation speedups (1000 fold), clustering speedups (1000-10000 fold), mixture model and Bayes Net discovery (10-100000 fold), anomaly detection (100-fold) and 2-, 3-, 4-point correlation function computation (100-fold up to about a trillion-fold). We will also show how this can be applied to other problems such as (i) detecting hits in very noisy proteomics screenings, (ii) within a DARPA-sponsored biological warfare detector system and (iii) for detecting very subtle structure in distributions of spatial results. If time permits we will also discuss the use of this technology in Active Learning scenarios, in which the ability to do these analyses extremely quickly in turn permits on-the fly planning of near-optimal experiments needed to minimize the amount of data needed to learn to solve some task.

Back to Top


 

Data Mining: Next Generation Problems

Prof. Johannes Gehrke

Cornell University

Data mining as a discipline has matured considerably, and especially strong progress has been made in the design of scalable algorithms that transform the oceans of bits in very large databases into interpretable patterns and predictive models. The current data mining research in my group concentrates on three technical challenges for the next generation of data mining tools: Interpretability of data mining models, high-speed main-memory data mining, and mining and monitoring data streams.

In this talk, I will describe some recent results on the correction of a problem in classification trees construction: Bias in split selection. An attribute selection criterion is unbiased if its choice of a split attribute is only determined by the strength of the dependency between the split attribute and the class label attribute; otherwise the criterion is biased. Biased split criteria do not return the "best" split attribute at a node of the classification tree, but rather a attribute that has properties unrelated to its dependence with the class label (for example, a large domain) that "confuse" the split criterion. This problem is ubiquitous in today's data mining tools; for example, the split selection method in IBM DB2 Intelligent Miner is biased. I will first show results from a thorough experimental and theoretical study that shows the nature of the bias, and then I will talk about our recent results that allow us to provably eliminate this bias. A careful experimental study shows that our method is computationally scalable, and has excellent behavior in practice.

Back to Top


Mining High-Speed Data Streams

Prof. Pedro Domingos

University of Washington

Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. In this talk I will describe VFDT, an anytime system that builds decision trees using constant memory and constant time per example. VFDT can incorporate tens of thousands of examples per second using off-the-shelf hardware. It uses Hoeffding bounds to guarantee that its output is asymptotically nearly identical to that of a conventional learner. We have studied VFDT's properties and demonstrated its utility through an extensive set of experiments on synthetic data. VFDT is currently being applied to mining the continuous stream of Web access data from the whole University of Washington main campus.

(Joint work with Geoff Hulten.)

Back to Top


A Detailed Probabilistic Analysis of EM

Dr. Sanjoy Dasgupta

AT&T Research

We prove strong performance guarantees for a simple variant of EM for Gaussian mixtures. Specifically, we establish fairly general conditions under which it will converge to the globally optimal mixture (rather than a local optimum). We place this in the context of previous theoretical and experimental work on the EM algorithm. Our analysis includes a lot of new intuition about the workings of EM on high-dimensional data. For instance, it provides a principled way of initializing the clusters and later pruning them.

Back to Top


 

Active Learning Using Adaptive Resampling

Dr. Vijay Iyengar

T.J. Watson Research Center, IBM Research

Classification modeling (a.k.a. supervised learning) is an extremely useful analytical technique for developing predictive and forecasting applications. The explosive growth in data warehousing and internet usage has made large amounts of data potentially available for developing classification models. For example, natural language text is widely available in many forms (e.g., electronic mail, news articles, reports, and web page contents). Categorization of data is a common activity which can be automated to a large extent using supervised learning methods. Examples of this include routing of electronic mail, satellite image classification, and character recognition. However, these tasks require labeled data sets of sufficiently high quality with adequate instances for training the predictive models. Much of the on-line data, particularly the unstructured variety (e.g., text), is unlabeled. Labeling is usually a expensive manual process done by domain experts. Active learning is an approach to solving this problem and works by identifying a subset of the data that needs to be labeled and uses this subset to generate classification models. We present an active learning method that uses adaptive resampling in a natural way to significantly reduce the size of the required labeled set and generates a classification model that achieves the high accuracies possible with current adaptive resampling methods.

Back to Top


 

Linear Prediction Methods Using Regularization

Dr. Tong Zhang

T.J. Watson Research Center, IBM Research

In this talk, I give an overview of some recent developments on linear prediction methods using regularization. In the first part, I will talk about algorithmic issues. I will describe a duality of a convex programming formulation of linear prediction, which can be used to design efficient numerical algorithms. In particular, I illustrate large margin Perceptron-type algorithms and large margin Winnow-type algorithms. Relationship with other methods such as maximum entropy and basis pursuit will be discussed. We will also compare the numerical aspects of the dual algorithms to the primal algorithms. In the context of duality, Kernel methods will be briefly mentioned. I will compare formulations that lead to sparse dual variables (SV machines) and formulations that lead to sparse primal variables (basis pursuit-like machines) as well as their algorithmic implications. In the second part, I will discuss some theoretical results concerning regularized linear prediction. I will argue why the convergence in probability of estimated parameter to the true parameter is important and show how to obtain such results.I will also relate this to outlier issues. For linearly separable classification, I show that the robustness of the data distribution leads to an exponential decay rate of misclassification error. This can be compared with a typical large margin bound that decays at a rate of inverse sample size.

Back to Top


 

Knowledge Management in Medicine: Challenges and Solutions

Dr. Jeremy Wyatt

University College, London

Doctors’ decisions determine three quarters of health care costs and depend critically on medical knowledge. However, medical knowledge is complex and doubles in amount every 20 years. Managing this knowledge poses many challenges and requires a range of techniques. This talk will discuss these challenges and sample the solutions adopted over the last forty years, including methods to refine knowledge (evidence-based medicine), disseminate knowledge (bibliographic databases and the internet) and apply knowledge to influence professional practice (decision support systems, telemedicine and other approaches).

Back to Top


 

Lightweight Rule Induction

Dr. Sholom Weiss

T.J. Watson Research Center, IBM Research Division

How a simple method for learning decision rules from data gets great results. A lightweight rule induction method is described that generates compact Disjunctive Normal Form (DNF) rules. Each class has an equal number of unweighted rules. A new example is classified by applying all rules and assigning the example to the class with the most satisfied rules. The induction method attempts to minimize the training error with no pruning. An overall design is specified by setting limits on the size and number of rules. During training, cases are adaptively weighted using a simple cumulative error method. The induction method is nearly linear in time relative to an increase in the number of induced rules or the number of cases. Experimental results on large benchmark data sets demonstrate that predictive performance can rival the best reported results in the literature.

Back to Top


 

Probabilistic Clustering of Dynamic Behavior

Prof. Padhraic Smyth

University of California, Irvine

Clustering has a long history in exploratory data analysis. Searching for homogenous data clusters is a fundamental component of any endeavor to discover structure from data. This talk will begin with a brief overview of a general model-based probabilistic clustering framework. This approach, relatively well known in applied statistics, focuses on the use of mixture models as an underlying generative model for observed data. It also provides a relatively objective and sound methodology for determining the model which is closest to truth.

A particularly useful aspect of the probabilistic approach is the ability to generalize from clustering in vector-spaces to clustering sequences, trajectories, and other ``dynamic" data that we commonly observe from individuals and/or systems. The main part of the talk will introduce a new and general framework for mixture-model clustering of such data. The probabilistic approach solves in an elegant and coherent manner the dual difficulties of (a) how to define distance metrics between observations (e.g., sequences) of different sizes, and (b) how to ``weight" different individuals for whom we have different amounts of data. Illustrative applications include unsupervised learning of gestures from video data, clustering of individuals based on Web browsing behavior, modeling of gene expression data, modeling of cyclone trajectories, clustering locusts based on observed motor behavior, and clustering of medical patients based on histograms of red blood cells. A subset of these applications will be discussed, time permitting. The talk will conclude with a brief discussion of some apparently useful connections between mixture modeling in this context and Bayesian hierarchical models.

Back to Top


Data Mining with DB2 UDB Miner or "Cut the Hype, Let's Get Down to Business"

Vince Thomas

IBM Business Intelligence Consulting & Services

DB2 UDBMiner was developed on a customer engagement to provide a data mining environment that was integrated into a 33TB DB2 data mart running on a 186-node SP. In particular, it was designed to address the challenges of making data mining part of the business process rather than a marketing exercise. The innovation of building the model into a User Defined Function revolutionizes the way we deploy data mining solutions.

Back to Top


Gene Shaving: a New Class of Clustering Methods for Expression Arrays

Prof. Trevor Hastie

Stanford University

Through the use of recently developed DNA microarrays, it is now possible to obtain accurate, quantitative measurements of the proportional representation of thousands of genes present in a biological sample. DNA arrays have now been used to monitor changes in gene expression during important biological processes (e.g. cellular replication and the response to changes in the environment), and to study variation in gene expression across collections of related samples (e.g. tumor samples from patients with cancer). A major statistical task is to understand the structure of the data from such studies, which often consist of measurements of thousands of genes in tens of conditions.

A variety of clustering techniques have been applied to such data, and proven to be useful tools for identifying biologically relevant groupings of genes and samples. In the paper, we describe a new method called GENE SHAVING which tries to identify subsets of genes with coherent expression patterns with large variation across conditions. The technique differs from hierarchical and other clustering methods that have appeared in the literature. The technique can be unsupervised (that is, treat the genes and samples as unlabeled), or partially or fully supervised by known properties of the samples. We illustrate the method on data from a large survey of gene expression in a diverse collection of human tumor cell lines.

* joint work with Robert Tibshirani, Michael Eisen, Patrick Brown, David Botstein, and others

Back to Top


2001: A Statistical Odyssey

Dr. Daryl Pregibon

AT&T Labs Research

This talk is an interim report on the 5 year plan launched in 1996 to provide a theoretical and computational foundation of Statistics for massive data sets. The plan coincided with the formation of AT&T Labs and the proposed research agenda of the InfoLab, which is both a physical laboratory and an interdisciplinary collection of information researchers in CS, mathematics, and Statistics. At the halfway point of this odyssey we can identify some success stories but more importantly it is an opportune time to re-calibrate the challenges and the milestones.

Back to Top


Mixture Density Estimation

Dr. Andrew Barron

Yale University

Gaussian mixtures for density estimation provide a natural counterpart to sigmoidal neural networks for function fitting and approximation. In both cases, it is possible to give simple expressions for the iterative improvement of performance as components of the network are introduced one at a time. In particular, for mixture density estimation we show that a k-component mixture estimated by maximum likelihood (or by an iterative likelihood improvement that we introduce) achieves log-likelihood within order 1/k of the log-likelihood achievable by any convex combination. Consequences for approximation and estimation using Kullback-Leibler risk are also given. A Minimum Description Length principle selects the optimal number of components k that minimizes the risk bound.

Back to Top


Knowledge Representation and Discovery in Performance Management

Dr. Joseph L. Hellerstein

T.J. Watson Research Center

IBM Research Division

Performance management deals with ensuring the quality and accessibility of electronic-based services. Examples of such services include: transaction processing, web access, and eBusiness interactions. Over the last several years, the performance management department at IBM Research has developed a number of projects that rely heavily on knowledge representation and machine learning. For example, we developed a measurement technology (ETE) that provides operations with a means to dynamically change the definition of end-user transactions, an important consideration to accommodate varied user-interaction styles during problem determination. This technology employs rule-based techniques for on-line pattern recognition. Subsequent work has involved learning end-user transactions from examples. Other work, such as our event mining project, has involved pattern discovery to characterize problem situations. Special considerations here include noisy patterns and partial periodicities. This talk will summarize key projects in the performance management department, identifying knowledge representation and discovery considerations.

Back to Top


Extracting Information from the World WIde Web

Prof. Tom Mitchell

Center for Automated Learning & Discovery

Carnegie-Mellon University

Consider the fact that although your computer workstation can now retrieve any of 600,000,000 pages on the World Wide Web, it unfortunately cannot understand their content. This is, of course, because web pages are written to be understandable to people, not computers.

The goal of our research is to automatically extract a very large database of facts that mirror the content of the Web, and that can be manipulated by computer. If we can achieve this goal, it will enable using the web as a gargantuan data base and knowledge base to support a rich variety of applications. Our approach is to use machine learning algorithms to train a system to automatically extract information from web hypertext. For example, in one set of experiments our system was trained to extract descriptions of faculty, students, research projects, and courses from web sites of computer science departments. It then used these learned extraction routines to build a database containing thousands of new entries by automatically browsing new university web sites. The system is currently running 24 hours per day, and over the past eight months has built a knowledge base containing over 100,000 assertions, with an accuracy of roughly 70%.

This talk will present the machine learning algorithms we have developed to date, along with experimental results suggesting these methods can be quite effective for information extraction in certain domains.In information extraction and in language understanding related tasks inference heavily depends on knowledge about the language and the world. I will present research on the use of Machine Learning in performing these kind of ``knowledge-intensive inferences''.

Back to Top


Learning and Managing Knowledge in Large Scale Natural Language Inferences

Prof. Dan Roth

Dept. of Computer Science and the Beckman Institute
University of Illinois - Urbana/Champaign

In information extraction and in language understanding related tasks inference heavily depends on knowledge about the language and the world. I will present research on the use of Machine Learning in performing these kind of ``knowledge-intensive inferences''.

First, I will present a learning theory account of the major statistical approaches to learning in natural language and explain why they work although, typically, the assumptions they are based on do not hold in the data.

Then, I will present the SNoW learning architecture and discuss how it is used to support large-scale inference problems. The emphasis is on how SNoW can tolerate data of high dimensionality, how additional knowledge - acquired as the system interacts with various information sources is incorporated, and on making inferences that depend on the outcome of several classifiers. The approach will be exemplified with experimental evidence from several language understanding related tasks.

Back to Top


Bootstrap Training for Neural Networks

Prof. Masato Koda

University of Tsukuba, Japan

We propose a method of training neural networks based on bootstrap samples. We also propose a method of evaluating bootstrap estimates for training errors. Then we review 4 learning algorithms, which include stochastic learning method formulated by the present author, from which we selected the conjugate gradient training method for its convergence speed. We set up an experiment meeting the following requirements:
1. Complete sample space is known,
2. Sample space is finite and small so that training is fast,
3. Its distribution function is unbalanced so that the consequence of resampling is easy to assess.
The result shows that the learning performance is enhanced by resampling.

Back to Top


Adaptive Automated Diagnosis

Dr. Magnus Stensmo

IBM Global Business Intelligence Solutions, Almaden

Diagnosis is an important but complicated task whose automation would be very useful. If automated successfully, scarce resources could be made less so and diagnosis could be made more accurate and efficient. We have used decision and probability theory to construct automated diagnosis systems directly from databases of cases. This approach alleviates the task of knowledge extraction from experts, who often are unaware of how they have come to a certain diagnosis. Probability models were constructed using mixture models that were efficiently learned by the Expectation-Maximization algorithm. Problems with missing data are thereby solved, both for missing data in the case database, and during actual diagnosis when missing data constitute observations not yet made. Decision theory was used to find the most informative next question to ask, observation to make, or test to do in order to minimize the total cost for the diagnosis. It was also used to decide when to stop requesting more information. To automatically find a good utility function for the decision theoretic model, temporal-difference reinforcement learning was used to learn it from the data set. The use of reinforcement learning reduced the number of tests necessary to achieve the same accuracy as if all tests had been performed which results in significant savings. It is demonstrated how a complete decision theory controlled sequential decision process can be completely learned from data. No "expert" knowledge is needed since only a database was used to construct the systems, and the system is completely domain independent.

Back to Top


Optimizing Association Rules

Dmitry Zelenko

University of Illinois, Urbana-Champaign
Dept. of Computer Science

We analyze several problems of optimizing association rules. The problems have important applications in data mining, allowing users to focus at interesting rules extracted from databases. We consider association rules of the form



where Ai's are categorical attributes of the underlying relation R, and C is any fixed condition defined over the attributes of the relation R.

We study several problems, in which we seek assignments to the variables vi's of a given rule so that the rule satisfies certain optimality constraints. We exhibit efficient algorithms for solving the optimized support and optimized confidence problems, the weighted support/confidence problem, and the shortest rule problem. We discuss time and space complexity of the designed algorithms and show how they can be improved by allowing for approximate solutions. We also delineate how we can use one of the developed algorithms to solve efficiently an interesting special case of minimum flux cut, a classical problem in network design.


Back to Top


Distributed Data Mining For Fraud And Intrusion Detection

Salvatore J. Stolfo

Columbia University - Department of Computer Science

In this talk, we will provide an overview of the JAM distributed data mining system and its application to fraud and intrusion detection. JAM provides a set of machine learning programs encapsulated as JAVA agents that can be dispatched to remote database sites. It also features a meta-learning system that integrates and combines multiple models gathered from and computed by the remote database sites.

Our work on credit card fraud, joint with the Financial Services Technology Consortium, provides insight into a number of technical problems that must be addressed by any distributed data mining system. We will describe these and how they manifest themselves in the credit card application.

The study of credit card fraud was not only useful to the banks. The results we achieved, and solutions we invented along the way provided guidance in studying the general problem of network intrusion detection. We applied our techniques to this problem, and participated in the DARPA/MIT intrusion detection evaluation program. The models generated by our data mining programs produced the best performance over a number of other knowledge engineered intrusion detection systems.

However, our work on the credit card fraud problem strongly indicates that the metrics used in evaluating intrusion detection systems is wrong. We will describe the issues involved, and define "cost-based" metrics for intrusion detection. If time permits, we shall provide a very brief overview of MarketNet, an experimental system under development at Columbia by Y. Yemini, that may provide the ultimate solution.

Back to Top


MailCat: An Intelligent Assistant for Organizing E-Mail

Richard Segal

IBM T.J. Watson Research Center

MailCat is an intelligent assistant for Lotus Notes that helps users organize their e-mail into folders. MailCat uses a text classifier to learn each user's mail-filing habits. MailCat uses the model it learns to predict the three folders in which the user is most likely to place each incoming message. The predictions are presented to the user as three shortcut buttons that allow the user to quickly file each message into one of the predicted folders. When one of MailCat's predictions is correct, the effort required to file a message is reduced to a single button click. MailCat uses an incremental version of the AIM classifier to make its predictions. An incremental classifier is used to allow MailCat to continuously update its model of the user. We have performed a number of static and dynamic experiments with MailCat as well some initial user studies. The results show that the shortcut buttons provided by MailCat are useful between 80% to 90% of the time and this performance level is maintained over time despite the constantly changing classification problem. We also find that the bootstrap period for the classifier is extremely short and that MailCat is useful from the moment it is first installed.

Back to Top


Applications of Data Mining to Personalized Product Recommendation

Rick Lawrence

IBM T.J. Watson Research Center

Recommender systems such as used by Amazon.com and other e-businesses have become an increasingly popular approach to web-based personalization. These systems most often rely on variants of collaborative filtering, in which the user rates a subset of items, and then receives additional recommendations based on analysis of the purchases of other customers with 'similar' tastes. In this talk, we describe a product recommendation system developed as part of an overall pervasive computing solution for the grocery chain Safeway UK. Our approach combines aspects of both collaborative filtering and content-based filtering. Customer peer groups are determined via data mining clustering, and this information is used to construct a candidate recommendation list based on popular items among each customer's peers. This list is further refined by a content-based projection algorithm which uses relationships between product groups computed using data mining associations. We have also explored the use of product profit margins in determining product recommendations. We describe results from an initial deployment at Safeway. Finally, we briefly mention the application of similar ideas to ongoing work in identifying stock market trends.

Back to Top


AI Challenges and Opportunities for the 21st Century

David Waltz

NEC Research Institute

AI has been successful in fielding applications that mine large databases for facts and patterns, that plan for complex logistics problems, that allow programs to accept speech inputs, that help physicians diagnose diseases and monitor patients, that play games (e.g. chess, bridge, poker) at human levels, etc. But there are still no computational examples of human-level performance on many human abilities (e.g. visual perception, deep natural language understanding, flexible reasoning). In fact, there is not even a consensus about the nature of the mechanisms that underlie such human abilities. But such abilities are important for future applications such as intelligent environments, instructable assistants, and text search systems that actually find meaningful answers to questions rather than merely returning a number of texts that contain the words of a query.

This talk will argue that any system we would consider truly intelligence will depend critically on the ability to separate the important from the unimportant. AI has for the most part focussed on logic and reasoning in artificial situations where only relevant variables and operators are specified, and has paid insufficient attention to processes of reducing the richness and disorganization of the real world to a form where logical reasoning can be applied. This talk will discuss the role of importance in intelligence, provide some positive examples of research that makes use of importance judgements, and offer suggestions for new mechanisms, architectures, applications, and research directions for AI.

Back to Top


SportsMiner: A sporting guide to better data mining

Inderpal Bhandari

Virtual Gold Inc.

Data mining involves the automatic identification of interesting patterns in large amounts of data. I will preview SportsMiner, a web site for sports professionals, amateurs, and fans who wish to mine data of major sports - e.g. PGA Golf, World Cup Soccer, Grand Slam tennis - to find hidden patterns that reveal what leads them (or their heroes) to victory or defeat. I will describe how the patent-pending technology that underlies the above demonstrations enables rapid, cost-effective development and deployment of Internet-ready applications that bring data-driven knowledge discovery to the masses. I will demonstrate such a data mining application for corporate banking and discuss the implications of the new technology for the database and data mining marketplace.

Back to Top


A Tree Projection Algorithm for Generating Associations

Ramesh Agarwal

Networked Data Systems, IBM Research

Generating all frequent itemsets (association patterns) in a transaction database is an important data mining problem. Current association algorithms are particularly inefficient at mining long patterns at low support. In this talk, we will describe a new approach to counting support for association. We represent associations as a lexicographic tree of frequent itemsets. We discuss different strategies in generation and traversal of the lexicographic tree such as breadth-first search, depth first search, or a combination of the two. These techniques provide different trade-offs in terms of the I/O, memory, and the computational time requirements. We use the hierarchical structure of the lexicographic tree to successively project transactions at each node of the lexicographic tree, and use a matrix counting technique on the projected transactions for counting support of candidate itemsets. Tree projection is a very powerful technique and it greatly reduces the computing time. Let us consider a node of the tree corresponding to a frequent itemset P having support S. Then only this fraction of the transactions will project to this node. When the frequent itemset P is extended, then all candidate extensions of P are counted only against the subset of transactions which project to node P. For example assume S is 1%, then counting time to extend node P is reduced by a factor of 100 so. We have tested our algorithm on both real and synthetic data. In those cases, where the A Priori algorithm becomes compute bound, our algorithms provides one to two orders of magnitude performance improvement. In this talk various implementation details and performance comparison results will be presented.

Back to Top


Multiple Comparisons in Induction Algorithms: Understanding and Correcting a Fundamental Statistical Error

David Jensen

Computer Science Department, University of Massachusetts, Amherst

An increasing number of modern systems incorporate algorithms that construct predictive models from data. These induction algorithms exhibit a surprising number of pathological behaviors, including overfitting, oversearching, and attribute selection errors. In this talk, I discuss a single statistical error responsible for all of these pathologies. At the core of this error is the multiple comparison procedure (MCP): Algorithms generate multiple items (e.g., classification rules), score each item based on a data sample, and select the item with the maximum score. In this talk, I outline proofs of the statistical properties of MCPs and show how failure to adjust for these properties leads to each pathology. I will also discuss approaches that can control pathological behavior, including Bonferroni adjustment, cross-validation, and randomization tests. I conclude by addressing the question "Why does such a simple statistical error cause so much mischief?"

Back to Top


Tutorial on Statistical Learning Theory

Edwin Pednault

Mathematical Sciences Department, IBM Research Division

Statistical learning theory, also known as Vapnik-Chervonenkis (VC) theory, addresses a key question that arises when constructing predictive models from data---how to decide whether a particular model is adequate or whether a different model would produce better predictions. Whereas classical statistics typically assumes that the form of the correct model is known and the objective is to estimate the model parameters, statistical learning theory presumes that the correct form is completely unknown and the goal is to identify the best possible model from a set of competing models. The models need not have the same mathematical form and none of them need be correct. The theory provides a sound statistical basis for assessing model adequacy under these circumstances, which are precisely the circumstances encountered in machine learning, pattern recognition, and exploratory data analysis. The speaker will present an informal chalk-talk tutorial on statistical learning theory as a prelude to Vladimir Vapnik's visit on January 11. The tutorial will focus on fundamental concepts and the intuitions behind the mathematical development of the theory.

Back to Top


An Evaluation of Data-Driven Quantitative Investment Technologies on Wall Street

Vasant Dhar

Stern School of Business, New York University

Financial markets are highly data intensive. They are also highly complex and multifaceted, with different psychologies that manifest themselves periodically. In this talk, I evaluate a number of models that are used by the investment community to price securities. I also discuss some of the concerns of this community towards bottom-up approaches to model building in general, and the potential of data mining approaches in addressing them.

Back to Top


Predicting the Trajectory of Failures in a Dynamical System

Ashok Srivastava

IBM Almaden

Many dynamical systems operate under relatively constrained and known conditions, where the dynamics and exogenous inputs into the system are well known and quantified. These systems can generally be characterized using established time series technologies. In many cases, however, standard methods may not be well suited for systems where a slow degradation in the system is present. This introduces a problem of multiple time scales: one at the scale of the observations, and the other at the scale of the degradation.

In this talk, we present a methodology to predict failure and classify the type of failure from a dynamical system in which interesting phenomenon exist on at least two time scales. The observations are on the order of milliseconds, and the development of the failure may be on the order of weeks. The problem is to predict the failure mode potentially far in the future given measurements on a very rapid time scale.

Our results are presented in the context of a benchmark data set (the "Hollins Data Set") from the rotorcraft industry, where the problem is to predict the failure of the rotorcraft transmission given a set of vibration measurements. We demonstrate that our methodology produces a high accuracy prediction using only a small number of features (less than 5), whereas the best competing methodologies use hundreds of features in an extremely complex classifier and produce similar classification rates.

This work was done in collaboration with Dr. Ed Huff of NASA Ames Research Center.

Back to Top


Data Mining for Direct Marketing -- Problems and Solutions

Charles Ling

University of Western Ontario

Direct marketing is a process of identifying likely buyers of certain products and promoting the products accordingly. It is increasingly used by banks, insurance companies, and the retail industry. Data mining can provide an effective tool for direct marketing. During data mining, several specific problems arise. For example, the class of distribution is extremely imbalanced (the response rate is about 1%), the predictive accuracy is no longer suitable for evaluating learning methods, and the number of examples can be too large. In this paper, we discuss methods of coping with these problems based on our experience on direct marketing projects using data mining.

Back to Top


Interactive Multidimensional Scaling

Andreas Buja

AT&T Labs - Research

We discuss interactive techniques for multidimensional scaling (MDS) and a system, named ``XGvis'', that implements these techniques.

MDS is a method for visualizing proximity data, that is, data where objects are characterized by dissimilarity values for all pairs of objects. MDS constructs maps of these objects in $\RealsÄk$ by interpreting the dissimilarities as distances. More precisely, objects are mapped to points in such a way that distances between points approximate the dissimilarities between corresponding objects.

MDS in its conventional batch implementations is prone to uncertainties with regard to 1)~local minima in the underlying optimization, 2)~sensitivity to the choice of the optimization criterion, 3)~artifacts in point configurations, and 4)~local inadequacy of the point configurations. These uncertainties will be addressed by the following interactive techniques: 1)~algorithm animation and interactive restarts for the optimization, 2)~interactive control over parameters that determine the criterion and its minimization, 3)~diagnostics for pinning down artifactual point configurations, and 4)~MDS on subsets of objects and subsets of pairs of objects.

MDS was originally developed for the social sciences, but lately there has been interest in using it for graph layout as well. While graph layout is usually done in two dimensions, in our implementation layouts can be constructed and viewed in arbitrarily high dimensions.

We will show applications to various areas: visualization of behavioral data about computer users, dimension reduction of marketing data, layout of telephone call graphs, and reconstruction of macro molecules in nano technology.

Joint work with Deborah Swayne, Michael Littman, Nate Dean

Back to Top


Clustering Methods for Collaborative Filtering

Lyle H. Ungar

Dept. of Computer and Information Science, University of Pennsylvania

Grouping people into clusters based on the items they have purchased allows accurate recommendations of new items for purchase: if you and I have liked many of the same movies, then I will probably enjoy other movies that you like. Recommending items based on similarity of interest (a.k.a. collaborative filtering) is attractive for many domains: books, CDs, movies, etc., but does not always work well. Because data are always sparse - any given person has seen only a small fraction of all movies - much more accurate predictions can be made by grouping people into clusters with similar movies and grouping movies into clusters which tend to be liked by the same people. Finding optimal clusters is tricky because the movie groups should be used to help determine the people groups and visa versa. We present a formal statistical model of collaborative filtering, and compare different algorithms for estimating the model parameters including variations of k-means clustering and Gibbs Sampling.

This is joint work with Dean Foster.

Back to Top


Military Applications of Knowledge Based Systems

Graham Bent

Data Sciences, IBM UK

The application of Intelligent Knowledge Based Systems (IKBS) or Artificial Intelligence (AI) to military applications has had something of a checkered history with much being promised but little being practically achieved. Whilst many of the technologies underpinning IKBS and AI are mature, a significant reason for a lack of progress in military applications (and in other domains) has been the difficulty in obtaining, in the first instance, the knowledge that is required to form the core of these systems. Recent work in the area of machine learning has lead to the development of techniques for automatically generating this knowledge from case histories and supplementing this knowledge with expert input.

IBM/Data Sciences has been involved in the development of these techniques over a number of years and are currently involved in investigating their application to military command and control, data fusion and classification problems. Many of the techniques are embodied in the Artificial Intelligence Discrimination Architecture (AIDA) which has been developed by IBM/ Data Sciences and is finding wide application in areas such as ballistic missile defense, electronic warfare, non-cooperative target recognition, naval command and control and information warfare.

The presentation will look at these emerging technologies and will include a number of computer assisted demonstrations that illustrate how these techniques can be applied in existing and future military applications.

Back to Top


Inductive Logic Programming and its Role in Data Mining

Ashwin Srinivasan

Oxford University Computing Laboratory, Oxford, UK

The existence and rapid growth of large-scale industrial data-bases have brought into focus the need for methods that enable the discovery of trends and predictive patterns in data, and communicating them in a manner designed to provoke insight. This has turned attention to ``machine learning'' techniques capable of extracting symbolic descriptions from data. At the cutting-edge of such techniques is Inductive Logic Programming (ILP).

Given a set of observations and background knowledge in the form of logic programs, an ILP system attempts to construct explanations for the observations. The explanations are in the same language as the observations and background knowledge, namely first-order logic programs. This contrasts with algorithms like decision-trees, and neural networks which employ simple propositional logic representations. This, along with the flexibility to include background knowledge -- which can even include other propositional algorithms -- allow a form of data analysis and decision-support that is, in principle, unmatched by first-generation methods. ILP systems have already been applied in academic settings with some success to a range of difficult problems. These include protein structure prediction, drug structure-activity prediction, finite element mesh design, satellite fault diagnosis, circuit design, automobile traffic-flow analysis, intelligent software agents for the internet, lexical analysis, and the analysis of Rachmaninoff's piano-forte performances. Some of the results have even been acknowledged by experts as having identified new ``knowledge''.

This talk will provide an introduction to ILP, and illustrate some aspects that should be of interest for data-mining research. These will include (a) the automatic construction of new attributes; (b) a natural solution to the ``multiple-instance'' problem; (c) combining statistical and logical predicates to achieve regression, clustering and classification within the ILP setting; and (d) a list of pros and cons for the use of ILP in data mining.

Back to Top


Uncertainty Sampling for Supervised Learning:

A Logarithmic Lunch?

David D. Lewis

AT&T Labs

Random sampling is often used to choose training data for language processing tasks such as text retrieval, email filtering, parsing, tagging, etc. We propose an alternative approach of labeling data, training the system, and finding examples for which the system is least certain of the correct answer. On a text categorization task this method, which we call uncertainty sampling, reduced by up to 500-fold the amount of training data needed to achieve a given level of categorization accuracy.

The computational learning theory results that inspired our own (heuristic) work suggest that, asymptotically, a labeled training set of size *logarithmic* in the amount of unlabeled training data can be used without sacrificing accuracy. For applications where unlabeled training data is cheap, this would be the next best thing to a free lunch. A great deal is unknown about these methods, and we will discuss avenues for research.

Back to Top


Analysis and Visualization of Data Mining Performance:

Comparison of Classifiers in Imprecise Environments

Foster Provost

Bell Atlantic (formerly NYNEX Science and Technology)

When mining data with inductive methods, one often experiments with a wide variety of model-building algorithms, using different algorithm parameters, varying output threshold values, and using different training regimens. Such experimentation yields a large number of classifiers to be evaluated and compared. Experience applying automated model-building algorithms to real-world problems has shown repeatedly that using *accuracy* to compare classifiers is not adequate because the underlying assumptions rarely hold; in order to compare performance it is necessary to know the conditions under which the classifiers will be used. Unfortunately, target class distributions and misclassification costs are often difficult or impossible to specify precisely.

We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. The ROC convex hull also facilitates building systems that will perform "optimally" under any target conditions.

Back to Top


The IBM/NASA Satellite Explorer System and its Data Mining Analytics

Vittorio Castelli and Lawrence Bergman

Image Information Systems, IBM Research, Watson

IBM/Nasa Satellite Explorer is a system designed to provide navigation, content-based query and data-mining capabilities for large satellite image archives. Our goal has been to provide rapid yet expressive search capabilities. In this talk we will discuss three foci of this project - image representation and search at multiple abstraction levels, content-based query specification and search, and development of algorithms for efficiently mining image archives.

We think of images as consisting of multiple layered abstraction levels. At the lowest layer is the raw pixels. This is the highest volume, and most compute-intensive to search. Certain kinds of search operators (such as template match) inherently require pixel access. At the next level of abstraction are image features. Features such as texture or local spectral histogram can be pre-extracted from image pixels and provide much faster search capabilities. Finally, very low-volume semantic extracts (such as lakes, forests, etc) provide extremely rapid search. The philosophy here is to work as high in the pyramid as possible, searching on data abstracts, but to provide for dropping to higher volume search algorithms when the abstracts do not provide the level of specificity required.

Content-based query is specified in terms of object types and image examples. Where pre-existing semantic categories capture a user's intent, they can be specified by name from a menu. When new object types are required, they can be processed using lower abstraction levels (such as texture, or raw pixels) and specified using sample image clips. A drag-and-drop java-based interface is used for specifying queries. Queries are posed as nested English sentences in a tightly constrained grammar and can include phrases such as "looks like XXXX" where XXXX can be one or more image samples.

Efficient mining of image archives is achieved by developing algorithms that operate on multi-resolution representations of the image. In a fashion analogous to the multiple abstraction pyramid described above, we can also store information at multiple resolutions at any given abstraction level. We have developed mining algorithms including classification and template matching that capitalize on a multi-resolution representation. Such algorithms can provide a 10-30 time speed-up with minimal loss of accuracy.

Back to Top


Discovering and Using Hidden Information in Financial Data

Andreas S. Weigend

Information Systems Department, Stern School of Business, NYU

While the first round of applications of modern machine learning techniques to real world problems focused primarily on prediction--the necessary first step for acceptance in applied communities--more recent approaches have emphasized the importance of understanding. Beside taking domain specific knowledge into account (such as the fact that adjacent time steps are adjacent in time), an important approach to understanding and explaining data is by introducing hidden variables or states, and trying to understand them. In this talk, I present an architecture I have developed that goes beyond the simple nonlinear regression models typically assumed by the physicists-in-finance community, (historically coming from dynamical systems and chaos). The architecture takes time seriously by allowing for hidden states: it models time dependency explicitly as a first-order Markov model with transitions between these hidden states. These transition probabilities can be functions of external variables, and the emission models are nonlinear feedforward neural networks. I will give examples of the characterization of the hidden states for several financial examples on different time scales (daily and high-frequency), as well characterize the performance of trading strategies in terms of profit and loss, Sharpe ratio etc. This is joint work with Shanming Shi at J P Morgan (New York) and University of Colorado (Boulder).

Back to Top


A Statistical Perspective on Data Mining

Jonathan R. M. Hosking

IBM Research/Yorktown

The fast-growing area of data mining can be regarded as a collection of methods for drawing inferences from data. The aims of data mining, and some of its methods, overlap with those of classical statistics. However, there are some philosophical and methodological differences. I examine these differences, and I describe three approaches to machine learning that have developed largely independently: classical statistics, Vapnik's statistical learning theory, and computational learning theory. Comparing these approaches, I conclude that statisticians and data miners can profit by studying each other's methods and using a judiciously chosen combination of them. This is joint work with Edwin Pednault and Madhu Sudan (IBM T. J. Watson Research Center).

Back to Top


Data Mining in the Advanced Scout Project

Inderpal Bhandari

IBM Research/Hawthorne

This presentation will present an overview of the data mining research work in the Advanced Scout project. The demonstration of Advanced Scout will illustrate (from an end-user's perspective) the analytical process beginning with the collection of raw data, the data-mining operations, and ending with decision support and strategy formulation. The underlying algorithm, the challenge of making data mining suitable for the lay-person, and the cognitive processes of pattern interpretation will be discussed. Beyond the team level, the use of Advanced Scout by the NBA, and by ESPN will also be illustrated. The latter part of the presentation will illustrate (from an organizational perspective) the generalizations learned about designing, deploying, supporting and iteratively developing a data mining solution and understanding the impact that data-mining can have beyond any specific domain.

Back to Top


Data Mining on the Web

Philip Yu

IBM Research/Hawthorne

The talk provides an overview of our recent work of exploring data mining on the Web. There are two different aspects. One is mining Web usage data to better understand surfer behavior and improve the organization of the Web presentation. We have developed a Web usage mining tool, SpeedTracer, to address this need. The basic features of SpeedTracer will be covered in this talk. The other aspect is to take advantage of the prevalence of the Web browser interface to do interactive data mining. We will discuss new approaches on interactive mining of association rules and sequence databases.

Back to Top

Revised October 13, 2003

HomeOrderContact IBMLegal