DAR Pages






|
Watson KDD Seminar Series
- For latest news on Watson KDD seminars, please visit the KDD
website at IBM Research.
- Prof. Andrew Moore,
Carnegie-Mellon University, April 18, 2001, 11:00am-12:15pm, 20-043,
The
Computer Science of Big Science Statistics
- Prof. Johannes Gehrke,
Cornell University, March 12, 2001, 11:00am-12:30pm, 20-043,
Data Mining: Next
Generation Problems
- Prof. Pedro Domingos,
University of Washington, October 12, 2000, 10:00am-11:30pm, H1S-E/F53,
Mining
High-Speed Data Streams
- Dr. Sanjoy Dasgupta,
AT&T Research, September 28, 2000, 11:00am-12:30pm, 20-001,
A Detailed Probabilistic
Analysis of EM
- Dr. Vijay Iyengar,
T.J. Watson Research Center, IBM Research, August 16, 2000, 11:00am-12:30pm,
20-043. Active
Learning Using Adaptive Resampling
- Dr. Tong Zhang,
T.J. Watson Research Center, IBM Research, July 14, 2000, 11:00am-12:30pm,
20-043. Linear
Prediction Methods Using Regularization
- Dr. Jeremy Wyatt,
University College, London, June 26, 2000, 3:00pm-4:30pm, Knowledge
Management in Medicine: Challenges and Solutions
- Dr. Sholom Weiss,
T.J. Watson Research Center, IBM Research, June 26, 2000, 11:30am-12:30pm,
20-043. Lightweight
Rule Induction
- Prof. Padhraic Smyth,
University of California, Irvine, April 28, 2000, 11:00am-12:30am,
20-043. Probabilistic
Clustering of Dynamic Behavior
- Mr. Vince Thomas,
IBM Business Intelligence Consulting & Services, April 3, 2000,
09:30am-11:00am, 26-014. Data
Mining with DB2 UDB Miner or "Cut the Hype, Let's Get
Down to Business"
- Prof. Trevor Hastie,
Stanford University, March 13, 2000, 11:00am-12:30pm, 20-043.
Gene Shaving: a
New Class of Clustering Methods for Expression Arrays
- Dr. Daryl Pregibon,
AT&T Labs Research, February 4, 2000, 11:00am-12:30pm, 20-043.
2001:
A Statistical Odyssey.
- Prof. Andrew Barron,
Yale University, January 24, 2000, 4:00pm-5:30pm, 40-200.
Mixture
Density Estimation
- Dr. Joseph L. Hellerstein,
T.J. Watson Research Center, IBM Research, January 7, 2000, 11:00am-12:30pm,
20-043. Knowledge
Representation and Discovery in Performance Management.
- Prof. Tom Mitchell,
Center for Automated Learning & Discovery, Carnegie Mellon University,
October 19, 1999, 9:30am-10:45am, 20-043. Extracting
Information from the World Wide Web.
- Prof. Dan Roth, Computer
Science Department, University of Illinois, Urbana-Champaign, September
24, 1999, 11am-12:30pm 20-043. Learning
and Managing Knowledge in Large Scale Natural Language Inferences.
- Prof. Masato Koda,
University of Tsukuba, Japan, August 13, 11am-12:30pm 20-043.
Bootstrap Training
for Neural Networks
- Dr. Magnus Stensmo,
IBM Global Business Intelligence Solutions, Almaden, August 10, 1999,
1pm-2pm 20-059. Adaptive
Automated Diagnosis
- Dmitry Zelenko, Computer
Science Department, University of Illinois, Urbana-Champaign, July
28, 1999, 11am-12:30pm 20-043. Optimizing
Association Rules
- Salvatore J. Stolfo,
Computer Science Department, Columbia University, May 11, 1999, 11am-12:30pm,
20-043. Distributed
Data Mining For Fraud And Intrusion Detection
- Richard Segal, Agents
and Emergent Phenomena, IBM T.J. Watson Research Center, April 30,
1999, 11am-12:30pm, 20-043. MailCat:
An Intelligent Assistant for Organizing E-Mail
- Rick Lawrence, IBM
T.J. Watson Research Center, April 16, 1999, 11am-12:30pm, 20-043.
Applications
of Data Mining to Personalized Product Recommendation
- David Waltz, NEC Research
Institute, April 6, 1999, 3:30pm-5:00pm, Aud-Rear. AI
Challenges and Opportunities for the 21st Century
- Inderpal Bhandari,
Virtual Gold Inc., March 26, 1999, SportsMiner:
A sporting guide to better data mining
- Ramesh Agarwal, Networked
Data Systems, IBM Research, March 5, 1999, A
Tree Projection Algorithm for Generating Associations
- David Jensen, University
of Massachusetts, Amherst, February 19, 1999, Multiple
Comparisons in Induction Algorithms: Understanding and Correcting
a Fundamental Statistical Error
- Edwin Pednault, Mathematical
Sciences Department, IBM Research, January 8, 1999, Tutorial
on Statistical Learning Theory
- Vasant Dhar, Stern
School of Business, New York University, November 20, 1998,
An Evaluation of Data-Driven
Quantitative Investment Technologies on Wall Street
- Ashok Srivastava,
IBM Almaden, October 20, 1998, Predicting
the Trajectory of Failures in a Dynamical System
- Charles Ling, University
of Western Ontario, August 31, 1998, Data
Mining for Direct Marketing -- Problems and Solutions
- Andreas Buja, AT&T
Labs - Research, March 20, 1998, Interactive
Multidimensional Scaling
- Lyle Ungar, Department
of Computer Science, University of Pennsylvania, February 12, 1998,
Clustering
Methods for Collaborative Filtering
- Graham Bent, Data
Sciences, IBM UK, December 12, 1997, Military
Applications of Knowledge Based Systems
- Ashwin Srinivasan,
Oxford University Computing Laboratory, Oxford, UK, November 10, 1997,
Inductive
Logic Programming and its Role in Data Mining
- David D. Lewis, AT&T
Labs, October 29, 1997, Uncertainty
Sampling for Supervised Learning: A Logarithmic Lunch?
- Foster Provost, Bell
Atlantic (formerly NYNEX Science and Technology), September 26, 1997,
Analysis
and Visualization of Data Mining Performance:Comparison of Classifiers
in Imprecise Environments
- Vittorio Castelli
and Lawrence Bergman, Image Information Systems, IBM Research, Watson,
August 8, 1997, The
IBM/NASA Satellite Explorer System and its Data Mining Analytics
- Andreas S. Weigend,
Information Systems Department, Stern School of Business, NYU, July
11, 1997, Discovering
and Using Hidden Information in Financial Data
- Jonathan R. M. Hosking,
IBM Research/Yorktown, June 13, 1997, A
statistical perspective on data mining
- Inderpal Bhandari,
IBM Research/Hawthorne, May 9, 1997, Data
Mining in the Advanced Scout Project
- Philip Yue, IBM Research/Hawthorne,
April 4, 1997, Data
Mining on the Web
The
Computer Science of Big Science Statistics
Prof. Andrew Moore
Carnegie-Mellon University
This talk is about the algorithmic
challenges involved in allowing statisticians, probabilists, physicists
and biologists to continue using the statistical modeling and probabilistic
inference tools they've been happily applying to megabytes of data,
when they start drawing in terabytes of data. I will try to give a roadmap
to the various literatures and tools from computational geometry, numerical
analysis, databases, data mining and artificial intelligence. The technical
meat of the talk will highlight some new algorithms and data structures
my group has been developing over the period 1994-2001 that fall into
the class of "cached sufficient statistics." These are summary data
structures that live between the statistical algorithm and the database,
intercepting the kinds of operations that have the potential to eat
up valuable time if they were answered by direct reading of the dataset.
I will give some computer demonstrations showing various classes of
accelerations broadly covering density estimation speedups (1000 fold),
clustering speedups (1000-10000 fold), mixture model and Bayes Net discovery
(10-100000 fold), anomaly detection (100-fold) and 2-, 3-, 4-point correlation
function computation (100-fold up to about a trillion-fold). We will
also show how this can be applied to other problems such as (i) detecting
hits in very noisy proteomics screenings, (ii) within a DARPA-sponsored
biological warfare detector system and (iii) for detecting very subtle
structure in distributions of spatial results. If time permits we will
also discuss the use of this technology in Active Learning scenarios,
in which the ability to do these analyses extremely quickly in turn
permits on-the fly planning of near-optimal experiments needed to minimize
the amount of data needed to learn to solve some task.
Back
to Top
Data
Mining: Next Generation Problems
Prof. Johannes Gehrke
Cornell University
Data mining as a discipline
has matured considerably, and especially strong progress has been made
in the design of scalable algorithms that transform the oceans of bits
in very large databases into interpretable patterns and predictive models.
The current data mining research in my group concentrates on three technical
challenges for the next generation of data mining tools: Interpretability
of data mining models, high-speed main-memory data mining, and mining
and monitoring data streams.
In this talk, I will describe
some recent results on the correction of a problem in classification
trees construction: Bias in split selection. An attribute selection
criterion is unbiased if its choice of a split attribute is only determined
by the strength of the dependency between the split attribute and the
class label attribute; otherwise the criterion is biased. Biased split
criteria do not return the "best" split attribute at a node of the classification
tree, but rather a attribute that has properties unrelated to its dependence
with the class label (for example, a large domain) that "confuse" the
split criterion. This problem is ubiquitous in today's data mining tools;
for example, the split selection method in IBM DB2 Intelligent Miner
is biased. I will first show results from a thorough experimental and
theoretical study that shows the nature of the bias, and then I will
talk about our recent results that allow us to provably eliminate this
bias. A careful experimental study shows that our method is computationally
scalable, and has excellent behavior in practice.
Back
to Top
Mining
High-Speed Data Streams
Prof. Pedro Domingos
University of Washington
Many organizations today
have more than very large databases; they have databases that grow without
limit at a rate of several million records per day. Mining these continuous
data streams brings unique opportunities, but also new challenges. In
this talk I will describe VFDT, an anytime system that builds decision
trees using constant memory and constant time per example. VFDT can
incorporate tens of thousands of examples per second using off-the-shelf
hardware. It uses Hoeffding bounds to guarantee that its output is asymptotically
nearly identical to that of a conventional learner. We have studied
VFDT's properties and demonstrated its utility through an extensive
set of experiments on synthetic data. VFDT is currently being applied
to mining the continuous stream of Web access data from the whole University
of Washington main campus.
(Joint work with Geoff Hulten.)
Back
to Top
A
Detailed Probabilistic Analysis of EM
Dr. Sanjoy Dasgupta
AT&T Research
We prove strong performance
guarantees for a simple variant of EM for Gaussian mixtures. Specifically,
we establish fairly general conditions under which it will converge
to the globally optimal mixture (rather than a local optimum). We place
this in the context of previous theoretical and experimental work on
the EM algorithm. Our analysis includes a lot of new intuition about
the workings of EM on high-dimensional data. For instance, it provides
a principled way of initializing the clusters and later pruning them.
Back
to Top
Active
Learning Using Adaptive Resampling
Dr. Vijay Iyengar
T.J. Watson Research Center, IBM Research
Classification modeling
(a.k.a. supervised learning) is an extremely useful analytical technique
for developing predictive and forecasting applications. The explosive
growth in data warehousing and internet usage has made large amounts
of data potentially available for developing classification models.
For example, natural language text is widely available in many forms
(e.g., electronic mail, news articles, reports, and web page contents).
Categorization of data is a common activity which can be automated to
a large extent using supervised learning methods. Examples of this include
routing of electronic mail, satellite image classification, and character
recognition. However, these tasks require labeled data sets of sufficiently
high quality with adequate instances for training the predictive models.
Much of the on-line data, particularly the unstructured variety (e.g.,
text), is unlabeled. Labeling is usually a expensive manual process
done by domain experts. Active learning is an approach to solving this
problem and works by identifying a subset of the data that needs to
be labeled and uses this subset to generate classification models. We
present an active learning method that uses adaptive resampling in a
natural way to significantly reduce the size of the required labeled
set and generates a classification model that achieves the high accuracies
possible with current adaptive resampling methods.
Back
to Top
Linear
Prediction Methods Using Regularization
Dr. Tong Zhang
T.J. Watson Research Center, IBM Research
In this talk, I give an
overview of some recent developments on linear prediction methods using
regularization. In the first part, I will talk about algorithmic issues.
I will describe a duality of a convex programming formulation of linear
prediction, which can be used to design efficient numerical algorithms.
In particular, I illustrate large margin Perceptron-type algorithms
and large margin Winnow-type algorithms. Relationship with other methods
such as maximum entropy and basis pursuit will be discussed. We will
also compare the numerical aspects of the dual algorithms to the primal
algorithms. In the context of duality, Kernel methods will be briefly
mentioned. I will compare formulations that lead to sparse dual variables
(SV machines) and formulations that lead to sparse primal variables
(basis pursuit-like machines) as well as their algorithmic implications.
In the second part, I will discuss some theoretical results concerning
regularized linear prediction. I will argue why the convergence in probability
of estimated parameter to the true parameter is important and show how
to obtain such results.I will also relate this to outlier issues. For
linearly separable classification, I show that the robustness of the
data distribution leads to an exponential decay rate of misclassification
error. This can be compared with a typical large margin bound that decays
at a rate of inverse sample size.
Back
to Top
Knowledge
Management in Medicine: Challenges and Solutions
Dr. Jeremy Wyatt
University College, London
Doctors’ decisions determine
three quarters of health care costs and depend critically on medical
knowledge. However, medical knowledge is complex and doubles in amount
every 20 years. Managing this knowledge poses many challenges and requires
a range of techniques. This talk will discuss these challenges and sample
the solutions adopted over the last forty years, including methods to
refine knowledge (evidence-based medicine), disseminate knowledge (bibliographic
databases and the internet) and apply knowledge to influence professional
practice (decision support systems, telemedicine and other approaches).
Back
to Top
Lightweight
Rule Induction
Dr. Sholom Weiss
T.J. Watson Research Center, IBM Research Division
How a simple method for learning
decision rules from data gets great results. A lightweight rule induction
method is described that generates compact Disjunctive Normal Form (DNF)
rules. Each class has an equal number of unweighted rules. A new example
is classified by applying all rules and assigning the example to the
class with the most satisfied rules. The induction method attempts to
minimize the training error with no pruning. An overall design is specified
by setting limits on the size and number of rules. During training,
cases are adaptively weighted using a simple cumulative error method.
The induction method is nearly linear in time relative to an increase
in the number of induced rules or the number of cases. Experimental
results on large benchmark data sets demonstrate that predictive performance
can rival the best reported results in the literature.
Back
to Top
Probabilistic
Clustering of Dynamic Behavior
Prof. Padhraic Smyth
University of California, Irvine
Clustering has a long history
in exploratory data analysis. Searching for homogenous data clusters
is a fundamental component of any endeavor to discover structure from
data. This talk will begin with a brief overview of a general model-based
probabilistic clustering framework. This approach, relatively well known
in applied statistics, focuses on the use of mixture models as an underlying
generative model for observed data. It also provides a relatively objective
and sound methodology for determining the model which is closest to
truth.
A particularly useful aspect
of the probabilistic approach is the ability to generalize from clustering
in vector-spaces to clustering sequences, trajectories, and other ``dynamic"
data that we commonly observe from individuals and/or systems. The main
part of the talk will introduce a new and general framework for mixture-model
clustering of such data. The probabilistic approach solves in an elegant
and coherent manner the dual difficulties of (a) how to define distance
metrics between observations (e.g., sequences) of different sizes, and
(b) how to ``weight" different individuals for whom we have different
amounts of data. Illustrative applications include unsupervised learning
of gestures from video data, clustering of individuals based on Web
browsing behavior, modeling of gene expression data, modeling of cyclone
trajectories, clustering locusts based on observed motor behavior, and
clustering of medical patients based on histograms of red blood cells.
A subset of these applications will be discussed, time permitting. The
talk will conclude with a brief discussion of some apparently useful
connections between mixture modeling in this context and Bayesian hierarchical
models.
Back
to Top
Data
Mining with DB2 UDB Miner or "Cut the Hype, Let's
Get Down to Business"
Vince Thomas
IBM Business Intelligence Consulting & Services
DB2 UDBMiner was developed
on a customer engagement to provide a data mining environment that was
integrated into a 33TB DB2 data mart running on a 186-node SP. In particular,
it was designed to address the challenges of making data mining part
of the business process rather than a marketing exercise. The innovation
of building the model into a User Defined Function revolutionizes the
way we deploy data mining solutions.
Back
to Top
Prof. Trevor Hastie
Stanford University
Through the use of recently
developed DNA microarrays, it is now possible to obtain accurate, quantitative
measurements of the proportional representation of thousands of genes
present in a biological sample. DNA arrays have now been used to monitor
changes in gene expression during important biological processes (e.g.
cellular replication and the response to changes in the environment),
and to study variation in gene expression across collections of related
samples (e.g. tumor samples from patients with cancer). A major statistical
task is to understand the structure of the data from such studies, which
often consist of measurements of thousands of genes in tens of conditions.
A variety of clustering techniques
have been applied to such data, and proven to be useful tools for identifying
biologically relevant groupings of genes and samples. In the paper,
we describe a new method called GENE SHAVING which tries to identify
subsets of genes with coherent expression patterns with large variation
across conditions. The technique differs from hierarchical and other
clustering methods that have appeared in the literature. The technique
can be unsupervised (that is, treat the genes and samples as unlabeled),
or partially or fully supervised by known properties of the samples.
We illustrate the method on data from a large survey of gene expression
in a diverse collection of human tumor cell lines.
* joint work with Robert
Tibshirani, Michael Eisen, Patrick Brown, David Botstein, and others
Back
to Top
Dr. Daryl Pregibon
AT&T Labs Research
This talk is an interim report
on the 5 year plan launched in 1996 to provide a theoretical and computational
foundation of Statistics for massive data sets. The plan coincided with
the formation of AT&T Labs and the proposed research agenda of the InfoLab,
which is both a physical laboratory and an interdisciplinary collection
of information researchers in CS, mathematics, and Statistics. At the
halfway point of this odyssey we can identify some success stories but
more importantly it is an opportune time to re-calibrate the challenges
and the milestones.
Back to Top
Dr. Andrew Barron
Yale University
Gaussian mixtures for density
estimation provide a natural counterpart to sigmoidal neural networks
for function fitting and approximation. In both cases, it is possible
to give simple expressions for the iterative improvement of performance
as components of the network are introduced one at a time. In particular,
for mixture density estimation we show that a k-component mixture estimated
by maximum likelihood (or by an iterative likelihood improvement that
we introduce) achieves log-likelihood within order 1/k of the log-likelihood
achievable by any convex combination. Consequences for approximation
and estimation using Kullback-Leibler risk are also given. A Minimum
Description Length principle selects the optimal number of components
k that minimizes the risk bound.
Back
to Top
Dr. Joseph L. Hellerstein
T.J. Watson Research Center
IBM Research Division
Performance
management deals with ensuring the quality and accessibility of electronic-based
services. Examples of such services include: transaction processing,
web access, and eBusiness interactions. Over the last several years,
the performance management department at IBM Research has developed
a number of projects that rely heavily on knowledge representation and
machine learning. For example, we developed a measurement technology
(ETE) that provides operations with a means to dynamically change the
definition of end-user transactions, an important consideration to accommodate
varied user-interaction styles during problem determination. This technology
employs rule-based techniques for on-line pattern recognition. Subsequent
work has involved learning end-user transactions from examples. Other
work, such as our event mining project, has involved pattern discovery
to characterize problem situations. Special considerations here include
noisy patterns and partial periodicities. This talk will summarize key
projects in the performance management department, identifying knowledge
representation and discovery considerations.
Back to Top
Prof. Tom Mitchell
Center for Automated Learning & Discovery
Carnegie-Mellon University
Consider the fact that although your
computer workstation can now retrieve any of 600,000,000 pages on the
World Wide Web, it unfortunately cannot understand their content. This
is, of course, because web pages are written to be understandable to
people, not computers.
The goal of our research is to automatically
extract a very large database of facts that mirror the content of the
Web, and that can be manipulated by computer. If we can achieve this
goal, it will enable using the web as a gargantuan data base and knowledge
base to support a rich variety of applications. Our approach is to use
machine learning algorithms to train a system to automatically extract
information from web hypertext. For example, in one set of experiments
our system was trained to extract descriptions of faculty, students,
research projects, and courses from web sites of computer science departments.
It then used these learned extraction routines to build a database containing
thousands of new entries by automatically browsing new university web
sites. The system is currently running 24 hours per day, and over the
past eight months has built a knowledge base containing over 100,000
assertions, with an accuracy of roughly 70%.
This talk will present the machine learning
algorithms we have developed to date, along with experimental results
suggesting these methods can be quite effective for information extraction
in certain domains.In information extraction and in language understanding
related tasks inference heavily depends on knowledge about the language
and the world. I will present research on the use of Machine Learning
in performing these kind of ``knowledge-intensive inferences''.
Back to Top
Prof. Dan Roth
Dept. of Computer Science and the Beckman Institute
University of Illinois - Urbana/Champaign
In information extraction and in language
understanding related tasks
inference heavily depends on knowledge
about the language and the
world. I will present research on the
use of Machine Learning in
performing these kind of ``knowledge-intensive
inferences''.
First, I will present a learning theory
account of the major
statistical approaches to learning
in natural language and explain why
they work although, typically, the
assumptions they are based on do
not hold in the data.
Then, I will present the SNoW learning
architecture and discuss how it
is used to support large-scale inference
problems. The emphasis is on
how SNoW can tolerate data of high
dimensionality, how additional
knowledge - acquired as the system
interacts with various information
sources is incorporated, and on making
inferences that depend on the
outcome of several classifiers. The
approach will be exemplified with
experimental evidence from several
language understanding related tasks.
Back to Top
Prof. Masato Koda
University of Tsukuba, Japan
We propose a method of training neural networks based on bootstrap samples. We also propose
a method of evaluating bootstrap estimates for training errors. Then we review
4 learning algorithms, which include stochastic learning
method formulated by the present author, from which we selected the conjugate
gradient training method for its convergence speed.
We set up an experiment meeting the following requirements:
1. Complete sample space is known,
2. Sample space is finite and small so that training is fast,
3. Its distribution function is unbalanced so that the consequence of resampling is easy to assess.
The result shows that the learning performance is enhanced by resampling.
Back to Top
Dr. Magnus Stensmo
IBM Global Business Intelligence Solutions, Almaden
Diagnosis is an important but complicated task whose automation would be very
useful. If automated successfully, scarce resources could be made less so and diagnosis could be made more accurate and efficient.
We have used decision and probability theory to construct automated diagnosis systems directly from databases of cases. This
approach alleviates the task of knowledge extraction from experts, who often are unaware of how they have come to a certain diagnosis.
Probability models were constructed using mixture models that were efficiently learned by the Expectation-Maximization
algorithm. Problems with missing data are thereby solved, both for missing data in the case database, and during actual
diagnosis when missing data constitute observations not yet made. Decision theory was used to find the most informative next question
to ask, observation to make, or test to do in order to minimize the total cost for the diagnosis. It was also used to decide
when to stop requesting more information. To automatically find a good utility function for the decision theoretic model, temporal-difference
reinforcement learning was used to learn it from the data set. The use of reinforcement learning reduced the number of tests necessary to achieve
the same accuracy as if all tests had been performed which results in significant savings. It is demonstrated how a complete decision
theory controlled sequential decision process can be completely learned from data. No "expert" knowledge is needed since only a database
was used to construct the systems, and the system is completely domain independent.
Back to Top
Dmitry Zelenko
University of Illinois, Urbana-Champaign
Dept. of Computer Science
We analyze several problems of optimizing
association rules. The problems have important applications in data mining, allowing users
to focus at interesting rules extracted from databases. We consider association
rules of the form
where Ai's are categorical attributes of the underlying
relation R, and C is any fixed condition defined over the
attributes of the relation R.
We study several problems, in which we seek assignments to the variables vi's
of a given rule so that the rule satisfies certain optimality constraints. We exhibit
efficient algorithms for solving the optimized support and optimized confidence problems, the weighted support/confidence problem,
and the shortest rule problem. We discuss time and space complexity of the designed
algorithms and show how they can be improved by allowing for approximate solutions.
We also delineate how we can use one of the developed algorithms to solve efficiently
an interesting special case of minimum flux cut, a classical problem in network design.
Back to Top
Salvatore J. Stolfo
Columbia University - Department of Computer Science
In this talk, we will provide an overview of the JAM distributed data
mining system and its application to fraud and intrusion detection.
JAM provides a set of machine learning programs encapsulated as JAVA
agents that can be dispatched to remote database sites. It also features
a meta-learning system that integrates and combines multiple models
gathered from and computed by the remote database sites.
Our work on credit card fraud, joint with the Financial Services Technology
Consortium, provides insight into a number of technical problems that
must be addressed by any distributed data mining system. We will describe
these and how they manifest themselves in the credit card application.
The study of credit card fraud was not only useful to the banks. The
results we achieved, and solutions we invented along the way
provided guidance in studying the general problem of network intrusion
detection. We applied our techniques to this problem, and participated
in the DARPA/MIT intrusion detection evaluation program. The models
generated by our data mining programs produced the best performance
over a number of other knowledge engineered intrusion detection systems.
However, our work on the credit card fraud problem strongly indicates
that the metrics used in evaluating intrusion detection systems is
wrong. We will describe the issues involved, and define "cost-based" metrics
for intrusion detection. If time permits, we shall provide a very
brief overview of MarketNet, an experimental system under development
at Columbia by Y. Yemini, that may provide the ultimate solution.
Back to Top
Richard Segal
IBM T.J. Watson Research Center
MailCat is an intelligent assistant for Lotus
Notes that helps users organize their e-mail into folders. MailCat
uses a text classifier to learn each user's mail-filing habits. MailCat
uses the model it learns to predict the three folders in which the
user is most likely to place each incoming message. The predictions are
presented to the user as three shortcut buttons that allow the user
to quickly file each message into one of the predicted folders. When one
of MailCat's predictions is correct, the effort required to file a message
is reduced to a single button click. MailCat uses an incremental
version of the AIM classifier to make its predictions. An incremental classifier
is used to allow MailCat to continuously update its model
of the user. We have performed a number of static and dynamic experiments
with MailCat as well some initial user studies. The results show that
the shortcut buttons provided by MailCat are useful between 80%
to 90% of the time and this performance level is maintained over time
despite the constantly changing classification problem. We also
find that the bootstrap period for the classifier is extremely short and
that MailCat is useful from the moment it is first installed.
Back to Top
Rick Lawrence
IBM T.J. Watson Research Center
Recommender systems such as used by Amazon.com
and other e-businesses have become an increasingly popular approach
to web-based personalization. These systems most often
rely on variants of collaborative filtering, in which the user
rates a subset of items, and then receives additional recommendations
based on analysis of the purchases of other customers with 'similar'
tastes. In this talk, we describe a product recommendation system
developed as part of an overall pervasive computing solution for the grocery
chain Safeway UK. Our approach combines aspects of both collaborative
filtering and content-based filtering. Customer peer groups
are determined via data mining clustering, and this information is
used to construct a candidate recommendation list based on popular items
among each customer's peers. This list is further refined by a content-based
projection algorithm which uses relationships between product
groups computed using data mining associations. We have also explored
the use of product profit margins in determining product recommendations.
We describe results from an initial deployment at Safeway. Finally,
we briefly mention the application of similar ideas to ongoing work
in identifying stock market trends.
Back to Top
David Waltz
NEC Research Institute
AI has been successful in fielding applications
that mine large databases for facts and patterns, that plan for complex
logistics problems, that allow programs to accept speech inputs, that
help physicians diagnose diseases and monitor patients, that play
games (e.g. chess, bridge, poker) at human levels, etc. But there are still
no computational examples of human-level performance on many human abilities
(e.g. visual perception, deep natural language understanding, flexible
reasoning). In fact, there is not even a consensus about the nature of
the mechanisms that underlie such human abilities. But such abilities are important
for future applications such as intelligent environments, instructable
assistants, and text search systems that actually find meaningful answers
to questions rather than merely returning a number of texts that contain
the words of a query.
This talk will argue that any system we would
consider truly intelligence will depend critically on the ability to
separate the important from the unimportant. AI has for the most part
focussed on logic and reasoning in artificial situations where only relevant
variables and operators are specified, and has paid insufficient attention
to processes of reducing the richness and disorganization of the real
world to a form where logical reasoning can be applied. This talk will
discuss the role of importance in intelligence, provide some positive examples
of research that makes use of importance judgements, and offer suggestions
for new mechanisms, architectures, applications, and research
directions for AI.
Back to Top
Inderpal Bhandari
Virtual Gold Inc.
Data mining involves the automatic identification
of interesting patterns in large amounts of data. I will preview
SportsMiner, a web site for sports professionals, amateurs, and fans
who wish to mine data of major sports - e.g. PGA Golf, World Cup Soccer,
Grand Slam tennis - to find hidden patterns that reveal what leads them
(or their heroes) to victory or defeat. I will describe how the patent-pending
technology that underlies the above demonstrations enables rapid, cost-effective
development and deployment of Internet-ready applications
that bring data-driven knowledge discovery to the masses. I will demonstrate
such a data mining application for corporate banking and discuss the implications
of the new technology for the database and data mining marketplace.
Back to Top
Ramesh Agarwal
Networked Data Systems, IBM Research
Generating all frequent itemsets (association
patterns) in a transaction database is an important data mining problem.
Current association algorithms are particularly inefficient at mining long
patterns at low support. In this talk, we will describe a new approach
to counting support for association. We represent associations
as a lexicographic tree of frequent itemsets. We discuss different
strategies in generation and traversal of the lexicographic tree such
as breadth-first search, depth first search, or a combination of the two.
These techniques provide different trade-offs in terms of the I/O,
memory, and the computational time requirements. We use the hierarchical
structure of the lexicographic tree to successively project transactions
at each node of the lexicographic tree, and use a matrix counting technique
on the projected transactions for counting support of candidate itemsets.
Tree projection is a very powerful technique
and it greatly reduces the computing time. Let us consider a node
of the tree corresponding to a frequent itemset P having support S.
Then only this fraction of the transactions will project to this node. When
the frequent itemset P is extended, then all candidate extensions of
P are counted only against the subset of transactions which project
to node P. For example assume S is 1%, then counting time to extend node
P is reduced by a factor of 100 so.
We have tested our algorithm on both real and synthetic data. In those
cases, where the A Priori algorithm becomes compute bound, our algorithms
provides one to two orders of magnitude performance improvement. In this talk various implementation details and performance comparison
results will be presented.
Back to Top
David Jensen
Computer Science Department, University of Massachusetts, Amherst
An increasing number of modern systems incorporate
algorithms that construct predictive models from data.
These induction algorithms exhibit a surprising number of pathological
behaviors, including overfitting, oversearching, and attribute
selection errors. In this talk, I discuss a single statistical error
responsible for all of these pathologies. At the core of this error is
the multiple comparison procedure (MCP): Algorithms generate multiple
items (e.g., classification rules), score each item based
on a data sample, and select the item with the maximum score. In this
talk, I outline proofs of the statistical properties of MCPs and show how
failure to adjust for these properties leads to each pathology. I will
also discuss approaches that can control pathological behavior, including
Bonferroni adjustment, cross-validation, and randomization tests.
I conclude by addressing the question "Why does such a simple statistical
error cause so much mischief?"
Back to Top
Edwin Pednault
Mathematical Sciences Department, IBM Research Division
Statistical learning theory, also known as
Vapnik-Chervonenkis (VC) theory, addresses
a key question that arises when constructing
predictive models from data---how to decide
whether a particular model is adequate or
whether a different model would produce better
predictions. Whereas classical statistics
typically assumes that the form of the correct
model is known and the objective is to estimate
the model parameters, statistical learning
theory presumes that the correct form is
completely unknown and the goal is to identify
the best possible model from a set of competing
models. The models need not have the same
mathematical form and none of them need be
correct. The theory provides a sound statistical
basis for assessing model adequacy under
these circumstances, which are precisely
the circumstances encountered in machine
learning, pattern recognition, and exploratory
data analysis. The speaker will present an
informal chalk-talk tutorial on statistical
learning theory as a prelude to Vladimir
Vapnik's visit on January 11. The tutorial
will focus on fundamental concepts and the
intuitions behind the mathematical development
of the theory.
Back to Top
Vasant Dhar
Stern School of Business, New York University
Financial markets are highly data intensive.
They are also highly complex and multifaceted,
with different psychologies that manifest
themselves periodically. In this talk, I
evaluate a number of models that are used
by the investment community to price securities.
I also discuss some of the concerns of this
community towards bottom-up approaches to
model building in general, and the potential
of data mining approaches in addressing them.
Back to Top
Ashok Srivastava
IBM Almaden
Many dynamical systems operate under relatively
constrained and known conditions, where the dynamics and exogenous
inputs into the system are well known and quantified. These systems
can generally be characterized using established time series technologies.
In many cases, however, standard methods may not be well suited for
systems where a slow degradation in the system is present. This
introduces a problem of multiple time scales: one at the scale of
the observations, and the other at the scale of the degradation.
In this talk, we present a methodology to
predict failure and classify the type of failure from a dynamical system in
which interesting phenomenon exist on at least two time scales. The observations
are on the order of milliseconds, and the development of the
failure may be on the order of weeks. The problem is to predict the failure
mode potentially far in the future given measurements on a very rapid
time scale.
Our results are presented in the context
of a benchmark data set (the "Hollins Data Set") from the rotorcraft industry,
where the problem is to predict the failure of the rotorcraft transmission
given a set of vibration measurements. We demonstrate that our methodology
produces a high accuracy prediction using only a small number of features
(less than 5), whereas the best competing methodologies use hundreds
of features in an extremely complex classifier and produce similar classification
rates.
This work was done in collaboration with Dr. Ed Huff of NASA Ames Research Center.
Back to Top
Charles Ling
University of Western Ontario
Direct marketing is a process of identifying
likely buyers of certain products and promoting
the products accordingly. It is increasingly
used by banks, insurance companies, and the
retail industry. Data mining can provide
an effective tool for direct marketing. During
data mining, several specific problems arise.
For example, the class of distribution is
extremely imbalanced (the response rate is
about 1%), the predictive accuracy is no
longer suitable for evaluating learning methods,
and the number of examples can be too large.
In this paper, we discuss methods of coping
with these problems based on our experience
on direct marketing projects using data mining.
Back to Top
Andreas Buja
AT&T Labs - Research
We discuss interactive techniques for multidimensional
scaling (MDS) and a system, named ``XGvis'', that implements
these techniques.
MDS is a method for visualizing proximity
data, that is, data where objects are characterized by dissimilarity
values for all pairs of objects. MDS constructs maps of these objects
in $\RealsÄk$ by interpreting the dissimilarities as distances.
More precisely, objects are mapped to points in such a way
that distances between points approximate the dissimilarities between
corresponding objects.
MDS in its conventional batch implementations
is prone to uncertainties with regard to 1)~local minima
in the underlying optimization, 2)~sensitivity to the choice
of the optimization criterion, 3)~artifacts in point configurations,
and 4)~local inadequacy of the point configurations. These uncertainties will be
addressed by the following interactive techniques:
1)~algorithm animation and interactive restarts for the
optimization, 2)~interactive control over parameters that
determine the criterion and its minimization, 3)~diagnostics for
pinning down artifactual point configurations, and 4)~MDS on subsets
of objects and subsets of pairs of objects.
MDS was originally developed for the social
sciences, but lately there has been interest in using it for graph layout
as well. While graph layout is usually done in two dimensions,
in our implementation layouts can be constructed and viewed in
arbitrarily high dimensions.
We will show applications to various areas:
visualization of behavioral data about computer users, dimension
reduction of marketing data, layout of telephone call graphs, and
reconstruction of macro molecules in nano technology.
Joint work with Deborah Swayne, Michael Littman,
Nate Dean
Back to Top
Lyle H. Ungar
Dept. of Computer and Information Science,
University of Pennsylvania
Grouping people into clusters based on the
items they have purchased allows accurate recommendations of new items
for purchase: if you and I have liked many of the same movies, then
I will probably enjoy other movies that you like. Recommending items
based on similarity of interest (a.k.a. collaborative filtering)
is attractive for many domains: books, CDs, movies, etc., but does
not always work well. Because data are always sparse - any
given person has seen only a small fraction of all movies - much more
accurate predictions can be made by grouping people into clusters with
similar movies and grouping movies into clusters which tend to be liked
by the same people. Finding optimal clusters is tricky because
the movie groups should be used to help determine the people groups
and visa versa. We present a formal statistical model of collaborative
filtering, and compare different algorithms for estimating the model
parameters including variations of k-means clustering and Gibbs
Sampling.
This is joint work with Dean Foster.
Back to Top
Graham Bent
Data Sciences, IBM UK
The application of Intelligent Knowledge
Based Systems (IKBS) or Artificial Intelligence (AI) to military
applications has had something of a checkered history with much being promised
but little being practically achieved. Whilst many of the
technologies underpinning IKBS and AI are mature, a significant reason for
a lack of progress in military applications (and in other domains)
has been the difficulty in obtaining, in the first instance, the knowledge
that is required to form the core of these systems. Recent work in
the area of machine learning has lead to the development of techniques for automatically generating
this knowledge from case histories and supplementing
this knowledge with expert input.
IBM/Data Sciences has been involved in the
development of these techniques over a number
of years and are currently involved in investigating
their application to military command and
control, data fusion and classification problems.
Many of the techniques are embodied in the
Artificial Intelligence Discrimination Architecture
(AIDA) which has been developed by IBM/ Data
Sciences and is finding wide application
in areas such as ballistic missile defense,
electronic warfare, non-cooperative target
recognition, naval command and control and
information warfare.
The presentation will look at these emerging
technologies and will include a number of computer assisted demonstrations
that illustrate how these techniques can be applied in existing
and future military applications.
Back to Top
Ashwin Srinivasan
Oxford University Computing Laboratory, Oxford,
UK
The existence and rapid growth of large-scale
industrial data-bases have brought into focus the need
for methods that enable the discovery of trends and predictive patterns
in data, and communicating them in a manner designed to provoke
insight. This has turned attention to ``machine learning''
techniques capable of extracting symbolic descriptions from data.
At the cutting-edge of such techniques is Inductive Logic Programming
(ILP).
Given a set of observations and background
knowledge in the form of logic programs, an ILP system attempts
to construct explanations for the observations. The explanations
are in the same language as the observations and background
knowledge, namely first-order logic programs. This contrasts
with algorithms like decision-trees, and neural networks which
employ simple propositional logic representations. This, along
with the flexibility to include background knowledge -- which
can even include other propositional algorithms -- allow a form
of data analysis and decision-support that is, in principle,
unmatched by first-generation methods. ILP systems have already
been applied in academic settings with some success to a
range of difficult problems. These include protein structure
prediction, drug structure-activity prediction, finite element
mesh design, satellite fault diagnosis, circuit design,
automobile traffic-flow analysis, intelligent software agents for
the internet, lexical analysis, and the analysis of Rachmaninoff's
piano-forte performances. Some of the results have even
been acknowledged by experts as having identified new ``knowledge''.
This talk will provide an introduction to
ILP, and illustrate some aspects that should be of interest for
data-mining research. These will include (a) the automatic construction
of new attributes; (b) a natural solution to the ``multiple-instance''
problem; (c) combining statistical and logical
predicates to achieve regression, clustering and classification
within the ILP setting; and (d) a list of pros and cons for the use
of ILP in data mining.
Back to Top
A Logarithmic Lunch?
David D. Lewis
AT&T Labs
Random sampling is often used to choose training
data for language processing tasks such as text retrieval,
email filtering, parsing, tagging, etc. We propose an alternative
approach of labeling data, training the system, and finding examples
for which the system is least certain of the correct answer.
On a text categorization task this method, which we call uncertainty sampling,
reduced by up to 500-fold the amount of training data needed
to achieve a given level of categorization accuracy.
The computational learning theory results that inspired our own (heuristic) work suggest that, asymptotically,
a labeled training set of size *logarithmic* in the amount of unlabeled
training data can be used without sacrificing accuracy. For applications
where unlabeled training data is cheap, this would be the
next best thing to a free lunch. A great deal is unknown about these
methods, and we will discuss avenues for research.
Back to Top
Comparison of Classifiers in Imprecise Environments
Foster Provost
Bell Atlantic (formerly NYNEX Science and Technology)
When mining data with inductive methods,
one often experiments with a wide variety of model-building algorithms,
using different algorithm parameters, varying output threshold values,
and using different training regimens. Such experimentation yields
a large number of classifiers to be evaluated and compared.
Experience applying automated model-building algorithms to real-world
problems has shown repeatedly that using *accuracy* to compare
classifiers is not adequate because the underlying assumptions
rarely hold; in order to compare performance it is necessary to know
the conditions under which the classifiers will be used. Unfortunately,
target class distributions and misclassification costs are often difficult
or impossible to specify precisely.
We present a method for the comparison of
classifier performance that is robust to imprecise class distributions
and misclassification costs. The ROC convex hull method combines
techniques from ROC analysis, decision analysis and computational
geometry, and adapts them to the particulars of analyzing learned
classifiers. The method is efficient and incremental, minimizes the
management of classifier performance data, and allows for clear visual
comparisons and sensitivity analyses. The ROC convex hull
also facilitates building systems that will perform "optimally" under
any target conditions.
Back to Top
Vittorio Castelli and Lawrence Bergman
Image Information Systems, IBM Research, Watson
IBM/Nasa Satellite Explorer is a system designed
to provide navigation, content-based query and data-mining capabilities
for large satellite image archives. Our goal has been to provide rapid
yet expressive search capabilities. In this talk we will discuss
three foci of this project - image representation and search at multiple
abstraction levels, content-based query specification and search,
and development of algorithms for efficiently mining image archives.
We think of images as consisting of multiple
layered abstraction levels. At the lowest layer is the raw pixels. This
is the highest volume, and most compute-intensive to search. Certain
kinds of search operators (such as template match) inherently require pixel
access. At the next level of abstraction are image features. Features
such as texture or local spectral histogram can be pre-extracted from image
pixels and provide much faster search capabilities. Finally, very low-volume
semantic extracts (such as lakes, forests, etc) provide extremely rapid
search. The philosophy here is to work as high in the pyramid as possible,
searching on data abstracts, but to provide for dropping to higher volume
search algorithms when the abstracts do not provide the level of specificity required.
Content-based query is specified in terms
of object types and image examples. Where
pre-existing semantic categories capture
a user's intent, they can be specified by
name from a menu. When new object types are
required, they can be processed using lower
abstraction levels (such as texture, or raw
pixels) and specified using sample image
clips. A drag-and-drop java-based interface
is used for specifying queries. Queries are
posed as nested English sentences in a tightly
constrained grammar and can include phrases
such as "looks like XXXX" where
XXXX can be one or more image samples.
Efficient mining of image archives is achieved
by developing algorithms that operate on multi-resolution representations
of the image. In a fashion analogous to the multiple abstraction pyramid
described above, we can also store information at multiple resolutions
at any given abstraction level. We have developed mining algorithms including
classification and template matching that capitalize on a multi-resolution
representation. Such algorithms can provide a 10-30 time speed-up
with minimal loss of accuracy.
Back to Top
Andreas S. Weigend
Information Systems Department, Stern School of Business, NYU
While the first round of applications of modern machine learning techniques
to real world problems focused primarily
on prediction--the necessary first step for acceptance in applied communities--more
recent approaches have emphasized the importance of understanding.
Beside taking domain specific knowledge into account (such as the fact
that adjacent time steps are adjacent in time), an important approach
to understanding and explaining data is by introducing hidden variables or
states, and trying to understand them.
In this talk, I present an architecture I have developed that goes beyond
the simple nonlinear regression models typically
assumed by the physicists-in-finance community, (historically
coming from dynamical systems and chaos). The architecture takes
time seriously by allowing for hidden states: it models time dependency
explicitly as a first-order Markov model with transitions between these hidden
states. These transition probabilities can be functions of external
variables, and the emission models are nonlinear feedforward neural networks.
I will give examples of the characterization of the hidden states for
several financial examples on different time scales (daily and
high-frequency), as well characterize the performance of trading strategies
in terms of profit and loss, Sharpe ratio etc.
This is joint work with Shanming Shi at J P Morgan (New York) and University of Colorado (Boulder).
Back to Top
Jonathan R. M. Hosking
IBM Research/Yorktown
The fast-growing area of data mining can be regarded as a collection
of methods for drawing inferences from data.
The aims of data mining, and some of its methods, overlap with those
of classical statistics. However, there are some philosophical and
methodological differences. I examine these differences, and I describe
three approaches to machine learning that have developed largely
independently: classical statistics, Vapnik's statistical learning
theory, and computational learning theory. Comparing these approaches,
I conclude that statisticians and data miners can profit
by studying each other's methods and using a judiciously chosen combination
of them.
This is joint work with Edwin Pednault and Madhu Sudan
(IBM T. J. Watson Research Center).
Back to Top
Inderpal Bhandari
IBM Research/Hawthorne
This presentation will present an overview of the data mining research work
in the Advanced Scout project. The demonstration
of Advanced Scout will illustrate (from an end-user's perspective)
the analytical process beginning with the collection of raw data,
the data-mining operations, and ending with decision support and strategy
formulation. The underlying algorithm, the challenge of making data mining
suitable for the lay-person, and the cognitive processes of pattern interpretation
will be discussed. Beyond the team level, the use of Advanced
Scout by the NBA, and by ESPN will also be illustrated. The latter part
of the presentation will illustrate (from an organizational perspective)
the generalizations learned about designing, deploying, supporting and
iteratively developing a data mining solution and understanding the impact
that data-mining can have beyond any specific domain.
Back to Top
Philip Yu
IBM Research/Hawthorne
The talk provides an overview of our recent
work of exploring data mining on the Web.
There are two different aspects. One is mining
Web usage data to better understand surfer
behavior and improve the organization of
the Web presentation. We have developed a
Web usage mining tool, SpeedTracer, to address
this need. The basic features of SpeedTracer
will be covered in this talk. The other aspect
is to take advantage of the prevalence of
the Web browser interface to do interactive
data mining. We will discuss new approaches
on interactive mining of association rules
and sequence databases.
Back to Top
Revised
October 13, 2003
|