|
Stream Data Mining
Stream mining algorithms are being developed
to discover useful knowledge from data as it streams in -- a process
that isn't as straightforward as it might seem. Scientists have
to deal with the fact that the data rate of the stream isn't constant,
leading to a condition called burstiness, and the patterns of
the data stream
and scheduling resources are continuously evolving.
Recent emerging applications,
such as trade surveillance for security fraud and money laundering,
network traffic monitoring, sensor network data analysis, Web click
stream mining, power consumption measurement, and dynamic tracing
of stock fluctuations, call for studying a new kind of data. Called
stream data, it can be a continuous, potentially infinite flow of
information as opposed to finite, statically stored data sets. Besides
querying data streams, another important application is to mine
data streams for interesting patterns or anomalies as they happen.
For data stream applications,
the volume of data is usually too huge to be stored on permanent
devices or to be scanned thoroughly more than once. Both approximation
and the ability to adapt are key ingredients for executing queries
and performing mining tasks over rapid data streams. The following
figure shows a stream mining application example:

A stream mining application example
Our work addresses the need
to (1) capture the evolving nature of the data stream, and (2) handle
the bursty nature of the stream with limited resources.
To address the evolving data stream, a summary
or condensation approach is developed and this condensed representation
is used to track the changes over time. In clustering, for example,
the data points in the stream are condensed into a moderate number
of micro-clusters and the evolution of micro-clusters is tracked
instead of the infinite number of individual data points. Another
issue is the temporal granularity, which refers to the fact that
as time advances, people are more interested in recent events, meaning
that the application can apply more resources to explore more recent
data with finer granularities. For mining association rules, similar
concepts can be applied to the tracking of frequent item sets. For
classifications, ensembles of classifiers are developed where a
new one is learned at each period. Instead of continuously revising
a single classification model, the ensemble of classifiers from
sequential data chunks in the stream is combined. Maintaining the
most up-to-date classifier is not necessarily the ideal choice because
potentially valuable information may be wasted by discarding results
of previously trained, less-accurate classifiers. We show that to
avoid over fitting and the problems of conflicting concepts, the
expiration of old data must rely on data’s distribution instead
of only their arrival times. The ensemble approach offers this capability
by giving each classifier a weight based on its expected prediction
accuracy on the current test examples. We further investigate the
issue of change detection without true labels of the stream data.
With the pass-through feature of data streams,
two types of resources, i.e., memory space and computation power,
are particularly valuable in the streaming environment. An effective
algorithm for data streams is expected to have the capability of
resource-awareness, which means that it is highly desirable for
all resources in a streaming environment to be adaptively allocated.
The overall goal is to maximize precision by making the best use
of available resources. Advanced scheduling and memory management
algorithms are being developed.
Another related issue is privacy. The data streams
may contain personal data that need to be protected. Some anonymity
or depersonalization mechanism needs to be applied to the data as
it streams in, before the data can be provided to the mining algorithms.
A condensation-based approach is developed to create k-anonymity
on the fly. A record is said to satisfy k- anonymity, when there
are at least k other records in the data from which it cannot be
distinguished. The approach regenerates the now-anonymous records
from the data stream to preserve privacy while maintaining other
characteristics of the record with little affect on the mining results.
C. Aggarwal, "An
Intuitive Framework for Understanding Changes in Evolving Data Streams",
Proceedings of the International Conference on Data Engineering,
San Jose, CA, Feb. 2002.
C. Aggarwal, "A Framework for
Change Diagnosis of Data Streams", Proceedings of the ACM
SIGMOD Conference, San Diego, CA, June 2003.
C. Aggarwal, J. Han, and J. Wang, and P.S. Yu, "A
Framework for Clustering Evolving Data Streams", Proceedings
of the International Conference on Very Large Data Bases, Berlin,
Germany, Sept. 2003.
C. Aggarwal, J. Han, and J. Wang, and P.S. Yu, "On-demand
Classification of Evolving Data Streams", Proceedings of
the ACM SIGKDD Conference, Seattle, WA, Aug. 2004.
C. Aggarwal, J. Han, and J. Wang, and P.S. Yu, "A
Framework for Projected Clustering of High Dimensional Streams",
Proceedings of the International Conference on Very Large Data Bases,
Sept. 2004.
C. Aggarwal, and P.S. Yu, "A Condensation
Approach for Privacy Preserving Data Mining", Proceedings
of the 9th International Conference on Extending Database Technology,
Heraklion-Crete, Greece, March 2004.
.W. Fan, H. Wang, and P.S. Yu, "Active
Mining of Data Streams", Proceedings of the SIAM International
Conference on Data Mining, April 2004.
W. Fan, H. Wang, and P.S. Yu, "Mining
Extremely Skewed Trading Anomalies", Proceedings of the
9th International Conference on Extending Database Technology, Heraklion-Crete,
Greece, March 2004.
Y. Law, H. Wang, C. Zaniolo, "Query
Languages and Data Models for Database Sequences and Data Streams",
Proceedings of the International Conference on Very Large Data Bases,
Sept. 2004.
W-G. Teng, M-S. Chen, and P.S. Yu, "A
Regression-based Temporal Pattern Mining Scheme for Data Streams
", Proceedings of the International Conference on Very Large
Data Bases, Berlin, Germany, Sept. 2003.
W-G. Teng, M-S. Chen, and P.S. Yu, "Using
Wavelet-based Resource-aware Mining to Explore Temporal and Support
Count Granularities in Data Streams", Proceedings of the
SIAM International Conference on Data Mining, April 2004.
H. Wang, W. Fan, P.S. Yu, and J. Han, "Mining
Concept-Drifting Data Streams using Ensemble Classifiers ",
Proceedings of the ACM SIGKDD Conference, Washington, D.C., Aug.
2003.
P.S. Yu, K.L. Wu, and S.K. Chen "Monitoring
Continual Range Queries", in Advanced Web Technologies
and Applications, (Lecture Notes in Computer Science LNCS 3007),
ed. by J.X. Yu, X. Lin, H. Lu, Y. Zhang, Springer, pp. 1-12, 2004.
Copyright © (2003, 2004) by Association
for Computing Machinery, Inc. Permission to make digital or hard
copies of part of all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed
for profit or commercial advantage. To copy otherwise, to republish,
to post on servers, or to redistribute to lists, requires prior
specific permission and/or a fee.
Copyright © (2002) by IEEE. Permission
to make digital or hard copies of part or all of this work for personal
or classroom use is granted without fee provided that copies are
not made or distributed for profit. To copy otherwise, to republish,
to post on servers, or to redistribute to lists, requires prior
specific permission and/or a fee.
P. S. Yu received Outstanding
Contributions Award from IEEE Data Mining Conference (2003)
P.S. Yu delivered a keynote
speech on "Real-time Monitoring and Surveillance using
Data Stream Mining" at IEEE International Conference on Data
Mining, Dec. 2003.
|
 |
What is the most exciting potential
future use for the work you're doing?
I believe that stream mining
technology can dramatically change the way corporations
and governments or even individuals handle or process data.
What is the most interesting part
of your research?
Today, every organization is facing the issue
of information or data overflow. There is too much data
received/generated every day and very little use out of
it. The most interesting area in the project has been to
develop the technology to distill useful knowledge from
the vast amount of data as it streams in.
What inspired you to go into this
field?
The amount of data generated or accumulated is
going up at a very fast rate. Corporations need better technology
to make sense out of their data and turn the knowledge learned
from the data into a competitive advantage. Stream data
mining offers a new technology to help achieve this.
What is your favorite invention
of all time?
The Web.
|
| Research Team |
 |
 |
|
Charu Aggarwal |
|
Gang Luo |
| |
|
|
 |
 |
 |
Haixun Wang |
Joel Leonard Wolf |
|
| |
|
|
 |
|
|
|
|
|
|
|