IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Computer Science

Innovation Matters


Design Automation

Stream Data Mining

Stream mining algorithms are being developed to discover useful knowledge from data as it streams in -- a process that isn't as straightforward as it might seem. Scientists have to deal with the fact that the data rate of the stream isn't constant, leading to a condition called burstiness, and the patterns of the data stream and scheduling resources are continuously evolving.

Recent emerging applications, such as trade surveillance for security fraud and money laundering, network traffic monitoring, sensor network data analysis, Web click stream mining, power consumption measurement, and dynamic tracing of stock fluctuations, call for studying a new kind of data. Called stream data, it can be a continuous, potentially infinite flow of information as opposed to finite, statically stored data sets. Besides querying data streams, another important application is to mine data streams for interesting patterns or anomalies as they happen.

For data stream applications, the volume of data is usually too huge to be stored on permanent devices or to be scanned thoroughly more than once. Both approximation and the ability to adapt are key ingredients for executing queries and performing mining tasks over rapid data streams. The following figure shows a stream mining application example:

A stream mining application example

A stream mining application example

 

Our work addresses the need to (1) capture the evolving nature of the data stream, and (2) handle the bursty nature of the stream with limited resources.

To address the evolving data stream, a summary or condensation approach is developed and this condensed representation is used to track the changes over time. In clustering, for example, the data points in the stream are condensed into a moderate number of micro-clusters and the evolution of micro-clusters is tracked instead of the infinite number of individual data points. Another issue is the temporal granularity, which refers to the fact that as time advances, people are more interested in recent events, meaning that the application can apply more resources to explore more recent data with finer granularities. For mining association rules, similar concepts can be applied to the tracking of frequent item sets. For classifications, ensembles of classifiers are developed where a new one is learned at each period. Instead of continuously revising a single classification model, the ensemble of classifiers from sequential data chunks in the stream is combined. Maintaining the most up-to-date classifier is not necessarily the ideal choice because potentially valuable information may be wasted by discarding results of previously trained, less-accurate classifiers. We show that to avoid over fitting and the problems of conflicting concepts, the expiration of old data must rely on data’s distribution instead of only their arrival times. The ensemble approach offers this capability by giving each classifier a weight based on its expected prediction accuracy on the current test examples. We further investigate the issue of change detection without true labels of the stream data.

With the pass-through feature of data streams, two types of resources, i.e., memory space and computation power, are particularly valuable in the streaming environment. An effective algorithm for data streams is expected to have the capability of resource-awareness, which means that it is highly desirable for all resources in a streaming environment to be adaptively allocated. The overall goal is to maximize precision by making the best use of available resources. Advanced scheduling and memory management algorithms are being developed.

Another related issue is privacy. The data streams may contain personal data that need to be protected. Some anonymity or depersonalization mechanism needs to be applied to the data as it streams in, before the data can be provided to the mining algorithms. A condensation-based approach is developed to create k-anonymity on the fly. A record is said to satisfy k- anonymity, when there are at least k other records in the data from which it cannot be distinguished. The approach regenerates the now-anonymous records from the data stream to preserve privacy while maintaining other characteristics of the record with little affect on the mining results.

Selected Publications

C. Aggarwal, "An Intuitive Framework for Understanding Changes in Evolving Data Streams", Proceedings of the International Conference on Data Engineering, San Jose, CA, Feb. 2002.

C. Aggarwal, "A Framework for Change Diagnosis of Data Streams", Proceedings of the ACM SIGMOD Conference, San Diego, CA, June 2003.

C. Aggarwal, J. Han, and J. Wang, and P.S. Yu, "A Framework for Clustering Evolving Data Streams", Proceedings of the International Conference on Very Large Data Bases, Berlin, Germany, Sept. 2003.

C. Aggarwal, J. Han, and J. Wang, and P.S. Yu, "On-demand Classification of Evolving Data Streams", Proceedings of the ACM SIGKDD Conference, Seattle, WA, Aug. 2004.

C. Aggarwal, J. Han, and J. Wang, and P.S. Yu, "A Framework for Projected Clustering of High Dimensional Streams", Proceedings of the International Conference on Very Large Data Bases, Sept. 2004.

C. Aggarwal, and P.S. Yu, "A Condensation Approach for Privacy Preserving Data Mining", Proceedings of the 9th International Conference on Extending Database Technology, Heraklion-Crete, Greece, March 2004.

.W. Fan, H. Wang, and P.S. Yu, "Active Mining of Data Streams", Proceedings of the SIAM International Conference on Data Mining, April 2004.

W. Fan, H. Wang, and P.S. Yu, "Mining Extremely Skewed Trading Anomalies", Proceedings of the 9th International Conference on Extending Database Technology, Heraklion-Crete, Greece, March 2004.

Y. Law, H. Wang, C. Zaniolo, "Query Languages and Data Models for Database Sequences and Data Streams", Proceedings of the International Conference on Very Large Data Bases, Sept. 2004.

W-G. Teng, M-S. Chen, and P.S. Yu, "A Regression-based Temporal Pattern Mining Scheme for Data Streams ", Proceedings of the International Conference on Very Large Data Bases, Berlin, Germany, Sept. 2003.

W-G. Teng, M-S. Chen, and P.S. Yu, "Using Wavelet-based Resource-aware Mining to Explore Temporal and Support Count Granularities in Data Streams", Proceedings of the SIAM International Conference on Data Mining, April 2004.

H. Wang, W. Fan, P.S. Yu, and J. Han, "Mining Concept-Drifting Data Streams using Ensemble Classifiers ", Proceedings of the ACM SIGKDD Conference, Washington, D.C., Aug. 2003.

P.S. Yu, K.L. Wu, and S.K. Chen "Monitoring Continual Range Queries", in Advanced Web Technologies and Applications, (Lecture Notes in Computer Science LNCS 3007), ed. by J.X. Yu, X. Lin, H. Lu, Y. Zhang, Springer, pp. 1-12, 2004.

Copyright © (2003, 2004) by Association for Computing Machinery, Inc. Permission to make digital or hard copies of part of all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright © (2002) by IEEE. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.


Awards & Recognition

P. S. Yu received Outstanding Contributions Award from IEEE Data Mining Conference (2003)

P.S. Yu delivered a keynote speech on "Real-time Monitoring and Surveillance using Data Stream Mining" at IEEE International Conference on Data Mining, Dec. 2003.

 

Innovators Corner
Philip S. Yu  
Philip S. Yu
Researcher

What is the most exciting potential future use for the work you're doing?
I believe that stream mining technology can dramatically change the way corporations and governments or even individuals handle or process data.

What is the most interesting part of your research?
Today, every organization is facing the issue of information or data overflow. There is too much data received/generated every day and very little use out of it. The most interesting area in the project has been to develop the technology to distill useful knowledge from the vast amount of data as it streams in.

What inspired you to go into this field?
The amount of data generated or accumulated is going up at a very fast rate. Corporations need better technology to make sense out of their data and turn the knowledge learned from the data into a competitive advantage. Stream data mining offers a new technology to help achieve this.

What is your favorite invention of all time?
The Web.


Team Members
Research Team
Charu Aggarwal Wei Fan
Gang Luo
Charu Aggarwal
Gang Luo
     
Haixun Wang Joel Wolf Kun-lung Wu
Haixun Wang
Joel Leonard Wolf
     
Philip S. Yu    
   

Related Links
arrowDiscipline: Computer Science
arrowResearch Area: Knowledge Discovery & Data Mining
arrowResearch Site: Watson

    About IBMPrivacyContact