C. Aggarwal, J. Han, and
J. Wang, and P.S. Yu, "A Framework for Projected Clustering
of High Dimensional Streams", Proceedings of the International
Conference on Very Large Data Bases, Sept. 2004.
Abstract
The data stream problem has been studied extensively in recent
years, because of the great ease in collection of stream data.
The nature of stream data makes it essential to use algorithms
which require only one pass over the data. Recently, single-scan,
stream analysis methods have been proposed in this context. However,
a lot of stream data is high-dimensional in nature. High-dimensional
data is inherently more complex in clustering, classification,
and similarity search. Recent research discusses methods for projected
clustering over high-dimensional data sets. This method is however
difficult to generalize to data streams because of the complexity
of the method and the large volume of the data streams. In this
paper, we propose a new, high-dimensional, projected data stream
clustering method, called HPStream. The method incorporates a
fading cluster structure, and the projection based clustering
methodology. It is incrementally updatable and is highly scalable
on both the number of dimensions and the size of the data streams,
and it achieves better clustering quality in comparison with the
previous stream clustering methods. Our performance study with
both real and synthetic data sets demonstrates the efficiency
and effectiveness of our proposed framework and implementation
methods.
C. Aggarwal, and P.S. Yu, "A Condensation
Approach for Privacy Preserving Data Mining", Proceedings of
the 9th International Conference on Extending Database Technology,
Heraklion-Crete, Greece, March 2004.
Abstract
In recent years, privacy preserving data mining has become an
important problem because of the large amount of personal data
which is tracked by many business applications. In many cases,
users are unwilling to provide personal information unless the
privacy of sensitive information is guaranteed. In this paper,
we propose a new framework for privacy preserving data mining
of multi-dimensional data. Previous work for privacy preserving
data mining uses a perturbation approach which reconstructs data
distributions in order to perform the mining. Such an approach
treats each dimension independently and therefore ignores the
correlations between the dierent dimensions. In addition, it
requires the development of a new distribution based algorithm
for each data mining problem, since it does not use the multi-dimensional
records, but uses aggregate distributions of the data as input.
This leads to a fundamental re-design of data mining algorithms.
In this paper, we will develop a new and exible approach for privacy
preserving data mining which does not require new problem-speci
c algorithms, since it maps the original data set into a new anonymized
data set. This anonymized data closely matches the characteristics
of the original data including the correlations among the dierent
dimensions. We present empirical results illustrating the eectiveness
of the method.
W. Fan, H. Wang, and P.S. Yu, "Mining Extremely Skewed Trading
Anomalies", Proceedings of the 9th International Conference
on Extending Database Technology, Heraklion-Crete, Greece, March
2004.
Abstract
Trading surveillance systems screen and detect anomalous trades
of equity, bonds, mortgage certificates among others. This is
to satisfy federal trading regulations as well as to prevent crimes,
such as insider trading and money laundry. Most existing trading
surveillance systems are based on hand-coded expert-rules. Such
systems are known to result in long developing process and extremely
high “false positive” rates. We participate in co-developing a
data mining based automatic trading surveillance system for one
of the biggest banks in the US. The challenge of this task is
to handle very skewed positive classes (< 0.01%) as well as very
large volume of data (millions of records and hundreds of features).
The combination of very skewed distribution and huge data volume
poses new challenge for data mining; previous work addresses these
issues separately, and existing solutions are rather complicated
and not very straightforward to implement. In this paper, we propose
a simple systematic approach to mine “very skewed distribution
in very large volume of data”.