Modern sensor and information technologies make it possible to continuously
collect sensor data, which is typically obtained as real-time and real-valued
numerical data. Examples include vehicles driving around in cities or a
power plant generating electricity, which can be equipped with numerous
sensors that produce data from moment to moment.
Though the data gathering systems are becoming relatively mature, a lot of innovative research needs to be done on knowledge discovery from these huge repositories of data. This project focuses on developing knowledge discovery methodologies mainly for real-valued data generated in manufacturing industries such as the automotive and other heavy industries.
Theory of subsequence time-series clustering
As shown in the webpage on Data Analytics for Structured Data, treating graphs as structured data strongly motivates us to extend traditional
machine learning methodologies. Similarly, sensor data is interesting in
that sensor data has features that are very different from traditional
Until 2003, the uniqueness of time-series data had attracted little attention in the data mining community, and many machine learning techniques had been used with an optimistic assumption that the time-series data could be treated essentially in the same way as traditional vector data. However, this optimism was refused by a surprising report published in 2003.
Subsequence time-series clustering (STSC) had been a widely-used pattern extraction technique for time-series data (see the figure below) since the late '90s. In spite of its popularity, it was pointed out in 2003 that this method produces only pseudo-patterns that are almost independent of the nature of the input time-series. The mechanism behind this surprising phenomenon was unknown.
In 2006, we theoretically explained why STSC breaks down (T. Ide, PKDD 2006). We provided a theoretical explanation of how the sliding-window-based
k-means STSC introduces a mathematical artifact into the data, and why
this artifact is so unexpectedly strong that the resulting cluster centers
are dominated by it, irrespective of the details of the data.
Though the experimental report in 2003 has challenged the optimism in the data mining community by pointing out the contrary reality, our work can be thought of as a practical basis for future attempts to make some new form of STSC more meaningful.
Change analysis of highly dynamic systems
Perhaps the most practically important task in data mining from sensor data is anomaly detection. Statistics-based anomaly detection has a long history of study. In the context of data mining, there also exist standard methods such as those using principal component analysis (PCA).
However, such standard approaches have confronted with difficulties when treating high-dimensional and highly dynamic systems such as automobiles. This is mainly because large fluctuations of the system make it extremely difficult to quantitatively define the normal state.
Recently, we showed that practical anomaly analysis can be done even in highly-dynamic systems, based on the notion of the neighborhood preservation principle (Ide, Papadimitriou, and Vlachos, ICDM 2007).
Our approach reduces the failure analysis task for multi-sensor systems
to a problem of graph comparison (see the figure above). Similarities between
sensors are first computed to produce a complete graph representing the
relationships among sensors. Unfortunately, this graph is quite unstable
due to dynamic fluctuations of the system. To handle this, we decompose
the global graph into a set of small subgraphs which includes only tightly
coupled sensor pairs. The anomaly score of each sensor is computed by evaluating
how each tightly coupled pair has changed. In our latest extensive experiments
in a sensor validation task, our method achieved a detection ratio of more
than 98 %.
Data analytics for massive data
Now that vast amounts of sensor data are being collected automatically,
we face scalability problems in data analytics. In general, the problems
of massive data analytics are twofold. The first one is how to compress
the data effectively, without losing essential features of the data. The
second one is how to optimize the runtime environments for analytic computer
Regarding the first problem, we have studied on matrix compression techniques (see the figure below). This work accelerates PCA by carefully using an approach from Krylov subspace learning (Ide and Tsuda, SDM 2007). We applied this approach to a change detection method called singular spectrum transformation (Ide and Inoue, SDM 2005) for speed improvement of more than 50 times the conventional method. The Krylov subspace learning has recently been shown to have interesting theoretical relationships with several advanced machine learning methods, which is also an an interesting research area.
For the second problem, we have studied on parallelization technniques of analysis algorithms for large-scale computer systems. To improve the performance of large-scale computational tasks such as city traffic simulations, we developed a resource allocation framework on the IBM's multi-node supercomputer BlueGene/L system (see the figure below). A user divides the analysis algorithm into some small modules and builds these modules into this framework. This framework monitors resources of the system e.g., cpu usage and network cost among these modules and reallocates these modules to the appropriate node to minimize the computational time.