Skip to main content

Modern sensor and information technologies make it possible to continuously collect sensor data, which is typically obtained as real-time and real-valued numerical data. Examples include vehicles driving around in cities or a power plant generating electricity, which can be equipped with numerous sensors that produce data from moment to moment.

Though the data gathering systems are becoming relatively mature, a lot of innovative research needs to be done on knowledge discovery from these huge repositories of data. This project focuses on developing knowledge discovery methodologies mainly for real-valued data generated in manufacturing industries such as the automotive and other heavy industries.

Theory of subsequence time-series clustering

As shown in the webpage on Data Analytics for Structured Data, treating graphs as structured data strongly motivates us to extend traditional machine learning methodologies. Similarly, sensor data is interesting in that sensor data has features that are very different from traditional vector data.

Until 2003, the uniqueness of time-series data had attracted little attention in the data mining community, and many machine learning techniques had been used with an optimistic assumption that the time-series data could be treated essentially in the same way as traditional vector data. However, this optimism was refused by a surprising report published in 2003.

Subsequence time-series clustering (STSC) had been a widely-used pattern extraction technique for time-series data (see the figure below) since the late '90s. In spite of its popularity, it was pointed out in 2003 that this method produces only pseudo-patterns that are almost independent of the nature of the input time-series. The mechanism behind this surprising phenomenon was unknown.


Subsequence time-series clustering
Fig. 1 - Subsequence time-series clustering. Subsequences are generated with the sliding-window technique, then clustered using the standard k-means clustering (see Ide, PKDD 2006).









In 2006, we theoretically explained why STSC breaks down (T. Ide, PKDD 2006). We provided a theoretical explanation of how the sliding-window-based k-means STSC introduces a mathematical artifact into the data, and why this artifact is so unexpectedly strong that the resulting cluster centers are dominated by it, irrespective of the details of the data.

Though the experimental report in 2003 has challenged the optimism in the data mining community by pointing out the contrary reality, our work can be thought of as a practical basis for future attempts to make some new form of STSC more meaningful.

Change analysis of highly dynamic systems

Perhaps the most practically important task in data mining from sensor data is anomaly detection. Statistics-based anomaly detection has a long history of study. In the context of data mining, there also exist standard methods such as those using principal component analysis (PCA).

However, such standard approaches have confronted with difficulties when treating high-dimensional and highly dynamic systems such as automobiles. This is mainly because large fluctuations of the system make it extremely difficult to quantitatively define the normal state.

Recently, we showed that practical anomaly analysis can be done even in highly-dynamic systems, based on the notion of the neighborhood preservation principle (Ide, Papadimitriou, and Vlachos, ICDM 2007).

Anomaly detection based on the neighborhood preservation principle
Fig. 2 - Anomaly detection based on the neighborhood preservation principle (Ide, Papadimitriou, and Vlachos, ICDM 2007
















Our approach reduces the failure analysis task for multi-sensor systems to a problem of graph comparison (see the figure above). Similarities between sensors are first computed to produce a complete graph representing the relationships among sensors. Unfortunately, this graph is quite unstable due to dynamic fluctuations of the system. To handle this, we decompose the global graph into a set of small subgraphs which includes only tightly coupled sensor pairs. The anomaly score of each sensor is computed by evaluating how each tightly coupled pair has changed. In our latest extensive experiments in a sensor validation task, our method achieved a detection ratio of more than 98 %.

Data analytics for massive data

Now that vast amounts of sensor data are being collected automatically, we face scalability problems in data analytics. In general, the problems of massive data analytics are twofold. The first one is how to compress the data effectively, without losing essential features of the data. The second one is how to optimize the runtime environments for analytic computer systems.

Regarding the first problem, we have studied on matrix compression techniques (see the figure below). This work accelerates PCA by carefully using an approach from Krylov subspace learning (Ide and Tsuda, SDM 2007). We applied this approach to a change detection method called singular spectrum transformation (Ide and Inoue, SDM 2005) for speed improvement of more than 50 times the conventional method. The Krylov subspace learning has recently been shown to have interesting theoretical relationships with several advanced machine learning methods, which is also an an interesting research area.

Fast PCA using Krylov subspace learning
Fig. 3 - Fast PCA using Krylov subspace learning (Ide and Tsuda, SDM 2007











For the second problem, we have studied on parallelization technniques of analysis algorithms for large-scale computer systems. To improve the performance of large-scale computational tasks such as city traffic simulations, we developed a resource allocation framework on the IBM's multi-node supercomputer BlueGene/L system (see the figure below). A user divides the analysis algorithm into some small modules and builds these modules into this framework. This framework monitors resources of the system e.g., cpu usage and network cost among these modules and reallocates these modules to the appropriate node to minimize the computational time.

Resource allocation framework for multi-node computer systems
Fig. 4 - Resource allocation framework for multi-node computer systems (see, Takahashi and Mizuta, Proc. Winter Simulation Conference 2006, pp.919-925.)