Modern sensor and information technologies make it possible to continuously
collect sensor data, which is typically obtained as real-time and real-valued
numerical data. Examples include vehicles driving around in cities or a
power plant generating electricity, which can be equipped with numerous
sensors that produce data from moment to moment.

Though the data gathering systems are becoming relatively mature, a lot of innovative research needs to be done on knowledge discovery from these huge repositories of data. This project focuses on developing knowledge discovery methodologies mainly for real-valued data generated in manufacturing industries such as the automotive and other heavy industries.

## Theory of subsequence time-series clustering

As shown in the webpage on Data Analytics for Structured Data, treating graphs as structured data strongly motivates us to extend traditional
machine learning methodologies. Similarly, sensor data is interesting in
that sensor data has features that are very different from traditional
vector data.

Until 2003, the uniqueness of time-series data had attracted little attention
in the data mining community, and many machine learning techniques had
been used with an optimistic assumption that the time-series data could
be treated essentially in the same way as traditional vector data. However,
this optimism was refused by a surprising report published in 2003.

Subsequence time-series clustering (STSC) had been a widely-used pattern
extraction technique for time-series data (see the figure below) since
the late '90s. In spite of its popularity, it was pointed out in 2003 that
this method produces only pseudo-patterns that are almost independent of
the nature of the input time-series. The mechanism behind this surprising
phenomenon was unknown.

In 2006, we theoretically explained why STSC breaks down (T. Ide, PKDD 2006). We provided a theoretical explanation of how the sliding-window-based
k-means STSC introduces a mathematical artifact into the data, and why
this artifact is so unexpectedly strong that the resulting cluster centers
are dominated by it, irrespective of the details of the data.

Though the experimental report in 2003 has challenged the optimism in the
data mining community by pointing out the contrary reality, our work can
be thought of as a practical basis for future attempts to make some new
form of STSC more meaningful.

## Change analysis of highly dynamic systems

Perhaps the most practically important task in data mining from sensor data is anomaly detection. Statistics-based anomaly detection has a long history of study. In the context of data mining, there also exist standard methods such as those using principal component analysis (PCA).

However, such standard approaches have confronted with difficulties when
treating high-dimensional and highly dynamic systems such as automobiles.
This is mainly because large fluctuations of the system make it extremely
difficult to quantitatively define the normal state.

Recently, we showed that practical anomaly analysis can be done even in
highly-dynamic systems, based on the notion of the neighborhood preservation
principle (Ide, Papadimitriou, and Vlachos, ICDM 2007).

Our approach reduces the failure analysis task for multi-sensor systems
to a problem of graph comparison (see the figure above). Similarities between
sensors are first computed to produce a complete graph representing the
relationships among sensors. Unfortunately, this graph is quite unstable
due to dynamic fluctuations of the system. To handle this, we decompose
the global graph into a set of small subgraphs which includes only tightly
coupled sensor pairs. The anomaly score of each sensor is computed by evaluating
how each tightly coupled pair has changed. In our latest extensive experiments
in a sensor validation task, our method achieved a detection ratio of more
than 98 %.

## Data analytics for massive data

Now that vast amounts of sensor data are being collected automatically,
we face scalability problems in data analytics. In general, the problems
of massive data analytics are twofold. The first one is how to compress
the data effectively, without losing essential features of the data. The
second one is how to optimize the runtime environments for analytic computer
systems.

Regarding the first problem, we have studied on matrix compression techniques
(see the figure below). This work accelerates PCA by carefully using an
approach from Krylov subspace learning (Ide and Tsuda, SDM 2007). We applied this approach to a change detection method called singular
spectrum transformation (Ide and Inoue, SDM 2005) for speed improvement of more than 50 times the conventional method.
The Krylov subspace learning has recently been shown to have interesting
theoretical relationships with several advanced machine learning methods,
which is also an an interesting research area.

For the second problem, we have studied on parallelization technniques
of analysis algorithms for large-scale computer systems. To improve the
performance of large-scale computational tasks such as city traffic simulations,
we developed a resource allocation framework on the IBM's multi-node supercomputer
**BlueGene/L** system (see the figure below). A user divides the analysis algorithm into
some small modules and builds these modules into this framework. This framework
monitors resources of the system e.g., cpu usage and network cost among
these modules and reallocates these modules to the appropriate node to
minimize the computational time.