Winning the 2007 ICDM Data Mining Contest
In 2007, the IEEE ICDM (IEEE International Conference on Data Mining) sponsored the 2007 Data Mining Contest, where all of the entrants struggled to analyze the same subjects and were judged by the estimation accuracies of their models. The ICDM is one of the most important international conferences in the data mining area.
Our TRL data analytics team won the first prize with more than 10% superior accuracy above the other entries. This achievement
was due to our skills with the latest machine learning techniques and our
numerous experiences in analyzing our customers' data in real business
scenarios.
With our skills in the latest machine learning techniques and many experiences in analyzing our customers' data in the real business problems, we were able to win.

Goal of the Contest
The task of the contest was to locate a mobile device indoors. This task
is a very important part of applications in many areas, such as marketing
or healthcare. The challenge was to find the location of a mobile device
using the signal strength values from several indoor access points. The
data was represented as the time-series values of the signal strength,
and only some of them were labeled with location information. The task
was to determine the unlabeled locations in the time-series data.
Physical interpretations like triangulation could not be used, since the
data was too noisy because of reflections, interference, shielding, temperature
changes, and so on.

This indoor location estimation problem is also important in the real world.
Our data analytics team is now working on customer flowline analysis with
an IBM consulting team. This service is to analyze the in-store flowlines
(tracked movements) of the customers to help retailers in the consumer
product industry manage their stores more effectively.
Capturing the Intrinsic Structure of the Data
This task was very difficult because more than 90% of the data didn't have
location labels. In addition, the distribution of the data points was complex,
with the original time-series data being recorded in a data space having
approximately 100 dimensions. To deal with the problems, we needed to grasp
the intrinsic structure of the original data (Figure 1).
The secret of understanding the intrinsic structure of this data was our utilization of the unlabeled data and the proper definition of distance for this problem.

Figure 1: Grasp of the intrinsic structure of the original data
Applied Method
We applied a semi-supervised approach using a problem-specific definition
of distance and succeeded in grasping the intrinsic structure of the data.
The semi-supervised machine learning approach is a method to utilize unlabeled
data as well as labeled data. In this case, we connected the similar instances,
and estimate the similar labels for the similar instances (Figure 2).

Figure 2: Estimation of the label for the similar instances
Here we introduced a form of similarity defined by both the time and the
data-space information for the proper definition of the similar instances.
For similarity in the data-space, we found that a broader definition of
a norm than the standard Euclidian norm improved the prediction accuracy
(Figure 3).

Figure 3: Equidistance curves using different norm definitions
For similarity along the time axis, we utilized the fact that even when
the signal strength values change abruptly, two instances can be considered
similar if they are next to each other on the time axis. With these two
definitions we had large improvements in the accuracy for the given task.
[The Japanese version is here.]
