Skip to main content

Winning the 2007 ICDM Data Mining Contest

In 2007, the IEEE ICDM (IEEE International Conference on Data Mining) sponsored the 2007 Data Mining Contest, where all of the entrants struggled to analyze the same subjects and were judged by the estimation accuracies of their models. The ICDM is one of the most important international conferences in the data mining area.

Our TRL data analytics team won the first prize with more than 10% superior accuracy above the other entries. This achievement was due to our skills with the latest machine learning techniques and our numerous experiences in analyzing our customers' data in real business scenarios.

With our skills in the latest machine learning techniques and many experiences in analyzing our customers' data in the real business problems, we were able to win.

Contest result


Goal of the Contest

The task of the contest was to locate a mobile device indoors. This task is a very important part of applications in many areas, such as marketing or healthcare. The challenge was to find the location of a mobile device using the signal strength values from several indoor access points. The data was represented as the time-series values of the signal strength, and only some of them were labeled with location information. The task was to determine the unlabeled locations in the time-series data.

Physical interpretations like triangulation could not be used, since the data was too noisy because of reflections, interference, shielding, temperature changes, and so on.

Locating a mobile device


This indoor location estimation problem is also important in the real world. Our data analytics team is now working on customer flowline analysis with an IBM consulting team. This service is to analyze the in-store flowlines (tracked movements) of the customers to help retailers in the consumer product industry manage their stores more effectively.

Capturing the Intrinsic Structure of the Data

This task was very difficult because more than 90% of the data didn't have location labels. In addition, the distribution of the data points was complex, with the original time-series data being recorded in a data space having approximately 100 dimensions. To deal with the problems, we needed to grasp the intrinsic structure of the original data (Figure 1).

The secret of understanding the intrinsic structure of this data was our utilization of the unlabeled data and the proper definition of distance for this problem.


Grasp of the intrinsic structure of the original data
Figure 1: Grasp of the intrinsic structure of the original data


Applied Method

We applied a semi-supervised approach using a problem-specific definition of distance and succeeded in grasping the intrinsic structure of the data. The semi-supervised machine learning approach is a method to utilize unlabeled data as well as labeled data. In this case, we connected the similar instances, and estimate the similar labels for the similar instances (Figure 2).

Estimation of the label for the similar instances
Figure 2: Estimation of the label for the similar instances


Here we introduced a form of similarity defined by both the time and the data-space information for the proper definition of the similar instances. For similarity in the data-space, we found that a broader definition of a norm than the standard Euclidian norm improved the prediction accuracy (Figure 3).
  

Equidistance curves using different norm definitions
Figure 3: Equidistance curves using different norm definitions

For similarity along the time axis, we utilized the fact that even when the signal strength values change abruptly, two instances can be considered similar if they are next to each other on the time axis. With these two definitions we had large improvements in the accuracy for the given task.  

[The Japanese version is here.]