Growing needs for analyzing data from physical sensors are driving new themes in machine learning and data mining technologies. We are focusing on geospatial-temporal analysis among other kinds of data analytics for sensor data. For example, geospatial-temporal analytics can analyze the movement patterns of objects that are being tracked on maps. Now this same kind of technology is being applied in various new forms of marketing. Our group is working on theoretical approaches for geospatial-temporal analytics (using machine learning and optimization techniques) and also on approaches using simulations. In this webpage, we introduce some of our recent contributions.
Trajectory Regression
One of the main theme of geospatial-temporal analytics is to analyze the trajectories of mobile objects. Trajectory data is generally not i.i.d. (independent and identically distributed) data; which is a standard working assumption in most statistical approaches. This makes the trajectory data even more interesting as part of recent data mining research.
One useful approach is to expand the typical tasks in the data mining technology so that they can be applied to trajectory data. We proposed a theoretical approach for trajectory regression (in Ide-Kato SDM 2009, and in Ide-Sugiyama AAAI 2011). In these studies, we expanded regression analysis beyond i.i.d. samples to handle vector data. There are challenges in creating mathematically consistent trajectory regression theories for general trajectories, but we began with a practical solution for trajectories on networks (see Fig. 1).
Fig.1 - trajectory regression
A typical application of trajectory regression in networks is to predict the travel times and trajectories, even though networks based on real-world maps have enormous numbers of possible trajectories, which makes the trajectory regression problem ill-posed. We solve these problems using graph Laplacian technique based on the propagation of information within the networks.
Trajectory regression can be applied not only to those trajectories as defined by maps, but to various collections of ordered entries.
We are now working on the theory of trajectory regression, which will lead to new applications.
Geospatial-Temporal Analytics for Location Recommendation
Location recommendation is now a hot topic due to the emergence of location-based services and the widespread adoption of mobile devices with GPS and location logging capabilities. Recommendation systems based on techniques that incorporate social relationships (such as friendship relations in the social network data) into collaborative filtering algorithms have been developed for improving the accuracy of location recommendation services. However, only a few of them exploit the geospatial-temporal correlations among the locations.
We propose a new approach for location recommendation using the Link Propagation principle to exploit the geospatial-temporal correlations among the locations for various collaborative filtering algorithms. Link Propagation is one of the state-of-the-art semi-supervised learning algorithms for link prediction in graphs based on well known label propagation approach that exploits information about the similarities of links and nodes. This was first proposed in by our research group (in Kashima et al., Proc. of the 9th SIAM International Conference on Data Mining, 2009) and then was extended to cope with larger datasets (in Raymond and Kashima, Proc. of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2010).
At a high level, while the collaborative filtering is based on the assumption that a user will likely visit places visited by friends (or similar users under various metrics), the Link Propagation principle is based on the assumption that a user will most likely visit places near previously-visited places. In our approach, the Link Propagation principle is combined with the scores from the collaborative filtering algorithms as a minimization problem. Its solution corresponds to scores that reflect both the social relationships among users (from the collaborative filtering part) and the geospatial correlations among locations (from the link propagation part). In other words, the solution is expected to guide a user not only to locations visited by his friends, but also to locations near those previously visited either by him or his friends. The similarities of the locations can be easily determined from the geospatial data, e.g., by using the inverses of the distances among the locations.
We used the new location recommendation for a bus system to predict the origins, destinations, and desired pick-up and arrival times of its riders. The prediction accuracy of various collaborative filtering schemes can be boosted by making use of the geospatial-temporal similarities (in Raymond et al., Proc. the 19th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, 2011).
References:
- Hisashi Kashima, Tsuyoshi Kato, Yoshihiro Yamanishi, Masashi Sugiyama and Koji Tsuda: Link Propagation: A Fast Semi-supervised Learning Algorithm for Link Prediction, In Proc. the 9th SIAM Conference on Data Mining (SDM), pp. 1099-1110, Sparks, Nevada, USA, 2009.
- Rudy Raymond and Hisashi Kashima: Fast and Scalable Algorithms for Semi-supervised Link Prediction on Static and Dynamic Graphs, In Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp.131-147, Barcelona, Spain, 2010.
- Rudy Raymond, Takamitsu Sugiura, Kota Tsubouchi: Location recommendation based on location history and spatio-temporal correlations for an on-demand bus system. In Proc. the 19th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, ACM GIS 2011: 377-380
Link Cost Prediction
Geospatial-temporal analytics can be applied to risk-sensitive dynamic decisions. In transportation applications involving risk-sensitive decision processes, we often need probability distributions for the travel time on each link of a road network. The travel-time distributions can be estimated from actual probe-car data, but usually only a small amount of probe-car data is available. We need algorithms that can accurately predict times for all of the links (even for links where there is no data) while being computationally efficient and scalable to large network.
We believe that risk-sensitive decisions based on overly strict assumptions (such as Gaussian or log-normal travel-time distributions) are not realistic, because they do not include appropriate risk metrics. Our challenge is to develop the efficient computational algorithms for realistic travel-time distributions for all of the links in a given road network.
We are solving this challenging problem by using (i) a sparse and globally-optimal nonparametric estimator to assign a basis function (which is modeled by a mixture of gamma or log-normal probability density functions) for each link having at least two samples, (ii) an interpolation of each basis function with an approximate diffusion kernel, and (iii) a sparse and globally-optimal optimization of the link importance weights. Our algorithm has significantly higher prediction performances than the previous regression methods(in Takahashi et al., Proc. of the 12th SIAM International Conference on Data Mining, 2012). In addition, we showed that the inherent optimization with the sparse nonparametric methods has the same form as the optimization in convex clustering, and proposed an efficient Sequential Minimal Optimization (SMO) algorithm (in Takahashi, Statistical Analysis and Data Mining, 2011).
For large datasets whose distributions are far from Gaussian, the new sparse nonparametric estimator and its fast optimization has good prediction performance for real phenomena, is stable for the fitting parameters, and is computationally efficient. Beyond our application in transportation research, we believe this algorithm will work for many real-world applications with complex datasets.
Fig. 2 -
Fitting basis density functions for the nonparametric estimator (1st),
and analyzing the gradient of the objective function in convex clustering (2nd-4th)
