Project overview
To detect a problem and find the root cause in today's IT infrastructure consisting of many distributed servers with complex inter-dependencies is often difficult. There are many problems that cannot be detected by looking at individual computers. The self-healing objective, which is one key for a self-managing computing system, solves such problems. Creating highly available distributed systems can be done by autonomously monitoring the systems, detecting problems, analyzing the root causes, and selecting the best remedial actions. Problem determination is the technology to provide the foundation for a self-healing system. We are developing a problem determination system that monitors network data to effectively detect problems involving multiple servers and take appropriate remedial actions. Our ultimate goal is a self-healing distributed system.
We are developing a problem determination system that monitors network data to effectively detect problems involving multiple servers and take appropriate remedial actions. Finally, we aim at realization of a self-healing distributed system.
Research items
We are tackling the following research areas to develop problem determination technologies leading to self-healing systems.
Dependency discovery
Our problem determination system monitors the data which flows in a network and discovers the dependencies within a system. This is done automatically using "data mining" and "machine learning" technologies based on data describing the application flow. In complex distributed computing systems, this information is very useful to understand the behavior of a system or application. In addition, since the data is tracked via the network, the systems are not loaded by the monitoring.
Failure detection
Our system monitors dependencies between computers obtained by analyzing the data about flows in the network. We study technologies so the critical points of dependencies are analyzed by using time-series analysis--so before a problem becomes fatal the anomalies are detected. We can analyze the root causes and perform healing action before failure. In addition, since our problem determination system monitors the soundness of the entire system, failures that cannot be detected by looking at individual computers are detectable.
Examples of detectable failures
- Abnormalities at the server level
- Partial stoppage and slowdowns of service
- Abnormalities at the application level
- Anomalous transaction terminations
- Abnormalities of the application flow caused by illegal access and so on.
|
|
Related information
|
|