Photo
Real-time Active Inference and Learning  (RAIL)

The objective of this project is the development of efficient techniques for real-time inference (diagnosis and prognosis) and learning   (model adaptation to system changes)  in complex distributed systems. An important property that differentiates our approach  from  ''passive'' data analysis is its ability to perform an  active, online selection and execution of  tests and measurements for more cost-efficient reasoning and learning. (For example, one of our applications uses adaptive selection of probes, or test transactions,  for real-time fault diagnosis in distributed computer systems.) Another important feature of our approach is its focus on achieving flexible and controllable  trade-offs between multiple objectives, such as the accuracy  of diagnosis versus  its computational  complexity  and  the cost of obtaining the information. (In the probing application, for example,  probes provide valuable information about the unknown system state but may also cause an increased network load and data maintenance costs; yet another  trade-off must be achieved betwen the time required to diagnose a problem versus the accuracy  of diagnosis; this requires the use of  approximation algorithms). A short project description is also provided here.

RAIL

As distributed computer systems and networks continue to grow in size and complexity, tasks such as real-time fault localization and problem diagnosis become significantly more challenging. As a result, more sophisticated tools are needed that can assist in performing these management tasks by both responding quickly and accurately to the ever-increasing volume of system measurements, and also actively selecting the minimum number of most-informative tests to run. In other words, a ''passive'' data-mining approach must be replaced by an  active, real-time information-gathering and inference system that can ''ask the right questions at the right time''. Moreover, the focus on the cost-efficency and scalability of real-time problem diagnosis is particularly important for making it useful in the context of extremely large geographically distributed (GRID) computing systems, and in the face of new technological challenges related to autonomic computing,  IBM's vision for new-generation IT systems capable of self-management and self-repair. 

We are currently developing methods and a system for real-time, active  problem diagnosis  using  probing technology. A probe is a command or a  transaction (e.g.,  ping or traceroute command, an email message, or a web-page access request),  sent from a particular machine called a probing station to a server or a network element in order to test a particular service (e.g., IP-connectivity, database- or web-access). A probe returns a set of measurements, such as response times, status code (OK/not OK), and so on. Probing technology is often used to measure the quality of network performance, often motivated by the requirements of service-level agreements (SLAs). Examples of probing technology include the IBM T.J. Watson EPP technology [1] and the Keynote product [2]. However,  using probing for real-time problem diagnosis (and prognosis) appears to be an open area. We address these  problems using methods from artificial intelligence and call the resulting approach Intelligent Probing.