Photo
Real-time Active Inference and Learning  (RAIL)

Intelligent Probing

The use of probing technology for real-time diagnosis requires addressing two issues: (1) a planning phase in which a set of  probes is selected and scheduled,   followed by (2) a diagnosis phase in which problem determination is performed using the results of both scheduled probes and probes adaptively selected on-line if required. 

Probing

Phase 1 (planning): probing stations must first be selected at one or more locations in the network. Then the probes must be configured; it must be decided which network elements to target and which station each probe should originate from. Using probes imposes a cost, both because of the additional network load that their use entails and also because the probe results must be collected, stored and analyzed. Cost-effective diagnosis requires a small probe set, yet the probe set must also provide wide coverage, in order to locate problems anywhere in the network. By reasoning about the interactions among the probe paths, we construct an estimate of the  information gain  provided by each probe, and use this estimate as a probe selection heuristic. This yields a quadratic-time greedy-search algorithm which finds near-optimal probe sets. We also implement a linear-time algorithm which can be used to find small probe sets very quickly; a reduction of almost 50% in the probe set size is achieved. The results are reported in [3,4]. Moreover, in [5] we also provide some theoretical bounds on the diagnosis error and derive necessary conditions on the number of probes required for an asymptotically error-free diagnosis (using  an anlogy with the noisy channel coding). A short summary of the results is also available in [6]. 

A related aspect of planning phase is selection of probe frequency in order to ensure a certain estimation accuracy for the 
key SLA parameters, such as total "down-time".  This is a nontrivial problem because a distributed system is dynamic, and thus the services and network performance are stochastic. In this work we study how to achieve accurate performance estimation (i.e. both unbiased and robust) by integrating temporal information (e.g. sending probes more frequently), and 
using correlations from multiple probes.  Our  initial analysis in provided in [7]. 

Phase 2 (real-time inference):  once the probes have been selected and sent, fault diagnosis is performed by analyzing the probe outcomes. In real-life scenarios this must be done in an environment of noise and uncertainty. For example, a probe 
can  fail even though all the nodes it goes through are OK (e.g., due to packet loss). Conversely, there is a chance that a 
probe succeeds even if a node on its path has failed (e.g., dynamic routing may result in the probe following a different path). Thus the task is to determine the  most likely configuration of the states of the network elements.  We use the graphical framework of Bayesian networks, where probe results correspond to observed nodes (evidence), while the problems to be diagnosed are represented by hidden nodes;  in order to represent temporal dependencies, we plan to extend the model to a k-order Dynamic Bayesian Network. Since exact diagnosis in very large networks is intractable, we investigate approximation techniques which  provide user-controllable  accuracy-efficiency trade-offs [8]. Our empirical studies show that these approximations ''degrade gracefully'' with noise and often yield a nearly-optimal solution when noise is low enough; our initial theoretical analysis explains some aspects of such behavior [5]. 

Finally, we consider the issue of learning (or updating) a Bayesian network given data in order to make models adaptive to intermittent failures, dynamic routing, and other non-stationarities in the network state and behavior. A part of our  work addresses a generic problem of choosing appropriate model-selection criteria that allow learning models that not only fit data well, but also have other important properties such as computational simplicity of inference (since in general, inference in BNs is NP-hard) [9], and sensitivity of probabilistic query to errors in parameter estimates [10]. Our most current focus of our work is on: 

  • active diagnosis - i.e. adjusting the probe set dynamically to improve diagnosis;
  • extending  local approximation techniques to incremental, real-time scenarios;
  • handling intermittent failures, dynamic routing, and other non-stationarities in the network state and behavior using on-line learning;
  • active learning using flexibility in probe selecion.
In summary,  our research can provide value to customers in at least the following two aspects. First, it provides solutions needed in the planning phase of probe deployment, such as determining the appropriate trade-offs for probe placement and deployment, as well as  means for better estimating SLA compliance based on probe information. Second,  it provides a means for diagnosing problems in the network. Compared with existing approaches (e.g. SMARTS[11]), we consider the uncertainties resulting from network traffic variation, missing information, and asynchronous probe measurements. We also incorporate prior knowledge so as to detect problems with high accuracy and confidence.