IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Systems Journal

Real-Time and Event-Based Systems   Volume 47, Number 2, 2008
Table of contents: HTMLPDF This article: HTML PDFDOI: 10.1147/sj.472.0265Copyright info

Pulsar: A resource-control architecture for time-critical service-oriented applications

by M. Astley,
S. Bhola,
M. J. Ward,
K. Shagin,
H. Paz,
and G. Gershinsky

The complexity of real-time systems is growing extremely rapidly, as they move from isolated devices to multilevel networked systems. Traditional methodologies for developing and managing these systems are not scaling to meet the requirements of a new generation of distributed applications. While developers of complex real-time applications are looking to service-oriented architecture to address their needs for ease of development and flexibility of integration, current software infrastructures for service-oriented applications do not address the issue of predictable latency for the applications they host. In this paper, we present Pulsar, a resource-control architecture for managing the end-to-end latency of a set of distributed, time-critical applications. The primary entity of Pulsar is called a controller, which regulates an aspect of resource allocation or scheduling policy. Controllers utilize policy configurations, which may include latency targets to be achieved or resource allocations to be honored, and interact with resource allocators and schedulers (e.g., thread schedulers, memory allocators, or bandwidth reservation mechanisms) to effect local policy. Controllers also provide feedback on how well they are executing a policy. Pulsar includes an application model which captures resource-sensitive behavior and requirements and is independent of high-level programming models and application programming interfaces.

Introduction

As real-time systems become increasingly complex and accommodate a new generation of distributed applications, traditional methodologies for developing and managing these systems are not scaling to meet the requirements of these applications. While service-oriented architecture (SOA) offers an approach to addressing the needs of developers of complex real-time applications with respect to ease of development and flexibility of integration, the issue of predictable latency for these SOA applications is inadequately addressed by current software infrastructures.

Real-time applications which are deployed over such infrastructures have several key features:

  • Distributed resources. Components of an application may be distributed, and may compete with other distributed applications for resources such as CPU, networking bandwidth, and memory;

  • Mixed requirements. The set of applications may include hard real-time (i.e., those in which deadlines must be met and all events must be handled) and soft real-time applications, as well as non-real-time workloads;

  • Dynamic loads. Workloads may be event-driven and can vary significantly over time; and

  • Dynamic resources. The set of available resources (both on the server and the network) may change over time due to failures or other reasons.

For example, a trade execution application for a stock exchange may have components which are distributed (for scalability and fault-tolerance) over a cluster of servers. The application is responsible for providing timely data from the market, as well as accepting and executing orders to buy or sell securities. The load experienced by the application is heavily influenced by trade volume, which varies both predictably (e.g., due to the trade backlog at market open) and unpredictably (e.g., due to breaking news). In addition, the application may service a variety of clients with different requirements and importance. Large investment banks, for example, are willing to pay a high premium to ensure a bounded latency on trade execution. Individual investors, on the other hand, may settle for a low-cost, best-effort service.

Satisfying the timeliness requirements for such an application is a challenging problem. For example, the dynamic load and availability of the system prohibit traditional techniques such as static allocation and scheduling. Instead, the system must be able to shift resource allocation spontaneously as needed. Likewise, the system must optimize resource allocation according to differing service level and latency requirements. For example, clients may express their requirements as a service level agreement (SLA), which quantifies the importance and flexibility of meeting requirements at different service levels. SLAs can be used to allocate resources in a manner which optimizes overall benefit to clients. Finally, scalable and fault-tolerant control mechanisms are needed to minimize overhead and provide resiliency in the face of server and network failures.

In this paper, we present Pulsar, a resource-control architecture for managing the end-to-end latency of a set of distributed, time-critical applications. Pulsar applications are described using a programming model that captures resource-sensitive behavior and requirements that are independent of high-level programming models and application programming interfaces (APIs). In particular, timeliness requirements are modeled as utility functions which map utility as a function of end-to-end latency. Depending on the shape of the utility function, the system is able to make trade-offs between total utility and available resources. These trade-offs are reflected in resource-control policies which are distributed among the nodes of the system.

Resource-control policies are constructed and enforced using controllers, which are arranged in a hierarchy in order to regulate various aspects of resource allocation or scheduling policy. Top-level controllers establish overall resource-control policies based on utility trade-offs. These policies are passed to “child” controllers and may include latency targets which must be achieved or resource allocations which must be honored. Intermediate controllers (e.g., one per server) translate policies received from parent controllers into policies delegated to child controllers. At the lowest level, controllers interact directly with resource allocators and schedulers, e.g., thread schedulers, memory allocators, or bandwidth reservation mechanisms, in order to effect local policies.

In this paper, we present: (1) a distributed, hierarchical control mechanism for distributed real-time applications; (2) an abstract model of resources which allows high-level control of these applications without exposing explicit resource-control mechanisms (e.g., resource pooling and sharing); (3) a framework for utility-based optimization which maximizes total system utility based on a novel intermediate model of application requirements; and (4) a method for online feedback and error correction to improve system performance.

The paper is structured as follows. We present a programming model for real-time applications in the next section. This is followed by a description of an architecture which supports this model. We demonstrate our approach by way of a distributed power-line fault detection application which we describe in the section “Scenario: Power-line fault detection.” In the section “Evaluation,” we evaluate the performance of our architecture in this scenario. We review related work in the subsequent section, followed by our conclusion.

Programming model

Whereas real-time applications typically target a specific high-level programming model, such as the Real-time CORBA** specification1 or the Real-time Specification for Java**,2 our techniques are focused on low-level resource management mechanisms which are often independent of high-level APIs. As a result, in this paper we describe applications in terms of a low-level model which captures resource-sensitive behavior and requirements. We view our model as a possible intermediate target for high-level APIs and languages, including those that facilitate building service-oriented applications like  Web Services Business Process Execution Language  (WS-BPEL)3 and Service Component Architecture (SCA).4

Applications with real-time deadlines are defined in terms of jobs, job sets, and job flows. A job is a discrete unit of computation that is localized on a node in the system. Jobs may execute concurrently, and all jobs are preemptible. Constraints among jobs (e.g., mutual exclusion) may be specified in job sets as described below. Jobs are organized into job sets consisting of a job list (a list of jobs contained in the set), a dependency graph (a directed acyclic graph [DAG] with a unique root, one or more leaf nodes, vertices which represent jobs, and directed edges which represent dependencies between jobs), and trigger events (a set of external events which cause the job set to be executed).

The unique root of a job set is called the start job and the leaf nodes are called end jobs. An edge in a job set dependency graph is either a local edge or a remote edge, according to whether the linked jobs execute on the same node or different nodes. Dependencies between jobs indicate ordering constraints, conditional paths, mutual exclusion, or communication. We view job sets as types, instances of which are created when the system receives an appropriate triggering event and agrees to schedule a start job for execution. Due to conditional paths, the jobs executed for a given instance of a job set may differ from other instances of the same job set.

In terms of execution, a job set instance defines a simple data flow which starts when the start job receives a trigger event. Edges carry discrete information in the form of events, and a job in a job set is eligible for dispatch when all inbound edges contain an event. When a job completes execution, it may generate an event on each outgoing edge. Events are used for modeling purposes and may not always have a physical realization. For example, the implementation of edges used to indicate ordering, conditional execution, or mutual exclusion need not transmit actual events. Communication edges, in contrast, transmit events representing data.

A job flow is a sequence of job set instances with the same timeliness requirements (e.g., a specific deadline for each job set). At run time, each job flow receives a stream of trigger events which cause the appropriate job set instances to be instantiated and scheduled for execution.

Figure 1 shows an example of the application model. In the figure, we illustrate the history of a particular job flow (F1) which shows the release time and execution time of multiple job set instantiations (two instantiations of job set type S1 and one of S2.) The instantiations of S1 differ in their job execution paths (e.g., due to conditional execution.) S2 has the same timeliness requirements as S1 and is included in the same flow. The view shown in the figure is a history because the job set rendering shows the actual jobs executed, not all possible paths. The latter may be different due to conditional execution. For clarity, we have shown job set executions which are non-overlapping in time, though there is no such restriction in the model.

Figure 1  Application model example Figure 1

This application model embodies the main features of service orientation. Jobs naturally map to services and are not tied to any implementation technology. Communication between jobs in a job set may be through procedure invocation, either local or remote, or by message passing. The model supports both request-response and one-way message exchange patterns; one-way message exchange may be one-to-one or one-to-many (we are investigating extensions to support many-to-one message exchange, if this is necessary.) We have validated this model with the WS-BPEL and SCA standards, two leading methods for specifying service-oriented applications. The Pulsar model fully supports the SCA Assembly Model for aggregating components and linking them through wiring. WS-BPEL specifies a rich set of structured activities for prescribing the order in which a collection of activities is executed. The current Pulsar model does not support the <while> and <repeatUntil> structured activities; we are investigating ways to support restricted cycles in the dependency graph if time-critical applications present such requirements.

Timeliness requirements

Application timeliness requirements are specified using a utility function, which is a non-increasing function that maps job set latency to a utility value. The maximum allowable latency may be limited by a critical time beyond which latency may not extend, regardless of utility. Thus, critical time is analogous to a deadline.

Utility functions are a generalization of simple deadlines where, in addition to defining absolute constraints (e.g., latency must not exceed the critical time), the shape of the function can be used to derive trade-offs between latency (i.e., resource allocation) and benefit to the application. Our goal is to satisfy all application deadlines (i.e., critical times) while maximizing utility.

The latency (and hence utility) of a job set depends on the latency experienced by individual jobs, which further depend on resource allocation and may vary according to application parameters. Job flows are expected to define properties which help to determine the latency for jobs (e.g., worst-case or average-case execution time), as well as other resource requirements (e.g., network bandwidth). Such properties could be derived from or corrected by runtime measurements. We can combine these specifications (including trigger-event specifications) with a model of resources to derive the predicted latency for a job.

In the case of soft real-time flows, it is often inefficient to specify performance as a function of worst-case execution since worst-case latency is often much larger than average latency. Our programming model allows flow latency to be based on particular latency samples (e.g., the 95th percentile). Likewise, our implementation uses these samples for monitoring and fine-tuning resource allocations.

Analysis

The analysis of job flow and resource allocation combines models of job flow latency and resource utilization to maximize a system scheduling objective. In our system, the scheduling objective is to maximize the sum of the utilities of the job flows, while ensuring that the latency of all job set instances remains below the critical time for the job flow it belongs to.

To perform scheduling analysis, the following information must be extracted from the application model: the timeliness requirements (i.e., utility function) and dependency graph for each job set; the inter-release time of the start jobs for job set instances in each flow; the probability of observing particular dependency graph instances for each job set (this implies knowing the probability of following each conditional branch in the dependency graph); and the execution time (e.g., worst-case execution time) for each job.

It may not be possible to extract such information from all applications. In extreme cases, it may be necessary to estimate application behavior from an online model. Our system performs a limited form of such estimation by way of error correction. However, more-complete online modeling is being considered for future work. Likewise, we consider only deterministic job dependency graphs in this paper; graphs with probabilistic branches are being considered for future work.

A latency model for a flow can be constructed by considering the latency of each job set. The latency of a job set, in turn, can be modeled according to the latencies due to jobs and edges. Note that both kinds of latencies include queuing delays of various kinds (communication buffer queues, thread scheduler queues, etc.). However, latency at a non-communication local edge is assumed to be zero.

The latencies of jobs and edges are affected by resource allocations, such as CPU share, bandwidth, and memory. In our approach, we define node-specific resource models which describe the effect of a particular allocation on latency. For example, increasing CPU share for a particular job will decrease its latency but may increase latency for concurrent jobs, since less of this resource is available.

Depending on how the latency models are constructed, it may be possible to perform a cost-versus-benefit analysis which leads to an optimal allocation strategy. We illustrate this technique using a worst-case latency model in the section “Evaluation.”

Architecture and implementation

Figure 2 presents a high-level view of the Pulsar system architecture. This architecture consists of three main components: the controller hierarchy, job execution, and monitoring and feedback. The controller hierarchy determines optimal resource allocations from node-level models of resource utilization. The job execution layer executes jobs and updates local models to correct for modeling error or other performance differences. We describe each of these components in more detail in the following.

Figure 2  High-level view of Pulsar system architecture Figure 2

Controller hierarchy

The controller hierarchy subdivides the task of global resource allocation between a node controller (one for each node), which utilizes specific models of resource requirements, and a global controller, which combines resource models with latency and flow utility models to determine appropriate resource allocations (see the section “Programming model”). The global controller does not need to be centralized. For example, a global controller may be created per flow with each controller interacting with the others to derive resource allocations.

Global controllers determine resource policies (i.e., specific resource assignments for each node) and forward these policies to each of the node controllers. Conversely, node controllers use local monitoring to improve local resource models and periodically forward these models to each global controller. The structure of the models and the mechanisms used by the global controllers to determine allocations may vary widely; this is beyond the scope of this paper. However, we consider an explicit example of this approach in the section “Evaluations.”

In the current architecture, the node-level latency models describe latency as a function of particular CPU and network share allocations. That is, we assume some form of a proportional share (PS) scheduling5,6 mechanism but do not require a particular implementation. These models vary according to the underlying resource and modeling error. For example, speed differences in CPUs or the choice of different share scheduling implementations will yield slightly different models. The node controllers attempt to compensate for modeling error by monitoring performance and incorporating error correction into the models.

Node controllers perform the additional function of enforcing resource allocations according to the policies provided by global controllers. In the current architecture, resource allocations are enforced by directing the kernel to assign shares to job threads and network links (see the section “Implementation details”).

Job execution

Job execution is handled by two components. The Job Execution Environment (JEE) provides a container for the job flows described by the application model. The TransFab component augments the JEE by providing access to network resources when required by job set dependencies. The JEE uses an event dispatch model to implement the facilities required by the application model.

At system start-up, an object is instantiated for each job in each job set which may execute in a job flow. Each job object is associated with an event queue. The job objects encapsulate the application code to be executed for the job, and the event queue stores events which represent dependent interactions in a job set instance. Each job object is uniquely associated with a job set type (i.e., job objects are never shared between job set types).

In the current implementation, each job object is also associated with a dedicated thread having a share which is assigned according to the current policy. The events on the event queue are used to dispatch these threads in a serial fashion. The thread is suspended if the event queue is empty.

At run time, a job set instance is represented by its triggering event, which is queued in the event queue of the appropriate start job object. A job set instance is viewed as “released” when its triggering event has been enqueued. During execution, a job object may produce at most one event on each outgoing link, as determined by the job dependency graph. When a job object completes, each such event is enqueued on the appropriate downstream job object queue. If the downstream object queue resides on a different node, then Transfab is invoked to forward the event to that node.

Monitoring and feedback

The observed latency of job flows may differ from that predicted by our models due to error induced by model abstractions and slight variations in run-time performance. Our architecture attempts to compensate for these errors by monitoring performance at the node level and computing smoothed additive error terms for the latency models.

The monitoring mechanisms and computation of error are heavily dependent on the form of the latency models. For example, if latency models are worst-case models, not all latency samples will be worst-case, and filtering will be necessary so that only representative samples are used to adjust the model. Likewise, error correction depends on where samples are measured and what dependent variables may be adjusted to influence error. We present an explicit example of monitoring and error correction in the section “Evaluations.”

Implementation details

We have implemented our architecture in Java executing in a Real-Time Specification for Java (RTSJ) implementation with some extended capabilities. Figure 3 illustrates the software stack. Pulsar executes within an RTSJ implementation running atop Linux** with RT support. Applications are executed directly by Pulsar. Certain resource management operations (e.g., CPU scheduling) require a JNI** (Java Native Interface) library, which invokes system calls directly.

Figure 3  Pulsar software stack Figure 3

Application code consists of a specification file, which defines flow properties (i.e., jobs, job sets, and job flows), and a set of classes, each of which implement jobs or events used in the application. The application code is executed by the Pulsar layer, which interprets the specification file and instantiates and dispatches application classes as necessary. The Pulsar layer also facilitates communication between jobs (e.g., through communication edges in job sets).

The Pulsar layer executes atop the IBM WebSphere* Real Time,7 which contains an RTSJ implementation and a customized Linux environment. The RTSJ implementation includes augmented real-time support facilities such as the Metronome8 deterministic garbage collector. This layer, in turn, executes atop the IBM Real-Time Linux (IBM-RTLinux) implementation, which is a customized derivative of Red Hat** Enterprise Linux. IBM-RTLinux offers capabilities such as priority inheritance and direct access to physical memory.

While most of the functionality required by Pulsar is available in RTSJ, we found it necessary, for efficiency, to implement the surplus-fair-share scheduler at the OS (operating system) level. Our implementation is based on the one described in Reference 9, with modifications to enable a mix of priority, share, and non-RT threads. In particular, share-scheduled threads also have a priority and the existing real-time priority scheduler in Linux is used to determine the next eligible thread. If the next eligible thread (i.e., the one which is active and has the highest priority) has been marked for share scheduling, the share scheduler is invoked and the selected thread may be replaced based on share surpluses. This mixed implementation simplifies short-lived preemption (e.g., by assigning high priority) for certain critical threads such as I/O handlers and the garbage collector. Although it is possible to expose the share scheduler through RTSJ facilities, to simplify the implementation we chose to implement a JNI library which, in turn, invokes the appropriate system calls on the OS.

Scenario: Power-line fault detection

We have used Pulsar to manage the latency requirements of a simulated application for power-line fault detection and analysis. In this section, we describe the fault detection application and its real-time requirements. In the following section, we evaluate the performance of Pulsar in this simulated setting.

In many parts of the world, power lines are part of the critical infrastructure for delivering electricity and other services to populated areas. Several recent high-profile failures have highlighted the need for problem monitoring and expedited problem resolution. To enable these capabilities, electricity providers have begun to deploy distributed monitoring capabilities over power-line networks. These monitoring capabilities take the form of remote terminal units (RTUs), which are hardware components that are attached directly to power-line installations (e.g., poles and underground conduits). RTUs monitor the physical characteristics of the power supply (e.g., frequency and amplitude) and are networked, so that measurements can be communicated to nearby power stations. By analyzing RTU measurements, either locally at the RTU or in aggregate across several installations, it may be possible to detect dangerous fluctuations in the power supply and take preventive action before wide-scale outages occur.

An application designed to monitor and react to this infrastructure faces several challenges. Because the power-line infrastructure is ubiquitous, depending on the density of RTUs, the resulting bandwidth will be likely to require careful management in order to prevent data loss or delay. Failing to report anomalous power-line conditions in a timely fashion may lead to cascading overloads or other critical failures. Therefore, it is important that the monitoring application be sensitive to latency requirements. Reliable but delayed information may be of little use. RTUs are limited in capability, for economic reasons. While it is reasonable to expect RTUs to have some basic buffering and analysis capabilities, the monitoring application should assume that the RTUs are a limited resource which must be carefully managed.

To test the latency management capabilities of Pulsar, we have simulated a limited power-line fault detection and monitoring infrastructure that uses a remote monitoring application to detect and act on possible faults. We attempt to answer each of the challenges mentioned by using the capabilities of Pulsar for managing bandwidth and CPU utilization in a manner that is sensitive to the latency requirements of applications.

Figure 4 illustrates our simulated power-line infrastructure. We simulate monitoring at one substation of the power-line infrastructure. This substation is responsible for monitoring 30 power-line poles, each of which is monitored by an RTU. We also assume that bidirectional communication between RTUs and the substation is made possible by a broadband connection over the power lines. To model the network, we normalize over unit messages and assume that all data is transmitted as a sequence of one or more messages. The network has a fixed capacity of 600 unit messages per second (msgs/s).

Figure 4  Simulated-power-line infrastructure Figure 4

Power distribution has several interesting physical characteristics, but there is insufficient bandwidth available for each RTU to transmit all of these characteristics. Therefore, under normal (i.e., non-fault) operating conditions, we assume that each RTU sends a limited summary of the current pole status. We call these summaries live messages and, as the figure shows, this data consumes 10 unit msgs/s per RTU. When the substation receives these messages, they are processed by a live application which checks the data for possible fault conditions. When only live messages are processed, we say that the system is in zero fault mode.

Occasionally, a live message may indicate a possible fault or other significant condition. When this occurs, it is desirable to gather more detailed information, possibly including data from other power-line poles. We simulate this behavior by way of a replay module which, when notified by the live module, will send a request for more detailed data from up to three power-line poles (including the pole that transmitted the summary which triggered the possible fault). The more detailed data consumes more bandwidth; in our case, it was 300 messages per RTU, or 900 messages in total. These messages are received by the replay module, which determines whether further action is required (e.g., notifying a human operator). When the system has requested or is still processing replay data, the system is said to be in one fault mode.

The end-to-end behavior of the system in zero-fault and one-fault mode can be visualized as a job graph that captures both the CPU and communication dependencies and can be used as a basis for partitioning resources to meet latency requirements.

To make the scenario more realistic, we assume that there is an ambient load at the substation corresponding to other real-time monitoring or service operations. The real-time portion of this load is denoted by an AppX module which executes a task with a fixed period. The remaining general application load is denoted by the AppY module and has no timeliness or bandwidth requirements.

From this scenario, we generate three timeliness requirements and one bandwidth requirement:

  • Live data latency. The delay between the generation of a live message (at the RTU) and its processing at the live module on the substation must be bounded. This is necessary to ensure that possible faults are detected in a timely manner, and so that any other required data is still available at the RTUs when requested.

  • Replay latency. The delay between the generation of a live message indicating a possible fault, the request for more detailed data by the replay module, and the processing of all replay messages must be bounded. This is necessary to ensure that verified faults are handled in a timely fashion.

  • AppX latency. The AppX module requires periodic service and must execute its tasks with a bounded delay. This constraint is limited to the CPU; AppX does not access the network.

  • Live and replay bandwidth. Total bandwidth must be managed during zero-fault and one-fault modes so that all timeliness requirements are met. Live, replay, and AppX latency requirements are specified as hard bounds on total latency. These are enforced by controlling the transmit latency of messages and the CPU shares of the modules which process data (only CPU share in the case of AppX). The bandwidth requirement is satisfied using message-traffic-shaping techniques so that transmit latency is predictable and bounded. We describe a more-detailed evaluation of this scenario, with specific latency and bandwidth bounds, in the next section.

Evaluation

We have evaluated our architecture using synthetic workloads and a workload motivated by the power-line fault detection scenario presented in the previous section. Here, we describe the evaluation results for power-line fault detection, which illustrates the significant challenges due to the dynamic workload and how our approach addresses them.

Experimental setup

We emulated the power-line fault scenario by running the RTU logic and the substation processing on blade servers. The constrained bandwidth, broadband over power line (BPL) network was emulated using a token bucket so that the total message rate was constrained to 600 msgs/s (see previous section). In this section, we refer to the zero-fault live data as the live job flow. Likewise, the replay data (generated in one-fault mode) is referred to as the replay job flow. We constrained the emulation such that the complete system at any point in time had either zero potential faults (zero-fault mode) or one potential fault (one-fault mode).

In terms of the architecture presented in Figure 2, the global controller in the scenario is centralized and located at the substation server. There is a node controller deployed at each RTU that receives a token bucket allocation at the time a replay is requested. A node controller at the substation server receives share allocations for all computing jobs. Resource allocation is controlled in the system using two types of resource allocators: (1) a token bucket to shape the replay traffic from each RTU; and (2) CPU share allocation at the substation server.

We present results from three kinds of experiments. In the first, the global policy sets a deadline for the live job flow. In the second, the global policy sets a deadline for the replay job flow. In the third, round-robin scheduling is used for the CPU at the substation server instead of proportional-share scheduling.

Latency model and resource allocation

Given the sudden transition in the workload from zero-fault to one-fault mode, a conventional latency analysis cannot be used. We adopt the approach of having a different model for zero-fault and for one-fault mode, and a model that describes the transient conditions between the two. All these models represent worst-case latency, due to the strong real-time requirements of the power-line fault detection scenario.

In this section, we present a few details of the model and how it is used to perform resource allocation. In particular, we focus on live latency and replay latency.

Zero-fault latency model
The live latency for zero-fault mode is described by the following equation:

latencyL0 = Trtu + N/C + (N*TL + G)/sL0(1)

where

Trtuis the processing time for live data at an RTU,
Nis the number of RTUs,
Cis the bandwidth of the BPL network,
N/Cis the latency due to the BPL network,
TLis the worst-case execution time (WCET) of a live message at the substation server,
Gis the lag of the proportional share scheduling algorithm,
sL0is the CPU share allocation for processing live jobs, and
(N*TL + G)/sL0  is the latency at the substation server.

All of these parameters, except for sL0, are fixed for the experiment. The sL0 parameter is variable and is set by the global controller. For all the latency equations described in this section, we also include an additive error term that is specific to each equation, and use a least-squares fit for online error correction.

One-fault latency model: Live latency
The live latency for one-fault mode is described by the following equation:

latencyL1 = Trtu + N/C + (3*b/C) + (N*TL + G)/sL1(2)

where bR is the size of the replay token bucket at each RTU, in number of messages. Note that compared to Equation 1, the only differences are the additional term (3*bR/C), which represents network latency due to sharing the network with the replay traffic, and the share parameter sL1, which is the CPU share allocation in one-fault mode. The bR and sL1 parameters are variable and set by the global controller.

One-fault latency model: Replay latency
The replay latency for one-fault mode is described by the following equation:

latencyR1 = latencyL0 + (BR − bR)/rR + (N + 3*b)/C + (TR + G)/sR1(3)

where

latencyL0is from Equation 1,
BRis total number of messages being replayed by an RTU,
rRis the rate, in messages per second, of filling the token bucket of that RTU,
(BR − bR)/rRrepresents the latency at the replay token bucket of the RTU,
(N + 3*bR)/C  is the latency in the BPL network,
TRis the WCET of processing all the replay messages for a fault,
sR1is the CPU share allocation for processing the replay messages, and
(TR + G)/sR1  is the latency at the substation server.

The BR and sR1 parameters are variable and set by the global controller.

Transient intervals
In addition to the above steady-state equations, the transient conditions when transitioning from one-fault to zero-fault mode have a significant effect on latency. We model this using an equation that includes the time to clear the backlog of live jobs in both the BPL network and the substation server CPU. The details are omitted from this paper.

Computing resource allocations
The global controller computes the values of sL0, sL1, sR1, BR, bR (and the CPU shares for AppX and AppY, which are not described here). The node controllers at the RTUs and the substation server enact the settings of these parameters.

For the experiments described in the following evaluation, this computation is done in a straightforward manner by solving a system of equations. These equations are solved repeatedly, as the error terms in the latency model change. Hence the behavior of the system changes over time. We have observed the change to be a gradual improvement, since the latency error terms become more accurate and stable over time. The stability of the system as a whole is always guaranteed, regardless of the accuracy of the latency model error correction, because we ensure that all jobs get the minimum resource allocation needed to prevent unbounded queuing.

Live deadline policy

The primary end-to-end deadline in the global policy, which is an input to the global controller, is the live deadline. It is set to 400 ms for zero-fault and 1100 ms for one-fault mode. The secondary end-to-end deadline is for AppX, which is 280 ms and 490 ms for the zero-fault and one-fault cases, respectively.

Figure 5(A) depicts the live job flow latency at the beginning of the experiment. The live max line refers to the observed worst-case (maximum) end-to-end latency, computed over time intervals. The “zero-fault deadline” and “one-fault deadline” show the global policy setting. The plot also shows predicted latencies. The square wave represents the occurrence of a fault and when the fault processing ends. The fault occurrences correspond to one-fault deadline policy, and the observed latency is increasing. The increase in latency is due to contention, both on the shared BPL network and the shared CPU at the substation server.

Figure 5  Live job flow latency: (A) at the start of a run; (B) later in the run Figure 5

The following observations can be made based on this plot:

  1. There is a small latency spike at the start of the run which is due to initial transient conditions.

  2. Subsequent to that spike, and up to time 30,000, there is a difference between the observed zero-fault latency and the predicted latency. This difference declines to the point of negligibility, due to the feedback (from the node controllers to the global controller) which corrects the latency model.

  3. The predicted and observed zero-fault latency is significantly smaller than the zero-fault deadline. This is due to significant spare resources in the absence of a fault.

  4. The observed one-fault latency is significantly greater than the deadline (the deadline and predicted one-fault latency are identical). This changes in a subsequent plot.

The AppX job flow shows a similar model correction for the zero-fault latency, which causes the CPU share for AppX in zero-fault mode to be reduced, such that it conforms to the deadline setting.

Figure 5(B) shows the live job flow latency at a later point in the run, starting from around 390,000 ms, which is about 5 minutes later than the first plot. The latency for one-fault mode has declined and is closer to the deadline. This is due to a slow model error correction, which eventually causes the token bucket for replay job flows to be decreased in size from 190 messages to 80 messages. The error correction is much slower than the zero-fault case since there are fewer latency samples available for one-fault mode.

Replay deadline policy

In this experiment, instead of setting a one-fault deadline for the live job flow, we set a one-fault deadline for the replay job flow with a value of 2000 ms. All other deadlines are the same as in the previous experiment.

Figure 6(A) depicts the latency of the replay job flow, from the start of the run. Note that the predicted latency tracks the observed latency. However, the observed latency does not reach the deadline of 2000 ms, and instead settles to a value close to 2500 ms. This is despite giving it the maximum possible allocation, without starving the other job flows. In fact, given this high allocation of resources to the replay job flow, the one-fault deadline for the live job flow increases to more than 2000 ms (not shown in the figure). Hence, this is an example of a deadline that cannot be achieved given the resources in the system.

Figure 6  (A) Replay deadline policy: replay job flow latency; (B) round-robin scheduling: live job flow latency Figure 6

Round-robin scheduling

In this experiment, we used round-robin scheduling for the CPU at the substation server, instead of proportional-share scheduling. The deadlines were the same as in the first experiment.

Figure 6(B) depicts the live job flow latency. The latency increases over time, and eventually exceeds 30,000 ms. Thus round-robin scheduling leads to starvation of this job flow. Unlike the live job flow latency, the AppX job flow latency does not increase indefinitely. However, the latency is not close to the deadline. Specifically, the latency for zero-fault mode can be either above or below the deadline. For instance, it was observed to be sometimes one-half the deadline, and sometimes three times the deadline.

Related work

Research related to our approach can be grouped in four categories: (1) deadline slicing in real-time systems; (2) schedulability analysis and feedback-control scheduling; (3) utility optimization; and (4) multilevel resource management.

Deadline slicing

Deadline slicing techniques1016 attempt to find the deadlines that optimize a predefined measure of schedulability. These algorithms work with a fixed end-to-end deadline and limited characterizations of tasks (e.g., periodic tasks with WCET), and are performed offline.

Basic Slicing Technique (BST)10 is a search algorithm that assigns slices—static execution windows in time—to tasks. The algorithm iteratively computes paths in the task graph that minimize the overall laxity (the difference between the time a task would complete if started now and its deadline) and assigns slices to the tasks by evenly distributing the path laxity. Adaptive Slicing Technique (AST)11,12 uses a task assignment algorithm which extends BST for the case where the resource-to-task mapping is not fixed a priori. Neither BST nor AST account for resource capacity.

Garcia and Harbour13 provide an iterative offline algorithm which assigns deadlines to sequential tasks. At each iteration, new deadlines are computed based on how far each subtask is from meeting its desired deadline, given its current deadline. Saksena and Hong14 derive subtask deadlines by maximizing the critical scaling factor17 of the task set. However, this algorithm works only with sequential tasks and, as the authors assert, their solution is not optimal. Bettati and Liu15 focus on distributed systems that can be characterized by a flow shop model: sequential subtasks execute on different resources in turn, following the same order. The authors assume identical execution times for all subtasks executing on the same resource and assign local deadlines by evenly distributing the end-to-end task deadline. In the context of soft real-time systems, Kao and Garcia-Molina16 divide the deadline assignment problem into serial and parallel subtask problems. They propose simple strategies for both problems based on minimizing the deadline miss ratio. The utility function of a job flow in our framework can be interpreted as a generalization of a deadline, which can be used to express soft deadlines and job flows of different importance.

Schedulability analysis

Traditionally, scheduling and schedulability analysis techniques have been used for meeting deadline requirements. However, most scheduling algorithms focus on controlling performance on a single processor. Approaches such as the rate monotonic approach and its extensions17 use static allocation and assume a priori knowledge of the task parameters. In contrast, dynamic scheduling algorithms work with incomplete knowledge about the task set. These algorithms guarantee schedulability either by relying on information about the system resources18 or through admission control and planning.19,20 Feedback-control scheduling2123 circumvents the problems of static and dynamic scheduling in unpredictable environments by continuous monitoring and adjustment of deadlines based on system feedback. Lu et al.22 propose a feedback-control scheduling framework to minimize the deadline miss ratio of soft, independent tasks. Abdelzaher et al.23 use utilization-based schedulability24 to guarantee deadlines of aperiodic requests to a Web server.

In the context of distributed systems, Stankovic et al.25 use feedback-control scheduling to guarantee deadlines for sets of independent tasks that run on different processors. Lu et al.26,27 apply utilization-based schedulability techniques24 to schedule periodic end-to-end tasks on a distributed platform. They propose both centralized and distributed algorithms that control the invocation rate of tasks in order to adjust the utilization of resources.

We consider much broader task behavior (analogous to job flow) by allowing flexibility in latency sampling (i.e., sampling at different percentiles in the overall distribution of latencies), tasks whose resource model can only be known at runtime, and utility functions to express elasticity of the deadline and different importance of tasks.

Utility optimization

Research into utility optimization in the context of computer system resource management has a large body of literature. Here we consider a few areas which bear some resemblance to our job flow model.

Several approaches for optimizing a system-wide utility function have been proposed in the area of network flow optimization. For practical reasons, distributed algorithms are highly desirable in this context, and are often based on the dual decomposition.28,29 Relevant research includes work in unicast3033 and multicast34,35 flow optimization, in which flow rates are varied in order to optimize system utility. The main difference with our work is that we are allocating resources to achieve a high utility expressed in terms of latency, while these works focus on utility in terms of network throughput.

Jensen et al.36 have proposed the idea of time utility functions (TUF) for real-time systems, and have applied it in multiple contexts.37 Unlike our work, theirs does not consider the use of feedback to adjust the policy dynamically at runtime.

Multilevel resource management

Multilevel resource management addresses resource management concerns at several levels of abstraction and achieves a high degree of scalability. Cardei et al.38 describe a general-purpose hierarchical architecture for adaptive resource management, and many related works implicitly comply with this architecture. Abeni et al.39 propose a two-level hierarchy that consists of a system-wide resource manager governing application-level controllers. Liu et al.40 present a two-level bandwidth management scheme for mobile ad hoc networks connected through a satellite. Manghwani et al.41 describe a three-level hierarchy that consists of system-wide, node-local, and resource-specific controllers. Similar to our work, the proposed system aims at providing end-to-end quality of service (QoS). Rohloff et al.42 present another three-level resource management hierarchy that consists of system-wide, mission-specific, and application-specific controllers. Decision making and performance evaluation in this control system is entirely based on a set of (hierarchical) utility functions.

Conclusion

We have presented a distributed, hierarchical control mechanism for distributed, time-critical applications which uses an abstract model of system resources. A novel intermediate model describes the resource and timeliness requirements of a broad range of service-oriented applications and supports utility-based optimization that maximizes total system utility. We have evaluated our architecture using a workload derived from a power-line fault detection scenario, demonstrating how a global application latency policy translates to the allocation of constrained resources that optimize system utility.

*Trademark, service mark, or registered trademark of the International Business Machines Corporation in the United States, other countries, or both.
**Trademark, service mark, or registered trademark of the Object Management Group, Inc., Sun Microsystems, Inc., Linus Torvalds, or Red Hat, Inc., in the United States, other countries, or both.

Cited references

Accepted for publication November 20, 2007; Published online May 6, 2008.


    About IBMPrivacyContact