| |
 |
Performance
Modeling and Analysis
|
|
|
|
History |
| |
Performance modeling and analysis has been and continues to be of
great practical and theoretical importance in research labs in the
design development and optimization of computer and communication
systems and applications. This includes a broad spectrum of
research activities from the use of more empirical methods (ranging
from experimental tweaking of simple existing models up to building
and experimenting with prototype implementations) through the use
of simulation to more sophisticated mathematical methods. IBM
Research has a long and rich history in this area of research.
A small subset of examples include: Time-Sharing
Computer Model, Token Ring Local Area Networks,
Product Form Networks and RESQ Package, Computer
Scheduling, S/390 Sysplex, Traffic
Management, Parallel Scheduling.
For more information, refer to S.S. Lavenberg and M.S. Squillante,
Performance Evaluation in Industry: A Personal Perspective,
in "Performance Evaluation -- Origins and Directions", G. Haring,
C. Lindemann, M. Reiser (eds.), Springer-Verlag, 1999, and the references
cited therein. |
| |
|
|
| |
A Simple Queueing Network Model of a Time
Sharing System |
| |
|
One
of the earliest documented successful applications of analytical computer
performance modeling in industry is due to Lassettre and Scherr (1972)
at IBM in the late 1960s during the development of IBM's OS/360 Time
Sharing Option (TSO). The machine repairman model, originally
developed in the operations research literature to model a single
repairman servicing machines that break down, was used to represent
OS/360 TSO running with a single memory partition and supporting multiple
terminal attached users via time sharing the execution of their programs.
With a single memory partition, only one user's program could be resident
in memory and the system was time shared by swapping programs into
and out of memory. The machine repairman model has a very simple
analytical solution which was used to estimate the number of users
the system could support without exceeding a specified average response
time. The model was used in conjunction with measurements.
Average program execution time was estimated from measurements on
a TSO system with a second computer system simulating the users by
running benchmark scripts and generating exponentially distributed
user think times. The measured average response time was compared
with the mean response time computed using the model under the assumption
of exponentially distributed execution times with mean equal to the
measured average. It is interesting to note that a substantial difference
between the measured and predicted response time was usually due to
bugs in the operating system. Once the bugs were removed the measured
and predicted results tracked closely. In approximately 75% of the
test cases, the prediction error was less than 10% with a maximum
error of 24%. This was surprising due to the simplistic model assumptions.
In particular, program execution times were assumed to be exponentially
distributed, although measurements showed they were not, and programs
were assumed to be executed first come first served, rather than via
time sharing. Unknown at the time, this queueing model is an example
of a product form queueing network. Product form queueing network
results of the 1970s showed invariance of performance measures including
mean response time when a first-come-first-served queue with
exponential service times is replaced by a processor sharing queue
with general service times. A processor sharing queue with general
service times would have been a more realistic model of the TSO system,
but the model's predictions would not have been different, thus helping
to explain the model's accuracy. |
| |
|
|
| |
The
Performance of Token Ring Local Area Networks |
| |
|
One
of the most influential analytical performance modeling studies in
IBM was done by Bux (1981). This study compared the analytically derived
delay-throughput characteristics of local area networks based on ring
and bus topologies. Included were the token ring, a ring network with
access controlled by a single circulating token that was being built
at IBM's Research Lab in Zurich, and the CSMA-CD (carrier sensing,
multiple access with collision detection) bus that was the basis of
Ethernet, originally developed by Xerox in the 1970s. The study primarily
used existing analytical queueing results, modifying them as required
to capture the essential characteristics of the networks being modeled.
For example, the key to analyzing token ring performance was recognizing
that a token ring functioned like a single server that served multiple
queues by round robin polling. An elegant discrete time queueing analysis
of such a polling system had appeared in Konheim and Meister (1974).
(The discrete time results were converted to continuous time by letting
the discrete time interval approach zero.) The study showed that the
delay-throughput characteristics of the token ring and CSMA-CD bus
were comparable at low transmission speeds, e.g. 1 Mb/sec, but the
token ring was superior at higher speeds, e.g. 10 Mb/sec. (The
key parameter affecting the relative performance of the token ring
and CSMA-CD bus is the ratio of propagation delay to packet transmission
time, with higher ratios favoring the token ring.) While many factors
influenced IBM's decision to develop a token ring local area network
product, the performance of the token ring as demonstrated in this
study was a key factor. |
| |
|
|
| |
Product
Form Networks and The Research Queueing (RESQ) Package |
| |
|
The
discovery of product form queueing networks and their properties and
the development of efficient computational algorithms for product
form networks was a breakthrough in analytical performance modeling.
Within the computer science literature, the classic paper that first
defined product form networks and gave their properties is Baskett,
Chandy, Muntz and Palacios-Gomez (1975). Researchers in IBM
developed the main computational algorithms for solving product form
networks, first the convolution algorithm, Reiser and Kobayashi (1975),
and later the Mean Value Analysis (MVA) algorithm, Reiser and Lavenberg
(1980), and they incorporated these algorithms in performance modeling
software packages to make them available to performance modeling practitioners.
The first such package was QNET4, software for specifying and solving
closed multichain product form queueing networks. QNET4 provided a
textual interface for the user to specify the queues and routing chains
and their associated parameter values. Performance measures were computed
using the convolution algorithm. Shortly after QNET4 was developed,
it was integrated into a more general performance modeling package,
the Research Queueing (RESQ) package.
RESQ allowed a user to specify and solve product form networks
(initially using the convolution algorithm; the MVA algorithm was
added later), but it also allowed a user to specify more general
``extended queueing networks'' and use discrete event simulation
to estimate performance measures. The then recently developed regenerative
method for estimating confidence intervals and controlling simulation
run length was incorporated in RESQ. One of the key extensions
was the inclusion of passive queues, which provide a convenient
way to model simultaneous resource possession. QNET4's textual user
interface was extended in a natural way to allow specification of
extended queueing networks. The modeling level of abstraction provided
by extended queueing networks and the implementation in RESQ proved
very useful. It allowed the rapid development of simulation models
without the programming required with a simulation programming language.
It helped guard against the pitfall of developing overly detailed
simulation models by forcing a higher level of abstraction. It included
modern statistical simulation techniques and made them easy to use.
It also helped bridge the gap between analytical modeling and simulation
modeling by incorporating both product form networks and extended
queueing networks. RESQ was a major success in IBM. Developed
by researchers, it began to be widely used in product development
groups in IBM in the late 1970s to model computer systems and subsystems
and local and wide area computer networks. It was enhanced
over time with additional computational algorithms and statistical
methods, a graphical user interface, simulation animation and other
features, and its widespread use in IBM continued into the 1990s.
It was also made available for use in research and teaching at universities.
|
| |
|
|
| |
Computer
System Scheduling |
| |
|
The
performance modeling and related stochastic literature over the past
five decades is rich with studies of scheduling optimization problems.
This includes optimal scheduling results for minimizing a weighted
sum of the per-class mean response times, as well as for achieving
a given vector of per-class mean response times, in a single queue
or a queueing network, with or without side constraints (i.e., a per-class
performance constraint that must be satisfied in addition to the global
objective function). These results have basically established
that, in many cases, the space of achievable performance measures
is a polymatroid, or extended polymatroid, whose vertices correspond
to the performance of the system under all possible fixed priority
rules. Furthermore, the optimal or desired performance vector
is a vertex or an interior point of this performance polytope, and
the scheduling strategies which satisfy these classes of objective
functions are some form or mixture of fixed priority policies (dynamic
priority policies are considered below).
As computer technology advanced and the complexity of computer
systems and applications continued to grow, new customer and/or
user requirements arose that were not fully addressed by previous
classes of scheduling objective functions. For this reason,
research studies at IBM investigated specific scheduling optimization
problems to consider the needs of certain IBM computer platforms.
Two such studies became the basis for the processor scheduling algorithms
in the Application System/400 (AS/400) and System/390 (S/390) computer
systems.
Basis for the AS/400 Processor Scheduler:
Much of the previous scheduling research considered scheduling
strategies for minimizing a weighted sum of the per-class mean
response times. An important additional objective is to
maintain a low variance of response times for each class.
Two related research studies at IBM that investigated this problem
resulted in the concepts of Delay Cost Scheduling, due to Franaszek
and Nelson (1995), and Time-Function Scheduling, due to Fong and
Squillante (1995). These two studies considered different
forms of objective functions that are based on minimizing a weighted
sum of per-class second moment measures of response time, and
they used different approaches to establish certain structural
properties for the respective optimal solutions.
One scheduling strategy with the structural properties for a
particular instance of the scheduling objectives considered by
Fong and Squillante is based on the use of general time-based
functions to obtain effective and flexible control over the allocation
of resources. This scheduling strategy is in part a generalization
of the linear time-dependent priority discipline due to Kleinrock
(1964) in which the priority of each job increases (linearly)
according to a per-class function of some measure of time and
the job with the highest instantaneous priority value in the queue
is selected for execution at each scheduling epoch. In its
most general setting, the time parameter for each per-class function
can include any measure of the time spent waiting for a resource
or set of resources (response mode), any measure of the time spent
using a resource or set of resources (usage mode), and any combination
of such modes, as developed by Fong, Hough and Squillante (1997).
An adaptive feedback mechanism is used together with mathematical
control formulas to adjust these per-class time functions, as
well as to migrate each job to a different class upon the occurrence
of certain events or upon the job exceeding some criteria, in
order to satisfy the scheduling objective function across all
and/or varying workloads. These control formulas can also
be used to obtain a scheduling strategy that realizes the optimal
or desired performance vector which is a (interior) point in the
performance space, while providing better per-class response time
variance properties. A number of important scheduling issues
can be easily accommodated in the per-class time functions and/or
addressed by the use of these time functions to control resource
scheduling decisions; examples include priority inversion and
processor-cache affinities. The theoretical properties of
time-function scheduling can be further exploited to obtain very
efficient implementations.
The above scheduling strategy is a fundamental aspect of the
dynamic priority scheduling algorithms employed in AS/400 systems.
Based on the control formulas mentioned above and a general set
of assumptions regarding workloads and performance objectives,
this scheduling technology offering has proven to be a great success
with AS/400 customers providing efficient and effective control
over resource management to address various customer requirements
across numerous diverse workloads without any perceived increase
in complexity by the system administrator and users.
Basis for the S/390 Processor Scheduler:
Instead of minimizing a weighted sum of the per-class mean response
times, it can be more natural to associate a mean response time
goal with each class and to consider the performance of the class
relative to its goal. The corresponding objective function
then is to minimize a vector of the per-class ratios of the response
time mean to the response time goal. This problem is studied
within the context of a multi-class M/GI/1 queue, with and without
feedback, by Bhattacharya et al. (1993,1995) where adaptive scheduling
strategies are presented and shown to lexicographically minimize
the vector of per-class performance ratios (exactly or approximately).
The results also apply to other systems including certain multi-class
Jackson networks and multi-class M/GI/c queues.
Consider a $K$-class system at a scheduling decision time epoch
in which the mean response time for class $i$ realized over the
previous scheduling time interval(s) is $x_i$ and the specified
mean response time goal for this class is $g_i$. A fixed
scheduling policy that gives priority to jobs of class $i$ over
jobs of class $j$ if $x_i/g_i \geq x_j/g_j$ is then used for the
next scheduling time interval. In other words, priority
is given to the class that has received the worse performance,
relative to its goal, over the previous time interval(s).
The time intervals between scheduling decision epochs, in which
priorities are updated, can be arbitrary provided that they are
bounded above by a finite measure. This scheduling strategy
is proven optimal in the sense that it converges asymptotically
in the number of time intervals to the optimal solution which
lexicographically minimizes the vector of ratios $x_i/g_i$ arranged
in non-increasing order $x_1/g_1 \geq x_2/g_2 \geq \cdots \geq
x_K/g_K$ (with probability 1).
The above scheduling strategy is an important aspect of the processor
scheduling algorithms employed in S/390 systems, where $x_i$ and
$g_i$ can be functions of performance metrics other than just
mean response time -- in particular, response time percentiles
and/or velocity goals can be used together with or instead of
mean response time goals. This S/390 concept is commonly
referred to as goal-oriented scheduling. It is a key component
of the workload management system provided on S/390 computers,
which has been a great success for IBM in the control and management
of mainframes and mainframe clusters.
|
| |
|
|
| |
Cluster Architectures and S/390 Parallel Sysplex |
| |
|
As
commercial workloads expand and evolve, clusters of multiple computers
are required to support high transaction processing rates and high
availability in large-scale commercial computing environments, which
include on-line transaction processing systems and parallel database
systems. The principal cluster architectures proposed to support
these scalable commercial application environments are the shared-nothing
(or partitioned), the shared-disk, the virtual shared-disk, and the
Parallel Sysplex} models. The shared-nothing architecture consists
of partitioning the database and disks among the nodes, where either
function-shipping (i.e., a remote function call to be executed by
the remote node with the results, if any, returned to the local node)
or I/O shipping (i.e., a remote I/O request to fetch the required
data from the remote node) is used when a local transaction needs
to access data located at a remote node in the cluster. Advantages
of this architecture include a higher local buffer hit ratio and no
need for global concurrency control, whereas the main disadvantages
center around the various additional costs for remote requests to
access non-local databases and load balancing and availability problems.
The shared-disk architecture consists of essentially having all nodes
directly access the disks on which shared data is located, where each
node has a local database buffer cache and a global concurrency control
protocol is used to maintain consistency among the local caches as
well as the database. This architecture has advantages with
respect to load balancing and availability, but it can suffer from
additional overhead to acquire and release global locks, as well as
large overhead, latency and increased I/O activity for hot shared
data (so-called ping-ponging). The virtual shared-disk architecture
is functionally identical to the shared-nothing architecture with
I/O shipping, while providing the view of a shared-disk model to the
database (i.e., the partitioned disks are transparent to the database).
The Parallel Sysplex architecture consists of the shared-disk model
together with a shared coupling facility that provides a shared database
buffer and highly-efficient support for locking, cache coherency and
general-purpose queueing.
Various aspects of each of these principal cluster architecture
alternatives have been analyzed with performance models, many of
which have been formulated by decomposing the original problem into
more tractable parts. The solutions of these hierarchical
models have often involved a combination of different methods including
analytical, mathematical optimization and simulation. The
results of this analysis of the principal cluster architectures
by researchers at IBM, such as King et al. (1997), and elsewhere
influenced the design of the IBM S/390 Parallel Sysplex architecture,
and was used by IBM to demonstrate the key advantages of this design
over alternative cluster architectures. In particular, the
Parallel Sysplex architecture provides the benefits of shared-disk
architectures and exploits the coupling facility services to obtain
very efficient intertransaction concurrency control, buffer cache
coherency control, shared buffer management, and shared job queues.
This results in transaction rate scaling which is close to linear
in the number of nodes, high shared buffer hit ratios which can
reduce the I/O rate per node, and excellent dynamic load balancing
even in systems with heterogeneous nodes.
The Parallel Sysplex technology provides the fundamental infrastructure
for IBM's large-scale enterprise server environments, and the above
performance modeling studies played a key role in its design.
|
| |
|
|
| |
Network
Traffic Management |
| |
|
The
allocation and management of shared network resources among different
classes of traffic streams with a wide range of performance requirements
and traffic characteristics in high-speed packet-switched network
architectures, such as ATM, is more complex than in traditional networks.
An important aspect of this traffic management problem is to characterize
the effective bandwidth requirement of both individual connections
and the aggregate bandwidth usage of multiple connections statistically
multiplexed on a given network link, as a function of their statistical
properties and the desired level of service. These metrics can
then be used for efficient bandwidth management and traffic control
in order to achieve high utilization of network resources while maintaining
the desired level of service for all connections. Guerin et
al. (1991) proposed a methodology for the computation of the effective
bandwidth requirement of individual and multiplexed connections based
on a combination of two complementary approaches, namely a fluid model
to estimate the bandwidth requirement when the impact of individual
connection characteristics is critical, and the stationary bit rate
distribution to estimate the bandwidth requirement when the effect
of statistical multiplexing is significant. While the fluid
model and its extension to capture the impact of multiplexing can
be used to obtain an exact computation of the bandwidth requirement,
the computational complexity involved is too high in general, and
particularly for real-time network traffic management. Guerin
et al. therefore used the proposed methodology to develop a computationally
simple approximate expression for the effective bandwidth requirement
of individual and multiplexed connections, which is shown to have
sufficiently good accuracy across the range of possible connection
characteristics in comparison with both exact computations and simulation
results. This approximation has proven quite successful as one
of the key mechanisms used for traffic management in IBM's Networking
BroadBand Services (NBBS) architecture, as well as in similar offerings
from other companies. |
| |
|
|
| |
Parallel
Scheduling |
| |
|
A
significant body of the performance modeling research literature has
focused on various aspects of the parallel computer scheduling problem
-- i.e., the allocation of computing resources among the parallel
jobs submitted for execution. Several classes of scheduling
strategies have been proposed for such computing environments, each
differing in the way the parallel resources are shared among the jobs.
This includes the class of space-sharing strategies that share the
processors in space by partitioning them among different parallel
jobs, the class of time-sharing strategies that share the processors
by rotating them among a set of jobs in time, and the class of gang-scheduling
strategies that combine both space-sharing and time-sharing.
The numerous performance modeling and optimization studies in this
area by researchers at IBM and elsewhere have played a fundamental
and important role in the design, development and implementation of
different forms of space-sharing and gang-scheduling strategies in
commercial parallel supercomputing systems. In fact, the IBM
SP family of parallel computers supports various forms of space-sharing
and gang-scheduling; support for space-sharing and/or gang-scheduling
is also included in many other commercial parallel supercomputers,
such as the Cray T3E, the Intel Paragon, the Meiko CS-2 and the SGI
Origin. |
| |
|
|
|