1. Introduction
As late as the 17th century in many European universities, only
the very best students were told that they could someday hope to
conquer long division if they applied themselves. At that time,
Roman numerals were the preferred system of numerical representation.
This demonstrates that if a bad system is chosen to represent aspects
of a problem, relatively simple problems can be made quite difficult.
The popular instructions per cycle (IPC) is a poor metric to use in
discussing processor performance because it does not lend itself to
intuition about what the components of that performance are. That is,
if a pipeline that runs at x IPC has improvements made to it
so that it runs at x +
IPC,
there is no way to intuit
(in units of IPC)
on the basis of those improvements.
The first point made in this paper is that while IPC makes for good
marketing, its inverse is what makes for good engineering intuition.
This paper strongly advocates the use of cycles per instruction (CPI)
as the metric of choice in processor microarchitecture discussions.
The second point made in this paper is a corollary to this first point.
That is, scalar pipeline performance (in units of CPI) can be split
into three relatively independent pieces: one involving the inherent
execution component of the workload (cited incorrectly as
"performance" in many studies), one involving pipeline effects, and
one involving the memory hierarchy. Changes to the processor
microarchitecture can be translated into changes in CPI almost
intuitively by considering these three components independently.
Since this is the fiftieth anniversary of electronic digital computing,
von Neumann's machine built at the Institute for Advanced Studies
(henceforth called the IAS machine) is used to illustrate the point.
This first-generation machine shares many similarities with modern
processors, and it is used to demonstrate that it is crucial to
understand instruction flow in a simple processor before hoping to
understand flow in a complex (e.g., superscalar) processor. This notion
is self-evident, but surprisingly many resist it.
The third point made in the paper builds on the first two points: The
CPI components of scalar (single decode per cycle) flow imply bounds
for the CPI components of superscalar (multiple decodes per cycle)
flow. Therefore, if understanding (or undertaking the development
of a detailed simulator for) the flow in a superscalar processor is an
overwhelming task, a great deal can be gained by understanding the flow
for a much simpler (scalar) case. Given a simple understanding of that
simple (scalar) case, extrapolation to the more general case can be (at
least) bounded.
The fourth point is a divergence from the first three points; it
focuses on an aspect of performance that will become more dominant in
the coming decade, hence, on an aspect of design that must evolve.
This is the performance and design of the memory bus. The simpler
designs and slower cycle times of past designs have not stressed bus
utilization the way future designs will. Queueing effects grow
nonlinearly once utilization exceeds certain thresholds. This must be
mitigated with a new approach to bus arbitration.
The final point of the paper is that as an industry, we are
devoting disproportionate resources to overly complex superscalar
designs because we have not found a metric that helps us to avoid
those features in the microarchitecture that hurt cycle time. That
metric is power consumption. A clear focus on power will ultimately
lead the industry through the GHz barrier, and will converge the
microarchitectures of the client and server spaces in the process.
2. Separable components of CPI
A digital computer executes an instruction stream, which is a
translated statement of a high-level problem. Computer performance is
"benchmarked" using instruction streams that are thought to be
representative of the typical work done by the computer.
An instruction stream is a sequence of instructions that effects a
sequence of events when executed on a computer. The particular events
effected depend on the microarchitecture of the computer, but the
frequency of those events is inherent in the instruction stream.
The basic unit of time within a computer is the cycle, and each event
caused by an instruction in the program takes an integer number of
cycles. Some events are concurrent, and some events take zero cycles
when they occur in certain microarchitectures.
Cycles per instruction (CPI) is the most natural metric for expressing
processor performance because it is the product of two measurable
things:
CPI = (cycles per event) · (events per instruction).
The number of cycles per event is determined by the event
type for a particular microarchitecture (it can be zero for some event
types), and the number of events per instruction is known for each
workload (independent of the microarchitecture).
For example, consider branches in the program. If one out of five
instructions in a program is a branch instruction, the event frequency
is 0.2 branches per instruction for that program. If a branch causes a
three-cycle delay in a processor, the CPI contribution of branches in
that processor is 0.6 CPI for that program. That is, on the average, an
instruction incurs 0.6 cycles of delay because of branches in the
program. A performance enhancer such as a branch prediction mechanism
can be thought of as a mechanism that either reduces the frequency of
branches or reduces the average penalty for a branch.
Other events that can cause delay include data dependencies
(instructions that are delayed because they need the results of
previous instructions that have not yet executed), cache misses
(referenced data that are not in the local cache and must be fetched
from the memory hierarchy), and serialization events (explicit draining
of the processor pipeline to ensure correct function of certain complex
operations). For the most part, the frequencies of these events can be
measured directly from a program (benchmark) without knowledge of
the microarchitecture of the processor that will execute the program.
Therefore, a benchmark is really a specification of event
frequencies, and those frequencies apply to any microarchitecture. A
specific microarchitecture determines the delay (penalty) associated
with each event type (which can be zero). CPI is the most natural
expression of performance because it is the dot product of a benchmark
(event frequencies) and a microarchitecture (event penalties) for all
event types. Further, a very small number of events account for most of
the performance of a processor.
3. Some observations on von Neumann's IAS machine
Figure 1 shows a sketch of the
organization of the IAS machine, which is considered to be the first
stored-program computer. This sketch was taken directly from Prof. John
Hayes's book [1] as he adapted it from
von Neumann's verbal description of the machine
[2], and it is reproduced here courtesy
of Prof. Hayes and of McGraw-Hill. The machine comprises a central
processing unit (CPU) that is coupled to a main memory and some simple
input-output equipment (I/O). For the purposes of this discussion, we
ignore the I/O portion of the figure because we are primarily concerned
with the active processing part of the computer. In the case of the IAS
machine, the I/O equipment is used to preload the main memory with a
program and its data, and then to start the CPU. The program and its
data live completely in main memory until completion of the program;
i.e., the I/O does not interact with the program.
Figure 1
The CPU comprises a data processing unit and a program control unit.
The data processing unit contains an arithmetic logic unit and some
internal registers, and the program control unit contains buffer
registers that hold instructions and circuitry for decoding those
instructions and generating the requisite control signals that govern
the operation of the aggregation in Figure 1.
In modern parlance, the data processing unit is called an execution unit
(EU), and the program control unit is called an instruction unit (IU).
In the IAS machine, main memory has the appearance of a large register
file from the instruction-level semantic point of view, but some of the
register space contains the program itself. (The instructions of the
program are within the data space, and can be dynamically modified by
the program.) By program construction, the program and its data fit
completely in main memory. Roughly speaking, main memory has the
appearance of a modern-day cache that never misses (both because
the program is constrained to live in main memory and because it is
semantically impossible to miss within a register file).
The operation of the IAS machine is as follows. Each instruction is
fetched and executed in its entirety only after the previous
instruction has completed in its entirety, and before the following
instruction is fetched. Furthermore, each instruction is processed in
two nonoverlapped phases. First, the instruction is fetched and decoded
(I phase), and then the instruction is executed by the data
processing unit (E phase). These phases are not pipelined; they are
done sequentially for each instruction.
Note that there is a single bus between main memory and the CPU. This
bus is used to transfer an instruction during the I phase, and an
operand during the E phase. This is what is commonly referred to as the
"von Neumann bottleneck," which is widely (and incorrectly)
perceived to be the "flaw" of the von Neumann programming model.
(Very interestingly, the von Neumann bottleneck has its genesis in the
solution to a power-budget constraint, as is explained later.)
A programming model is the high-level part of a computer architecture.
Computer architecture is the contractual interface between the hardware
and the program; that is, a computer architecture is a set of rules
that describe the logical function of a machine as observable by a
program running on that machine.
Computer architecture does not specify what the hardware actually does;
it specifies what the hardware appears to do. The entire point of
defining an architecture is to isolate the programmer from those
details. Architecture is a level of abstraction that allows a
program to have consistent function when run on machines having
different implementations (microarchitectures).
The von Neumann programming model is simply this: A program (which
is a sequence of instructions) generates outputs that are consistent
with the outputs generated by that program if each of the instructions
in the program sequence is fetched and executed in its entirety only
after all instructions preceding it have completed, and before any
instructions following it are fetched. A simpler (albeit ambiguous)
statement of the same thing is that "executing a sequence of
instructions should have the effect of executing the instructions in
the order in which they appear in the sequence," or more simply, "a
sequence of instructions is executed in sequence."
The von Neumann programming model is merely a formal statement that the
execution of a program should not be nonsensical. It does not imply
that the hardware cannot execute instructions out of order, and it does
not imply that there is only one bus or an inherent bottleneck. All it
implies is that the hardware must not generate program outputs that are
inconsistent with the outputs that would be generated by hardware that
is constrained in the fashion implied by the programming model.
In fact, the IAS machine has a single bus to memory, and the I and E
phases are not overlapped (not pipelined). Each vacuum tube in the IAS
machine runs with its cathode at 200 V, and turns on when its grid
exceeds 20 V. Putting 20 V on the grid is most easily accomplished by
using a resistive divider, but this results in an undesirable
steady-state power dissipation. To contain their power budget, the
designers of the IAS machine used a more complex scheme to propagate
logic values.
Instead of having resistive dividers between the stages of the
arithmetic-logic unit (i.e., the carry chain), the IAS machine uses
bypass capacitors to couple the stages. Therefore, instead of propagating
static states (high or low voltages), the IAS machine propagates
pulses.¹ (This was the first dynamic logic.)
If logic pulses are passed repeatedly through a capacitor with no
quiescent period, a dc bias builds up, creating unacceptable noise
susceptibility. The nonpipelined nature of the IAS microarchitecture
provides a natural solution to this problem: The execution unit
quiesces during the I phase, and the instruction unit quiesces
during the E phase.
The von Neumann bottleneck has little to do with architecture and a lot
to do with power. As described later, power budgets will steer
microarchitecture in the next decade in a fairly obvious way.
4. The modern processor and its performance
Figure 2 is a high-level diagram of a
modern processor. The processor core comprises an instruction unit
(IU), an execution unit (EU), and a cache (C). The processor core is
coupled to a main memory which can be a hierarchical system.
Figure 2
At this level of abstraction, the processor core is very similar to the
IAS machine in many ways. (The greatest similarity is that the IAS
machine never experiences a cache miss, just like any modern processor
when running ISPEC.)
The major difference in the microarchitecture is that the IU and the EU
operate concurrently in modern processors. This is basic
"pipelining," and in a well-designed processor, the cache can
accommodate the bandwidth requirements of both units running
concurrently. In some modern architectures, instruction data and
operand data are distinct, and reside in separate caches that are
(necessarily) visible to the instruction set architecture (ISA). In the
IAS machine, the two forms of data are indistinguishable, and the
interpretation of a datum is contextual.
The performance components of this processor correspond to its major
hardware components, which are enclosed by the dotted lines in
Figure 2. Therefore, partitioning performance
into these three major components (as described below) is the most
natural means for understanding processor performance.
The total CPI for a processor is the sum of an infinite-cache
performance and a finite-cache effect (FCE). The infinite-cache
performance is the CPI for the core processor under the assumption that
there are no cache misses. Roughly speaking, this is raw processor
speed when the effects of the memory hierarchy are ignored. The FCE
accounts for the effects of the memory hierarchy.
Finite-cache effect (FCE)
The FCE is the CPI associated with cache misses:
FCE = (cycles per miss) · (misses per instruction).
The misses-per-instruction component is commonly called
the miss rate, and the cycles-per-miss component is called the miss
penalty. (In the literature, the terms "miss rate" and "miss
ratio" are often used interchangeably. This is incorrect. The miss
ratio is the ratio of misses to references, and this can only be
related to the instruction frequency by knowing the average number of
references per instruction.)
The beauty of separating the FCE in this way is that for a given
cache geometry (size, line size, and set associativity), it allows a
benchmark to be characterized by its miss rate quite independently
of the details of the cache-to-memory interface. That miss rate applies
to any other processor whose cache shares the same geometry.
Measuring the miss rate does not require complex simulation, but it
requires simulating many misses (e.g., millions) for an accurate
measurement. A miss rate simulation merely requires a data structure
that reflects the cache geometry (the "simulated cache"), and a
long reference stream to run through that data structure to effect
replacements (misses). The simulation does not require any level of
detail, or any emulation of actual hardware mechanisms.
Characterization of the miss penalty requires a detailed simulation of
the processor and the memory system to which it is coupled, but an
accurate characterization can be done with a relatively small number of
misses (e.g., thousands). That is, to characterize miss penalty, the
actual hardware mechanisms that process a miss must be simulated in
detail, and they must be simulated within the context of the processor.
This ensures certainty regarding the average delay associated with a
miss, and this average converges rapidly.
Therefore, the complex simulation (to characterize miss penalty) can be
fairly short, and the lengthy simulation (to characterize miss rate)
can be very simple. It is also true that a (good) performance
practitioner can make a fairly accurate guess as to miss penalty
for a given processor-memory structure if he has experience with
similar structures (i.e., detailed simulation of derivative designs
might be unnecessary for someone who is knowledgeable). Note that
simulation is unnecessary in the trivial case of a blocking cache,
because nothing overlaps the processing of the miss.
Figure 3 shows an idealized plot of a
processor's CPI as a function of miss rate. Note that the slope of
the finite-cache CPI is the miss penalty (cycles per miss), and the
y-intercept for this line is the infinite-cache CPI.
Since the infinite-cache CPI is independent of miss rate, it is
horizontal.
Figure 3
Therefore, if the infinite-cache CPI can be determined for a processor,
this fixes the y-intercept. If the miss penalty can be
determined, this fixes the slope of the finite-cache CPI, and that line
can be drawn. For a given miss rate (x coordinate), the CPI
is determined.
Infinite-cache performance
Measuring infinite-cache performance requires a detailed
simulation of the processor core. This simulation (using no cache
misses) can be relatively short if there is reasonable confidence that
both the instruction mix and the software-module mix are
representative of steady-state performance.
The infinite-cache CPI is the sum of two parts (each of which is
independent of the miss rate), as is shown in
Figures 2 and 3. These
are the execute busy (EBusy) and the execute idle (EIdle) components.
EBusy represents the average intrinsic work (execution cycles) done per
instruction. In a scalar processor, EBusy corresponds to the CPI that
would be obtained if there were no pipeline delays. Another
interpretation of EBusy is infinite-cache CPI with the EU utilization
at 100%. For a scalar RISC machine, EBusy = 1; for a CISC
machine, EBusy > 1. Note that EBusy is independent of pipeline
delays. EBusy is the dot product of the instruction frequencies and
their execution times; it is a pure measure of intrinsic work.
EIdle represents the average delay per instruction caused by pipeline
effects. Very simply, EIdle is the difference between the
infinite-cache CPI and EBusy. In general, it accounts for all of the
numerous pipeline effects, but only a few of these are really
significant. EIdle can also be separated into a number of relatively
independent components [3], although
this is not done in detail here.
As can be surmised from some of the previous discussion, the first
crude separation of EIdle is as follows:
- Events in which instructions are not available in time to
keep the EU busy. These are branches, dynamic effects associated with
cache utilization, and effects associated with the fetching of
variable-length instructions (if applicable).
- Events in which operand data are not ready in time to keep
the EU busy. The most predominant is the address generate interlock
(AGI) associated with indirect data fetching (e.g., pointer chasing),
but execution dependencies in general fall into this category.
- Serialization events in which the pipeline is deliberately
drained between instructions to ensure correct function. These events
are predominant in MP environments.
Note that each of these events has a frequency that can be
measured for any benchmark, and that frequency is independent of the
microarchitecture. Each event has an associated penalty that can be
determined from the microarchitecture. The CPI contribution of each
event type has the form
CPI = (cycles per event) · (events per instruction).
A further interesting observation is that most of the
EIdle events, and all of the predominant ones (branches, AGIs, and
serialization), have penalties that are directly proportional to the
number of stages in the pipeline (in general), and specifically to the
number of cycles required for a cache access. That is, EIdle CPI varies
linearly with the number of stages in the pipeline (i.e., with the
degree of pipelining).
Superpipelining and wave pipelining are techniques that achieve a
fast cycle time by making the pipeline granularity fine. It has been
shown that processor cycle time decays to an asymptote when the
pipeline granularity is increased, while CPI increases linearly, as
shown respectively in Figures 4(a) and
4(b) [4]. The performance (MIPS) is the
inverse of the product of the two, and that product has a quadratic
form (thus, a unique maximum), as shown in
Figure 4(c).
Figure 4
Because of unpredictable branching, and because of a high AGI frequency
due to heavy pointer use, EIdle events in commercial code render
superpipelining techniques fruitless. These techniques are better
suited to the realm of scientific workloads.
Real workloads
Figure 3 is an idealized plot, but in fact,
real workloads exhibit
a small correlation between EBusy and miss rate. This does not mean
that miss rate is causal to EBusy; it merely means that there is a
statistical dependence (correlation). This is why rigorous benchmark
selection requires an accurate module mix. An accurate instruction mix
alone is insufficient.
Furthermore, the actual finite-cache CPI becomes superlinear when the
miss rate is very low or very high, for different reasons:
- When the miss rate is very low, the misses do not cluster
(in time) quite as much as they would in steady state; therefore, no
portions of misses overlap each other. The net effect is that the
average miss penalty is slightly larger.
- When the miss rate is very high, the memory system and its
buses become saturated servers. The misses serviced by the saturated
system then incur inordinate queueing delays, and the average miss
penalty increases. This effect is nonlinear, and is discussed later in
this paper.
Out-of-order execution
in modern pipelines
Tomasulo described a "common data-bus algorithm" that permits
out-of-order execution in the floating-point hardware of the
System/360* Model 91 computer [5]. In
the 1960s, floating-point
operations required multiple execution cycles, so allowing a logically
subsequent instruction to bypass an operation that was in progress
helped performance.
The Tomasulo algorithm was extended for use in the S/390* ES/9000*
family of processors to include register renaming and a means for
taking precise interrupts while executing out of order
[6]. The general scheme is described by
Smith and Pleszkun [7], and is now
used in many modern processors.
Two aspects of the general scheme can be compelling under certain
circumstances:
- If the number of architected registers is insufficient for
supporting the maximum number of instructions that can be in progress
at any time (proportional to the product of the number of pipeline
stages and the number of instructions decoded per cycle), register
renaming provides a means to increase the number of physical registers
so as to support this peak rate.
- Register renaming can simplify the control required to take
precise interrupts and/or to initiate retry operations.
However, the out-of-order aspect of the broad scheme is generally
not valuable in modern processors, where the number of execution cycles
needed per instruction is almost always one (exceptions are divide,
square root, and some decimal operations).
Out-of-order execution is not usually valuable because in-order flow is
generally maintained in all of the pipeline except for the execution
phase, which is usually a single cycle. Specifically, the in-order
decoding of instructions is required so that the hardware can make
sense of the logical instruction sequence, and in-order instruction
completion is required so that exceptions are handled properly and
interrupts are taken precisely. Between the decoding and the completion
of an instruction, CISC pipelines access an operand and then execute
the instruction, and RISC pipelines merely execute the instruction. In
general, the ordering of operand accesses is preserved to satisfy
constraints imposed by MP-coherency issues (specifically, fetches are
kept in order, stores are kept in order, and fetches and stores to the
same location are kept in order).
Therefore, the situation in modern processors is that order within the
pipeline is maintained up until execution, it is restored immediately
after execution, and execution takes a single cycle. Enabling
out-of-order execution in this environment is not worthwhile, because
it will seldom happen. If enabling out-of-order execution complicates a
design, it should not be done.
Further, if register renaming is used specifically to enable
out-of-order execution (i.e., if the two circumstances listed above are
not motivating factors), it should certainly not be done. Not only will
this mechanism fail to improve performance, but the renaming steps will
lengthen the pipeline (or will have a cycle-time impact) which will
increase CPI as previously described, thereby actually hurting
performance.
In addition to arithmetic multicycle operations, there are other
classes of multicycle operations in some instruction-set architectures.
Specifically, there can be long moves, and there can be
operating-system assist instructions. In both cases, the opportunity
for them to overlap other operations is very slight. Long moves fully
utilize critical resources (e.g., buses) and prevent much else from
happening while they are in progress. Operating-system assists are
usually serialization events which explicitly preclude overlap.
High-MIPS effects and
workload characterization
The initial velocity of a falling object is determined
predominantly by gravity. As this predominant effect increases the
velocity, second-order effects (which increase with velocity) gain
significance, and work against the effect of gravity. In physics, this
situation is clearly described by a simple differential equation in
which the object reaches a terminal (constant) velocity. The falling
object can go no faster than this.
Similar phenomena affect processor speed, although the effects
cannot be crisply described by differential equations. Collectively
"high-MIPS" effects (HMEs) are effects that tend to slow a
processor as it runs faster. Most of the HMEs arise because the
latency (not the bandwidth) associated with the I/O subsystem does not
scale with processor speed.
In a multiprogramming environment, as the processor speeds up (via
faster circuits, higher ILP, or higher degree of MP), the latency of
the I/O subsystem appears larger from the perspective of an executing
program. If the multiprogramming level is maintained, the added speed
causes the processor utilization to drop (i.e., the added speed is not
utilized). To maintain a high utilization, the multiprogramming level
is increased.
When the multiprogramming level increases, the software queues become
longer, which causes dispatchers and other operating-system modules to
utilize a higher percentage of the system time. Operating systems
compensate for this by using different algorithms, which changes the
basic instruction mix somewhat. Further, the larger number of
coexistent processes generate a higher rate of synchronization events,
and those events must synchronize with more processes. Since the
aggregate "footprint" of the coexistent processes is larger, the
cache miss rate and the TLB miss rate will be larger for each process.
These effects all slow the system down.
Note that as processes (mainstream and I/O) and processors (mainstream
and I/O) interact, they necessarily spend some portion of their time in
wait-loops waiting for asynchronous events to complete (e.g., a
page-in). As the number of processors in a multiprocessor increases, as
those processors run faster, and as the degree of multiprogramming
increases, the portion of time spent waiting increases.
When measuring the real performance of a running system by measuring
instruction throughput (MIPS), it is important to subtract out that
portion of the performance that is spent in wait loops (or other
analogous code). Not only are wait loops nonproductive (and should not
be included in any measure of "work"), but they can actually run
very fast, and can bias a measurement of MIPS in an artificially
favorable way.
5. IPC and other metrics that defy intuition
Thus far, the case has been made for CPI as the performance metric
that most naturally lends itself to intuitive interpretation and simple
estimation techniques. CPI is a dot product involving a set of event
types, where the contribution of each event type is the product of its
frequency and its associated penalty. The frequencies are intrinsic to
workloads, and the penalties can be readily derived from the
microarchitecture.
An experienced designer should know approximate numbers for key events,
key workloads, and basic microarchitectures. Changes to a
microarchitecture rarely affect more than one or two event types, and
that effect should be readily understood by an experienced designer.
Quick estimation using this dot product is simple and remarkably
accurate.
IPC is the inverse of CPI. It defies intuition while hinting at high
performance. Merely intoning "instructions per cycle" creates a
subliminal feeling that massive instruction-level parallelism (ILP) is
being achieved. It is not.
When performance is expressed as IPC, EBusy and EIdle become
intermingled and inseparable. This does not clarify performance; it
shrouds it. The use of IPC has been the largest impediment to the
understanding of pipeline effects, and the largest contributor to the
notion that EIdle defies intuitive quantification. It has been the
largest impediment to simple analyses of superscalar and VLIW processor
designs.
In fact, so many papers quote some theoretical maximum speed in the
absence of any possible accounting for pipeline effects as
"performance," that for several years we assumed that most of the
industry was measuring CPI in the complex-number domain, as shown in
Figure 5.
Figure 5
Complex CPI
As was explained in the previous section on infinite-cache
performance,
CPI = EBusy + EIdle.
On the basis of the numbers quoted in many papers, the
only reasonable conclusion that is possible is that published
performance is just EBusy, which can be interpreted as a projection of
complex CPI onto the real axis,
Re(CPI) = EBusy,
where the angle associated with the projection is
= arccos
(EBusy / EBusy + EIdle).
The corresponding projection onto the imaginary axis is
Im(CPI) = (EBusy + EIdle) sin
( ).
Think of the imaginary component of CPI as that component of
performance many authors would like to imagine they need not deal with.
Complex CPI is
CPI = EBusy + i Im(CPI).
A good indicator of the usefulness of a measurement is the sensitivity of
that measurement to a parameter of interest. Since changes to a
microarchitecture usually target EIdle (i.e., improving EU utilization),
the sensitivity of CPI to EIdle is this indicator. This is
which is a purely imaginary complex trigonometric function that is
messy, hence unintuitive. (It also happens that if EBusy and EIdle are
comparable, this sensitivity is small, so complex CPI is not a useful
measurement anyway.)
Thus, just as for the inverse of CPI (IPC), it is apparent that the
complex form of CPI does not lead to clear intuition. Nonetheless,
computing complex CPI from published performance numbers does yield
another useful metric. In particular,
is a direct measure
of the degree to which the published number is out of phase with reality.
6. Limits of superscalar performance as understood from scalar
components
Thus far, discussion has focused on performance for pipelined
machines that issue a single instruction per cycle. It was argued that
CPI can be partitioned into independent pieces for such a machine, and
those pieces can be easily understood.
Superscalar processors decode and execute multiple instructions per
cycle, and the execution can be done out of order with respect to the
decoding. Understanding the pipelined flow of multiple instructions per
cycle (possibly out of order) is more difficult than understanding the
flow of single instructions per cycle. Similarly, implementing a
superscalar simulator is more difficult than implementing a scalar
simulator.
For some proposed microarchitectures, the tasks of understanding
performance and of simulating performance seem overwhelming. While the
dissection of superscalar performance into its components is not done
in this paper, its key is a solid grasp of the analogous scalar flow.
Frequently, this grasp is sufficient for making very good estimates.
Understanding superscalar performance without understanding the
analogous scalar flow is hopeless.
Every designer and performance analyst who is working on a superscalar
processor without a detailed model of such and without the resources or
time to construct one should construct (at least) the analogous scalar
model. That is, construct a model of the same pipeline, and study that
pipeline running in scalar mode.
The scalar performance components are hard bounds (fixed CPI limits)
for the superscalar components; they give definite indications of what
levels of performance are and are not achievable in superscalar mode.
In particular, the FCE is identical; scalar EBusy is an upper bound;
and scalar EIdle is a lower bound.
Recall that FCE is the product of the miss rate and the miss penalty.
The miss penalty depends on the memory hierarchy, not on the
microarchitecture of the processor. The miss rate (misses per
instruction) is a characteristic of the workload, and is independent of
the rate at which instructions are executed. (Note that prefetching
mechanisms, branch prediction mechanisms, and explicit speculation can
increase the miss rate, but that increase can be accurately estimated
without a detailed model of the processor.) Therefore, it is an
excellent approximation that the FCE in superscalar mode is identical
to the FCE in scalar mode. (This paper does not address
multithreading; suffice it to say that it makes the FCE component
worse, and it should not be used if a primary goal is high
performance.)
EBusy is the inherent work done by the program. This work is identical
in scalar and superscalar modes. The difference is that in superscalar
mode, some of the work is done in parallel. As such, the scalar EBusy
is an upper bound, and the superscalar EBusy should be smaller. If
there are n functional units, it cannot be more than
n times smaller. Estimates that approach EBusy/n
are probably wrong.
EIdle is (primarily) a measure of the functional interlocks that are
intrinsic to the program, e.g., branches, AGIs, and serialization
events. The penalties associated with these intrinsic interlocks depend
on the pipeline, and on the temporal proximities of the interlocks to
the events that resolve them (e.g., AGI).
The interlocks do not disappear in superscalar mode; on the contrary,
their effects are amplified. If superscalar operation is having the
desired effect (i.e., the execution rate is increased), the temporal
proximities of interlocks are shortened with respect to the events that
resolve them. That is, sequences of instructions are compressed into a
smaller temporal "window," and the resolving events have less time
to resolve the interlocks before the interlocks stop pipeline flow.
Therefore, the intrinsic part of scalar EIdle is a lower bound. As the
execution rate increases, EIdle can only become larger.
There are also EIdle effects that arise because of resource conflicts
(e.g., insufficient buffering, buses, or ports). These are called
"bottlenecks." They are (generally) extrinsic to the program, and
should be relatively small. If these effects are manifest in a scalar
design, the design is poor. Design resource should be directed at these
bottlenecks instead of at a superscalar control structure.
If these effects are ignorable in a scalar design, they nonetheless
might be manifest in a superscalar design if buffers and
bandwidth-related resources are not scaled accordingly. Therefore, as
was true of the intrinsic EIdle, extrinsic EIdle in a scalar design (if
any) can only become worse in a superscalar design. Scalar EIdle is a
lower bound.
Of the three scalar bounds, EIdle is the most difficult to extrapolate
to the superscalar realm. As was discussed in the subsection on
infinite-cache performance, the principal components of EIdle have
penalties that are directly proportional to the number of cycles
required for a cache access. For this reason, a designer should know
the coefficient of infinite-cache CPI with respect to this number.
Recall that this is strictly linear, so the increase in CPI per cycle
increase in the cache-access time is constant. This coefficient is
useful in quickly evaluating whether a potential reduction in miss rate
from a larger cache justifies the potential exposure of a longer
cache-access time.
Small fast L0 caches
A caveat in regard to the aforementioned pipeline coefficient is
that it should not be used to extrapolate CPI downward into the realm
of L0 caches. L0 caches cannot be conceptualized in the same manner as
L1 caches because their (rare) benefits cannot be quantified in terms
of hit rate. This is because L0 caches are very small and have hit
rates that are below the "critical mass" required to conceptualize
their steady-state operation.
Specifically, the quanta of pipeline flow are basic blocks which are
sequences of instructions between successive taken branches (or between
successive mispredicted branches). If all data and instruction
references in a basic block are satisfied by the L1 cache, that basic
block "sees" the pipeline flow involving the L1 access path. If L1
misses are rare (relative to the number of instructions in the basic
block), "the pipeline" is the pipeline associated with the L1, and
L1 misses can be treated as isolated events that are accounted for in
the FCE.
The L0 cache allows the basic block to start early (assuming that the
initial access of the block hits in the L0), and therefore to finish
early (assuming that all references in the basic block hit in the L0).
If any access in the basic block misses in the L0 and requires an L1
access, the flow for that block reverts to the L1 pipeline flow, and
the "head start" afforded by the L0 is lost.
Therefore, only those basic blocks that live completely in the L0
benefit from the L0. For typical L0 sizes and miss rates, most
workloads do not have a predominance of basic blocks with this
characteristic. In any case, the probability of losing performance due
to L0 misses is not proportional to the L0 miss rate; instead, it is
closer to a geometric distribution of the miss rate whose degree is
proportional to the average number of accesses per basic block.
The disadvantage of
decoupling the I and E phases
A model of high-performance computing that has become pervasive in
the last several years has the I phase decoupled from the E phase. The
two phases are implemented with engines that are autonomous with
respect to each other, and they operate on a common instruction queue.
In this model, the I engine is directed by an extremely accurate
branch-prediction mechanism, and it is able to run far ahead of the E
engine, fetching instructions at a high rate and dispatching them to
the common instruction queue. The E engine is a superscalar dataflow
machine. The dataflow analyzer chooses instructions that are not
interlocked, and dispatches them out of order (if necessary) to a set
of parallel execution elements at a hypothetically high rate. The cache
and the register file are assumed to be highly multiported so that they
do not bottleneck execution.
Many studies that involve this model ignore the details of the I phase
and assume that it can keep ahead of the E phase. The focus then
shifts to the dataflow analyzer, which is a more tractable problem. If
the constraints imposed by considerations for an MP environment are
loosened (specifically, the ordering of fetches and the completion of
stores), and if the cache miss rate is extremely low, the execution
time is a very impressive (and very unsurprising) number that is
proportional to the depth of the dataflow graph.
This model yields very impressive results that are flawed on several
levels. The first level is a pure "catch-22" that does not require
any understanding of microarchitecture: If the E phase accelerates to
its theoretical speed limit (determined by the dataflow graph), it must
catch up with the I phase. Therefore, there cannot be instructions in
the common queue to dispatch (i.e., if the E phase goes as fast as it
is supposed to, the processor runs out of instructions). The point is
that the I phase is not ignorable.
In real commercial workloads, the frequency of unpredictable branches
prevents the I phase from getting far ahead of the E phase. Given the
best techniques available, a branch will be guessed wrong every 20-30
instructions in real commercial code. (In SPEC**, the numbers are much
better.) Therefore, the theoretical maximum execution rate can only be
maintained for intervals of 20-30 instructions, and each interval is
followed by a wrong-guess recovery action that requires restarting the
I phase. For real hardware to approach the bounds implied by this
model, branch prediction must become much better than it is today
(except in SPEC).
In real commercial code, an L1 cache miss occurs every 10-20
instructions for L1 sizes in the 64KB to 256KB range (in SPEC, the
numbers are much better); thus, maintaining a high execution rate
requires moving cache lines at a rate corresponding to the resulting
miss frequency. This is not realistic at very high execution rates. For
example, if a processor can execute five instructions per cycle, and
there is an L1 miss every ten instructions, the memory hierarchy must
be able to move an entire cache line in two processor cycles for the
processor to sustain this rate. (This is still an unrealistically
optimistic statement, because it assumes that there is no miss latency,
which would require perfect prefetching.)
Another characteristic of real commercial code is that a majority
of conditional branches are resolved by instructions that have an
AGI interlock with an immediately preceding instruction. (An example is
a loop that searches for a record in a linked list.) When this is the
case, it is disadvantageous to separate (temporally) the I and E phases
by adding pipeline stages (the cycles required to put instructions into
a queue, analyze them, and redispatch them). Instead, it is desirable
to couple the phases tightly so that there is minimal latency between
the initial dispatch of a load instruction and its execution outcome.
(This is much less of a constraint in SPEC.)
7. Future trends in cache miss penalty
In high-performance systems of the 1980s, cost was a less crucial
factor than it has become in the 1990s. Symmetric multiprocessor (SMP)
systems of the 1980s typically had private unidirectional
point-to-point buses between processors and the memory system. Those
buses were driven by water-cooled circuits onto low-dielectric
packaging, and they ran at the same speed as the processor. Cache line
sizes were moderate by today's standards.
In the 1990s, processors have become faster (shorter cycle time) and
buses have not kept pace. There are many systems in which the processor
runs at a higher frequency than the bus, and that difference is
becoming larger. The number of processors in SMPs has increased, and to
contain costs, many systems are shared-bus systems. With the growth of
object-oriented programming, miss rates have increased, and with higher
instruction-level parallelism, miss frequencies (in time) are larger
for the same miss rates. Further, it is becoming the fashion to
increase line size.
Each of these trends increases bus utilization, and thus exacerbates
the effects of bus utilization. Bus queueing (which was an ignorable
effect in high-performance systems of the 1980s) will dominate
system-level performance in the next decade if new protocols are not
adopted to mitigate it.
Figure 6 is a conceptual drawing of the
temporal components of a cache miss. The leading-edge component is the
time it takes the memory system to deliver the first datum of a miss.
Since a miss causes the transfer of both the datum that caused the miss
and all other data in the same line, the trailing-edge component
accounts for those cycles that are lost because an entire line was
transferred, i.e., the cycles following the leading edge.
Figure 6
Note that the trailing-edge delay that affects the processor is less
than the total number of cycles that it takes to transfer a line, but
it is directly proportional to that number, and thus directly
proportional to the line size. Roughly speaking, trailing-edge effects
fall into three categories:
- Since spatial locality exists, there is frequently an upstream
reference (immediately following the miss event) to a datum that is in
the same line as the datum that caused the miss. This should not be
counted as a second miss, but the second reference will experience a
delay if the associated datum is still in transit at the time of the
reference event.
- The incoming line consumes bandwidth at the L1 cache and interferes
with the running processor.
- There are finitely many systems at the L1 interface (e.g., only one
system in many processors) for containing the necessary state for
controlling the processing of each miss in progress. That is, there is
a maximum number of misses that can be in progress at any time, and
that number is typically small. If all of these systems are occupied, a
new miss cannot be started until a miss in progress has completed in
its entirety (i.e., until an entire trailing edge is over).
Assessment of the leading-edge penalty is more straightforward.
When a miss is recognized by the L1 cache, there can be queueing at the
address bus in some systems. Once the processor gets control of the
address bus, it transmits an address to the L2 cache (or whatever is
next in the hierarchy). At the receiving end, there could be an ECC
verification cycle, and queueing at the L2. Once the request is
accepted, there is a (fixed) L2 access time, perhaps followed by
another ECC verification cycle. Again, there can be queueing at the
data bus that returns the first datum to the processor, followed by the
transfer of that datum (and perhaps another ECC verification cycle at
the receiving end). The trailing edge follows this.
Therefore, leading edge comprises several fixed delays that can be
estimated directly from the microarchitecture of the cache hierarchy,
and a few chances for queueing. Whenever there is a chance for
queueing, the average delay incurred is a nonlinear function of the
utilization of the subject resource (in this case the address bus, the
L2, and the data bus).
The address bus is used for only one bus cycle every miss, so its
utilization is small. The L2 utilization can be large, but there are
many straightforward techniques for mitigating these effects, e.g.,
interleaving. The utilization of the data bus is strongly related to
the trailing edge. All current system-level design trends tend toward
increasing this utilization, hence its associated queueing delay. As is
described below, queueing delay explodes if utilization is driven past
some general thresholds.
For the sake of illustration, the general trend in queueing effects is
analyzed below, using an open queueing model. While this is not
accurate because the real system is closed, the solution for the open
system is simple, and yields valuable intuition as to the general
trend. (The solution for a closed system is very complex, and does not
yield to intuition.) The notation used is as follows:
Description of bus rate
f =
frequency of the processor in MHz.
f =
frequency of the bus in MHz.
BR = f
/f
=
bus ratio (number of processor cycles per bus cycle).
Description of miss rates
mr = miss rate for a single processor (misses per instruction).
CPI = cycles per instruction for a single processor.
P = number of processors that share the bus.
BMF = (mr × P)/CPI = bus miss frequency
(misses per cycle on the bus).
Description of trailing edge
L = line size (bytes per line).
W = width of the bus (bytes).
P =
L/W = number of "packets" per line.
Service time for a miss
TE = P
×
BR = trailing edge (number of processor cycles to transfer a
line).
Bus utilization
U = BMF × TE = bus utilization (a
probability).
Note that the trailing edge is the service time for a miss
when the bus is the "server" in the queueing model. (Keep in mind
that the trailing edge is distinct from the trailing-edge effect, as
was previously described.)
The open aspect of the model is easily seen in the definition of
utilization above. In particular, since utilization is a probability,
it is physically bounded between zero and one. In the open definition,
it is the product of a rate and a service time which are assumed to be
constant; hence, there is no feedback and no implicit bounding. In the
real closed system, there is feedback to bound the utilization.
Specifically, increasing bus utilization increases the queueing effect,
which increases CPI, which decreases BMF, which keeps the utilization
less than one.
Nevertheless, the queueing delay (Q) is calculated for the
open system below. The calculation is expressed in two forms to
illustrate a point (as explained later). Q is the average
number of cycles that each miss spends waiting to get control of the
data bus. This is part of the leading-edge penalty.
The first form of the equation is written directly from the
following intuition. If a miss requires service, there are i
misses in progress with probability
U , and
the new miss must wait for the trailing edges of all i misses to
complete. On the average, the first trailing edge is half over when the
new miss finds that the bus is busy (hence, the negative term before
the summation to correct for this). There can be as many as
n misses outstanding in the system, so a new miss can have
no more than n - 1 misses ahead of it (which is the
limit of the summation). Most processors today can only have a single
miss outstanding, so in most shared-bus MP systems, n = P.
The first expression for Q shows that the average queueing
delay is a polynomial in U, and the degree of the polynomial
is one less than the number of misses that can be simultaneously
outstanding in the system. In most systems, the degree is P
- 1. The second expression for Q is the same as the
first, but it has been rewritten to show that the first term is
proportional to the square of the trailing edge, hence, the square of
the line size.
If U is kept relatively small,
U
dies out, and the queueing delay is proportional to the square of the
line size. If U is allowed to grow large (say, >0.5),
U does not
die out, and the queueing delay explodes.
For example, consider a four-processor system where each processor runs
at 2 CPI, and each processor can have (at most) one miss outstanding.
Assume that the bus is 32 bytes wide, and that it runs at half the
speed of the processor. Let the line size be 128 bytes. If the miss
rate is 0.02 (i.e., if the intermiss distance is one miss every 50
instructions), the equations above yield a queueing delay of 3.71
cycles per miss. These system-level assumptions are fairly conservative
for most systems that are being projected, and this queueing delay is
quite significant.
Figure 7 is a plot of queueing delay as
a function of trailing edge for various miss frequencies. The dotted
lines show constant utilizations. This shows that if U is
kept reasonably small (e.g., <0.3), the queueing delay is fairly
flat, but if U is large (e.g., >0.5), the delay becomes
exceedingly large.
Figure 7
The example above is shown as point A on the plot, and it
demonstrates that modern systems are on the edge of a nonlinear range
in bus queueing. A new approach to transferring data is needed in the
coming decade. The effect of queueing delay on the base CPI is
In the example above,
(CPI) = 0.074,
which is nearly 4% of the total system performance.
Figure 7 shows that this is headed in a very
dramatic direction as utilization is increased further.
8. Future trends in bus protocols
In the previous section, the various trends in system design that
drive bus utilization were discussed, and the effects of utilization on
queueing delay were shown. To summarize, the trends that drive bus
utilization are the following:
- More processors in a system.
- More processors sharing a bus.
- Faster processors (cycle time) and higher bus ratios.
- Larger line sizes (= larger trailing edge).
- Lower CPI (= higher miss frequency for the same miss rate).
- Higher miss rates (new code with larger working sets).
- Multiple misses outstanding per processor (larger n).
- Speculative execution (= higher miss rate).
- Prefetching (= higher miss rate).
These are all good trends, but they make a previously ignorable
effect significant. It is very simple to propose solutions such as
making the buses faster and/or wider, adding more buses to the system,
or reducing the miss rate. However, these are not real solutions. Much
effort is already spent making buses as fast as possible and as wide as
possible. They cannot be made faster and wider than what is possible.
Miss rates are inherent, and they are getting bigger.
The problem with a highly utilized bus is that when a demand miss
occurs, the exigent datum (the word that the processor needs
immediately) is queued behind nonessential and nonurgent traffic that
is flowing in the system. Those data fall into three categories:
- Trailing edges that follow other exigent data.
- Speculatively fetched data that are not actually needed.
- Prefetched data that are not needed anytime soon.
Current bus protocols treat the transfers of cache lines as atomic
events, and allow exigent data to queue behind nonexigent data.
Since data bus utilization will become large in the next decade, a
family of protocols is needed that allows the transfers of lines to be
broken up so that exigent data can be transferred on demand. This
requires that a small amount of control information be added to a miss
request that identifies the urgency and/or priority of the miss. In
current protocols, limited control information is already sent that
tells the hierarchy the nature of the request (e.g., read, write, fetch
exclusive). The new information requires a few more bits depending on
how exotic the protocol is.
When future memory systems send data back to a set of processors, they
will have to arbitrate each bus cycle. This will allow the trailing
edges of multiple misses to be interleaved. Specifically, future
protocols must allow the first datum of a demand miss to interrupt the
returning traffic (other trailing edges) to get that urgently needed
datum to the requesting processor, to then resume transferring the
interrupted stream, and to merge the trailing edge of the new stream
into the existing stream.
In future systems, misses will not be atomic transactions. Instead,
they will be divisible strings of contiguous data. The returning
control bus must be active every cycle to handle this; i.e., since a
miss will not be returned as an atomic transaction, control data are
required to identify the data on the bus for every cycle in which there
are data on the bus. (Minimally, it must indicate what processor the
data are for, and the miss ID number, as it currently must for each
atomic transaction.)
This will be required in shared-bus systems, and will obviate
mainstream point-to-point systems because it enables point-to-point bus
performance (electrical considerations aside) in a shared-bus
environment by minimizing queueing delay even at very high bus
utilizations.
Line buffers at both the memory and the processor ends of a bus will be
required to time-multiplex the bus among multiple trailing edges. At
the memory end, the buffer must hold as many lines as there can be
misses outstanding in the system. At the processor end, the buffer must
hold as many incoming lines as the processor can have misses
outstanding.
In current bus protocols, when a processor issues a miss request, it
sends the following three pieces of information to the memory
hierarchy:
- Address (real) of the datum that is to be returned first
(and thereby, explicitly, the address of the line that contains the
datum).
- A miss ID, which (on a shared-bus system) comprises two
parts:
- The ID of the processor that is issuing the miss.
- The ID of the miss (in case the processor can have more than one
miss outstanding).
- The miss type, which tells the memory what status is needed
for the line (e.g., shared, exclusive).
When existing memory systems return data, they send back the
second field above with the data. In this way, the processors on the
bus can tell what the data are (i.e., who they are for, and the miss to
which they correspond).
Existing protocols treat each miss as an atomic transaction, and they
lock up the bus for the duration of the associated line transfer. For
these protocols, the second field (above) need not be sent back on
every bus cycle; it is just sent at the beginning of each transaction.
In future bus protocols, each bus cycle will be used for a unique
transaction, and the second field (above) will be sent on every cycle
to identify the data for that cycle.
In future protocols, there could be as many as three additional control
fields sent with each miss request. All three together would
provide a much richer set of protocols than is actually needed.
Nonetheless, all fields are listed below to provide a complete
specification for future protocols. Each field is optional, and each
can be arbitrarily simple or complex. The new fields are the following:
- A priority level is used to distinguish as many of the
following types of misses as desired:
- Demand data fetch (exigent).
- Demand instruction fetch (exigent).
- Demand data fetch down conditional path (conditionally exigent).
- Demand instruction fetch down conditional path (conditionally
exigent).
- Data fetch down speculative path (speculatively exigent).
- Instruction fetch down speculative path (speculatively exigent).
- Data prefetch initiated by prefetch mechanism (prefetch).
- Instruction prefetch initiated by prefetch mechanism (prefetch).
- Multiplex control specifies how this data stream is to be
multiplexed with respect to the other active streams. Assuming that
Field 1 above is not used, Field 2 might specify one of the following
actions:
- Put the entire stream behind the currently active traffic.
(This is the current default; i.e., this is what current protocols do.)
- Suspend all other active traffic for one bus cycle to return the
first datum from this miss immediately, then resume the suspended
traffic, and put the remainder of this stream at the end of the
queue.
- Suspend all other active traffic to return this entire stream,
then resume the suspended traffic.
- Suspend all other active traffic for one bus cycle to return the
first datum from this miss immediately, then interleave the packets of
this stream with the ongoing traffic.
- Return the packets from this stream only if there is no other
traffic present.
- Other things deemed useful.
If Field 1 above is used, Field 2 would specify the
same control functions listed above, except that it would enforce them
with respect to priority levels, e.g.:
- Suspend traffic of equal or lower priority for one
cycle to return the first packet of this stream, then queue the
remainder of this stream behind traffic of higher or equal priority.
- Suspend all traffic for one cycle to return the first packet of
this stream, then queue the remainder of this stream behind traffic of
higher or equal priority.
- A command field rescinds or changes instructions as to how
previously issued misses are to be handled. This allows for the
changing of characteristics of misses that have not yet begun, or of
misses that are in progress. For example, Field 3 could specify the
following:
- Change the type of request of an outstanding miss. This allows
the upgrade of a read-only request to an exclusive request, or the
downgrade of an exclusive request.
- Change Field 1 on an outstanding miss. This allows the upgrade of
the priority of a miss, which would be appropriate if a speculative
prefetch became a known demand miss.
- Rescind miss request (i.e., cancel a previous request). This
would be useful if it were found that a prefetch that had previously
been issued was down a path that is now known not taken.
- Reset starting address. This is useful if a previously initiated
prefetch for a line was done without specific knowledge of the address
of the datum that would be needed first, and the address of that datum
is now known.
- Add a new starting address. This is useful if an upstream
reference to an incoming line is discovered. It is different from the
previous command because it does not change the first doubleword
address; it just says that instead of returning the line sequentially,
the line should be returned starting with the first two explicitly
named (nonsequential) words.
- Other things deemed useful.
This new family of protocols will emerge in the coming decade. It
enables high performance on a shared bus, and it allows the bus to be
run at very high utilization with minimal queueing delays.
There is currently a resurgence of interest in sectored caches for some
of the reasons outlined above. In a future generation of processors,
sectoring will re-emerge as an intermediate step in the evolution
toward these protocols.
9. Power and microarchitecture for high-frequency design
In the description of the IAS machine, the power budget was shown
to be partly responsible for a basic aspect of many modern ISAs known
as the von Neumann bottleneck. In fact, power will play an important
role in shaping microarchitecture in the coming decade.
A popular viewpoint is that power is an unfortunate constraint which
causes a divergence in the microarchitectures that target the server
and client spaces. The reasoning is that the client requires low power,
and thus a simple core, and the server requires a high degree of ILP,
hence high power.
Instead, consider power to be a metric that is useful for guiding
microarchitectural development into the highest performance realm for
the server. The client and server spaces will converge to the same
microarchitectural core with the same physical floorplans, albeit with
different circuit designs.
Specifically, there has been confusion in the last decade because there
is no general methodology for assessing the impact of a
microarchitectural feature on cycle time. Most proposed changes to a
microarchitecture target ILP. It is known that adding
microarchitectural features to a processor makes it bigger, and that
making a processor bigger probably does not improve its cycle time, but
the extent to which an added microarchitectural feature hurts cycle
time is generally not known.
It is human nature to give credit to a design where there is a tangible
benefit (quantifiable CPI reduction), and to discount the unknown. That
is, if the cycle-time impact of an added feature might be negligible,
we tend to assume that it is negligible; if the CPI benefit is known,
adding a feature to a machine to achieve higher ILP is perceived to be
desirable on balance.
This is a dangerous trend, particularly in CMOS, because CMOS is a
wiring-driven technology. In CMOS, the area of a machine is (almost)
unrelated to the number of gates in the machine, and is more strongly
dependent on the number of wires that must be run to connect them.
Things that drive ILP tend to drive wiring-track usage in an
exponential manner. When parallel execution units, parallel buses, and
multiple ports are added to parts of a machine, wiring (hence area)
increases dramatically.
While it is true that planarization technology allows for more levels
of wiring, more than a few levels are useless for increasing
interconnectivity. Very simply, once the lower levels of metal are
heavily congested, vias cannot be dropped through them, and the upper
levels of metal cannot be connected to the silicon surface. Sai-Halasz
advocates using upper levels of metal for "fat wires" which have
lower resistance, hence higher speed for signals that must travel more
than a few millimeters [8].
Figures 8(a) and
8(b) respectively
show the cycle time and the CPI as a function of the degree of ILP in a
superscalar processor. In this case, CPI decreases to an asymptote, and
cycle time increases linearly as more parallelism is added to the
processor. This is the dual of
Figure 4 (the superpipelining case), so
MIPS as shown in Figure 8(c) has the same
form as in Figure 4(c). The lesson is that there
is an optimal level of ILP, and the level appears to be small (say 2 to
4, or when pressed for an exact number,
), mostly because of the
sensitivity of cycle time to adding wires to the machine.
Figure 8
While the coefficient of the line in Figure 8(b)
is very difficult to
assess, and is highly debatable, power can be the metric that provides
real guidance in achieving high performance. When hardware is added to
a machine, the power impact is readily tangible. If the goal of a
microarchitecture is low power and, more specifically, low work (the
product of power and time as it affects battery life), only those
features that pervasively provide low CPI are included. Features that
only help CPI sometimes (and that hurt cycle time all of the time) are
eliminated if low power is a goal. Those elements should also be
eliminated if high performance is a goal.
As the industry pushes processor design into the GHz range and beyond,
there will be a resurgence of the RISC approach. While superscalar
design is very fashionable, it remains so largely because its impact on
cycle time is not well understood. Complex superscalar design stands in
the path of the highest performance; he who achieves the highest MHz
runs the fastest.
A clear focus on power yields a clear focus on high performance. This
trend will make the microarchitectures of the client and the server
converge. In the client processor, the circuit design will be optimized
for low power. Since CMOS is wiring-driven, adding active silicon area
via resizing devices can usually be done with little impact to the
physical floorplan. The high-performance server can then be derived
directly from the client core by resizing devices to optimize for
speed.
Complex CPI, relativity,
and adiabatics
In a previous subsection on complex CPI, the discussion was
limited to values of CPI in the first quadrant only. Now that the
discussion has turned to power, the other quadrants in
Figure 5 take on
significant physical interest. In particular, points in quadrants II
and III have negative real components. The only reasonable
interpretation of such points is that they represent the performance of
a processor that is running the program backward.
One possible interpretation of quadrants III and IV, which have
negative complex components, is that they represent a new paradigm in
circuit performance; in particular, they represent processors that run
faster than the speed of light. According to simple relativistic
theory, when the machine runs faster than light, time moves backward
relative to our inertial frame of reference. According to this theory,
quadrants I and III are indistinguishable, since quadrant III has the
computation being run in reverse while time moves backward. As such,
quadrant III is uninteresting.
of adiabatic computing. A processor that can run adiabatically
in quadrant II acts as a power source, hence a perpetual motion
machine. In quadrant IV, if a machine enters an adiabatic realm, it
becomes a black hole. If this happens, it will change the world as we
know it.
10. Conclusion
In this paper, several points were made that are antithetical to
some of the modern philosophy in processor microarchitecture. These
points are based on simple observations relating to the machinations of
electronic von Neumann computers, which have been in existence since
the onset of this industry.
First, the most popular performance metric, IPC (instructions per
cycle), is the reciprocal of the metric that should be used, CPI
(cycles per instruction). This is primarily because CPI is a simple dot
product of a few numbers that any experienced designer should have at
his fingertips. It is intuitive, and it makes for remarkably quick and
remarkably accurate estimates.
On the other hand, IPC does not yield to intuition. Instead, it shrouds
fundamental issues in mystery, and it has much of the industry (and
academia) running down blind corridors in a state of general confusion.
Second, the separability of CPI into three independent components was
demonstrated. The three components account for the intrinsic work done
by the computer, the pipeline structure of the computer, and the memory
hierarchy. It was argued that a solid grasp of each of these three
components is necessary in understanding the performance of a
superscalar processor, because the scalar components are hard bounds
for the analogous superscalar components. Essentially, the argument is
that one must have a grasp of the simple case before one can hope to
understand the general case.
Third, attention was focused on a trend in future systems in which data
bus utilizations cross a threshold that will make queueing at the
memory bus a limitation of system performance. A new family of bus
protocols that can mitigate this effect was proposed. These protocols
will emerge in the coming decade because of the impending delays due to
queueing.
Finally, an argument was made that power consumption will drive the
development of microarchitecture in the coming decade, and that the
aspects of a microarchitecture that result in low power also result in
high performance. This is particularly true in CMOS, which is a
wiring-driven technology. This trend will cause the client
microarchitecture and the server microarchitecture to converge.
*Trademark or registered trademark of International Business
Machines Corporation.
**Trademark or registered trademark of Standard Performance Evaluation
Corporation.
¹Related in a conversation with J. Pomerene,
engineer on the IAS project.
References
Received August 8, 1996; accepted for publication February 19, 1997
|