Introduction
As power becomes an increasingly important constraint, it is necessary to include circuit power implications to evaluate correctly the impact of architectural changes. In previous works [1–8], this has been attempted with broad metrics combining global power and performance. For example, maximizing MIPSn/watts and minimizing power × delayn are expressed as goals. Arguments are made that n = 0 (power per operation) and n = 1 (energy per operation) are inadequate for evaluating tradeoffs, and n = 2 (energy–delay product) is commonly used. Attention to supply-voltage scaling [8] gives n = 3 as more appropriate in some domains. This reference also provides a good overview and a more refined version of the metric.
Some issues are common to these approaches. First, the exponent is global and typically integer, while the power–performance tradeoffs at the circuit level are generally local and continuous. Second, correct evaluation of the terms is often difficult because important side effects are easily neglected. For example, using MIPSn/watts, it is straightforward to estimate changes in instructions executed per cycle (IPC); however, MIPS and watts also include changes in frequency that result from added logic (significant and often neglected), or operation near the power limit. Changes with marginal IPC improvement are much more likely to be accepted when such circuit power implications are overlooked.
With these issues in mind, we propose a metric of hardware intensity , useful for evaluating issues which affect both circuits and microarchitecture. In the first section, we define and other related parameters and illustrate how they can be measured with actual design data. We next derive relations between and supply voltage v for a power-efficient design under progressively more general conditions. Specifically, we examine a single pipeline stage, multiple independent stages, and sequences within a stage. These relations allow the metric to overcome the first issue above. Hardware intensity is continuous, and can be used in both global and local contexts as appropriate. No assumptions concerning technology voltage scaling are required. All introduced parameters have clear physical meanings and a method for measuring them.
After this we show how hardware intensity can be incorporated into an existing architectural metric. Some of the terms can be straightforwardly measured or estimated. For others we derive equations relating them to expressions involving hardware intensity and calculable terms. We show how, under simplifying assumptions, the more common metrics can be derived. These equations explicitly capture the effects which are often neglected in the second issue above. We conclude with a brief summary and a list of possible extensions of the ideas.
Hardware intensity
In the design of pipelined processors, the hardware in each stage is optimized by restructuring logic and tuning transistor sizes to meet the cycle requirement. The tighter the delay budget, the greater the parallelism required at the gate level and the larger the transistor sizes needed, which leads to higher power. To quantify these speed–power tradeoffs, we introduce a notion of hardware intensity, and a variable associated with it. We define the physical meaning of as a parameter in the cost function for optimizing hardware:
Fcost(E, D) =
(E/E0)(D/D0)
≥ 0,
| (1) |
where D is the critical path delay through the circuit, E is the average energy dissipated per cycle, and D0 and E0 are the corresponding lower bounds that can be achieved through tuning and logic restructuring for a fixed supply voltage. Many types of functions can be used as a cost function. This particular form 1 was chosen because of the property
| (2) |
which makes it useful as a common language in circuit and architectural communities, as is apparent in the following sections. Cost functions of form (1) have been used in previous work [3, 4, 6, 7, 9, 10] with fixed or variable to optimize or compare hardware implementations in the power–performance space. In this paper we relate to the power-supply voltage in energy-efficient designs and link it to the architectural energy efficiency criterion derived in [11].
A notion of the energy-efficient family was introduced in [7], [9], and [10] as a set of implementations of a given hardware function, each of which results in the highest performance among all possible configurations dissipating the same power. If plotted in the energy-versus-delay coordinates, the energy-efficient configurations form a convex hull of all possible implementations of a given hardware function.
Under a very general assumption that the curvature of the energy–delay curve is higher than the curvature of the contour of the cost function (1) at any point at which the two touch,
we can show that for any power-supply voltage v, every point on the energy–delay curve corresponds to a certain value of the hardware intensity , 0 < +∞. Then, the energy-efficient curve in the energy-versus-delay coordinates can viewed as a parameterized curve: D = D( , v), E = E( , v).
Figure 1 gives a graphical interpretation of the hardware intensity. The solid line plots a typical energy-efficient curve for some hardware function. Dotted curves show several contours of the cost function (1) for two values of the hardware intensity . Point (D, E) at which the energy-efficient curve tangents the lowest of the contours [Fcost(E, D) = A with the smallest value of A] corresponds to the energy-efficient implementation for this value of the hardware intensity . Using (2), the tangent to the energy-efficient curve at this point can be expressed as
| (3) |
Then, we have the following property for the hardware intensity:
| (4) |
or
| (5) |
Figure 1
Thus, the hardware intensity is the ratio of the relative increase in energy to the corresponding relative gain in performance achievable locally through logic restructuring and tuning at a fixed power-supply voltage for a power-efficient design. Simply put, it is the value of % energy per % performance for an energy-efficient design:
| (6) |
Figure 2 shows, on a logarithmic scale, energy-efficient curves for two tuned adders, a vector reduction unit, a latch, and several ASIC cells, all implemented in a 0.13-µm technology (some in bulk, others in SOI). The energy-efficient curve for the latch was obtained by tuning several latches with a dynamic transistor-level Spice-based circuit tuner, run with different cost functions. The tuned points for all simulated latches were combined into a common energy-efficient family, as described in [10]. For ASIC cells, different power levels (from A to I) were used as points on the energy-efficient family, assuming that every ASIC cell is optimally tuned. Energy and delay values for the cells were looked up for various power levels directly from the design databook for the assumed load capacitances. The adder curves were obtained using a formal static tuner, EinsTuner [12], for a variety of targets for the total device width. The curve for the vector reduction unit was obtained using multiple ASIC synthesis runs for different frequency targets. The IBM BooleDozer* synthesis tool was used.
Figure 2
An interesting observation is that energy-efficient curves for widely different hardware functions, obtained using different methods, are remarkably similar. A recent theoretical work [7] predicts the dependence E = E(D) as (E - E0)(D - D0) = E0D0, plotted as a dashed curve. Our results in Figure 2 show a substantial deviation from this prediction even for simple gates. However, the expression above can be modified to fit the experimental data as follows: (E - E0)(D - D0) = E0D0, where 0 < < 1.
To explain this form of the dependence, let us rewrite the expression D = D0 + RCld, used for calculating delays of the ASIC cell, as follows: (D - D0)/D0 = Cld/Ccell, where Ccell is the sum of the cell input and internal capacitances, Cld is the load capacitance, and = RCcell/D0. The value of is approximately constant for every type of cell across a range of power levels, because the output resistance R of a cell is roughly inversely proportional to the sizes of transistors used in the cell, and thus, inversely proportional to Ccell, R ~ 1/Ccell. For standard cells in a 0.13-µm technology, the value of is in the range from 0.2 to 0.4, depending on the cell type. The expression for energy can be roughly approximated as proportional to the sum of the cell capacitance and the load capacitance, E ~ (Ccell + Cld). If Ccell << Cld for the minimum-size cell, the expression can be further approximated as (E - E0)/E0 = Ccell/Cld, where E0 is the energy dissipated by the minimum-size cell. Multiplying the expressions for energy and delay, we arrive at (E - E0)(D - D0) = E0D0. The dashed curve in Figure 2 that corresponds to = 0.2 is in much better agreement with the experimental results.
The formula for the energy–delay curve can also be derived using the logical effort delay model [13] as follows: D = (gh + P), where is the intrinsic delay of an inverter, g is the logical effort of the gate, h is the ratio of load capacitance to input capacitance, h = (Cout/Cin), and P is the delay of the gate driving zero load, P = D0. Then
Approximating energy for a fixed output load as
hence
The range of values of (0.2 < < 0.4) that we measured is consistent with data reported in [13] for an 0.18-µm technology.
Through the remainder of the work, we assume that all implementations of any hardware belong to the energy-efficient family; however, none of the results depend on any analytical formula for the shape of the energy–delay curves.
Voltage intensity
For the energy-efficiency analysis that follows, it is useful to introduce the dimensionless derivatives of the delay and energy with respect to the power-supply voltage, and their ratio, referred to as voltage intensity:
| (7) |
Thus, the voltage intensity is the ratio of the relative increase in energy to the corresponding reduction in delay achievable locally through varying the power supply at a fixed hardware intensity:
Theoretical formulas could be used to predict Dv, Ev, and as functions of v. Alternatively, a more practical way to calculate the values of these coefficients is to simulate representative circuits over a range of v.
For a fixed logic style and a fixed technology we observed a close resemblance between the dependencies Ev(v) and Dv(v) for different functional units, and for hardware blocks optimized for different values of hardware intensity .
As an illustration we plotted in Figure 3 simulation results for a chain of XOR gates and a 32-bit adder implemented in a 0.13-µm technology, tuned for several values of . For the energy analysis, PowerMill** was used with random patterns at the inputs with a switching factor of 0.3, run for 200 cycles. The PathMill** static timer was used for delay analysis.
Figure 3
For all of the blocks, the value of Ev is higher than the value of 2 that corresponds to the E = CV 2 dependence. This superquadratic dependence of energy on the supply voltage is explained by short-circuit power that grows faster than the square of v [14], and by the higher glitching activity in large blocks of logic at higher supply voltages that we observed in our experiments. Although curves for different circuits in Figure 3 are very close to one another, we observed higher variation for hardware blocks designed in different circuit styles or using different design flows [11]. According to the experimental results in Figure 3, the voltage intensity grows almost linearly with the power supply v.
Balance between hardware intensity and voltage intensity
Typically, the cycle-time requirement can be met at different combinations of hardware intensity and power supply v. In this section we derive a condition for the optimal balance between v and , such that for a given critical path delay requirement D = Dr, the energy reaches its minimum over the two-dimensional space ( , v). We derive optimality relations for progressively more general assumptions about the pipeline, starting with a single-stage assumption and ending with a general case of a multistage nonuniform pipeline. We also show how to abstract an aggregate hardware intensity ag for nonuniformly optimized pipelines to be used in the microarchitecture-level power optimization that follows.
Single pipeline stage
Consider an “ideal” system in which the hardware is evenly distributed among multiple identical stages, which means that the same value of the hardware intensity applies to all stages. By solving the problem of minimizing the energy as a function of two variables and v, E( , v), subject to the constant delay constraint D( , v) = Dr, we arrive at
| (8) |
Using 4 and the definition for the voltage intensity in 7, we arrive at
| (9) |
Hardware intensity must equal voltage intensity. This formula can be interpreted as follows: For an optimal balance between the power-supply voltage and the hardware intensity, the relative gain in performance achieved at the cost of a given relative increase in energy due to an increment in the supply voltage must equal the relative gain in performance achieved at the cost of a given relative increase in energy due to an increase in the hardware intensity.
With the help of (9), an optimal value for can be determined for every value of v. For example, if Dv = 1 and Ev = 2 for a given power-supply voltage and technology, then, according to (9), for the optimal balance the hardware intensity must be set to = 2, so that 1% gain in the critical path delay, achieved by retuning the circuit, costs 2% in the energy increase.
Relation (9) disproves the common misconception that the lowest power can achieved by building the fastest circuit and then reducing the power supply to the lowest value for which the clocking rate requirement is still satisfied. For example, if Dv = 1 and Ev = 2 (v = 1.6 V), and the circuit is optimized for = 4 instead of = 2, the balance between power supply and hardware intensity is not optimal. It is easy to calculate for the circuit in Figure 1 that by retuning the circuit for = 2 and increasing the power supply appropriately for an unchanged performance, power reduction close to 10% will be achieved.
Multistage pipeline
The simple case of an isolated hardware macro considered above can be applied only to an ideally uniform pipeline. In real designs, different stages of the pipeline usually have different amounts of complexity, and it would be incorrect to tune all of them for the same value of hardware intensity. In this subsection we derive an optimality criterion for nonuniform pipelines.
Assume that there are N stages in a pipeline which are different in the amount of logic and time slack available. Each stage consists of a single block of logic followed by a latch, both tuned for one value of energy weight wi and hardware intensity i as shown in Figure 4. Then, to achieve the optimum in the power–performance characteristics of the whole pipeline, the values of hardware intensity for different stages may be different. There are N + 1 independent variables corresponding to the hardware intensities in the N pipeline stages: 1, … N, and a single power supply, v.
Figure 4
Since all stages are optimized for the same clocking rate, D1 = D2 = … = DN. Then, the problem is reduced to minimizing the function
| (10) |
subject to N constraints
Di( , v) = D
i = 1, … N.
| (11) |
Solving the optimization problem and taking advantage of the earlier discussed property that Ev and Dv for all stages of the pipeline are equal, we arrive at
| (12) |
where wi = (Ei /E) are the energy weights of the pipeline stages, ∑i wi = 1. In the presence of clock gating, the weights of those pipeline stages that are not activated every cycle are scaled down by the corresponding activity factors.
The optimality criterion (12) together with the cycle time requirement conditions (11) allows us to derive the optimal values for the hardware intensity at different stages of the pipeline as functions of the supply voltage. It can also be used to calculate the optimal value for the power-supply voltage, after a preliminary version of the pipeline is designed, by summing (with energy weights) the values of hardware intensities that were needed to meet the clock cycle target for every pipeline stage. If (12) is not satisfied, this indicates that power can be reduced without performance loss by changing voltage and retuning circuits. This information can then be used as feedback to reevaluate the choice of the power-supply voltage and the clock-cycle target, and possibly the partitioning of the pipeline into stages.
As an example application of this second relation, consider the system illustrated in Figure 5(a), consisting of two pipelines of one and three stages. Suppose that there is a large architectural cost to lengthening the first pipeline, but that reducing the second pipeline to two stages would have only a negligible impact on the architectural performance. The indicated allocations (again for Ev = 2, Dv = 1) show how an overall target of ag = 2 is obtained by using high-hardware-intensity ( = 5) circuitry in the architecturally important cycle and balancing that with lower = 1.25 in the less critical pipe. Note that we neglect the important effect of changing latch count in this example.
A second example [Figure 5(b)] shows how the movement of logic between adjacent cycles can also be used to facilitate power efficiency. Again the overall target is = 2, but the initial partitioning has = 3 in both cycles. However, perhaps because of large differences in the sizes of logic cones between the cycles, most of the power is burned in the second stage. By moving a portion of the second-stage logic to the first cycle (again ignoring any changes in latch count), and by resizing the two cycles, the values are made more unbalanced but the overall weighted aggregate is reduced to the target value.
Figure 5
For the higher-level microarchitectural analysis of energy–performance tradeoffs, it is useful to abstract a single aggregate quantity for hardware intensity ag that represents the whole pipeline, such that
where D is the clock period of the pipeline and E is the average energy dissipated per cycle, E = ∑Ei. To derive an expression for ag, notice that increasing the clock cycle time by dD through retuning the circuits in all stages of the pipeline increases the total energy of the pipeline by
where the summation is performed over all stages of the pipeline. Since (11) is satisfied,
which means that the aggregate hardware intensity for a multistage pipeline is expressed through the hardware intensities of individual stages i as
| (13) |
Then (12) is identical to (9), with = ag.
Composite pipeline stage
Pipeline stages usually consist of multiple blocks that are designed and optimized independently. In any conventional pipeline, at least two independent blocks (latches and logic) can be distinguished, and these are usually designed and tuned independently of each other. Consequently, different blocks in the same pipeline stage may have different values for the optimal hardware intensity [Figure 6(a)]. Then, there are M + 1 independent variables corresponding to the hardware intensities in the M blocks of a pipeline stage: 1, … M, and the single power-supply voltage v. The goal is to find a relation between 1, … M and v that leads to the minimum energy
| (14) |
subject to the total delay requirement Dr; disregarding interblock delay coupling effects, this relation can be written as
| (15) |
Solving this optimization problem, we arrive at M expressions,
| (16) |
where ui is the delay weight of block i, ui = (Di /D), and wi is the corresponding energy weight, wi = (Ei /E), calculated taking into account the activity factors in clock-gated designs. Note that within a single pipeline stage this implies
| (17) |
Thus, in a pipeline stage that consists of multiple blocks designed independently, blocks that have lower energy weight and higher delay weight should be designed more aggressively than blocks with lower delay weight and higher energy weight. This balance equation is immediately useful in describing the relation between latches and logic. Again, use target values of Ev = 2, Dv = 1 and suppose that all of the pipeline stages are similar, with latches using 20% of the cycle delay but 50% of the power, as in Figure 6(b). Equation (16) can then be used to determine the optimal hardware intensity for the latches at latch = (0.2/0.5)2.0 = 0.8, and for the logic at logic = (0.8/0.5)2.0 = 3.2. Thus, for these assumptions logic must be optimized much more aggressively than latches.
Figure 6
To derive an expression for the aggregate hardware intensity of a composite pipeline stage, notice that increasing the clock cycle time by dD = ∑dDi through retuning the circuits in all blocks of the pipeline stage increases the total energy by
or
If (17) is satisfied, the expression reduces to
where k is any sub-block in the pipeline stage. Thus, the aggregate hardware intensity ag for a composite stage is expressed through the hardware intensities of individual sub-blocks i as follows:
| (18) |
where k is any sub-block in the pipeline stage.
Multistage pipeline with composite stages
Now we derive the optimality relation for a more general case, representative of a realistic microprocessor, in which the pipeline consists of N stages and there are at most M sub-blocks in each pipeline stage that are designed independently of one another (Figure 7). Let Eij be the energy dissipated in sub-block j of pipeline stage i, Dij be the corresponding critical path delay, and ij be the corresponding hardware intensity, 1 i N, 1 j M. The goal is to minimize the total energy in the space on N × M + 1 variables,
| (19) |
subject to the N constraints
| (20) |
Figure 7
Solving this problem, we arrive at N(M - 1) relations, M - 1 relations for every pipeline stage i, which are similar to (17):
| (21) |
and one expression similar to (12):
| (22) |
where index ik refers to any sub-block k within pipeline stage i, uij is the delay weight of sub-block j in pipeline stage j, uij = (Dij /D), and wij is the corresponding energy weight, wij = (Eij /E), calculated taking into account the activity factors.
To derive an expression for the aggregate hardware intensity of a multistage pipeline with composite stages, notice that increasing the clock cycle time by dD = ∑dDi through retuning circuits in all pipeline stages increases the total energy by
or
If (21) is satisfied, the expression reduces to
where index ik refers to any sub-block k within pipeline stage i. Thus, the aggregate hardware intensity ag for a pipeline with composite stages is expressed through the hardware intensities of individual sub-blocks ij as
| (23) |
where index ik refers to any sub-block k within pipeline stage i. Notice that the optimality relation (22) is equivalent to (9), with = ag.
Using the expression for the aggregate hardware intensity within pipeline stages (18), relation (22) can be rewritten as
| (24) |
where wi is the total energy weight of pipeline stage i and ag i is the aggregate hardware intensity in pipeline stage i.
Relation to the architectural metric
So far the paper has focused on balancing performance and power at the circuit level. It was shown in [15] that the concept of hardware intensity is closely related to the architectural energy-efficiency metric. To achieve the energy-optimal design in the global architecture-circuit space, architectural choices must be balanced with circuit-level decisions. We next present a methodology that allows architects to optimize the architecture in the global energy-performance space by balancing the architectural complexity with the aggressiveness of the design at the implementation level.
To derive the architectural energy-efficiency criterion, we introduce a discrete variable that represents the architectural complexity of a processor [11, 16], and we express the average power W 1 and performance P of a processor as functions of three variables: architectural complexity , power-supply voltage v, and aggregate hardware intensity ,2 as follows:
| (25) |
W( , , v) =
f( , , v)I( )E( , , v),
| (26) |
where I is the average number of instructions executed per cycle (IPC), which is a measure of the architectural speed, N is the dynamic instruction count, f is the maximum clock frequency, and E is the average energy dissipated per executed instruction, measured on the same set of benchmarks as the IPC. The architectural characteristics N and I do not depend on and v, whereas f and E depend on all three design variables.
We pose the optimization problem as a problem of optimizing performance subject to a constant power budget.3 In discrete terms, for an architectural feature  under evaluation we will find a condition for which P > 0, assuming that the power-supply voltage v and the aggregate hardware intensity are adjusted accordingly to satisfy the constraint W = 0.
To derive the criterion, we make an assumption that for every architectural configuration the processor pipeline is tuned according to the optimal balance between the aggregate hardware intensity and the power supply [Equations (9) and (23)], = (v). Then, disregarding second-order terms, the increment in performance and the constraint of fixed power can be expressed as follows:
| (27) |
| (28) |
Substituting expressions (25) and (26) into these formulas and using notation from (7), we rewrite (27) and (28) as
| (29) |
and
| (30) |
Using the definitions of the voltage intensity in (7) and aggregate hardware intensity in (5), and the assumption about the optimal power–performance balance in the pipeline, = , we rewrite the constraint of fixed power (30) as follows:
| (31) |
When expressions (29) and (31) are combined, the condition of an increase in performance P > 0 subject to the constant power constraint W = 0 leads to the following formula:
| (32) |
In this formula ( f/f ), ( I/I ), ( E/E ), and ( N/N ) are relative increments in the processor frequency, architectural speed, average energy per instruction, and dynamic instruction count arising from a modification at the architectural or microarchitectural level, evaluated for a fixed hardware intensity ag and power supply v. Thus, all deltas in (32) have the meaning of partial derivatives with respect to the architectural complexity.
The terms ( I/I ) and ( N/N ) in (32) can be measured by running the benchmark suite on an architectural simulator. Next we present a methodology for estimating the two remaining terms, ( f/f ) and ( E/E ), and derive a new form of the energy-efficiency criterion that does not require estimating f.
The key assumption in deriving the energy-efficiency criterion (32) was the assumption about the optimal tuning of circuits in every pipeline stage (21) and (22) for every architectural alternative, such that the aggregate hardware intensity of the processor ag (23) is unchanged between designs implementing architectural alternatives. This assumption imposes special rules on the calculation of ( f/f) and ( E/E); in particular, these relative increments must be calculated assuming that the processor pipeline is reoptimized after every modification to the microarchitecture to satisfy (21) and (22).
Suppose that an architectural feature under evaluation introduces an additional complexity into several (or all) stages of the pipeline, leading to increments Di|no retuning in critical path delays in the corresponding pipeline changes, assuming that no retuning is done to recover the clock frequency. Suppose that the corresponding increments in average energies are Ei|no retuning. The increments Di|no retuning and Ei|no retuning should be evaluated consistently with the initial hardware intensities of the corresponding stages. For example, logic added to stage i should be tuned (or assumed to be tuned) according to Equation (17). Then, after the logic is added, the aggregate hardware intensity (18) in pipeline stage i will not change. The delay and energy increments may be either positive or negative, and in those pipeline stages that are unaffected by the architectural modification, the delay and energy increments are zero, Di|no retuning = 0,
Ei|no retuning = 0, as shown in Figure 8.
Figure 8
Circuit designers usually have no difficulties estimating the “nonretuned” increments in delay and energy. For example, adding an execution bypass in the 10FO4 pipeline results in increments in the critical path delay and average energy of the execution stage of the pipeline which are approximately
whereas adding an extra read port to a multiported register file may result in
with no impact on other stages of the pipeline.
To recover the clock frequency, circuits in those stages of the pipeline that are negatively affected by the architectural modification must be tuned up for a higher hardware intensity. To restore the energy optimal balance in the pipeline (24) circuits in all remaining stages must be tuned down for a lower hardware intensity, so that
| (33) |
where  i is the increment in the aggregate hardware intensity in stage i as a result of retuning ( i = ifinal - iinitial), as illustrated in Figure 8, whereas wi is the net increment in the corresponding energy weight as a result of both adding hardware and subsequent retuning [ wi = ( Ei /E) - wi( E/E)].
We designate by Di|retuning and Ei|retuning the increments in delay and energy in the pipeline stage i as a result of retuning the processor, whereas by Di = D and Ei we designate the net increment in delay and energy in pipeline stage i as a result of both modifying the function and subsequent retuning:
D = Di|no retuning + Di|retuning;
| (34) |
E = Ei|no retuning + Ei|retuning;
| (35) |
Thus, the net delay and energy increments in every pipeline stage consist of increments due to a change in the functionality resulting from a microarchitectural modification, and additional increments as a result of retuning the circuits. The net delay increment D does not require any index because all pipeline stages are assumed to have the same delay before and after the retuning, Di = D The relative increment in the maximum clock frequency is related to D as
| (36) |
Assuming small changes in hardware intensities in all pipeline stages and neglecting second-order terms, the increments in energies Ei|retuning as a result of the retuning can be expressed through the corresponding increments in delays Di|retuning as follows:
| (37) |
By using (34) and (35), the final increments in energies can be expressed as
| (38) |
The total increment in energy of the whole pipeline, E = ∑ Ei, is calculated by summing expressions (38) over all pipeline stages and taking advantage of (13) and (36):
| (39) |
Substituting this expression into the derived energy-efficiency criterion (32), we notice that the term ( f/f) cancels out, since in both expressions it has the same meaning of a partial derivative with respect to architectural complexity . Then, dropping  in the denominators of all terms, we arrive at the form of the energy-efficiency criterion that does not require estimating the increment in frequency:
| (40) |
where
is the total increase in average energy dissipated per instruction, assuming that no retuning is done, summation being done over all stages in the pipeline affected by the architectural modification.
Expression (40) is a more convenient form of the energy-efficiency criterion than (32). According to (40), in order to evaluate the energy efficiency of some architectural feature, the architects must supply the relative gain (or loss) in the architectural performance ( I/I) and relative change in the dynamic instruction count ( N/N) that result from this feature. These estimates can be obtained by running an architectural simulator, or timer, such as Turandot [17, 18]. The second term, N, is nonzero if changes to the instruction set architecture (ISA) are considered, or compiler optimizations are analyzed for energy efficiency. It may also be nonzero if microarchitectural changes that affect the average number of instructions executed from mispredicted paths are considered in a speculative issue processor.
Then the architect needs to consult circuit designers to estimate the impact of the architectural feature under consideration on the average energy dissipated per instruction and the critical path delay through every stage of the pipeline affected by this architectural feature. A significant advantage of the derived formula is that in estimating the relative changes in energy and critical path delays, the circuit designer does not need to worry about retuning the circuits to recover the frequency, or reducing the positive timing slack to save some power. Then, the relative increments in critical path delays are summed and multiplied by the appropriate energy weights and hardware intensities. The higher the energy weight wi and the hardware intensity i of a part of the pipeline i affected by the architectural feature, the higher the weight of the increase in the critical path delay through this part of the pipeline.
Expression (40) is then evaluated. If the inequality holds, the architectural feature under evaluation is energy-efficient; that is, after adopting it, the processor will deliver higher net performance at the same power budget, after appropriate retuning and, possibly, adjustment in the power-supply voltage are done to meet the power budget.
The energy weights wi in (40) are typically available as part of power budgeting at the early stages of the definition of the processor pipeline. The only additional data required in order to use the energy-efficiency criterion are hardware intensities i in all blocks of the processor. Those quantities can be measured by static tuning tools such as EinsTuner [12] on the basis of simulations of previous designs, or set as targets at early planning of the microarchitecture, in the same way the power targets are budgeted.
We refer the reader to [16] for realistic examples of using the derived energy-efficiency criterion, and a graphical interpretation of the iterative process of refining the architecture using the criterion.
Conclusions and future work
The concept of hardware intensity leads to a number of quantitative relations which can be used to communicate information between circuit designers and architects. Circuit designers can use existing designs to provide typical hardware intensity values to architects for use in evaluating the power efficiency of a starting design. Architects in turn can use these relations to provide guidance to the circuit designers on appropriate levels of power/performance to target. Note that the metric can be used as a target for circuit tuning, or straightforwardly evaluated for a tuned circuit. The relations on also provide guidance for choosing appropriate supply voltage. Overall attention to these concepts ensures a more power-efficient design.
We plan to extend this analysis in a number of ways. In the examples considered so far, the delay of a block has been considered as a single value. In practice, however, macros typically display irregular boundary conditions. There may be critical paths of greatly different lengths in a single macro. Fortunately, it is also possible to compute sensitivities of macro power with respect to various timing assertions and derive conditions for power optimality based on them. This work is underway. We have mentioned briefly the fact that moving latch boundaries generally results in changing the number of latches, which is significant because of the large contribution of latches to overall power. Such considerations generally push the analysis into the realm of microarchitecture, and collaboration is underway to extend the analysis in this way [11, 19]. For planning purposes, it is important to understand the values for hardware intensity implied by real designs. Existing circuits are being studied to understand the hardware intensities in practice.
The existing analysis asserts relatively rigid cycle boundaries, while in practice some degree of transparency or cycle stealing may be possible. There may be mixtures of clock domains of differing frequencies. Techniques for enhanced power efficiency such as multiple thresholds, oxides, or supplies should be considered. In addition to facilitating communication between circuits and microarchitecture, hardware intensity could also be used between circuits and technology. Decisions regarding FET parameters could be aided by considering the power–performance characteristics of such devices. The supply voltage for power limits may differ from the supply voltage used for nominal timing, and this can affect the scaling. As is clear from the above discussion, these ideas provoke a number of interesting extensions addressing practical issues.
Appendix
For completeness, we derive below a closed-form expression for the term
which appears in form (32) of the energy-efficiency criterion and relations for energy increments. The goal is to express this term through the easily measurable quantities
To derive the formula, we use N expressions (38) and the condition of the unchanged aggregate hardware intensity (33) which, using (12), can be expressed as
| (41) |
To close the system of equations, we express the increments in hardware intensities  i, resulting from the retuning in expression (41), through the corresponding increments in delays as
The partial derivatives ( i / Di) can be expressed through the second-order derivative of energy with respect to delay as
Using (34), the increments in hardware intensities  i can be expressed as
| (42) |
Substituting N relations (38) for Ei and N relations (42) for  i into Equation (41), we can express the final increments in the clock period D and frequency f through “nonretuned” increments in energy and delay, Ei|no retuning and Di|no retuning in pipeline stages i = 1, … N as
| (43) |
| (44) |
where bi and ci are weighting factors expressed as
| (45) |
ci = wi( i ),
| (46) |
where wi are the energy weights of pipeline stages, as defined before. Thus, the relative change in the clock period is a weighted average of relative increments in critical path delays, calculated over all pipeline stages. The only new quantity in these expressions is the normalized second-order derivative of energy with respect to delay, which, like hardware intensity, can be obtained from energy–delay tradeoff curves.
It can be shown under very general assumptions about the shape of the energy–delay tradeoff curve, such as a uniform growth of hardware intensity as a function of the critical path delay
= (D), that the normalized second-order derivative of energy with respect to delay grows at least as a square of the hardware intensity,
and, as a consequence, ∑bi > 0. Then, the higher the energy weight wi and hardware intensity i of pipeline stage i, the higher the weight of the corresponding increase in the critical path delay bi in the weighted average (43).
Changes in energies of pipeline stages due to architectural modifications also affect the increment in the clock period. According to (46), the higher the energy weight wi and hardware intensity i of pipeline stage i, the higher the weight of the increase in the energy ci in (43). Notice that ∑ci = 0.
For architectural changes that uniformly affect all pipeline stages,
Ei|no retuning = Ej|no retuning,
|
and
Di|no retuning = Dj|no retuning,
|
formula (43) is reduced to intuitive expressions
The same simplification applies to uniformly tuned pipelines, with bi = bj, i = j, and wi = wj.
Acknowledgment
The authors would like to thank P. Bose for valuable discussions and J. Moreno and K. Warren for management support.
Footnotes
1For designs in which the worst-case power is the main criterion or designs in which clock gating is not implemented, use should be made of a different form of expression for power which leads to a similar expression for the energy-efficiency criterion [11, 14].
2For compactness, the subscript “ag” of has been omitted in most of the formulas in this section.
3It was shown in [11] that the reciprocal problem of minimizing power subject to a constant performance requirement leads to the same result.
*Trademark or registered trademark of International Business Machines Corporation.
**Trademark or registered trademark of Synopsys, Inc.
Received October 9, 2002;
accepted
for publication February 14, 2003; Internet publication September 23, 2003 |