IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Journal of Research and Development

Advanced Silicon Technology   Volume 50, Number 4/5, 2006
Table of contents: HTMLPDF This article: HTML PDFDOI: 10.1147/rd.504.0419Copyright info

Optimizing CMOS technology for maximum performance

by D. J. Frank,
W. Haensch,
G. Shahidi,
and O. H. Dokumaci

Since power dissipation is becoming a dominant limitation on the continued improvement of CMOS technology, technologists must understand the best way to design transistors in the presence of power constraints. The primary objective is to obtain as much performance as possible for a fixed amount of power, and it is chip performance, not device performance, that matters. In order to investigate this regime, we have captured in simplified models the basic elements for determining chip performance, including intrinsic transistor characteristics, circuit delay, tolerance issues, basic microprocessor composition, and power dissipation and heat removal considerations. These models have been assembled in a processor-level technology-optimization program to study the characteristics of optimal technology across many generations of CMOS. The results that are presented elucidate the limits of future CMOS technology improvements, the optimal energy consumption conditions, and the relative benefits of various proposed technology enhancements, including high-k gate insulators, metal gates, high-mobility semiconductors, improved heat removal, and the use of multiple layers of circuitry.

Introduction

For the past several decades, the semiconductor industry has relied on a progression of smaller, denser, faster, cheaper MOSFETs to provide increasingly better products for digital electronics. This process of shrinking CMOS transistors in order to attain these improvements is known as scaling, and its progress is often characterized by measuring the device speed. As CMOS scaling continues, however, it is increasingly important to analyze potential technology design points for their impact on overall chip performance, and not just for their impact on device speed, because chip-level power constraints as well as device and process variability can seriously diminish the value of device innovations. The high cost of developing new technology options also makes it vital to gain an early understanding of their potential benefit to the final products, so that developments with little benefit can be avoided. This paper describes a high-level technology optimization tool, and the results of using it to perform chip-level analyses of potential technology options for the 45-nm and 32-nm generations. These options include enhanced mobility, high-permittivity gate-insulating materials (“high-k”), metal gate workfunctions, and thermal solutions.

Most prior work in this area has focused on system-level power and performance estimation for extrapolating the behavior of future technologies, using estimated critical paths and fairly detailed system descriptions to determine clock frequency, often with much attention paid to the nature and use of the wiring hierarchy. Early examples of prior work are described in [13]. Second-generation modeling systems have included GENESYS [45], RIPE [6], BACPAC [7], and, more recently, GTX [8]. Although some of these models are quite detailed at the system level, and optimize various aspects of the wiring and its usage, they have not generally sought to optimize the device technology, preferring to treat information on device technology as a user input. J. D. Meindl et al. have performed a system-level analysis of the limits to scaling for both devices and wires, looking at limits that are caused by a wide range of physical effects [9]. Threshold- and supply-voltage optimization has frequently been studied as a means to reduce power dissipation (see for example [1011]), while other work has focused on minimizing power for a fixed performance, subject to various constraints [51213]. Recently it has been argued [1415] that power should be considered as the primary constraint that determines how far technology can scale, and that the different power requirements associated with different applications result in different limits to scaling. These prior studies formed the basis for an interdisciplinary CMOS design space study [16], in which many aspects of CMOS design were combined into a single model for optimizing CMOS device and wiring technology, with an emphasis on new device scaling aspects.

The work described in this paper builds on the previous design space studies by adding an improved, calibrated device model, accounting for on-chip tolerance issues, accurately capturing the area and power allocations of real processor chips, and implementing detailed temperature dependences and heat-sink models. This tool is believed to provide the best available analysis of the relative utility of proposed technology options because it attempts to “self-consistently” optimize the devices themselves.

The next section describes the overall optimization approach, followed by an explanation of some of the details of the models. Results of the optimizations are discussed.

Optimization methodology

If Moore's law [17] could provide a path for unbounded progress, as some projections have implied, our goal of seeking an optimum CMOS technology would not be meaningful. The reality, however, is that CMOS technology is bounded. The existence of an optimum technology is illustrated schematically in Figure 1. When device dimensions are large and threshold voltages are high (at the left side of the curves), dissipation caused by leakage currents can be low. The shrinking of dimensions reduces capacitance and enables increasing performance at fixed power. However, device scaling eventually leads to increasing leakage current, due to quantum-mechanical tunneling and subthreshold current. When the total power is constrained, this leakage dissipation ultimately dominates the power consumption, leaving very little power left over for active circuit switching, which leads to a loss of overall performance beyond some point in the scaling process, even though the devices may be getting faster (the right side of the curves). The height and position of the maximum performance-versus-scaling curve depends on the power constraint and other system conditions, but a maximum does exist. This performance-versus-scaling situation applies to cases in which the power constraint is associated with a roughly fixed degree of architectural complexity. This argument for the existence of an optimum should generally apply to all computational electronics, because it depends only on some very broad features of device physics: 1) electrostatics imposes geometric constraints on the relative device dimensions; 2) thermodynamics imposes constraints on voltage reduction; 3) quantum-mechanical tunneling effects inevitably cause exponentially increasing leakage currents when dimensions are sufficiently reduced; and 4) various practical considerations limit power dissipation.

Figure 1 Figure 1

An optimization tool has been developed to determine the technology parameters that lead to this optimum performance for CMOS technology. The program involves a collection of models that span the material, device, circuit, and system levels, some aspects of which are described in more detail below [916]. The overall structure of the optimization tool is shown schematically in Figure 2. The goal is to find the values of the device technology parameters that will result in the greatest possible processor performance for a given power level. We have chosen to measure performance in terms of a total logic net transition rate, LTR. This is the total number of state changes per second for all of the logic nets in the processor core(s) combined. We have deliberately chosen this metric, rather than a critical path-delay metric, because it allows a substantial degree of independence from the architectural details, making our results more generally useful for changing architectures. This metric relies on the expectation that the rate at which useful instructions can be executed by the processor will monotonically increase with LTR. An alternate optimization metric also was considered in [16], based on total computation received per dollar spent (on both chip and energy) over the expected life of the chip. Because it was shown in [16] that both optimization approaches give similar results, we have focused on maximizing LTR at fixed power in this paper.

Figure 2 Figure 2

As in [16], to reduce complexity and narrow our focus, only the logic devices in the processor core are actually optimized; it is assumed that the power and speed of the clock and latch circuits, registers, memories, and I/O can ultimately all be optimized with essentially the same power/performance result as the logic part of the processor. To achieve accurate results at the chip level, we can use actual chip data to set the processor core complexity, and use the fraction of the chip area and power that is used for logic. The optimization controls the actual size of the chip, and hence the power density, by adjusting the device and wire sizes.

Because memory actually occupies the majority of the chip in modern processors, it may seem unreasonable not to include it in the optimizations. However, we have not done so because of the previously mentioned observation that different applications must be separately optimized [14]. Memory has very different requirements than logic, which lead to optima that are quite different from those for logic. The best system performance can certainly be obtained by creating a technology that offers different, separately optimized devices (and voltages) for memory and for logic. If economic considerations force memory and logic to use the same devices, optimization across both sets of requirements might very well find different results than those reported here, which are for logic by itself.

The basic optimization methodology starts with definitions for power and delay as functions of the underlying technology parameters. In an inner programming loop, one degree of freedom (usually the supply voltage, VDD) is used to satisfy the power constraint, and then in the outer loop, the remaining variables are optimized to find the maximum possible performance.

The total power (PTOT) calculation includes dynamic switching power (PDYN), power due to subthreshold leakage current (PsubVT), power due to gate oxide tunneling current (Pox), and power due to drain-to-body tunneling current (PB2B), as defined in the following equations:

PTOT = (PDYN + PsubVT + POX + PB2B)logic + (PDYN + PsubVT + POX + PB2B)repeaters,(1)

Equation 2(2)

PsubVT = 1.7βIoff · NCKT · VDD · W · Joff(VT, VDD, tox, η, LCH),(3)
Pox = NCKT · FI · VDD · LG · W · Jox(VT, VDD, tox, η),(4)
PB2B = NCKT · FI · VDD · L/2 · W · JB2B(FMax, VDD),(5)

where VDD, VH, VL, and VT are the supply voltage, high-logic-level, low-logic-level, and threshold voltage, respectively, and tau is the average switching delay of a loaded logic stage (a NAND gate, with average fan-in, FI, usually set to 2 and average fan-out of 1.65). NCKT is the number of logic gates, langCrang is the average total load capacitance, alphaS is the switching activity factor, ℓD is the logic depth, Joff is off-current density (at VL) of a typical logic FET (see next section), Jox is the oxide tunneling current density (at VH) [15], and JB2B is the band-to-band tunneling current density from drain to body (at VH, and using junction area ½LGW)) [1518]. tox is the oxide thickness, η is the subthreshold ideality, W is the average FET width, LG and LCH are the gate length and channel length, FMax is the peak field in the body–drain junction, which depends on the doping and the voltage, and βtau and βIoff are tolerance-related correction factors. The power contributions are separately computed for logic and for the buffers that are placed in long wires (i.e., repeaters).

The performance metric, LTR, is computed from the delay as

Equation 6(6)

where CPIeff is a correction factor to take into account the impact of long wires and their repeaters, described in a later section. The loaded logic delay computation proceeds in steps as in [5]. First, the basic device delay is computed using a modified CV/I form that has been shown to accurately take into account output conductance effects [19]:

Equation 7(7)

where Ieff = ½[IDS(VDD, ½VDD) + IDSVDD, VDD)], the Cs are average capacitances, as indicated by their subscripts, and IDS(VDS, VGS) is drain current as a function of drain and gate voltages, which is described in the next section. Next, the wire RC and time-of-flight delays are computed, and combined using an empirical formula [35]:

Equation 8(8)

tau3 = Lwire /(c/2),(9)
tau4 = (tau24/3 + tau34/3)3/4,(10)
where Rwire is the temperature-dependent wire resistance, Lwire is the average wire length, and c is the speed of light. Finally, these delays are combined and divided by a rise-time correction factor due to Sakurai and Newton [19]:

Equation 11(11)

where alpha is the power-law exponent described in the next section.

Model details

Device current–voltage model, calibration, and technology generations

The structural portion of the device model assumes bulk or partially depleted silicon-on-insulator (PD-SOI) FETs, and uses the effective doping, Neff, in conjunction with a first-order analytic 2D Poisson solution to determine VT, η, and DIBL (drain-induced barrier lowering). This model yields continuous, physically realistic device characteristics for all gate lengths, from punch-through to long-channel. Because the Poisson solution depends on LCH, gate insulator thickness and material, and body doping, we have fully captured the underlying technology dependencies in a very general way. The current–voltage model is a generalization of Sakurai's alpha power-law model [19], in which we use the Fermi–Dirac function Falpha−1 to achieve a smooth transition between an alpha power law above VT and an appropriate subthreshold exponential tail so that the same model can be used for both ON and OFF currents. The intrinsic saturation current is given by

Equation 12(12)

where VGS is the gate-to-source voltage, epsilonI is the gate insulator permittivity, and toxeff is the effective thickness of the gate insulator, including quantum effects and poly-Si depletion. e is the electronic charge, EC is the characteristic field in the velocity–field relationship, μ0 is a calibration parameter with units of mobility, β and s are exponents fitted to available data, μ(E,T) is the universal mobility curve [20] as a function of the effective perpendicular field and the temperature, and μx is a mobility enhancement factor used to account for technologies that improve mobility. VDS dependence is accommodated by means of a DIBL adjustment to VT. Source and drain resistance, RCS, is included by using a numerical iteration to self-consistently adjust VGS and VDS to account for the extrinsic voltage drops. Table 1 gives the values used for the constant-valued parameters.


Table 1 Values for various constant model parameters.
DescriptionSymbolValue

Activity factor over logic depthalpha/ℓD0.012
I–V curve power lawalpha1.462
I–V formula gate-length exponentβ0.405
I–V formula mobility exponents0.430
Mobility calibration parameterμ0132.3 cm2/V-s
Critical fieldEC2.5 × 104 V/cm
Halo exponent 1n1−0.574
Halo exponent 2n22.18
Maximum logic depthDmx10
Number of logic stages in typical instructionnLI60
Latency penalty weighting factorgamma0.1

Halo doping is a fabrication process in which body dopants are implanted at angles from both the source and drain side of a FET. This is very useful because it causes the effective doping, Neff, in the channel of a MOSFET to increase when the gate length becomes shorter, which tends to compensate for the electrostatic short-channel effects. This is accounted for in our model through the fitting function

Equation 13(13)

where ND sets the doping magnitude, n1 and n2 are fitting exponents, and the parameter xe sets the channel length scale over which Neff varies. xe should be related to the characteristic length scale of the halo-doping profile. This is one of the parameters that must improve from generation to generation in order for scaling to proceed. The source/drain doping usually causes overlap between the gate and the source and drain, so the channel length is offset from the gate length (LCH = LG − xovlp) by an overlap distance, xovlp, that must also decrease as technology improves.

The model is calibrated to 2D drift/diffusion FIELDAY (FInite ELement Device Analysis) [21] simulations at several technology nodes, as shown in Figure 3. The correlation between full 2D device simulations and our simple compact model is excellent, considering that only 14 fitting parameters are used to match this entire set of data, which includes three different gate lengths at both the 90-nm and 45-nm technology nodes. The values of most of these parameters are included in Tables 1 and 2.

Figure 3 Figure 3

On the basis of fits to FIELDAY simulations and ITRS roadmap considerations, the set of adjustable parameters for the FET model were chosen for each technology node, as shown in Table 2.1


Table 2 Fixed technology parameters that vary by node.
DescriptionSymbolTechnology node (nm)

13090654532

Wire 1/2 pitch (nm)175120907050
 
Gate overlap (nm)xovlp198.743−1.03−1.5
 
Halo scale length (nm)xe61.553.446.340.234.9
 
Contact resistance (Ω-cm)RCS0.01290.01360.01440.01520.0161
 
LER sigma LG @W = 1 μm (nm)0.280.280.280.280.28
 
ACLV (nm)3.92.71.81.31
 
Mobility enhancement factorμx11.41.722
 
Gate depletion (nm)0.40.40.30.30.01
 
Wiring permittivitykwiring3.93.53.22.82.5
 
Permittivity (gate insulator)epsilonI4.7 (oxynitride)4.7 (oxynitride)4.7 (oxynitride)4.7 (oxynitride)20 (HfO2)

These are the parameters that are fixed for each node, and are thought of as the best that technology will be capable of at that node. The gate length, oxide thickness, and voltages are not fixed by node, but rather are determined by optimization.

Tolerance modeling

The following within-chip tolerances are included in the analysis: discrete dopant VT variation, random gate-length variation due to line-edge roughness (LER), across-chip gate-length variation (ACLV), VDD variations, and signal coupling noise. The model estimates the impact of these variations on the average subthreshold leakage current and on the worst-case delay. It also checks that 6σ VT and noise shifts do not cause an individual NAND gate to fail, although this is not usually a problem.

The impact on subthreshold leakage current is estimated by observing that the doping variations, gate-length variations, and noise combine to create an approximately Gaussian distribution, ρ, of equivalent threshold voltage with sigma, σVTeff. When this Gaussian distribution is integrated against the exponential off-current dependence, it yields an average shift [14]:

Equation 14(14)

Thus, the background leakage current increases by the factor

Equation 15(15)

and so does the static power dissipation. If σVTeff exceeds kT, this factor can become quite large.

The impact of random variations on worst-case delay is estimated on the basis of the following analysis. Each path i in the set of all paths has a distribution ρi(t) of worst-case delays, where “worst case” means the longest possible delay that can occur due to any conceivable instruction sequence (i.e., due to worst-case signal coupling and supply noise). The distribution is over random intrinsic device variations (e.g., variations in VT). The distribution should also include across-chip variation. Across-chip variations probably have correlations, but to a first approximation we may treat them as independently random and consider them as part of the intrinsic variations.

Thus, the probability of path i failing (by taking too long because its delay is longer than the clock time) is qi = integraltCKinfinityρi(t)dt, where tCK is the desired clock delay. Then, the yield for the whole chip is Y = Π (1 − qi), where the product is over all independent paths. Each path is considered as a set of stages, so that taui = ∑j = 1ni tauij, summing over the ni stages of the ith path. Next, approximate langtauijrang as tau0, the nominal value of the delay, and assume that the distribution of tauij is Gaussian and can be characterized by a σtau that can be estimated numerically by using a worst-case vector of variations away from the nominal case. Then, set taui = nitau0 and σtaui = square root(nitau. Now,

Equation 16(16)

where

Equation a

Next, treat ni as a continuous variable, and let P(n) be the density of effectively independent paths. Then,

Equation 17(17)

where ℓDmx is the maximum logic depth. If the P(n) density function can be treated as trapezoidal, then to first order only the value at ℓDmx matters, and Equation (16) may be approximately evaluated as

Equation 18(18)

This equation gives yield Y as a function of tCK, which is implicitly present in u.

Since we are given Y and want to find tCK, we can numerically reverse this equation and solve for tCK by iteration. Normalizing by ℓDmxtau0, which is the nominal value, gives

βtau = tCK /ℓDmxtau0.(19)

Figure 4 shows contours of constant βtau for a range of σtau/tau0 and yield. The value of P(ℓDmx) is not well known, but fortunately βtau has only a weak logarithmic dependence on this parameter. Our calculations use P(ℓDmx) = 0.125NCKT.

Figure 4 Figure 4

System composition

As noted before, to reduce complexity, only the logic devices and repeaters in the processor core are actually optimized; we assume that the power and speed of the clock and latch circuits, registers, memories, and I/O will be separately optimized and that the power/performance result of doing so will be essentially the same as for the logic part of the processor (although the optimal devices will be different). The assumed packing density of logic transistors has been adjusted to reflect actual design practices, as have the power allocations. On the basis of analyses of 90-nm- and 65-nm-generation IBM processor chips, we have assumed that 33% of core power and 15% of core area is associated with logic. For this purpose, “logic” excludes latches, clock buffers, registers, and all other forms of RAM. Depending on the processor chip configurations being simulated, we have assumed that 50–75% of the chip power is dissipated in the cores and that 50–75% of the chip area is devoted to cache.

Rent's rule [22] is used to determine average wire length for shorter wires. Repeaters are placed in long wires, with the average repeater width and separation being optimized as part of the overall optimization. In addition, we assume that wires with repeaters are on higher levels of the wiring hierarchy and are two times the size of the regular wires, thus lowering their resistance. This is a very simplified approximation to the detailed optimization of the wiring hierarchy that has been pursued by some [13], but we believe that it is sufficient for addressing the underlying device technology optimization in which we are interested. Furthermore, rather than independently optimizing the repeaters on individual wires [5], which raises questions of which sort of optimization to perform, we have chosen to merge the repeater optimization into that of the whole chip. The impact of the long wires on overall performance is captured in a latency-oriented model described in [16]. The model is based on the observation that long wire delay does not directly influence cycle time when designing a new processor, because it can be absorbed in increased pipelining, but the latency of long wires does contribute to the inefficiency of the processor by increasing the effective CPI (cycles per instruction) due to “instruction misses” (times when the processor must wait for a previous instruction to finish before launching a new instruction because the processor needs the previous result). To account for this latency, an application-dependent latency penalty factor gamma is introduced, and an effective CPI (associated with latency issues only) is computed as a weighted average between the case in which instructions can be launched immediately (CPI = 1) and the case in which the previous instructions must finish first:

Equation 20(20)

where taucycle = ℓDmxtau is the cycle time, and tauinstr = nLItau + nRItauR is the total time required to complete a typical instruction, from beginning to end, including all the stages of logic (nLI) and the transmission time for long wires, which depends on the repeater stage delay, tauR. This penalty factor enables the repeater characteristics to be included in the optimizations. The number of repeaters in a typical instruction is taken to be nRI = 2square root(Acore)/SR, where SR is the average spacing between repeaters, and 2square root(Acore) means that the total wire length requiring repeaters is twice the edge of the processor core. We usually set gamma = 0.1.

Thermal modeling

A thermal model has been implemented that allows the junction temperature to be self-consistently determined from the power dissipation. Temperature dependence is included in the subthreshold leakage current, the mobility model, and the wire resistance model. The heat-sink model is illustrated in Figure 5, and can accommodate hot spots, 2D and 3D thermal spreading resistance, and a wide range of materials.

Figure 5 Figure 5

Junction temperature constraints can also be imposed on the optimizations, to reflect realistic reliability concerns. At low power levels, when the junction temperature does not reach the constraint value, there is no effect, but at high power levels, when the temperature would exceed the constraint value, the chip area is increased by the addition of non-dissipating, unused areas. This increase in area reduces the power density just enough to keep the temperature at the constraint value. Such design points are undesirable, but represent the best that could be done if one insisted on dissipating excessive power.

Optimization results

Figure 6 shows the detailed results of optimizations for the 90-nm to 32-nm-technology nodes, using the node characteristics shown in Table 2. These optimizations are performed for a dual-core processor chip with aggressive air cooling. Seven variables have been optimized: gate length, oxide thickness, halo doping, mean width, mean repeater spacing, mean repeater width, and VDD. The peak in performance versus power seen in Figure 6(a) occurs because the heat-sink technology is fixed and the temperature rise is constrained. The peak corresponds to the power at which the maximum temperature is first reached, as can be seen from the constant temperature contours in the figure. In this case, the maximum temperature rise is 60°C. The only way to increase power further, without increasing temperature, is to spread the chip out, as described in the previous section. This lengthens wires and slows down the chip even though the power level is higher. Design points at power levels beyond the peak are undesirable and should be avoided in practice. Low-power designs require larger, less scaled devices [Figures 6(b), 6(c)] in order to reduce leakage currents, indicating that only the highest-power applications can utilize extremely scaled devices. Figure 7 shows the optimal allocation of power dissipation among the various mechanisms for a processor using water cooling (which allows higher power dissipation), from which it can be seen that gate leakage dissipation should not exceed a few percent, but optimal subthreshold leakage can exceed 50% for very high-power designs.

Figure 6 Figure 6 Figure 7 Figure 7

In an effort to understand the accuracy of our calculations, we have checked to see how our model predictions for past technology generations compare with what was actually built. We have found that the gate lengths [e.g., Figure 6(b)] agree reasonably well at the power levels for which technology generations have been targeted, but our supply and threshold voltages tend to be lower than what was used in practice, rising only slowly as the lithographic dimension increases. Two main reasons exist for this voltage discrepancy: 1) We have not yet included process variations in our analysis, which would undoubtedly slightly increase our optimized voltages; and 2) voltages used in past designs were probably not optimal. Ten to fifteen years ago, supply voltages were determined by external considerations, such as the “industry standard” five-volt power supply, and there was much resistance to the idea of lowering voltages, even when it became clear that reliability concerns demanded a change [23]. Furthermore, technologists tended to think that standby power should be quite low (unlike the optimized results in Figure 7), which required higher VT, and commensurately higher VDD. Consequently, we believe that past technology generations had non-optimally high voltages, making our comparison partially unsuccessful. On the basis of comparisons with more recent technologies, we expect our modeling to accurately predict trends, but exact optimum values for a specific scenario may be less accurate than the trend predictions because of the many simplifying assumptions.

Next we consider future technology options. Metal gates are simulated by removing the poly-Si depletion effects and adjusting the workfunction. High-k gate stacks are simulated using a double-layer bandgap-dependent tunneling model. As can be seen in Figure 8, high-k combined with metal-gate can potentially yield excellent chip-level performance enhancement, as is also seen in the 32-nm node in Figure 6(a), but metal gates by themselves do not offer much benefit over poly-Si, even for workfunctions that are equivalent to poly-Si. As the workfunction shifts from band edge toward midgap, a significant loss of performance occurs for both metal-gate and high-k combined with metal-gate. This loss occurs because the optimizations compensate for midgap workfunctions by lower doping (which increases depletion depth) and by raising the supply voltage (which necessitates thicker oxide). According to these optimizations, the benefits of the use of high-k are lost for PD-SOI by the time the workfunction reaches quarter-gap.

Figure 8 Figure 8

Many future technology options involve increasing mobility, such as the use of strain, hybrid-orientation substrates, and SiGe layers. Figure 9 shows that mobility increases can indeed increase chip performance, though with diminishing returns for large increases in mobility. This performance increase is larger for high-power chip designs than for low-power designs. It is not yet clear, however, how much mobility improvement will be possible at 32 nm, because much of the available increase will already have been achieved in previous nodes.

Figure 9 Figure 9

If we pessimistically suppose that some improvements will not be manufacturable, we obtain the results shown in Figure 10. Figure 10(a) serves as a baseline, in which improvements are successfully implemented according to Table 2. This is the same data as for Figure 6(a), on a linear scale, and we have added a high-k option for the 45-nm node, which clearly illustrates the potential benefits of an optimal high-k solution. Figure 10(b) shows the results if we are unable to improve the wiring dielectric constant and it remains fixed at 2.8. This reduces the peak performance of the 32-nm generation to the same value as for the 45-nm node. Finally, in Figure 10(c) we assume that the device technology is also fixed, with LG = 36 nm, tox = 1.1 nm, and the wiring dielectric constant kwiring = 2.8. In this case, the generation-to-generation changes involve only the packing density, as driven by the widths and wiring pitch, which are not fixed. A significant peak performance loss occurs in future generations, with very little gain even at low power, making it clear that density improvements alone are insufficient for future technology generations.

Figure 10 Figure 10

In Figure 11, the optimizer is used to assess and compare the potential performance gain achievable by a variety of proposed technology options. In this figure, the “base” case is the “baseline” 65-nm-node technology to which the other cases should be compared. Each point plotted is the peak performance that is possible with the given heat sink and the specified temperature rise, corresponding to the highest points on curves similar to those in Figure 6(a). Because the power level at which the peak occurs varies depending on the technology option, the data points are somewhat scattered along the x-axis. Among the options compared, the following appear to be effective for improving performance: reducing the wiring permittivity by 0.64x, using 3D integration with two layers of active circuitry, and turning off the supply to inactive logic. Reducing variability by 0.7x also helps performance somewhat, while simply making the wiring smaller does not appear very beneficial, as has already been discussed. Overall, technology changes that truly lower the switching energy appear useful, while changes that only make devices faster or denser at the same switching energy are not valuable when the circuits are power-limited. Because power is the controlling factor, improved heat removal is also quite effective. Note that maximum performance should be achieved by implementing all of the favorable changes simultaneously.

Figure 11 Figure 11

As noted above, improved heat-sink technology offers a direct path to larger performance gains than those provided by improved device technology, as shown in Figure 12. One may also gain a modest performance increase by decreasing the heat-sink temperature, but this performance gain generally disappears if the refrigerator power must be taken into account. Thermal solutions are difficult because such high-power processors are very inefficient, and performance is only increasing as roughly the log of the power. Note that it may not actually be possible to reliably deliver such high power levels to the chip, but experiments have shown that microchannel liquid cooling can remove the associated heat [2425].

Figure 12 Figure 12

This efficiency challenge is captured in Figure 13, which shows average energy dissipated per logic transition (total logic power divided by LTR) versus overall performance, for optimizations that cross technology generations by also including the wire pitch and the halo behavior among the optimized variables, yielding a total of nine variables being optimized. These optimizations are considered for four-core processor chips. Clearly, the very high-power designs are quite energy-inefficient, on a logic transition basis, compared to what is possible at lower power. The knee in this curve is very interesting, because it turns out to be within a factor of ~3 from both the lowest-energy designs and the highest-performance designs. Architectural innovations may allow most applications below and above the knee of this curve to efficiently utilize the device design at the knee, making the knee a very important technology design point.

Figure 13 Figure 13

One way to address the energy inefficiency of the high-power design points involves the use of smaller, lower-power cores in parallel. This is examined in Figure 14, in which a fixed number of transistors is divided into different numbers of processor cores. The cases with the higher numbers of cores have higher performance because the smaller cores result in relatively less wiring capacitance due to the shorter wires. This basic performance increase must of course be adjusted for architecture and system effects associated with increased parallelism, but these issues are outside the scope of this work.

Figure 14 Figure 14

Conclusion

Our results show very clearly that power constraints have a great effect on technology scaling. It is no longer possible to scale CMOS technology from one generation to another without taking into account power dissipation. Many of the proposed technology enhancements do show promise, but careful optimizations are necessary at every power level to ensure that the most appropriate technology is being used. The optimizations show that there is still room for significant progress in high-performance CMOS out to at least the 32-nm generation, especially for high-power applications, but low-power applications will require less-scaled devices. In the future, the dominant CMOS market may include technologies such as those with characteristics near the knee of Figure 13; however, smaller markets will undoubtedly continue to exist for both high-performance logic and very low-power technology.

Acknowledgments

The authors wish to thank M. Wisniewski, M. Scheuermann, P. Restle, S. Kosonocky, E. Colgan, and J. Magerlein for useful discussions and information.

References


Footnote

1The International Technology Roadmap for Semiconductors (ITRS) is an assessment of semiconductor technology requirements and is a cooperative effort of manufacturers, suppliers, government organizations, and universities.

Received October 3, 2005; accepted for publication December 22, 2005; Published online August 6, 2006.


    About IBMPrivacyContact