IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
  Related links:  
     IBM Microelectronics  
     ITRS  
IBM Journal of Research and Development  
Volume 46, Numbers 2/3, 2002
Scaling CMOS to the Limits
 Table of contents: arrowHTML arrowPDF arrowASCII   This article: HTML arrowPDF arrowASCII   DOI: 10.1147/rd.462.0245 arrowCopyright info
   

Interconnect opportunities for gigascale integration

by J. D. Meindl, J. A. Davis, P. Zarkesh-Ha, C. S. Patel, K. P. Martin, and P. A. Kohl
Throughout the past four decades, semiconductor technology has advanced at exponential rates in both productivity and performance. In recent years, multilevel interconnect networks have become the primary limit on the productivity, performance, energy dissipation, and signal integrity of gigascale integration. Consequently, a broad spectrum of novel solutions to the multifaceted interconnect problem must be explored. Here we review recent salient results of this exploration. Based upon prediction of the complete stochastic signal interconnect length distribution of a megacell, optimal reverse scaling of each pair of wiring levels provides a prime opportunity to minimize cell area, clock period, power dissipation, or number of wiring levels. Using a heterogeneous version of Rent's rule, a design methodology for the global signal, clock, and power/ground distribution networks for a system-on-a-chip has been derived. Wiring area, bandwidth, and signal integrity are the prime constraints on the design of the networks. Three-dimensional integration offers the opportunity to reduce the length of the longest global interconnects in a distribution by as much as 75%. Wafer-level batch fabrication of chip input/output interconnects and chip scale packages provides new benefits such as I/O bandwidth enhancement, simultaneous switching-noise reduction, and lower cost of packaging and testing. Microphotonic interconnects have long-term potential to reduce latency, power dissipation, and crosstalk while increasing bandwidth.

1. Introduction

Semiconductor productivity and performance have increased at exponential rates in the last forty years. Three generic strategies have guided these advances: 1) scaling down minimum feature size, 2) increasing die size, and 3) enhancing packing efficiency (defined as the number of transistors or length of interconnect per minimum feature square of silicon area). Scaling of transistors reduces their cost, intrinsic switching delay, and energy dissipation per binary transition. Scaling of interconnects serves to reduce cost but increases latency (response time) in absolute value and energy dissipation relative to that of transistors. These increases result from relatively larger average interconnect lengths (measured in gate pitches) and larger die sizes for successive generations. Therefore, interconnects have become the primary limit on both the performance and the energy dissipation of gigascale integration (GSI).

Following this brief introduction, Section 2 quantifies the key facets of the interconnect problem. The principal generic opportunities to resolve it, including new materials and processes, scaling, and novel architectures, are reviewed in Section 3 with emphasis on scaling. Reverse scaling of multilevel interconnect networks is based upon prediction of stochastic signal wiring distributions to achieve minimum area, power dissipation, clock period, or number of metal levels. A methodology to derive an integrated architecture for global signal, power, and clock distribution networks for a system-on-a-chip is reviewed in Section 4. Sections 5, 6, and 7 explore three unconventional approaches to alleviating the on-chip interconnect problem. These are respectively novel three-dimensional structures, high-density input/output interconnect enhancements, and compatible microphotonic interconnects. A brief conclusion is provided in Section 8.

2. The interconnect problem

What is the quintessential purpose of an interconnect? In a single word, it is communication. To give a more complete definition, it is communication between distant points with small latency. A lucid illustration that displays this key purpose is a graph whose vertical axis is reciprocal interconnect length squared and whose horizontal axis is interconnect latency [1]. Using logarithmic scales on both axes, a diagonal line is a locus of constant distributed resistance–capacitance product, the principal figure of merit of the large majority of interconnects used for GSI. As illustrated in Figure 1, reducing the distributed resistance–capacitance product moves the diagonal locus toward the lower left corner of the figure, providing smaller latency for a given interconnect length. However, during the past four decades interconnect scaling has increased the distributed resistance–capacitance product, moving toward the upper right corner of the figure and therefore demanding larger latency for a given interconnect length. In stark contrast, scaling of transistors reduces the power–delay product or switching energy of a binary transition, therefore moving toward the lower left corner of the power–delay plane to reduce simultaneously both average power transfer and delay.

Figure 1Figure 1

To quantify the exploding disparity between the latency of interconnects and transistors, consider the comparisons illustrated in Table 1. For the 1-µm-generation technology of the late 1980s, the “CV/I,” or intrinsic switching delay of a MOSFET [2] before it is loaded with parasitic or wiring capacitance, is approximately 20 ps. However, for the same generation, the total resistance–capacitance product or RC delay of a “benchmark” 1.0-mm-long interconnect is about 1.0 ps. In comparison, for the 100-nm generation projected for early production in 2005, the CV/I delay of a MOSFET decreases to 5 ps, while the RC latency of a 1.0-mm-long wire increases to 30 ps. The relevant observation is that as semiconductor technology is advancing from the 1.0-µm to the 100-nm generation, the RC delay or response time of a benchmark 1.0-mm-long interconnect is devolving from 20 times faster to six times slower than transistor intrinsic switching delay. Furthermore, the 1999 International Technology Roadmap for Semiconductors (ITRS) projection for 35-nm technology in 2014 suggests a 2.5-ps transistor delay and a 250-ps RC latency for a 1.0-mm-long interconnect [3]. For completeness, the time of flight (ToF) of a 1.0-mm-long interconnect is included in Table 1. As indicated, ToF delay is independent of scaling but does depend on the value of the relative permittivity of the interconnect dielectric.


Table 1   MOSFET and interconnect latency for 1.0-µm, 100-nm, and 35-nm-technology generations [3].
Technology generation MOSFET switching delay
(td = CV/I)

(ps)
RC response time
(Lint = 1 mm)

(ps)
Time of flight
(Lint = 1 mm)

(ps)

   1.0 µm (Al, SiO2) ~20 ~1 ~6.6
100 nm (Cu, k = 2.0) ~5 ~30 ~4.6
35 nm (Cu, k = 2.0) ~2.5 ~250 ~4.6

To underscore the formidable challenge presented by interconnects to continued performance improvements for GSI, it is noteworthy that the numerical values for RC delay cited in Table 1 represent simple best-case calculations. For example, the results do not account for the adverse results of surface scattering, high-frequency skin effect, liner thickness for copper interconnects, or temperature rises in a multilevel wiring network.

Beyond latency, interconnects present an energy-dissipation problem illustrated in Table 2 that also limits the performance of GSI as a consequence of practical constraints on the heat-removal capacity of the package of a gigascale chip or the energy-storage capacity of its portable power source. Again comparing technology generations, it is evident that the energy dissipation associated with a binary transition of a minimum-geometry MOSFET versus a 1.0-mm-long interconnect is respectively 33%, five times, and thirty times larger for the interconnect for the 1.0-µm, 100-nm, and 35-nm-technology generations. These gross imbalances clearly indicate that the power-dissipation problem of gigascale chips is essentially an interconnect problem.


Table 2   ITRS projections for switching delay, switching energy, clock frequency, total chip current drain, maximum number of wiring levels, maximum total wire length per chip, and chip pad count for 1.0-µm, 100-nm, and 35-nm-technology generations [3].
Technology generation

1.0 µm 100 nm 35 nm

   MOSFET switching delay (ps) ~20 ~5 ~2.5
Interconnect RC response time (ps) ~1 ~30 ~250
(Lint = 1 mm)
 
MOSFET switching energy (fJ) ~300 ~2 ~0.1
Interconnect switching energy (fJ) ~400 ~10 ~3
 
Clock frequency ~30 MHz ~2–3.5 GHz ~3.6–13.5 GHz
 
Supply current (A) ~2.5 ~150 ~360
(Vdd = 5.0, 1.0, 0.5 V)
 
Maximum number of wiring levels 3 8–9 10
 
Maximum total wire length per chip (m) ~100 ~5000 –—
 
Chip pad count ~200 ~3000–4000 4000–4400

The preceding discussion of latency and energy-dissipation problems presented by interconnects is concerned with signal wiring. Historical records and ITRS projections [3] of clock frequencies for high-performance microprocessors summarized in Table 2 indicate 30 MHz, 3.0 GHz, and 13 GHz as the respective nominal clock frequencies for the 1.0-µm, 100-nm, and 35-nm-technology generations. These rapidly escalating requirements place enormous new demands on the interconnects that implement clock distribution networks of gigascale chips. Bandwidth, power dissipation, variation in the time of arrival of a clock pulse at different points on a chip (skew), and differences in clock pulse width (jitter) represent increasingly formidable issues.

Although gigascale signal and clock distribution network problems are daunting, power distribution may well match them in difficulty. As noted in Table 2, estimated maximum chip current drain is respectively 2.5 A, 150 A, and 360 A for the 1.0-µm, 100-nm, and 35-nm-technology generations. Concurrently, power-supply voltage scales down from 5.0 V to 1.0 V to 0.5 V for the corresponding generations. These aggressive expectations for high-current, low-voltage power distribution impose utterly unprecedented demands on interconnect networks.

Finally, the targets for number of wiring levels, maximum total interconnect length, and number of bonding pads or input/output interconnects per chip cited in Table 2 add significantly to expectations for future interconnect capabilities. In short, the highly demanding requirements that are projected for on-chip wiring compel comprehensive research over the most extensive and multidimensional solution space that can be conceived.

3. Reverse scaling

Approximate expressions for the latency (tau) of a single isolated interconnect that is RC limited with an ideal return path are given by

tau90% approximately fully equal to rintcintL2 + 2.3RtrcintL + 2.3CL(rintL + Rtr), (1a)

tau90% approximately fully equal to rintcintL2 + 2.3RtrcintL    for    CL lesser lesser cintL, (1b)

and

tau90% approximately fully equal to rintcintL2    for    CL lesser lesser cintL    and    Rtr lesser lesser rintL, (1c)

where rint and cint are the interconnect resistance and capacitance per unit length, respectively, Rtr is the source resistance, CL is the load capacitance, and L is the interconnect length. The latency of a low-resistance interconnect that is resistance-, inductance-, and capacitance- or RLC-limited is given by

tau90% approximately fully equal to ToF = L/[c0/(epsilonr)1/2], (2a)

where

Rint  less or = to 2 ln  open parenthesis 4Z0 close parenthesis , Rtr < 3Z0, and CL lesser lesser cintL


Z0 Rtr + Z0
(2b)

are required for ToF response. In Equations (2), Z0 is the characteristic impedance and Rint = rintL is the total resistance of the interconnect; c0 is the velocity of light in free space, and epsilonr is the relative permittivity of the interconnect insulator. Since RC-limited performance is far more common than ToF limitations, the RC case is considered in this section.

The simple relationship given by Equation (1c) serves as the basis for reviewing the principal generic opportunities for solving the key latency problem. The latency of an RC-limited interconnect can be expressed as the product of three factors, as indicated in Equation (3):

tau = [rhoepsilon Open bracket 1 close bracket  [L2].

HT
(3)

The resistivity–permittivity factor [rhoepsilon] identifies opportunities to reduce latency through new materials and processes such as the replacement of aluminum with copper [4]. The [1/HT] factor, where H defines metal height and T defines insulator thickness, represents device- and circuit-level [1] opportunities to reduce latency through reverse scaling. Finally, L defines the length of an interconnect, and the [L2] factor represents system-level [5] opportunities to improve latency through the use of new microarchitectures that serve to “keep interconnects short.” Solutions to the latency problem must be pursued at each of the levels represented in Equation (3): material and process, device, circuit, and system [1]. The scope of this section is confined to device-, circuit-, and system-level opportunities to reduce latency through reverse scaling. In comparison to alternatives such as new materials and processes as well as novel architectures, the compelling advantages of reverse scaling are 1) minimal time to implementation, 2) low cost of implementation, 3) low risk, and 4) high payoff.

The key to optimal reverse scaling is the capability to predict the complete stochastic interconnect density distribution for a projected next-generation product. Consider the case of a macrocell consisting of a random logic network of N microcells or logic gates. As illustrated in Figure 2, the macrocell can be modeled as a square array of logic gates. Rent's rule (R = kNp) [6] and the principle of conservation of interconnects are applied recursively to the macrocell, as indicated in Figure 2. A closed-form expression for the complete stochastic signal wiring distribution resulting from this process is given [7] by the following:

Region 1:  1 less or = to L < square_root_of_N,
 
f(L) = Gamma  alphak  open parenthesis  L3   – 2 square_root_of_N L2 + 2NL close parenthesis  L2p–4 ;


2 3


(4a)

Region 2:  square_root_of_N less or = to L less or = to 2 square_root_of_N,
 
f(L) = Gamma  alphak  (2 square_root_of_NL)3L2p–4 ;

6


(4b)

Gamma =   


2N(1 – Np–1)

Np  1 + 2p – 22p–1  –  1  +  2 square_root_of_N  –  N




p(2p – 1)(p – 1)(2p – 3) 6p 2p – 1 p – 1
(4c)

Equation (4a) applies to shorter interconnects and Equation (4b) to longer interconnects in the distribution. These expressions reveal the dependence of interconnect density [f(L) in units of number of interconnects of length L per gate pitch] versus interconnect length L in gate pitches. The dependencies on interconnect length L, number of gates in the network N, Rent's coefficient k, and Rent's exponent p are evident. As demonstrated in Figure 3, this stochastic wiring distribution is found to be in close agreement with experimental data characterizing commercial products [7]. The key to obtaining close agreement between predicted and actual wiring distributions for a new product is to derive appropriate values of Rent's coefficient k and exponent p using data from previous generations of a product family. These two empirical parameters appear to have genetic characteristics.

Figure 2Figure 2 Figure 3Figure 3

An optimal architecture for a multilevel interconnect network that minimizes macrocell area, power dissipation, clock cycle time, or number of wiring levels can be derived using the stochastic interconnect distribution given by Equation (4). A derivation for minimum macrocell area begins with the formulation of a wiring area “supply and demand” equation (5a) [8]:

2ewAm = chipn square_root(Am/N) integral Ln Lf(L) dL ;
Ln–1
(5a)

pn = 2 sqrt(1.1[rho][epsilon]r[epsilon]06.2fc/[Beta]) square_root(Am/N)  Ln ;
(5b)

pn = 2.5  2fc  sqrt(6.2[rho][epsilon]r[epsilon]0R0C0)  square_root(Am/N)  Ln .

ß
(5c)

The available area for an orthogonal pair of wiring levels can be expressed as 2ewAm, where ew is a wiring efficiency factor that must be estimated from previous designs and Am is the area of the macrocell. The required area is defined by the right-hand side of Equation (5a), where chi < 1 converts point-to-point wire length to net length [7]. (Net length is the total length of wiring that connects the output terminal of a driver gate to the inputs of its load gates.) The factor pn is wire pitch, the square-root factor is gate pitch (in cm), and the integral represents the total length of wire in gate pitches between its upper (Ln) and lower (Ln–1) length limits. On the basis of a distributed RC network model, Equation (5b) imposes a latency requirement on the longest interconnect (of length Ln) on a given pair of wiring levels. The required latency is expressed by ß/fc, where ß < 1 and 1/fc is the clock period. In essence, the first and second equations are solved simultaneously for the minimum pitch pn and maximum corresponding wire length Ln for each pair of wiring levels until Ln equals the maximum required wire length of the macrocell on its top pair of wiring levels. Equations (5a) and (5b) are solved simultaneously if repeaters are not used, while Equations (5a) and (5c) apply if optimal repeaters are used [8, 9]. The parameters R0 and C0 respectively represent the output resistance and input capacitance of a minimum-geometry MOSFET used as the basis for the repeater circuits [10].

An example of minimization of macrocell area is illustrated in Figure 4(a). A random logic network consisting of 12.4 million gates implemented with 100-nm-generation technology using eight levels (n = 8) of copper interconnects and operating at a clock frequency fc = 578 MHz is considered. Two alternative wiring network architectures are compared. The first architecture (shown on the left) is restricted to two and only two different cross-sectional dimensions (or two tiers) for eight levels of wiring. It requires two levels of 100-nm wiring and six levels of 540-nm wiring as well as a macrocell area Am = 2.34 cm2 to interconnect the macrocell. The second architecture (shown on the right) is optimized to use three tiers of wiring in order to minimize cell area. It therefore requires four levels of 100-nm wiring, two levels of 150-nm wiring, and two levels of 300-nm wiring, as well as a macrocell area Am = 0.70 cm2. The decisive macrocell-area advantage of the three-tier architecture is achieved using the methodology defined in Equations (5a), (5b), and (5c), whose central feature is demand prediction based upon a complete stochastic wiring distribution f(L) [8, 9].

Figure 4Figure 4

A second and currently more realistic example of an optimal multilevel network architecture is illustrated in Figure 4(b). In this case the macrocell consists of an 11.3-million-gate random logic network implemented with 100-nm technology using eight levels of copper wiring (n = 8) and operating at a clock frequency of 1.56 GHz. If the pitch is chosen a priori to double for every pair of levels, the resulting architecture consists of two levels each of 100-, 200-, 400-, and 800-nm wiring, which require a 1.45-cm2 area. In contrast, using the methodology prescribed by Equations (5a) and (5b), the optimal wire-level-pair dimensions are 100, 130, 300, and 580 nm, yielding a macrocell area of 0.98 cm2 or a reduction of approximately 32%. If 1.5 × 106 optimal repeaters are used, the macrocell clock frequency can be increased to fc = 2.0 GHz and the area reduced to 0.48 cm2, as illustrated in Figure 4(c) [11].

As indicated by Equation (5a), determination of the area available for signal wiring on an orthogonal pair of levels requires estimation of the wiring efficiency factor ew based on results of previous designs. As the number of wiring levels and the number of repeater circuits increase, via blockage tends to reduce wiring efficiency. The impact of via blockage can be estimated by calculation of a via blockage factor,

BV = AV /Am, (6a)

where AV is the area blocked by vias on a given level of wiring and Am is the macrocell or chip area. As illustrated in Figure 5(a), terminal vias (which connect a particular interconnect net to a transistor) cause a “ripple effect” that reduces the number of wiring tracks available in a given area. In contrast, turn vias (which connect two wiring levels) do not cause via blockage. To elucidate with a simple example illustrated in Figure 5(a), BV = 0 for the five uninterrupted wiring tracks on the left without terminal vias, and BV = 0.2 for the four wiring tracks on the right, where 20% of the available wiring area is blocked by terminal vias [three of which are shown in Figure 5(a)] [12]. Assuming a uniform distribution of terminal vias as illustrated in Figure 5(b), a general expression can be derived for BV in terms of the geometry of the wiring layout [12]:

BV sqrt(NV(2W+s[lambda])2/Am) ,
(6b)

where NV is the total number of terminal vias for a particular metal level on a chip and W, s, and lambda are defined in Figure 5(b). The number of terminal vias NV for a given wiring level is determined by the total number of interconnects on wiring levels above the given wiring level using the methodology defined by Equations (4a), (4b), and (4c). From Equation (6), the via blockage factors for the eight wiring levels used in two macrocells (with F = 100 nm and N = 12.4 million gates) similar to those described in Figure 4 are illustrated in Figure 6 [12]. Figure 6 reveals two striking features of via blockage due to signal interconnects as predicted by the new model. First, via blockage is more problematic for relatively small-area macrocells because of their greater interconnect density. More significantly, via blockage is severe on only the first level of wiring, where 15–30% of the total wiring area of a representative macrocell can be blocked. The via blockage estimate based on a previous model [13] is also illustrated in Figure 6.

Figure 5Figure 5 Figure 6Figure 6

4. System-on-a-chip (SoC)

The previous section deals with reverse scaling of signal wiring for a macrocell that may be modeled as a largely homogeneous block of microcells. A second commonly encountered situation is a system-on-a-chip consisting of a number of heterogeneous megacells such as control logic networks, cache memory arrays, arithmetic logic units, and register files. Each of these megacells can be characterized by a particular equivalent number of gates NGi, Rent's coefficient Ki, and Rent's exponent Pi [14]. The question to be addressed is the following: How can the global signal, power, and clock distribution networks for the heterogeneous SoC be designed to 1) fit all of the global wiring into the top two metal levels, 2) meet the required system clock frequency, and 3) limit the crosstalk noise to a specified maximum value? An initial response to this question follows.

The methodology begins by engaging a recently derived heterogeneous version of Rent's rule [14]. For the heterogeneous system-on-a-chip illustrated in Figure 7, this expanded version is defined by

Teq = KeqNPeq, (7a)

where

Keq open parenthesis n  K NGi
i
close parenthesis 1/NGeq  ,
Pi
i =1
(7b)




Peq
n PiNGi


 ,
 summation 
i =1

NGeq
(7c)

and


NGeq
n  NGi .
summation
i =1
(7d)

In this power-law relationship (7a), Rent's coefficient Keq is expressed as a weighted geometric average (7b) and Rent's exponent Peq as a weighted arithmetic average (7c). Heterogeneous Rent's rule is used to derive three probability density distributions as summarized in Figure 8 [14]. The first is a net fan-out (FO) distribution that defines the number of nets NNet(m) versus the number of net terminals m = FO + 1, where Nm is the total number of megacells in the SoC. The second is a net bounding area distribution that describes the number of nets versus the average net bounding area for nets with a specific number of terminals m. The average bounding area dimension of a square net a(m) is shown in Figure 8, where etap is an empirical placement efficiency factor that is estimated on the basis of previous designs [14]. The third distribution is an average net length distribution that describes the number of nets versus average net length for nets with a specific number of terminals m. An expression for the average value of net length Lav(m) is given in Figure 8. These three distributions are combined to derive the total global signal wiring requirement Ltot as shown in Figure 8 [14].

Figure 7Figure 7 Figure 8Figure 8

Figure 9 summarizes this new methodology and compares model predictions with data from a commercial product. The graph in Figure 9 plots number of interconnect nets per mm or net density versus average interconnect net length in mm. The first dashed locus describes the density of nets with a fan-out of 1; the second describes nets with a fan-out of 2; the third a fan-out of 3, etc. The solid locus is the total interconnect net density distribution in number of nets per mm versus average interconnect length as calculated using the new model. The open circles represent data describing a commercial microprocessor consisting of 20 heterogeneous megacells [14].

Figure 9Figure 9

In essence, the summation in Figure 8 defines the total length of global signal wiring required for a heterogeneous SoC. The next wiring resource requirement that must be defined concerns power distribution. Figure 10 presents the results of modeling the required area for power distribution Apower, for the cases of peripheral bonding pads or an area array of bonding pads. For peripheral bonding pads, it is assumed that an equipotential ring surrounds the chip, as illustrated in Figure 10(a). For area array bonding pads, illustrated in Figure 10(b), it is assumed that Vdd is the potential of each bonding pad and that the current drain at each orthogonal intersection of the power grid lines is constant. ASoC is the total SoC area [14]. In Figure 10, delta = DeltaVdd/Vdd is the normalized voltage drop from a bonding pad to the most distant via at the intersection of an orthogonal pair of power grid lines, Vdd is supply voltage, H is metal height, Ptot is total chip power dissipation, and rhoW is metal resistivity. Note that APower for area array bonding pads can be reduced effectively by increasing the number of bonding pads, npad.

Figure 10Figure 10

The most critical clock distribution network requirement that must be met is imposed by the bandwidth necessary for rapid transitions of the clock waveform. It is assumed that global clock distribution is implemented with a balanced H-array. This array is modeled as a distributed RC network whose maximum length extends from the chip clock input pad to a terminal buffer/repeater of the global H-array. The approximate value of this maximum length is the dimension of the chip edge l. Figure 11 defines the clock frequency limit fClock as a function of chip area ASoC = l 2[14].

Figure 11Figure 11

The final performance requirement that is imposed on the global wiring network is a crosstalk noise limit. A model used for an approximate calculation of global crosstalk noise is illustrated in Figure 12. In this representation, a global signal line or “victim” is assumed to be surrounded by two near and two far “attackers.” Simultaneous in-phase switching of the four attackers causes crosstalk noise on the victim due to coupling of both mutual capacitance and mutual inductance. A nearby high-quality return path is assumed to be available. Treating the five coupled lines as distributed RLC networks, a set of partial differential equations quantifies the problem [15, 16]. Some results of a solution to this set of equations are illustrated in Figure 13, which plots the ratio of crosstalk-to-binary signal voltage swing versus time [16]. Comparing the three- and five-line loci, it is evident that in the presence of a nearby high-quality return path, the near attackers shield the victim from the far attackers. Therefore, using the three-line model, simplified expressions for peak crosstalk voltage derived from the solutions of the set of partial differential equations are

Vn  approximately fully equal to   1  cmutual



Vdd  2   cline + cmutual
(8a)

for distributed RC models [17] and

Vn  approximately fully equal to   pi   1  cmutual




Vdd  2   2   cline + cmutual
(8b)

for distributed RLC models [15, 16], where cmutual is the line-to-line distributed capacitance and cline is the line-to-return-path distributed capacitance.

Figure 12Figure 12 Figure 13Figure 13

A summary of the complete set of three compact models that define the primary global interconnect design requirements is given in Table 3. The models are expressed in terms of the physical parameters of the two global wiring levels illustrated in Figure 14. In the wiring resource requirement, the three terms respectively represent the signal and power wiring areas and unused area. The second and third models respectively describe the clock wiring bandwidth and signal wiring crosstalk noise limit. In Figure 15, the three models are applied to a particular SoC consisting of 20 heterogeneous megacells containing a total of approximately six million transistors [14]. In the global interconnect design plane, the vertical axis represents interconnect thickness H and the horizontal axis, interconnect width W. The allowable design region that satisfies all primary global wiring requirements is the zone bounded by the resource, bandwidth, and noise limit loci. For example, an interconnect width W approximately fully equal to 2.4 µm and height H approximately fully equal to 2.0 µm satisfies the prime design constraints with a minimum pitch. Projections of the allowable design regions for several future generations of technology are illustrated in Figure 16. Here it is evident that the amount of compression of the allowable design region becomes unacceptable, and additional flexibility such as expansion of the number of global wiring levels appears to become necessary.


Table 3   Summary of complete set of requirements to be imposed on global signal, power, and clock distribution networks expressed in terms of the geometry of the two global wiring levels.
Equation description Simplified expressions for global wiring requirements in terms
of w, s, H, and Tox

   Wiring resource
requirement
    
(w + s Nm  NNet(m)Lav(m) +  2mp – 1  ASoC + 0.5  open parenthesis  1 –  1 close parenthesis 2 ASoC less or = to ASoC
summation


mp2 mp
m=2
 
where NNet(m) approximate approximate  KeqNm[mPeq–1 – (m + 1)Peq–1]

m + 1
 
Lav(m) approximate approximate (0.5square_root(m) + 1)  m – 1 sqrt(ASoC[[eta]p+(Nm/m)(1-[eta]p)]) , mp 16deltaV2ddH


m + 1 Ptotrhow
 
Wiring bandwidth
requirement
fc less or = to   1

4pirhowepsilon0epsilonr(1/HTox + 1/ws)ASoC
 
Wiring noise limit
pi 1/ws  less or = to % noise


4   (1/HTox + 1/ws)  

Figure 14Figure 14 Figure 15Figure 15 Figure 16Figure 16

In summary, the methodology presented in this section enables early projections of key physical parameters of a global interconnect network that simultaneously satisfies the primary requirements of a SoC for signal, power, and clock distribution. The compact physical models that serve to implement the methodology offer a convenient opportunity to establish a quantitative guide to detailed design of a SoC. Therefore, the methodology may serve as a useful precursor to final design. Enhancements of this methodology that include, for example, the effects of clock skew, nonideal return paths, and simultaneous switching noise are needed.

5. Three-dimensional integration

Achieving three-dimensional (3D) integration in semiconductor technology requires the capability to stack multiple strata, each containing both transistors and multilevel interconnect networks, as discussed in preceding sections. This is a formidable challenge that is unlikely to be engaged seriously absent a convincing case for substantive benefits. Therefore, what are the primary benefits that can be projected for 3D integration? It appears that the singular generic advantage of 3D integration is a substantial reduction in length of the longest global interconnects used in a SoC.

Several rigorous derivations of stochastic interconnect distributions for 3D random logic networks [18–20] based upon the 2D distribution discussed in Section 3 [7, 8] have been reported. Using the analytic models derived in [20], the stochastic interconnect distributions for a 4.0-million-gate random logic network implemented with 1, 4, and 16 strata are illustrated in Figure 17(a). Note that for simplicity these distributions assume that the interstratal pitch r = 1, which strictly imposes the condition that the interstratal pitch equals the intrastratal logic gate pitch. The loci of Figure 17(a) clearly indicate that multiple strata or 3D integration exerts very little impact on the density of local interconnects, but it has a profound effect on the length of the longest interconnects of the logic network. This observation is illustrated with greater clarity in
Figure 17(b). The right vertical axis indicates a length of approximately 4000 gate pitches for a corner-to-corner interconnect in a single-stratum implementation, 2000 gate pitches for a four-stratum implementation, and 1000 gate pitches for a 16-stratum implementation. For time-of-flight-limited global interconnects, this could result in a 4:1 reduction of latency and the possibility of an approximately fourfold increase in global clock frequency—for the expense of a 16-stratum implementation of the system.

Figure 17Figure 17

A key simplifying assumption limiting the projections illustrated in Figures 17(a) and 17(b) is that the interstratal pitch equals the intrastratal gate pitch, or r = 1. Setting aside this assumption, a generic 3D wiring distribution for a 4.0-million-gate random logic network whose interstratal pitch is treated as an independent variable has been rigorously derived [20]. Figure 17(c) illustrates a key result of this new derivation for interstratal pitches r = 1 and r = 50. The two distributions are quite similar for short local and long global interconnect lengths. The only region in which the two loci deviate is the midrange of interconnect lengths, where interconnect length and stratal pitch are roughly equal. Consequently, it appears that interstratal separation distance is not a critical parameter in determining 3D wiring distributions.

The generic benefit of substantial reductions in length of the longest global interconnects in a distribution resulting from 3D integration is an inherent advantage of 3D wiring. A concomitant inherent disadvantage of 3D structures is heat removal [21]. Beyond these general issues, the attraction of 3D integration for specific applications may be dominated by the peculiar features of the application itself. For example, two-dimensional sensor arrays that require direct access to each sensor cell for immediate signal preprocessing are interesting prospects for 3D integration [22]. More broadly, the capacity to explore opportunities for extraordinary performance enhancements through 3D integration would benefit from generic advances in capabilities to fabricate 3D structures.

6. Input/output interconnect enhancements

The intent of input/output interconnect enhancements is to improve the cost, size, reliability, and performance of a gigascale SoC. Historically, bonding wires have been the dominant approach to chip input/output (I/O) interconnects [23]. IBM pioneered the introduction of solder-bump I/O interconnects using flip-chip technology with a thin layer of glass passivation sealing the chip encapsulated in silicone gel, which prevented the formation of continuous water films [24, 25]. A particular novel technology that is currently under investigation for I/O enhancements is described as Sea of Leads (SoL) [26]. This technology proposes the use of wafer-level batch fabrication of compliant polymer packages, ultrahigh-density (>104/cm2) x–y–z flexible metal leads, and solder-like bumps attached to the lead tips, as illustrated in Figure 18(a). A short sequence of full-wafer SoL batch- fabrication processes constituting a “tail-end-of-the-line” (TEOL) are envisaged to follow conventional back-end-of-the-line (BEOL) wafer processing. The further intent of SoL technology is to complete all final electrical testing and burn-in operations prior to wafer dicing that yields known good packaged die ready for immediate shipment to customers. The flexible leads are designed to provide sufficient x–y–z compliance to accommodate typical differences in the thermal coefficients of expansion between a silicon chip and the substrate to which it is attached. The need for epoxy underfill is thereby precluded, and the possibility of convenient detachment of a chip from a substrate module is enabled.

Figure 18Figure 18

Concurrent fabrication of packages and leads of all chips on a wafer extends the historically potent economies of wafer-level batch processing to the relatively costly die-by-die assembly, bonding, packaging, testing, and burn-in operations [27, 28]. Moreover, the size of the SoL package is the minimum for a chip-scale package (CSP). Significant reliability improvement may result from avoidance of epoxy underfill often needed to relieve stress on relatively rigid solder-ball connections between chip and substrate. Figure 18(b) is a photomicrograph of an SoL. The circular pattern is the via linking a die-bonding pad with the lead itself, which is the “question-mark-shaped” copper pattern. This peculiar shape is designed to provide a high degree of x–y axis flexibility and thus accommodate chip–substrate thermal expansion differences. The somewhat rounded region beneath the copper lead defines the boundaries of a polymer interposer air cavity that is introduced to enhance z-axis compliance. This compliance is added in order to provide convenient and reliable temporary electrical contacts between an array of electrical test probes and the leads of the dice under test, especially when the probe tips are not in a precisely planar arrangement. A photomicrograph of the cross section of an air cavity is shown in Figure 18(c). An SEM of a 1 × 1-cm die with an SoL density of 12000 per cm2 is shown in Figure 18(d). The leads are oriented along the contours of expansion of the die to provide a higher degree of compliance proceeding radially outward from the center to the edge of the die.

Key performance enhancements that appear to be in the offing for SoL technology include the following [16, 26]:

  1. Substantially increased input/output bandwidth for a chip resulting from the significantly larger (e.g., ~10×) number of signal leads that are available.
  2. “Time-of-flight” global signal interconnect latency for a chip resulting from exiting and then reentering the die using external on-module wiring, or “exterconnects,” to implement very-low-loss time-of-flight internal global wiring links for the chip.
  3. Reduced global clock skew due to use of time-of-flight exterconnects to implement global clock trees.
  4. Reduced global clock power dissipation through recycling the energy of reflected clock pulses distributed through low-loss exterconnects [29].
  5. Suppression of far-attacker crosstalk noise on global signal interconnects through the use of exterconnects with nearby high-quality return paths provided by module power and ground planes.
  6. Suppression of simultaneous switching noise (SSN) and reduced parasitic IR voltage drop in the power/ground distribution networks resulting from the significantly larger (e.g., ~10×) number of power and ground leads that are available.
  7. Improved isolation and reduced interference in mixed-signal systems resulting from use of separate power/ground input/output leads for analog and digital signals.
Additional opportunities that are available through SoL include the capacity to satisfy the voracious appetite of 3D integration for I/O capacity and the potential for compatibility of electrical, rf wireless, and photonic I/O interconnects.

In short, SoL can be described as a “disruptive” technology, because the intent is to use batch-fabricated ultrahigh-density input/output leads to improve the cost, size, reliability, and performance of an SoC [16, 26].

7. Photonic interconnects

An exposition of interconnect opportunities for GSI would not be complete without consideration of photonic or optical interconnects [30–33]. In order to be competitive with electrical interconnects for GSI, photonics must provide small, low-power, high-speed, low-cost photon emitters, detectors, and conductors or waveguides that are compatible with CMOS technology. Consequently, this section focuses on compatible photonics, or photonic technologies with the long-range potential to satisfy the extremely stringent and particular demands of GSI.

The most challenging objective for CMOS-compatible photonic interconnects is an efficient room-temperature silicon light emitter. A novel silicon diode which exploits dislocation loops to introduce a local strain field that modifies the band structure to confine carriers near the junction and therefore enhance light emission was recently demonstrated [34].

Short of high-quality silicon photoemitters, a most interesting approach to compatible photonics is based upon heteroepitaxial deposition on Si of SiGe, followed by Ge, followed by GaAs, and finally by AlGaAs [31, 32]. The close lattice-constant match of Ge and GaAs provides a basis for growing high-quality single-crystal layers of GaAs. This heteroepitaxial approach to compatible photonics has the potential to provide III–V compound semiconductor lasers, Ge detectors, and polycrystalline or monocrystalline Si waveguides. Figure 19 illustrates the current–voltage curves of heteroepitaxial SiGe and GaAs diodes on a Si substrate [32]. Figure 20 displays photomicrographs of a right-angle bend and a junction in a polycrystalline Si waveguide [33]. Transmission loss is less than 0.5 dB in the bend and 1.0 dB in the junction. The waveguide width is 0.5 µm, which is comparable to dimensions of upper-level metal interconnects. These recent advances are encouraging demonstrations of the long-range promise of compatible microphotonic interconnects.

Figure 19Figure 19
Figure 20Figure 20

It has long been proposed that the most likely point of entry of photonic interconnects into silicon integrated electronics is in clock distribution [35, 36]. Recently, a polymer waveguide network with volume grating output couplers embedded in a printed wiring board (PWB) was proposed to transfer photons from a printed wiring board to one or more silicon photodetectors fabricated in a CMOS chip [37]. This approach to optical clock distribution does not utilize on-chip photon emitters and enables a planar package configuration.

8. Conclusions

Interconnect latency is now the primary performance issue for GSI, and the problem promises to become more serious for future generations of technology. Opportunities to address the problem range, for example, from carbon nanotube conductors that may enable ultrahigh-speed ballistic transport [38] to new single-chip, distributed shared memory, cellular arrays of microprocessors [39, 40] that serve to keep interconnects short. The second interconnect problem that is not broadly recognized as such is energy dissipation. The keys to solving this problem are short interconnects, and transistors with the smallest possible subthreshold swing and therefore the smallest possible binary signal swing. Crosstalk and simultaneous switching noise represent a third interconnect problem—signal integrity—which is difficult to describe using compact physical models.

For virtually any family of gigascale chips, the key to optimal reverse scaling of multilevel signal interconnect networks is prediction of the complete stochastic wiring distribution of a next-generation product. More general signal integrity models that can be incorporated into reverse scaling methodologies are needed.

The task of conjointly optimizing the architecture of the global signal, clock, and power/ground distribution networks of a system-on-a-chip consisting of a set of heterogeneous megacells is demanding. A first attempt to address this task comprehensively engages a new stochastic model for global signal wiring, a new model for global power/ground wiring area, a global clock bandwidth requirement, and a crosstalk noise requirement. Enhancements of current methodologies that include, for example, the effects of clock skew, nonideal return paths, and simultaneous switching noise are needed.

The generic benefit of substantial reductions (e.g., >50%) in the length of the longest global interconnects in a distribution is an inherent advantage of 3D integration. The capacity to explore novel opportunities for extraordinary performance enhancements through 3D integration would benefit from generic advances in capabilities to fabricate 3D structures.

In order to maintain historic rates of advance of monolithic semiconductor technology, more attention to ancillary features and particularly to input/output interconnects is unavoidable. Sea of Leads represents an early effort to more intimately couple the chip itself to its environment and then to exploit concomitant new opportunities. Key projected performance enhancements include substantially increased input/output bandwidth, reduced global signal interconnect latency, reduced global clock skew, reduced global clock power dissipation, greater suppression of simultaneous switching noise, and improved signal integrity in mixed-signal systems. More broadly, Sea of Leads represents an effort to extend the quintessential feature of semiconductor technology—wafer-level batch fabrication of several hundred chips—to the traditional die-by-die packaging and testing domains.

To be become widely used in GSI, photonics must provide small, low-power, high-speed, low-cost photon emitters, detectors, and conductors or waveguides that are compatible with CMOS technology [41]. Recent advances in heteroepitaxial deposition on Si of SiGe, followed by Ge, followed by GaAs to demonstrate light-emitting and -detecting diodes as well as Si waveguides, are promising.

Acknowledgments

The intellectual contributions to this paper by Azad Naeemi, Raguraman Venkatesan, Muhannad Bakir, Hiren Thacker, Qiang Chen, James Joyner, and Tony Mulé of the Georgia Institute of Technology Microelectronics Research Center are gratefully acknowledged. In addition, the authors wish to express their appreciation to DARPA, Contract No. F33615-97-C-1132, MARCO, Contract No. MDA 972-99-1-002, and the SRC, Contract No. 448:048, for their generous support.

References

Received May 22, 2001; accepted for publication January 7, 2002