|
|
 |
|
 |
Volume 41, Numbers 4/5, 1997
IBM S/390 G3 and G4 |
|
Table of contents: HTML ASCII |
|
This article: HTML ASCII DOI: 10.1147/rd.414.0489 |
Copyright info |
 |
 |
 |
 |
| |
|
Circuit design techniques for the high-performance CMOS IBM
S/390 Parallel Enterprise Server G4 microprocessor |
 |
by L. Sigal,
J. D. Warnock,
B. W. Curran,
Y. H. Chan,
P. J. Camporese,
M. D. Mayo,
W. V. Huott,
D. R. Knebel,
C. T. Chuang,
J. P. Eckhardt,
and P. T. Wu |
 |
 |
 |
 |
|
This paper describes the circuit design techniques used
for the IBM S/390® Parallel Enterprise Server G4 microprocessor to
achieve operation up to 400 MHz. A judicious choice of process
technology and concurrent top-down and bottom-up design approaches
reduced risk and shortened the design time. The use of timing-driven
synthesis/placement methodologies improved design turnaround time and
chip timing. The combined use of static, dynamic, and self-resetting
CMOS (SRCMOS) circuits facilitated the balancing of design time and
performance return. The use of robust PLL design, floorplanning, and
clock distribution minimized clock skew. Innovative latch designs
permitted performance optimization without adding risk.
Microarchitecture optimization and circuit innovations improved the
performance of timing-critical macros. Full custom array
design with extensive use of SRCMOS circuit techniques resulted in an
on-chip L1 cache having 2.0-ns cycle time.
|
 |
 |
|
Introduction
The CMOS G4 central processor (CP) is a single-chip CMOS
microprocessor that has been designed for the IBM S/390* Parallel
Enterprise Server G4 system. It was designed to execute the S/390
instruction set at speeds up to 400 MHz. Achieving this performance
required close integration of logic design and microarchitecture
optimizations, high-performance CMOS circuit and array techniques, an
advanced custom design tool suite, and full utilization of the advanced
process technology. This paper describes the innovative CMOS circuit
design techniques used to achieve high performance. The technology
features and chip characteristics are discussed first, followed by the
chip floorplan and global design strategy. The phase-locked loop (PLL),
which provides a processor clock that runs at twice the system bus
frequency, is then described. High-frequency design issues for both the
standard cell library and full custom circuits, the judicious choice
and mixed use of static and dynamic circuits, and timing-driven
placement are discussed. We then focus on the clock distribution and
latch designs, which are critical for the high-frequency
operation of all of the logic circuits. Examples are given for several
timing-critical macros to illustrate how microstructure optimization
and innovative circuit techniques are employed to improve timing.
Finally, we discuss the array circuits, which use self-resetting CMOS
(SRCMOS) circuit techniques extensively to achieve 2.0-ns access time
and 500-MHz operation.
Technology features and chip characteristics
The microprocessor was implemented in the IBM CMOS 6S technology
[1,2].
Typical chip technology parameters
are shown in Table 1. The technology
features a 0.2-µ channel
length, a 5.5-nm-thick gate oxide, low-resistance Ti-salicided
n+ and p+ polysilicon and diffusions,
shallow-trench isolation, metal fuses for laser cuts, n+
precision resistors, five levels of metallization, and local
interconnections. The power supply is 2.5 V.
Table 1
Typical chip technology parameters (from
[3], reproduced with permission; ©
1997 IEEE).
| Leff |
0.2µ |
| Gate oxide | 5.5nm |
| M1 pitch | 1.2µ |
| M2 pitch | 1.8µ |
| M3 pitch | 1.8µ |
| M4 pitch | 1.8µ |
| M5 pitch | 4.8µ |
| Power supply | 2.5V |
A CP chip micrograph is shown in
Figure 1.
The chip measures 17.35 mm × 17.30 mm and contains 7.8 million
transistors (Table 2). A phase-locked loop
provides a processor clock that runs at twice the system bus frequency.
There are about 3.8 million logic transistors and 4.0 million
array transistors on the chip. The measured power dissipation at 300
MHz is 37 W. The chip contains 1600 C4 terminals and 448 off-chip
signal I/Os, and has operated successfully at frequencies up to 400 MHz.
Figure 1
Table 2
Key aspects of the CP chip (from
[3], reproduced with permission; ©
1997 IEEE).
| Transistor count |
7.8 million |
| Logic |
3.8 million |
| Array |
4.0 million |
| Die size |
17.35 mm × 17.30 mm |
| Power dissipation |
37 W @ 2.5 V 300 MHz |
| System bus frequency |
Half of processor |
| On-chip decoupling cap |
102 nF |
| C4 chip interconnections |
1600 |
| Off-chip signal I/O |
448 |
| Maximum frequency | 400 MHz |
Dedicated thin-oxide capacitors totaling 102 nF are provided for
on-chip decoupling. This, combined with the inherent nonswitching
well-to-substrate and diffusion-to-well capacitances, provides about
200 nF of on-chip decoupling capacitance. Each capacitor has a gated
n-FET control device with an external decap-enable pin for leakage
current measurements during test. Additionally, each capacitor contains
a "built-in" fuse which, in the presence of a large current
resulting from oxide defects, results in an open in the first level of
chip metallization (M1). The decoupling capacitor cell
(Figure 2) is designed to fit under the
dataflow wiring
tracks. The cell is double-bit-pitch wide (43.2 µ) and 14 tracks
tall (25.2 µ). Two of the fourteen horizontal wiring tracks are
specifically blocked for the decoupling capacitor wiring so that the
capacitor can fit under the wiring. A low-resistance layout of the
capacitor cell provides a fast time constant, approximately 85 ps.
Figure 2
Chip floorplan and global design strategy
The CP chip comprises five major units: instruction unit (IU),
fixed-point unit (FXU), floating-point unit (FPU), buffer control
element (BCE), and recovery and error-detection unit (RU)
[3]. The
IU, FXU, and FPU are duplicated on the chip for increased system
reliability. The BCE and RU, with error-detection logic, are centrally
located in the floorplan to support communication with both sets of
instruction and execution units. The BCE contains a unified 64KB L1
cache [4] and 32KB read-only memory
(ROM) for millicode storage. The
RU maintains an error-checking and correction (ECC) protected copy of
the processor states, and various system support functions for error
detection and recovery. The PLL is located near the center of the chip
and generates internal system clocks that run at twice the system bus
frequency.
Each unit consists of many smaller design elements called macros, and
within each unit macros are classified as dataflow or control. Dataflow
macros are custom-designed and are arranged in bit stacks. Most of the
dataflow and control logic was implemented with static CMOS circuits.
Dynamic circuits were used only in isolated, performance-critical
cases. Approximately 75% of the logic macros were implemented with
custom schematics and layouts, and the remaining 25% were synthesized.
Control macros (also designated as random logic macros, or RLMs) are
synthesized logic. This logic was implemented using a standard-cell
approach in which gates are selected from a common fixed library and
are placed and routed with automated tools. In most cases the RLMs also
included gate array backfill of spare logic in the unused area within
the macro. The gate array's spare logic consists of unwired
transistors which can be personalized at the metal-only levels to
facilitate quick design changes.
Functional descriptions in VHSIC Hardware Description
Language (VHDL) and physical design hierarchies match at the
macro level. The physical partitioning of the chip was largely
influenced by the logical partition at the macro level. The chip
physically comprises at least three levels of hierarchy (chip,
unit, and macro), and design implementations were achieved concurrently
at each level. Thus, interconnections between units and
interconnections between macros were implemented in parallel with the
macro layouts. Our concurrent design approach supported both a top-down
and bottom-up design methodology, with constraints being passed at each
level of the hierarchy. The shape and size of arrays and width of
dataflow stacks were designed from the bottom up. The aspect ratio of
control macros and placement of control pins were designed from the top
down. Design constraints were passed in both directions of the
hierarchy in the implementation of the chip power distribution. The
power mesh on higher metal levels (M4 and M5) was predefined at the
chip level and considered fixed at the unit and macro levels. The power
mesh on the M3 level and below was designed from the bottom up, giving
the most flexibility to the macros and units. In some cases, such as
over the dataflow areas, the M2 power mesh was predefined.
The power distribution supports an average dc voltage drop of 23 mV.
The current transients were managed by including additional on-chip
decoupling capacitors around large noise sources such as the off-chip
drivers, clock buffers, and on-chip drivers with large loads. Since a
large amount of switching capacitance occurs in the dataflow stacks,
decoupling capacitors were also placed under the wiring tracks with
little or no area impact.
Phase-locked loop (PLL)
The PLL provides a processor clock that runs at twice the system
bus frequency. It has a current-controlled oscillator and an on-chip
chip-generated analog VDD. It operates over the
range from 36 MHz to 571 MHz, with less than 4 ps/mV of long-term
phase error between the PLL output and reference clock due to power
supply noise, and less than a 1-ps/mV reduction in cycle time due to
other noise sources. Figure 3 shows a
block diagram of the PLL. The phase detector generates pulses with widths
equal to the phase difference between its two inputs (the reference
clock fref and
fclock/Mfeedback, where
fclock is the chip clock). The charge pump
converts the pulse width into charge stored on a capacitor pair to set
a differential bias. The amount of charge on the capacitor is increased
or decreased until the pulse widths are reduced to the minimum value
(i.e., no phase difference between fref and
fclock/Mfeedback), and
the correct bias voltage is reached. The voltage-to-current converter
converts the bias voltage to a current which is injected into the
current-controlled oscillator (ICO). The amount of current determines
the oscillator frequency. The forward divide and feedback divide
determine the relationship between the frequencies of the reference
clock and the oscillator and chip clock, such that
fclock =
fosc/Mforward and
fref =
fclock/Mfeedback, where
Mforward and Mfeedback
are the divide values of the forward and feedback divides,
respectively. A constant-current source generates the reference current
Iref, which biases the current-controlled analog
circuits of the PLL.
Figure 3
A primary barometer of PLL performance is jitter. In the CP, the PLL
aligns its clock with other chips in the system. Consequently,
accumulated phase error (the phase difference between the PLL output
and the reference clock) is important. Other critical measurements are
cycle-to-cycle jitter and long-term jitter, which reduce the cycle
period available for logic operations. In the present design, there are
three places where PLL performance can be severely degraded by power
supply noise--the constant-current source
(Iref), the current injected into the ICO by the
voltage-to-current converter, and the ICO itself. The
Iref supplies the constant current which biases
the entire PLL. Any fluctuation in the current provided results in both
cycle-to-cycle and long-term jitter. This error is reduced by
minimizing the sensitivity to supply voltage (dc shifts and
low-frequency noise), and by the use of capacitive coupling
(high-frequency noise). The voltage-to-current converter supplies bias
current to the oscillator. Noise added to this bias results in both
types of jitter. The output of the converter is single-ended and
can be sensitive to power-supply shifts. The present design not
only minimizes the injected noise, but also utilizes common-mode
noise-rejection technique so that the noises are added as common mode
and cancel out at the output. The ICO is a differential cascoded
current source design which relies on the variation of the input
current to control the frequency.
To minimize the noise and add flexibility to the PLL usage, the analog
VDD is generated on-chip with the simple
RC filter shown in Figure 4. This
approach eliminates the need for an additional power supply and the
associated package-coupled noise, and makes the design compatible with
the existing power-supply systems. The on-chip coupling noise on the
analog VDD is minimized both by careful
placement and wiring and by the low-impedance path to
VDD through the RC filter.
Figure 4
High-frequency design strategy
The fundamental design strategy was to design the dataflow to the
most aggressive cycle time achievable, and then to match the control
portion to the dataflow cycle time. In cases where control paths lag
the dataflow cycle time after all possible logic optimizations, cycles
per instruction (CPI) were traded off for cycle time. The CP achieved
its high-frequency design goals by unique combinations of improved
timing methodologies and design techniques. These methodologies
included advanced synthesis techniques for automatically generating
timing-optimized schematics for the control functions, and more
accurate extraction and modeling of the layout capacitances and
resistances. Two types of high-frequency design strategies (logical and
physical) were developed to resolve long-path timing problems. The
first type incorporated high-leverage logic design optimizations, and
included macro-partitioning optimizations of the timing-critical logic
and VHDL optimizations [5]. When logical
strategies did not resolve a
long-path timing problem, physical strategies of placement, wiring,
high-performance circuit, and clocking techniques were exercised. Most
dataflow macros were implemented with custom static circuits. All
control macros were implemented from a static cell library with fine
power granularity.
Standard cell
library
A completely full custom implementation was not feasible because
of the complexity of the S/390 architecture and our resource and
schedule constraints. Thus, two carefully tuned standard cell booksets
were developed for use with synthesis. One of these contained fixed
schematics and layouts, and the other was parameterizable with
capabilities for automatic layout generation
[5]. Both bookset
designs were limited to relatively simple functions: INVERTERs and
BUFFERs; two-, three-, and four-way NANDs and ANDs; two- and three-way
NORs and ORs; and simple AOIs, OAIs, AOs, and OAs (2 by 1 and 2 by 2).
Many nonbuffered power levels were designed for each book type. These
power levels were created both by reducing the width of the FET fingers
and by adding FET fingers. Within the fixed bookset, each book type was
individually simulated and beta-tuned (ratio of p-FET device width to
n-FET device width) to achieve nearly equal rising and falling output
delays.
The parameterized bookset offered the most flexibility and highest
performance. Multiple beta ratio books were designed for the book types
with the highest usage. The beta ratio was changed by both increasing
and decreasing the ratio of p-FET width to n-FET width. Thus, rising
input delay was improved to speed up the falling output delay, and vice
versa. (Synthesis exploited these various beta ratios as it attempted
to minimize both falling-input-to-falling-output delay and
rising-input-to-rising-output delay for a path.) All of the books in
this two-dimensional bookset (the two dimensions being power level and
beta ratio) were timing-characterized for synthesis. Timing
characterization included input-pin-to-output-pin base delays,
and delay sensitivities to input-pin slew and output- pin
capacitive load. Synthesis using these carefully tuned booksets yielded
schematics which were nearly as optimal as full custom schematics.
Much effort was expended to minimize the internal parasitic resistances
and capacitances of each book. Polygate resistances were reduced by
placing the input pin between the n-FET and p-FET fingers. Source and
drain diffusion capacitances were reduced by sharing as much of the
diffusion as possible between adjacent-series-connected p-FETs and
n-FETs. The local interconnection level was leveraged for the internal
book connections of FET drains and sources.
Full custom circuit
design
Each custom circuit designer had full control of circuit
schematics and circuit topologies, and this permitted them to exploit
fully the delay advantages of transmission gates, AOIs, and OAIs.
Transmission gates were used primarily to implement selector functions
when the data path was more timing-critical than the select path.
Manual logic transforms were applied to the schematics to reduce the
number of stages and critical path delays through the schematics.
Transforms included buffer insertion and deletion, NAND-NAND-to-AOI
conversions, NOR-NOR-to-OAI conversions, and Shannon expansions.
The noncritical loads on a timing-critical net were buffered in a
manner which reduced the load on the net's driver.
After the circuit schematic optimizations were completed, individual
gates were tuned. Gates in the non-timing-critical paths were resized
to achieve 10-90% rising and falling slew targets
[5]. Next, stage
gains in the critical paths were reduced to a value typically between 2
and 3, and equalized throughout the entire path. Stage gain was defined
to be the ratio of a gate output load capacitance to its input pin
capacitance. The output drivers (gates driving the macro output pins)
were properly sized to drive the full lumped load capacitance instead
of buffering the signal via multiple inverters to increase drive
strength. The preceding gates (i.e., gates driving output drivers) were
accordingly sized up to reduce their stage gain. Stage gains were also
reduced by increasing the FET widths of the critical gate load on a
timing-critical net and/or by lowering the FET widths of the
noncritical gate loads on a timing-critical net. Equalizing stage gains
was accomplished by increasing and/or decreasing the FET widths in the
path until the slews of all of the nets in that path were approximately
equal. (Excessive gain on a circuit stage causes an excessive slew on
the output net of that stage.) The beta ratios of books were modified
to equalize the rising-input-to-rising-output delay and
falling-input-to-falling-output delay for the timing-critical paths, as
synthesis did for the control functions.
The layout techniques employed during development of the standard
cell booksets were also used during full custom circuit layouts.
Additionally, to minimize wire capacitance in timing-critical custom
paths, circuits were manually placed and wired. In situations where
long wires were unavoidable (principally on nets with large fanout),
wire RC delay was modeled with various wire widths to find
the width which resulted in minimum path delay.
Dynamic circuits
Array circuits such as the L1 cache, L1 directory, TLBs, ROM,
traces, and write buffer were designed with self-resetting CMOS
(SRCMOS) circuits [4]. The only
nonarray macros designed with dynamic
circuits were the cache hit logic [6],
the dynamic multiplexors used
with latches, and the divide and square-root lookup ROM in the FPU. The
rest of the macros were designed with static circuits. All known static
techniques were applied before considering dynamic circuit solutions.
Timing
considerations
The address-generation path was still timing-critical after
application of all of the physical placement, wiring, and circuit
techniques described above. Fortunately, the L1 cache and ROM access
paths had been designed with timing margin; thus, some of this margin
was transferred to the address-generation path. This was accomplished
by simply adding a programmable clock delay block within the cache and
ROM arrays. In other similar situations, where a path was
timing-critical and the prior or subsequent cycle path had timing
margin, we opted to reposition the latch to rebalance the delays of the
back-to-back cycle paths [5].
Many critical paths arose because of excessive delay in the propagation
of a signal between distant macros. The microprocessor floorplan was
modified to move these macros closer together; even small movements
improved timing, since (to first order) wire propagation delay is
proportional to the square of wire length. If the wire was still
timing-critical, several other solutions were explored. For wires with
more than one load, we investigated the feasibility and timing impact
of duplicating the net driver and net so that one of the drivers merely
drives the timing-critical load. Solutions involving wide or tapered
wires were analyzed via AS/X simulations. Widening a wire reduces its
resistance at a faster rate than its capacitance increases, thus
permitting a reduction in the RC delay of the wire. A
tapering solution involved the use of a wider section of wire at the
near (driver) end and a narrower section of wire at the far (receiver)
end. These solutions further reduced its RC delay. Care was
taken to ensure that no polysilicon wires were used on critical wiring
nets.
In addition to implementing techniques to reduce cycle times, careful
noise analysis was required to ensure that capacitive metal-to-metal
noise couplings would not adversely affect the functionality of the
circuits at high frequencies. All distant inputs to transfer gate
latches were buffered locally with receivers. The parasitic coupling
analysis tools identified nets that required grounded metal shields.
Design for
reliability
It was challenging to design this high-speed microprocessor to
operate functionally in an accelerated burn-in environment. In
normal operation, the CP operates in a 2.5-V environment. However, to
ensure high quality and reliability, the CP was required to pass
high-voltage and temperature stress testing. A dynamic voltage screen
(DVS) test consists of writing patterns at 4.25 V, while a 5.0-V bump
is applied at the end of each DVS pattern. The CP was designed to be
functional and pass IDDQ leakage currents under
these stress conditions. In addition, the CP was also designed to be
functional under burn-in conditions at 140°C and 3.75 V.
Clock distribution and latches
The microprocessor clock is a single-phase clock that is
distributed from the chip central clock buffer to all latches inside
the macros in three levels of hierarchy. The first two levels of
distribution are in the form of balanced H-like trees. The
first-level tree routes the global clock from the central clock
buffer to the nine sector buffers. Each IU, FXU, FPU, and RU unit
has one sector buffer, while the BCE has two sector buffers.
Figure 5 shows the first-level clock
wiring
tree. The sector buffers repower the clock to all macros inside the
sectors. There are 580 macro clock pins among all the units. The clock
propagation delay along the tree is balanced against the macro input
capacitance and RLC characteristics of the tree wires. To
form the horizontal wiring of each tree, use is made of the relatively
low-resistance M5 layer. At various places along the tree, inductive
coupling is reduced, and the return path is improved by using power
wires for shielding. Decoupling capacitors are incorporated into
central and sector buffers to reduce delta-I noise. Extensive AS/X
circuit simulations were done to guarantee low clock skew at macro
pins. Typical simulated RLC delay of the first-level tree is
300 ps, with a skew of 20 ps at the sector buffers. The sector buffer
delay is 230 ps. Typical simulated RLC delay within sectors
is 210 ps, with a skew of 30 ps at the macros.
Figure 6 shows the measured waveforms of
the central clock
buffer output and clocks at ten points (marked on
Figure 5) of the 580
macro pin locations driven by the second-level clock tree. The results
indicate a mean delay of 740 ps and less than 30-ps skew from the
central clock buffer to the macro pin. The last level of clock
distribution is local to each macro.
Figure 7
shows the clocking scheme within macros. From the macro pin the
clocks are wired to the clock blocks. The overall target skew for
this wiring is less than 20 ps. For large-area macros, multiple clock
pins were used to reduce the wire length to clock blocks. The clock
block generates local clocks that drive latches, as is explained later.
The target skew for local clocks is less than 50 ps. All macro-level
wiring is done by hand for custom macros or with a place-and-route tool
for synthesized macros. For synthesized macros that contained many
latches, and therefore multiple clock blocks, a clock optimization tool
was used that reassigned latches to clock blocks on the basis of
placement. This resulted in clock blocks driving latches that were
placed closest to them. Macro layouts were extracted for R
and C parasitics, and the extracted netlists were used to
time the macros. Therefore, any skew in the last level of clock
distribution was captured in that macro's timing abstraction.
Figure 5
Figure 6
Figure 7
Two types of latches are used on the microprocessor outside the arrays:
L2-only latch and L1-L2 pair; both are cycle boundary latches (i.e.,
there are no midcycle or phase latches). The cycle boundary is defined
to occur at 'CLKG' falling. Corresponding to the two
latch types are two types of clock blocks. The first clock block/latch
combination is shown in Figure 8(a). This
clock block chops the global clock on the falling edge to create a
short pulse 'CLKL' that triggers the latch. By using
either a dynamic multiplexor in front of the latch
[Figure 8(a)] or a preset static multiplexor
[Figure 8(b)], a fast latch
delay is achieved. This mux/latch combination interfaces smoothly with
the static circuits, yet allows fast delays for multi-input high-fanout
registers typical of the dataflow. While 'CLKL' is
inactive (high), the dynamic multiplexor is in the precharge state
while the latch is holding its state. Similarly, the static multiplexor
is preset high (node mux_A is high) while the latch is holding
its state. When the 'CLKL' low-going pulse
arrives, the latch node 'LAT' begins to
discharge. If the multiplexors' data/selects are such that the
multiplexors remain in their precharged/preset state, node
'LAT' discharges fully. If, however, the multiplexors
evaluate, node 'LAT' is driven high. The n/p
transistors in the multiplexors are skewed to favor the transition
launched by the arrival of the 'CLKL' pulse. Strength
ratios in the preset static multiplexor are limited to 2.5:1 to
maintain a reasonable noise margin and to allow adequate time for
presetting all gates at the end of the clock pulse. The seven-input
preset static multiplexor evaluates in about 225 ps compared to about
400 ps for a standard static multiplexor with the same input
capacitance and area. When 'CLKL' is active, the latch
is in the transparent mode. Full chip and macro wire RC
extractions were carried out to guarantee no early-mode problems.
Figure 8
The second clock block/latch combination is shown in
Figure 9. This clock block splits the
global clock on the
falling edge to create C1/C2 clocks. C1-C2 clock overlap at the
cycle boundary is set close to 0 ps. Having a positive overlap would
reduce the latch propagation delay, but it would also require more
early-mode padding. These latches are used in non-timing-critical
dataflow macros and in control macros where all latches are
single-input and the speed advantage of an L2-only latch is
reduced.
Figure 9
All of the latches used in the design are fully scannable and
LSSD-compatible. The largest area penalty for fully scannable latches
was in the general-purpose register (GPR) arrays at 25%. The overall
chip area penalty is less than 5%. Having an LSSD-compatible design
increased productivity during test vector generation and allowed
greater than 99.5% dc stuck-at-fault test coverage. The goal of the
design was also to achieve high ac transition test coverage, since ac
test coverage often suffers from an inability to create appropriate
transitions due to latch adjacency in the scan chain. This is most
prevalent in dataflow macros where complicated logic functions are
being done (e.g., multipliers). To eliminate the latch adjacency
problems, nonfunctional scan-only latches were inserted into some
macros for every register bit. With this addition, ac transition test
coverage greater than 91% was achieved.
Specialized circuitry was added to all clock blocks to allow for edge
shifting at the cycle boundary. Taking the clock splitter as an
example, we are able to
- Completely un-overlap C1 falling and C2 rising by a large amount.
This provides a work-around for potential early-mode problems.
- Delay C1 falling from its nominal value. This allows stressing of
early mode and provides a determination of how much margin exists in the
design.
- Delay C1 falling and C2 rising together from their nominal values. This
allows cycle stealing from the previous cycle.
The last feature was found to be extremely useful in debugging
late-mode problems. Since clock stressing is local to each macro, when
the first late-mode path was found, more time was given to the failing
cycle through cycle stealing. This usually did not create late-mode
problems in the following cycle for the stressed macro. With the worst
late-mode path now masked, work could continue in finding the next
late-mode path that failed. Figure 10
shows how the clock splitter block was modified to cause un-overlap
between C1 falling and C2 rising.
Figure 10
Microarchitecture optimization and circuit innovations
In this section, we present several examples of how
microarchitecture optimization and innovative circuit techniques were
employed to improve timing.
FXU binary-zero-detect
circuit
The execution of instructions in the fixed-point unit involves
computing results in the 64-bit binary adder, 32-bit decimal adder,
and 64-bit logic unit, and setting the condition codes--all
in a single cycle. The critical path was computing the 64-bit
binary sum, doing a 64-bit zero-detect on the sum, and using the
result to set binary instruction condition codes. A scheme was used
where the 64-bit zero-detect was initiated and completed before the
binary sum result was actually available
[7]. This led to an
execution cycle of less than 3 ns.
FPU 120-bit
adder
The 120-bit floating-point adder was implemented as a carry-select
adder. Care was taken to make sure that the adder would never produce a
negative result. This eliminated the step of converting a
two's-complement negative result to sign-magnitude format. The adder
was split into 15 bytes, and each byte generated two sums. The
critical path was the carry-merge tree, which was implemented in a
binary tree fashion. This carry-merge tree produced the 15 selects for
byte sums. The adder computed the result in 1.35 ns in eleven
levels of logic.
Binary priority
encoders
Several places in the logic design required the use of priority
encoders. An example was a priority encoder in asynchronous interrupt
logic where a 16-bit asynchronous interrupt vector had to be
binary-encoded. Another use of this circuit was in leading-zero
detectors. A fast priority encoder circuit was designed with pass-gate
multiplexors, as shown in Figure 11.
This circuit computes the result in about five levels of logic on 16-bit
input vectors.
Figure 11
FXU 64-bit rotator
A 64-bit rotator in the fixed-point unit was implemented in two
levels of 8-to-1 multiplexors. A straightforward implementation
would look like that shown in
Figure 12(a).
In this implementation bits 0-6 of the first-level 8-to-1
multiplexor had to drive the entire width of the data stack to reach
bits 57-63. This increased the wire capacitance of these bits
eightfold and significantly increased the delay of the first-level
multiplexor. The implementation that was used instead duplicated some
bits of the first-level multiplexor shown in
Figure 12(b).
This increased the capacitance of the first-level multiplexor select
and data lines by only 12.5%, while keeping the wire capacitance of
the second-level multiplexor small. We found the latter
implementation to be much faster.
Figure 12
Interleaving bits in
the FXU data stack
The fixed-point-unit data stack was 64 bits wide to accommodate
instructions that work on 64-bit operands. However, a large number of
instructions in the architecture move data from the upper 32 bits to
the lower 32 bits and vice versa. To minimize both area and cycle time,
we avoided the many wiring channels that would be required for this by
interleaving bits 0, 32, 1, 33,···, 31, 63. This had zero
cost on
most macros that are bit-sliced, such as registers, GPR, FPR, mask, and
BLU. Even in the 64-bit rotator there was no penalty. The only
macro that was made more complicated was the 64-bit binary adder in
its carry-merge tree.
Clock checking in the
RU unit
Every cycle datum that represents the architected state of the
processor is sent to the RU unit from both copies of the processor to
be compared, ECC-encoded, and written into the checkpoint array.
Special precautions had to be taken to ensure that data are written
into the checkpoint array at the appropriate time. There can be a
malfunction in the write address decode circuit or the clock circuit
that can prevent the clock from becoming active (thus not writing the
checkpoint array), or cause it to become active when it should not,
causing a spurious write. Therefore, the entire data vector cannot be
written into the checkpoint array by a single clock--two clocks are
required so that they can check each other. The data vector is
partitioned into two halves, and a unique clock is sent to each half,
as shown in Figure 13. The use of two
toggle
latches with an error detector guarantees that a single fault in the
write address decode circuit or clock circuit will be detected.
Figure 13
Array circuits
The CP has a unified 64KB store-through L1 cache. It uses a line
size of 128 bytes to buffer the data needed by the processor units as
well as to interface with the off-chip L2 cache. The L1 cache was an
absolute address cache with four-way set associativity and two-way
interleaving. Because of the store-through design, the L1 uses
parity instead of ECC while maintaining good error recoverability.
During a cache fetch operation, the L1 cache, directory, and TLB are
read simultaneously in the beginning of the cycle. Results from the
directory and TLB are fed into the address comparators and encode logic
to determine whether a cache hit occurs. If a hit occurs, a 1-out-of-4
late-select signal is generated and passed to the cache output
multiplexor to select one of the four cache sets. The directory,
TLB, L1 cache, and the associated comparator and hit decode logic
must all operate within a single system cycle.
To achieve these performance goals, the arrays utilized various
innovative dynamic circuit techniques. Some of the key design
attributes are
- A six-device CMOS SRAM array cell (33-µ2 cell
area).
- A high-density ROM array cell (4.4-µ2 cell area).
- Multiport register file cells.
- Peripheral array circuits utilizing synchronized pipelined designs
with
dynamic and self-resetting circuit techniques for fast access and cycle
times [4].
- Array circuits (all of them) designed for system robustness to be
fully functional at extended power supply and temperature ranges with
worst-case process parameters.
- Advanced ABIST (array built-in self-test) engines. These are used in
the SRAM and ROM macros to provide comprehensive array test coverage.
The ABIST engines are scan-programmable for pattern-generation
flexibility and are capable of performing accurate cycle- and
access-time measurements.
Table 3 lists the characteristics of
various high-performance array macros. As the table shows, these arrays
achieved bipolar-like access and cycle times with the density leverage
of CMOS technology. The L1 cache achieved 2.0-ns access and
operation up to 500 MHz.
Table 3
Characteristics of various array macros.
| Array type
| Macro density (Kb)
| Macros per chip
| Access/Cycle (ns)
| Power (mW/Kb)
| Area (Kb/mm²) |
| L1 directory | 15 |
1 | <1.5/2.5 |
60 | 7 |
| TLB-A | 7 |
1 | <1.5/2.5 |
60 | 7 |
| TLB-L | 14 |
1 | <1.5/2.5 |
60 | 7 |
| L1 cache | 288 |
2 | <2.4/2.5 |
13 | 14 |
| ROS | 144 |
2 | <2.5/2.5 |
6 | 62 |
| Store buffer | 4.5 |
1 | <1.4/2.5 |
58 | 3 |
Summary
We have described the circuit design techniques used for the S/390
G4 microprocessor to achieve operation up to 400 MHz. The target
performance of 300 MHz was exceeded by focusing on timing and
high-frequency circuit design methodology and techniques. At the
architecture level, cycle time was emphasized over cycles per
instruction. Both synthesis and placement methodologies were developed
with timing as top priority. PLL design, chip floorplan, and clock
distribution were skew-optimized. For the custom design suite,
judicious mixed use of static, dynamic, and SRCMOS array circuits
achieved high-frequency operation without sacrificing design
time. Innovative latch designs allowed performance optimizations in a
controlled manner. Microarchitecture optimizations and circuit
innovations were shown to be effective in improving the performance of
timing-critical macros. The well-planned concurrent top-down and
bottom-up design approach has been shown to be viable for
high-frequency designs as well as for reducing design risk and
shortening the design time.
Acknowledgments
The authors would like to acknowledge the contributions of the
following persons: R.Allen, C.Anderson, R.Averill, J.Badar,
D.Bair, A.Barish, U.Bakhru, D.Balazich, K.Barkley, D.Beece,
D.Becker, I.Bendrihem, M.Billeci, F.Bozso, J.Braun,
T.Bucelot, J.Burns, C.Bui, S.Carey, Y.Chan, M.Check,
C.Chen, K.Chin, E.Cho, J.Clabes, S.Coleman, R.Crea,
A.Dansky, J.Dickerson, R.Dennis, G.Ditlow, R.Dussault,
P.Emma, K.Eng, R.Farrell, R.Franch, J.Feldman, B.Giamei,
J.Gilligan, B.Grossman, A.Gruodis, H.Hansen, R.Hanson,
A.Hartstein, R.Hatch, D.Heidel, D.Hillerud, D.Hoffman,
D.Hui, K.Jenkins, J.Ji, D.Jones, G.Jordy, G.Katopis,
V.Klimanis, A.Kohli, N.Kollesar, T.Koprowski, S.Kowalczyk,
B.Krumm, C.Krygowski, L.Lacey, L.Lange, K.Lewis, W.Li,
J.Liptay, P.Liu, P.Lu, J.Ludwig, V.Lund, R.Mauri,
S.McCabe, T.McNamara, T.McPherson, D.Merrill, P.Minear,
P.Muench, M.Mullen, J.Navarro, J.Neely, H.Ngo, T.Nguyen,
G.Northrop, D.Ostapko, P.Patel, A.Pelella, E.Pell, C.Price,
J.Rawlins, W.Reohr, P.Restle, S.Risch, R.Rizzolo,
B.Robbins, R.Robortaccio, G.Scharff, M.Scheuerman,
E.Schwartz, A.Sharif, K.Shepard, K.Shum, T.Slegel, H.Smith,
P.Strenski, S.Swaney, F.Tanzi, S.Tsapepas, A.Tuminaro,
P.Turgeon, G.VanHuben, S.Walker, L.Wang, D.Webber,
C.Webb, B.Wile, P.Williams, B.Winter, T.Wohlfahrt, D.Wong,
B.Wu, and F.Yee.
*Trademark or registered trademark of International Business
Machines Corporation.
References
Received December 12, 1996; accepted for publication May 16, 1997
|
 |
|
|