| VLIW prototype semicustom chip |
The VLIW prototype includes an IBM CMOS IIs standard cell chip
designed at the gate level. Since the CMOS IIs technology was both
area-limited and pin-limited for VLIW, extensive partitioning was
required. The semicustom chip implements three different functions,
depending on a "chip characteristic" input:
- a 24 port register file;
- a next address multiplexer for 8-way branching and conditional
execution; and
- a crossbar switch for performing 4 memory accesses from 8
interleaved memory banks.
Register file
The register file (RF) consists of 24-port, 64*33-bit registers which allows
16 read and 8 write operations in each cycle for the 8 ALUs.
The result of an operation performed in cycle n is written into
the register file in cycle n+1 during the subsequent write-back
pipeline stage. However, bypass paths between all ALUs are implemented,
so that any ALU
can use the result of any other ALU at the immediately following cycle.
- Partitioning:
- There are two copies of the register file, each allowing 8 read
and 8 write operations; each chip implements a 5-bit slice of the
8R/8W port register file (14 RF chips total).
- Critical delays:
- At the beginning of an execution cycle, the ALU input is read
from the register file or an immediate field, or data that has
not yet been written to the register file is bypassed from an
ALU write buffer (also inside the RF chip);
this path contains three gate levels (plus output driver overhead).
Next-address multiplexer
A next address multiplexer (NAMUX) implements the 8-way branching in
a tree-instruction, conditional execution, as well as the interrupt
logic.
- Partitioning:
- There are four copies of the chip, with each one being responsible
for driving the address for two target instructions of the current
tree-instruction. The outputs from the four chips are ORed together
externally, and sent to the instruction memory. Only one of the
8 possible next addresses will be driven to the instruction memory.
- Critical delays:
- Copies of condition codes are kept in each NAMUX chip. The next
instruction address is computed from the condition codes within
two AND-OR gate levels in each chip (plus output driver overhead
and ORing).
Crossbar switch
A 4*8 crossbar switch (XBAR) allows four memory accesses to any of the
eight interleaved banks of data memory (i.e. 4 memory ports are
connected to 8 memory banks). The crossbar switch takes its addresses
from special "MAR" (memory address) registers inside the crossbar,
which must have been loaded in a previous cycle. All four accesses
complete in one cycle when there are no bank conflicts. A stall
occurs for the minimum number of cycles, when more than one port
accesses the same bank.
- Partitioning:
- There are eight copies of the chip, each implementing a 4-bit
slice of the 4+8 data/address paths. Each data path is subdivided
into 4-bit slices, to facilitate the required byte rotates for
byte and halfword loads, with the following bit configuration:
(0,8,16,24), (1,9,17,25),...(7,15,23,31)
- Critical delays:
- A copy of the low order bits of all MAR registers (used for bank
selection and byte/halfword selection within word)
are kept in every XBAR chip.
For a load, the 4-way port-to-bank address multiplexing is done
in two AND-OR gate levels starting from the MAR registers
(including one logic level hidden inside LSSD latches). The
8-way bank-to-port data multiplexing, store buffer bypassing,
and rotate for byte/halfword loads take two AND-OR levels on
the way back from the data memory (plus driver/receiver overhead).
We concentrated on reducing the following path delays:
- RF access or bypass, ALU operation
(RF clock to data output + ALU +
wire delays + skew + setup of ``condition code'' inputs
for NAMUX, before next clock);
- data cache load
(XBAR clock to bank address output + PAL for SRAM chip select +
SRAM access time +
XBAR bank-to-port mux delay + wire delays + skew +
setup of ``write buffer'' inputs for RF, before next clock);
- computation of next address from condition codes, and fetch
VLIW
(NAMUX clock to instruction address output +
external logic for OR'ing addresses
+ external logic for muxing the address from NAMUX and the address from PS/2
+ SRAM access time + wire delays+
skew + setup of ``source register number'' inputs for RF,
before next clock).
These paths were all measured to be under 60ns. However, there was
as least one unexpected path that was longer, discovered after the system
was built, thus forcing the use
of a longer cycle time (90 ns) than the original design. The
long path occurs, for example, when two store operations cause a
bank conflict by accessing the same memory bank:
- from the condition codes, a NAMUX chip
generates a signal that indicates
a store operation is on the taken path of the tree;
- the signals from the 4 NAMUX chips are ORed externally;
- based on this signal, XBAR generates a signal that there is a bank
conflict, i.e. more than one memory operation is
accessing the same bank (this XBAR delay was longer than
anticipated);
- the bank-conflict signal is ORed on the board, in a PAL, with
other stall signals, generating a clock gate signal;
- the clock gate signal is driven to many places in the machine,
and must be available sufficiently before the next clock period
starts (less time is available than the full clock period, to
generate the clock gate signal).
|
|
|
|