IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
Journal of Research and Development  
Volume 41, Numbers 4/5, 1997
IBM S/390 G3 and G4
 Table of contents: arrowHTML arrowASCII   This article: HTML arrowASCII   DOI: 10.1147/rd.414.0405 arrowCopyright info
   

S/390 Parallel Enterprise Server Generation 3: A balanced system and cache structure

by G. Doettling, K. J. Getzlaff, B. Leppla, W. Lipponer, T. Pflueger, T. Schlipf, D. Schmunkamp, and U. Wille
Since initiating the information technology industry-wide transition from bipolar to CMOS technology with the first generation of S/390® processors in 1994, IBM reached another major milestone with the introduction of the third generation in September 1996. The balanced system and cache structure and the modularity of the components of Generation 3 support a wide performance range from a uniprocessor to a high-performance multiprocessing system. Because of this modularity, Generation 4 is also based on this structure.

Introduction

The IBM S/390* Parallel Enterprise Server Generation 3 and the IBM S/390 Multiprise* 2000 (both called G3) and the later G4 systems are tightly coupled S/390 symmetrical multiprocessing systems with up to ten processors, a three-level cache hierarchy, and up to 8 GB (for G3) of physical memory. The modularity of the system and cache structure supports a wide performance range based on the same chip set. The G3 spans a range from a uniprocessor with one memory card to a high-end system with ten processors and four memory cards. There are either one or two additional processors in the system structure, which serve as system assist processors (SAPs) for I/O operations. The balanced system and cache structure design makes it possible to retain all elements unchanged for G4, except for replacing the PU (processing unit, or processor) and L2 chips. First an overview of the G3 system and cache structure is given, followed by discussions of the implementation and the features of the PU and L2 chip and the common G3/G4 parts (BSN, MBA, STC/memory card, and CGC). Special focus is given to the error-detection and recovery capabilities of the G3 system.

G3 system structure

The 12-way four-bus system structure and components of the chip set are shown in Figure 1. The chip set consists of processor units (PUs), level-2 caches (L2s), bus-switching network adapters (BSNs), storage controllers (STCs) on the memory cards, I/O adapters (MBAs), and the clock chip (CGC). A level-1 cache is included on the PU, and a level-3 cache (called L2.5) on the BSN.

Figure 1

All system buses are 16 bytes wide and bidirectional; L2, BSN, and STC are implemented in pairs of identical chips referred to as the high (H) and low (L) chip because of constraints on chip pad area and silicon area. In the G3 system, a 1:1 ratio between the processor cycle and the bus cycle is reached. The cycle time is 5.9 ns in the high-end system, while the smaller one- or two-bus systems run with a relaxed cycle time.

In the four-bus system the PU, L2, BSN, and MBA chips are packaged on a 127-mm air-cooled multichip module (MCM). Figure 2 shows the chip placement on the MCM, which is optimized for minimal bus wire length. The 127-mm MCM uses 20 pairs of wiring planes and has a total of 1732 signal module pins. The CGC chip and the STC chips are packaged on single-chip modules (SCMs). Two STC chips are placed on a memory card. The MCM, the CGC chip, and the memory cards are placed on a planar board.

Figure 2

The one- and two-bus systems have a card-on-board (COB) package with all MCMs and SCMs mounted on one card. Four PU chips and two L2 chips are assembled on a 63-mm MCM, and the BSN, MBA, and CGC chips are packaged on SCMs. Table 1 presents an overview of the capabilities of the chip set to support different system configurations.

Table 1 System configurations.
SystemFour-bus Two-busOne-bus
Number of PU chips2-12 2-82-4
L2 size per PU256 KB 128 KB64 KB
Number of BSN chips8 42
L2.5 size2 MB 1 MB512 KB
Number of MBA chips2 11
I/O path12 63
Number of memory cards4 21

Cache structure

The memory subsystem consists of a three-level cache hierarchy and the memory. This cache structure is common throughout the low-end, intermediate, and high-end systems, differing only in cache size and number of buses. The line size for all hierarchy levels is 128 bytes.

o Level-1 cache (L1)
The level-1 cache is integrated on the PU chip and is implemented as a unified instruction and data cache. The size is limited to 32 KB to fit on the PU chip and to achieve a one-cycle access. To resolve L1 cache misses, a line request is sent to the private L2 with the address of the missing quadword (16 bytes). This quadword is returned first to allow the PU to proceed immediately. The remaining seven quadwords follow and may wrap within a cache line. The L1 cache is parity-checked. Intermittent errors (soft errors, etc.) are recovered by reloading the defective line from the L2. The L1 cache includes the following features:

  • Store-through concept.
  • Size: 32 KB.
  • Eight-way set-associative.
  • Parity-checked with recovery.
  • One-cycle access.
  • 16-byte access.
o Level-2 cache (L2)
The L2 cache chip comprises four independent cache partitions, each with a size of 64 KB. They can operate concurrently, and each is assigned to one PU to serve as a private L2 cache (see Figure 3). The chip is implemented with an 8-byte dataflow and interface. A pair of L2 chips work synchronously in the 16-byte bus structure, providing four PU ports and two bus ports for the L2-BSN bus connection. A PU has a 16-byte bus to each of the L2 pairs and routes the cache-line requests according to the defined address class scheme.

Figure 3

The L2 is a store-in cache. It always holds the actual copy of a line, because its L1 cache stores through. The L2 cache performs bus snooping on all L2-BSN buses to maintain the data coherence within the L1- and L2-cache-level hierarchy. The L2 cache includes the following features:

  • Private cache.
  • Store-in concept.
  • Size: 256 KB per PU.
  • Eight-way set-associative.
  • ECC-checked.
  • Five-cycle access.
  • 16-byte access.
o Level-3 cache (L2.5)
The L2.5 cache is a shared cache [1] serving all PUs. It is implemented on pairs of BSN chips. A pair of BSN chips performs the required bus arbitration and bus switching, and provides the L2.5 cache with a size of 512 KB per bus. The L2.5 cache contains two banks, which can operate concurrently; its features include
  • Shared cache.
  • Store-through concept.
  • 512 KB per bus, 2 MB maximum.
  • Eight-way set-associative.
  • ECC-checked with deletion feature per bank.
  • 14-cycle access.
  • 16-byte access.
  • Two independent banks.
o Memory
The memory is implemented with DRAMs on memory cards, and is controlled by a pair of STC chips which are located on the card and run synchronously. A memory card contains two memory banks, which can operate concurrently; one to four cards may be used depending on the system configuration. The S/390 key storage is implemented on the memory cards. Each card contains the full set of key entries to avoid bus interference for combined data and key operations. The memory card features include
  • Shared-system memory.
  • Size: up to 2 GB per card.
  • One to four cards depending on the configuration.
  • ECC-checked, with redundancy bits.
  • 32-cycle access.
  • Two independent banks.

Bus structure

The bus structure of a high-end system (Figure 1) comprises four logical buses numbered 0 to 3. Each of these buses is controlled by a pair of BSN chips [2]. A "logical" bus can consist of up to three "physical" buses to connect the PU-L2 clusters to a pair of BSN chips and two additional buses to connect the MBA chips. Each of the pairs of BSN chips controls one memory bus, with a pair of STC chips located on the memory card. All system buses are 16 bytes wide, parity-checked, and bidirectional, and are connected by pairs of L2, BSN, and STC chips. The synchronous operation of a pair is checked every cycle. Command and address are duplicated on each half of the bus during the command/address cycle to allow both chips to run synchronously.

o PU-L2 cluster
PU and L2 cache chips are grouped in clusters (A, B, C). Each cluster contains one to four PUs and one or two pairs of L2 chips (Figure 3). A PU owns a 64KB cache partition on each of the four L2 chips, giving a total of 256 KB per PU in the four-bus system. The cluster contains two L2 chips in the one- or two-bus system, with a total of 64 KB or 128 KB per PU. A pair of L2 chips has four PU ports, each with a 16-byte data bus and a 4-byte address bus. These are private buses with a simple handshaking protocol (PU-L2 bus). They are used by the PUs

  • To request cache lines from the L2 with the address put on the address bus.
  • To perform store-through operations with the address put on the address bus and a 16-byte datablock on the data bus.
  • To communicate with other PUs, MBAs, or memory via the L2 cache.
o Logical L2-BSN bus
A pair of L2 chips contains two bus ports, each 16 bytes wide. These are the L2-BSN buses, which connect the PU-L2 clusters with pairs of BSN chips. The four-bus system contains two pairs of L2 chips within each cluster, providing a connectivity of 4 × 16-byte buses to the memory via the BSN chips. A simple routing scheme is used to select the bus and bank for a memory access. Bits 22 and 23 of the line address define the bus number, and bit 21 selects one of the two memory banks. This maps the cache lines with the low-order line address 0 and 1 in the first memory card, cache lines with the line address 2 and 3 in the next card, and so on. This fine granularity provides an equally distributed load on the buses.

The L2-BSN buses are shared between the PUs and the MBAs and are controlled by pairs of BSN chips with a simple handshake protocol. There is no separate address bus available; command/address and data are multiplexed on the same bus. The BSN controls access to the bus, granting access to one of the PUs or to the MBAs. It redrives the command/address cycle to the memory via the STC and to the other PU-L2 clusters to allow bus snooping.

A cache line is transferred in eight datashots of 16 bytes each; it is redriven to the STC for line-store operations and to the PU-L2 clusters for line-fetch operations. The L2-BSN bus supports two-way interleaving by using the latency between the command/address cycle and the first data transfer cycle for line-fetch operations. In this gap another command/address cycle can be issued to the other bank of this bus, utilizing the two independent banks of the L2.5 cache and the memory. A pair of BSN chips has four L2-BSN bus ports to support a maximum of four PU-L2 clusters (three are used in the G3 system), two MBA-BSN bus ports, and one BSN-STC bus port to connect the STC chips and the memory.

o BSN-STC bus
The BSN-STC bus is controlled by a pair of BSN chips. The command/address format and the basic protocol are similar to those of the L2-BSN bus, since the BSN simply operates as a switch between them. There is an internal latency of two cycles for the command/address cycle and for a data-transfer cycle. (This bus works in interleave mode as well.) The bus protocol allows the cancellation of line-fetch operations that have already been started without knowing whether the line is in the L2.5 cache.

Bus operations

o Bus commands
The bus structure connects all components of the system; it provides communication among them and allows access to internal objects such as the key storage on the memory card or to I/O adapters via the MBA. The bus operations can be broken down into eight command groups:

  • Line fetch.
  • Line store/line fetch.
  • Cast-out.
  • Line-invalidate.
  • Absolute fetch/store operations with 1-8 bytes.
  • Key-storage operations.
  • Sense/control operations.
  • Fetch/store MBA with 1-128 bytes.
There are several line-fetch commands; they specify whether the line is requested for a fetch or store operation (LF-DFETCH or LF-DSTORE), an instruction fetch (LF-IFETCH), or the key from the key storage is requested together with the data (LFK). The line-store/line-fetch command combines an LRU cast-out with the line fetch to free up an entry in the L2 cache. This LRU cast-out is also a separate command, used for cache-to-cache cast-out operations as well (cross-cache cast-out). A line-invalidation command is signaled from one PU to all other PUs via the L2 cache to become the single owner (exclusive state) of a line. Absolute fetch/store commands allow access to the memory, bypassing the caches. They are used for internal functions. S/390 key-storage commands perform read/write operations to a key-storage entry or modification to the reference and change bits.

Sense/control commands are used for many internal purposes. They permit the exchange of control and status information among the PU and L2, BSN, MBA, or the STC and are also used for communication between PUs via the L2 caches.

MBA fetch/store commands provide direct memory access for the I/O adapters, connected to the MBA with self-timed interfaces (STIs). This access is monitored by the L2 caches to maintain the data coherence within the memory subsystem.

o Bus arbitration
Centralized bus arbitration is performed by the BSN chips for each bus and bank. It prioritizes the access of the MBAand keeps the access to the banks in balance.

o Cache coherence
Data coherence within the memory subsystem is controlled by the L2 caches using a bus-snooping mechanism. The snooping is limited to the line address class of one logical bus; there is no interference with other buses. Line address hits during bus snooping may cause the L2 cache(s) to change the status of the line in their directories or to cast out a modified line to the requestor and to the memory for update. Table 2 shows the actions of L2 caches for line fetch due to fetch (LF-DFETCH) or due to store (LF-DSTORE), depending on the state of the cache line. The implemented scheme follows the MESI protocol.

Table 2 L2 cache bus-snooping response.
Operation Old stateNew state Cast-out Memory
update
MESI MESI
LF-DFETCH     x     x   
    x     x     
   x      x     
 x        x  x x
LF-DSTORE     x     x   
    x       x   
   x        x   
 x          xx  
Cache-line status: M = Modified, E = Exclusive, S = Shared, I = Invalid.

The snooping function requires the command/address information from all L2 caches of one logical bus. The BSN receives the command/address from the requestor on the L2-BSN bus and distributes it to all other PU-L2 clusters with one cycle of latency on each of the other physical L2-BSN buses. Hits in a private L2 cache do not require cross-cache communication. The advantage of this concept is that there is no need for a central control element or central directory copies; it therefore allows concurrent operations on four buses. No extra bus cycles are spent in the critical access path, since the bus snooping is done in parallel.

o Line-fetch operation
The execution of a line-fetch operation depends on the type of line-fetch command and on the state of the line in the private L2 cache, in the other L2 caches, or in the L2.5 cache. Figure 4 shows the different sources and paths (1 to 4) for a requested line and the minimum access times in processor cycles (including the repetition of the failing instruction). The possible paths are as follows.

Figure 4

Path 1   The L2 cache has a hit and provides the line, usually within five cycles.

Path 2   The L2.5 cache has a hit and returns the line in 14 cycles if it is "shared" or the command is a line fetch for instructions (LF-IFETCH). Those lines are shared by definition. Otherwise, the L2.5 cache must wait for the result of bus snooping, because the line may be modified in one L2 cache; eight or more waiting cycles are added in this case. The line is also stored in the L2 cache.

Path 3   There is no hit in the L2 or L2.5 cache. The memory returns the line in 32 cycles or more, depending on its availability. The memory also provides the line for a line fetch with key command (LFK), regardless of a hit in the L2 or L2.5 cache, because the key from the key storage must be delivered. The line is also stored in the L2 and L2.5 caches.

Path 4   Another L2 cache has the requested line in "modified" state and returns it in 23 cycles or more, depending on its availability. During this cross-cache cast-out, the line is put on the L2-BSN bus for the requestor and on the BSN-STC bus for a memory update. This update is not done for a line fetch that is due to store (LF-DSTORE), because the requestor changes the line again immediately. The line is also stored in the L2 and L2.5 caches.

o Example of bus timing
An example of bus timing for a line-fetch operation for a shared line with an L2.5 cache hit is shown in Figure 5. A PU from PU-L2 cluster A gets an L1 cache miss for a fetch operation (F). The failing address is sent to the L2 cache, which performs an L2 directory search (miss) and raises the request line to the BSN. The BSN accepts the request and gives the bidirectional L2-BSN bus to the L2 cache (L2-BSN bus enable = L2). The requesting L2 cache puts command and address (C) on the L2-BSN bus, and the BSN redrives command and address on the L2-BSN buses for the PU-L2 clusters B, C. The L2 caches in clusters B, C search their directories in the next cycle (bus snooping).

Figure 5

The BSN accesses the configuration table (see the related section) and sends the command and the physical address via the BSN-STC bus to the memory to start the memory operation in parallel. The BSN searches the L2.5 directory in the same cycle and fetches the first quadword from the L2.5 cache. The BSN gets the hit and the shared state for this line from the L2.5 directory and immediately puts the first and the following datashots on the L2-BSN buses (L2-BSN bus enable = BSN). The BSN also cancels the memory operation.

Command/address (C) and the last datashot (7) on the bus are always driven for two cycles in order to quiesce the bus. Finally the requesting L2 drops the request with the first data cycle and sends the data with one cycle of latency to its PU. The PU repeats the failing instruction (F) and continues with processing while the remaining data blocks are transferred.

o Storage access penalty
The storage access penalty is the latency between the data request to the cache hierarchy and the first returned 16-byte data block, plus the time required to repeat the missing instruction. The L1 cache access of one cycle is counted as part of the instruction processing.

Table 3 shows the cache sizes, the access penalties, and the cache hit rate for each level of the cache hierarchy for a ten-way and a five-way system. The penalty is given in processor cycles of 5.9 ns, the hit rate in percent per instruction for a typical S/390 commercial workload with heavy memory and bus load. Both systems have the same L1 cache size with a hit rate of 89%, so 11% of the storage references must be resolved by the following cache hierarchy or memory; 5% of the remaining references are covered by the L2 cache (4% in the five-way), 3% by the L2.5 cache, and 3% by the memory (4% in the five-way). Both systems have almost identical hit rates, providing equally balanced bus and cache behaviors. On the basis of the results of Table 3, three design points were considered: a cache hierarchy 1) with L1; 2) with L1 and L2; 3) with L1, L2, and L2.5.

Table 3 Typical cache hit rate in a ten-way/five-way system.
Memory subsystem Access penalty
(PU cycles)
Ten-way, four buses Five-way, two buses
Cache size Hit rate
(%)
Cache size Hit rate
(%)
L1 cache1 32 KB89 32 KB89
L2 cache5 256 KB/PU5 128 KB/PU4
L2.5 cache14 2 MB3 1 MB3
Memory32 8 GB max3 4 GB max4

An arbitrary scale is used for the storage access penalty (see Figure 6) with the L1 cache-only design normalized to 1. The storage access penalty decreases rapidly from point 1 to point 2 and moderately from point 2 to point 3. The diagram illustrates with this sample the efficiency of the three-level cache hierarchy and the benefit of the L2.5 cache.

Figure 6

o Bus bandwidth
The bus bandwidth determines the dynamic part of the storage penalty. An insufficient bandwidth of the memory subsystem increases the latency from the processor's viewpoint, because the queueing on the bus leads to more waiting cycles. This bus structure minimizes the impact by utilizing

  • 16-byte-wide buses.
  • Up to four concurrently operating buses.
  • Two-way interleaving on each bus.
  • Low bus cycle time.
  • Bus protocol.
The bus protocol avoids unnecessary waiting cycles once a command has been granted by the BSN arbiter and issued by an L2. The command is routed immediately to the L2.5 cache and further to the memory. Bus snooping by the other L2 caches and the L2.5 directory lookup are done in parallel and may cause a late cancellation of the memory operation.

The peak data rate on the L2-BSN bus is reached with permanent requests for shared lines and L2.5 cache hits; 256 bytes can be transferred in 26 bus cycles, including the command/address cycles, the access cycles, and the eight data cycles per cache line. This results in a data rate of 1.67 GB/s per bus, or a total of 6.68 GB/s for a four-bus system at 5.9-ns cycle time.

A bus-timing sequence is given in Figure 13 (shown later) with a line fetch from memory interleaved with a line fetch from the L2.5 cache at a peak data rate of 1.4 GB/s per bus. This example shows the bus interleaving effect. Overall, a four-bus system will provide a sustained throughput of 5-6 GB/s. Calculation of the bus utilization for typical S/390 workloads shows a utilization of 30-40% for the L2-BSN bus and 20-30% for the BSN-STC bus.

o Bus cycle time
All system buses run at the cycle time of the processor. This avoids additional latency in the PU or L2 chip for cycle-time adaptation and improves the bus bandwidth. This is achieved by the point-to-point characteristic ofthe buses (with the exception of the MBA-BSN bus). Optimized chip placement on the MCM provides balanced wiring density and short bus wire length. The bus wire length on the MCM does not exceed four chip spaces. The module was carefully wired with min/max wiring rules for each net in order to reach the bus cycle time.

Processing unit (PU)

o Overview
The processing unit (Figure 7) supports the S/390 architecture instruction set [3]. The most frequently used commercial instructions (108) and all floating-point instructions (54) are hard-wired. The remaining instructions are executed by the firmware.

Figure 7

The features of the PU chip include the following:

  • 14.5-mm chip size, CMOS 5X, 7.2 million transistors.
  • 5.9-ns cycle time.
  • 744 chip I/Os used.
  • 32KB unified cache (L1).
  • 32KB RAM and 32KB ROM for firmware.
  • Floating-point unit.
  • Data-compression unit.
  • S/390 timers.
  • Two 16-byte-wide bidirectional buses to the L2 for data transfer and 4 × 4-byte address buses.
  • System measurement instrument (SMI) to support performance verification.
  • Trace buffers for system debug.
  • Escape logic to support system debug.
  • Serial link to the clock chip.
o S/390 implementation
Addressing and dataflow are implemented in 32-bit CISC processors. As in RISC processors, 108 instructions are executed by hardware, with a four-stage pipelining concept. The pipeline stages are i-fetch, decode, execution, and writeback. If an instruction requires more than one execution cycle, it is not sent for writeback until decode and execution are complete.

The i-fetch obtains data from the L1 cache if no other cache operation takes place. The fetched instructions are routed to two 16-byte instruction buffers. Address bit 27 of the virtual instruction address selects which buffer is loaded. To get the instruction buffers filled, an indicator is set if one of the two buffers is empty and a request for reload is raised. The reload is started if the instruction in the decode stage will not use the cache in the execution stage. The decode stage performs the address calculation and operand register read for the S/390 instructions. It loads and starts the firmware program for instructions that are not hard-wired. Firmware and hard-wired instructions such as integer and cache operations are processed in the execution stage. Additional hardware is implemented to speed up firmware programs. The results from the execution stage are transferred in the writeback stage either to an architectured register or to the L1 cache. This instruction implementation leads to an infinite-cache cycles-per-instruction (cpi) performance of 2.4 in a typical commercial environment.

o L1 cache
Since most of the S/390 instructions have a data storage reference, it is important that the L1 cache can be accessed within one cycle. Access to the translation lookaside buffer (TLB), the L1 directory, and the L1 cache is done in parallel. The L1 is an eight-way set-associative cache with store-through and is addressed with the low-order real address bits. In the case of a cache miss, the requested cache line is transferred from the L2 in eight shots of 16 bytes to the L1 fetch buffer (see Figure 8). The instruction which caused the cache miss is continued after the transmission of the first 16-byte shot from the L1 fetch buffer into the L1 cache. During the processing of the following S/390 instructions, the remaining data from the L1 fetch buffer are copied to the L1 cache if they are not being used by instruction processing. Store data and L1 fetch buffer data are merged if the quadword addresses are the same.

Figure 8

Systemwide data integrity is maintained by a broadcasting mechanism between the L1 and L2 caches. After each instruction or between the execution of units of operations, the PU is interruptible by a broadcasting request to update the L1 cache directory.

Virtual addresses have to be translated into absolute addresses. Both are kept in a four-way set-associative TLB with 64 entries in each set, together with the S/390 storage key and protection information. In the case of a cache miss, the TLB provides the 4KB page address to form the complete 35-bit absolute address.

o Firmware
The firmware is used to execute complex instructions and the interrupt handling of the PU. It resides in RAM, ROM, or the memory DRAM subsystem. The addressable firmware region is 128 KB in size; 64 KB are held in the RAM and ROM, and the remaining part is transferred on request from the memory DRAM subsystem to a 128-byte register inside the PU. The firmware instructions have a vertical format; each instruction has a size of 2 bytes. The pipe depth is the same as for the hard-wired S/390 instructions.

o Floating-point
The main floating-point dataflow consists of an add/subtract flow with 116-bit width, and a signed multiplier of 60 × 60 bits. All floating-point operations except for divide, square root, and extended multiply require only one cycle in pipelined mode. Synchronization for instructions which require more than one cycle is done by the floating-point interface, which stalls the PU. A zero-cycle branch is available in conjunction with the floating-point unit, which speeds up the loop processing.

o Data compression
Data compression is based on a Lempel-Ziv algorithm and is fully implemented in hardware. Expansion of 1 byte takes two cycles, and eight cycles are needed for compression.

o Error detection and recovery
To maintain data integrity throughout the system, all data paths, including external buses, are parity-checked on byte boundaries. The multiplier in the floating-point unit includes a residue-checking scheme. The occurrence of a parity check or residue check within a PU will check-stop this PU and propagate the check-stop state to the clock chip and the correspondent L2. The defective PU is fenced "on the fly," and the remaining system, including all L2 caches, continues with processing, thus enhancing availability to the user. Soft errors in large arrays are also detected by parity checkers. In the case of an error, the faulty part is reloaded automatically. If, after the reload, the error comes up again, it is handled as a check-stop condition. Problems during debug can be analyzed by reading the 256-entry-deep trace buffer. With this information, it is possible to track what happened within the PU in the last 256 consecutive cycles. To circumvent hardware problems, a programmable "escape logic" is used to change the behavior of the hardware.

Level-2 cache (L2)

o Overview
The L2 caches (Figure 9) are private caches, each associated with a dedicated processor chip. One chip holds four cache macros of 64 KB each. The dataflow per chip is 8 bytes wide. As with the BSN chip, there is also a switching function. The buses of four PUs are multiplexed to obtain access to two BSN buses. This structure keeps the number of chip I/Os on PU, L2, and BSN chips balanced.

Figure 9

The L2 chip features are as follows:

  • 16.4-mm chip size, CMOS 5X, 17.9 million transistors.
  • 5.9-ns cycle time.
  • 928 chip I/Os used.
  • Four 64KB private caches, eight-way set-associative.
  • Three cycles of access time.
  • Line-fetch operations in parallel with bus snooping.
  • Pipelining of subsequent line fetches.
  • Caches with ECC, directories with parity and duplicated.
  • Fencing against defective PUs.
o L2 cache access
The L2 cache chip is optimized for fast access time. To achieve this, a separate bidirectional address bus in addition to the bidirectional data bus between the PU and the L2 chip is provided. The address bus carries the absolute address for each L1 reference in order to maintain L1/L2 cache consistency. This address is used by the L2 to address port 1 of the directory array (also called tag RAM). This array has a second address port to support bus snooping via the BSN bus. Using a two-port array avoids an arbitration cycle between the PU and BSN bus addresses. When the cache array is not busy, directory access and first-cache access are done in the same cycle. Fast address-compare macros at the directory outputs create a late column-select signal. The data of the selected column are latched in a data-out register and sent to the PU interface register via the error-correction logic (ECC). This results in a three-cycle access from address-in to data-out.

o Pipelining of data and addresses
The L2 cache performs a store-in scheme. Store-through from the L1 cache is done in units of 16 bytes, disregarding the fact that the PU can store only 1···8 bytes per cycle. The L1 cache merges new store data bytes with old data bytes, thus sending 16 bytes on quadword boundaries. This avoids a time-consuming prefetch, byte merge, and ECC generation in the L2 cache.

Store-throughs have lower priority than line fetches in the L2 cache. Store datashots prior to a line fetch are buffered in an L2 store buffer (see Figure 8), which is two entries deep. The buffer content is transferred into the cache when a line-fetch operation results in an "L2 miss." Thus, all store operations are completed before an LRU cast-out can occur.

A line fetch with an L2 hit requires eight array-read cycles and needs the bidirectional data bus for eight cycles, plus an extra "quiesce" cycle prior to bus direction change. The term quiesce means that the last pattern is repeated, so the reflections from the far end of the bus are clamped in the driver circuit, which is still in low-impedance state. During these nine cycles the bus is not available for store-through operations by the PU. Therefore, the PU has implemented one L1 store buffer per L2 pair. These buffers are four entries deep. The address bus is not busy during the line-fetch operation, so subsequent fetch, store, or L1 miss addresses can be sent. In the case of store-through operations, associated address/column information is buffered in the L2 address buffer. The interface ensures that the PU store buffer and the L2 address buffer fill and empty in synchronism.

Figure 10 shows a timing diagram for a sequence of five PU store operations a-e, where c and e encounter an L1 miss, e additionally an L2 miss. The normal pipeline of stores a and b is interrupted when the L1 miss condition for store c is signaled to the L2 chip. Datashots a and b are saved in the L2 store buffer. The write cache is suppressed, and eight cache-read cycles are performed to provide the data for store c. The PU repeats the store c when the first fetched quadword arrives in the L1 receive register. Now the L1 references d and e can be processed. The L2 address buffer holds the corresponding cache row address and column for stores c and d, and the PU buffers the datashots c and d. This buffer is emptied at the end of the line-fetch operation in the PU, resulting in the L2 write cycles c and d. The pending stores a and b are executed after the end of the L2 line fetch. Store e leads to L1 miss and L2 miss. The L2 starts the bus operation by raising the L2-BSN request in parallel with the current line-fetch operation for store c.

Figure 10

o L1/L2 cache consistency
The data integrity across all caches in the system is guaranteed by the MESI protocol (M = modified, E = exclusive, S = shared, I = invalid). The M state is known only to the L2 cache. The L1 does not need to know this state because of the L1 store-through scheme. Data integrity between L1 and the private L2 must also be maintained.

L1 store-through requires that the L1 content must always be a subset of the L2 content. Generally, this requirement is fulfilled as long as the L2 cache associativity is equal to or greater than the L1 cache associativity and as long as the LRU replacement algorithms of both caches are perfect. A difficulty arises in systems with depopulated L2 caches (two-bus or one-bus systems). Here, two or even four L1 rows are mapped into one L2 row. Therefore, the L2 must enforce the invalidation of L1 cache lines when the L2 replaces a line with the same address. A fixed protocol is used for this purpose. The third and fourth cycles after the missing address e (see Figure 10) are reserved on the bidirectional address bus for the L2. When an L2 miss occurs, the L2 sends its LRU line address, indicated as l2, to the L1 cache. The L1 performs a directory search and an invalidation in the case of a match. Prior to the L2 LRU address, the L1 LRU address is on the address bus. This is used for a different purpose, the so-called "mini-broadcast."

Mini-broadcast
The L2 cache directory maintains a copy of the L1-valid bits, as well as the indicator bits used by the MESI protocol. An L1-valid bit in the L2 is always set after the missing address for a line fetch has appeared at the L2. An L1-valid bit is reset after the L1 LRU address has been sent over the address bus. Using the L1-valid bit gives some performance advantages in the MP system. Other processors may request a change of a cache-line state via the L2 and BSN chips. The command (line fetch due to fetch, line fetch due to store, line invalidate, I/O fetch or store, etc.) and the address are routed over the L2-BSN bus, and an L2 directory access is done at the second directory port. When an L2 hit occurs, the line state must be changed according to Table 2. An L2-to-L1 protocol ensures that the states of the L2 and L1 cache lines change concurrently. The PU is interrupted for five cycles to perform the appropriate action. This performance loss can be avoided when the L1-valid bit is off. The L2 does the state change alone in a "read-modify-write" action on the directory.

o Error detection and recovery

PU fencing
The clock-generation chip reacts differently under PU check-stop conditions than under L2, L2.5, or memory check-stop conditions. The system continues its operation even when a PU enters check-stop state. Thus, system availability is increased significantly. Since the L2 may contain modified memory data, it is mandatory that the L2-BSN interface is not disturbed. The PU fencing function is implemented simply by a negative active "PU check-stop" signal which is latched in the L2 chip. Incoming control signals are suppressed, and the address bus-in register switches over to hold its old value. The handshaking for broadcast operations between L2 and PU due to cross-cache interrogation is emulated by the L2 chip.

The clock-generation chip forces all I/O drivers of a defective PU to high-impedance state. The signal levels of all receivers go to minus because of the integrated pull-down resistors. The negative active PU check-stop signal utilizes this effect and stays active. Systems with depopulated or deactivated PU chips behave similarly; the open PU port at the L2 chip is fenced.

Soft errors in large arrays
Soft errors due to alpha particles in large arrays must not create check-stop situations. These errors are usually single-bit errors. ECC with single-error correct, double-error detect capability is implemented on the L2 data caches. The directories require an equivalent correction/detection mechanism. The implementation of ECC would have cost one extra cycle of L2 access; instead, a duplication of arrays was chosen and the following rules obeyed:

  • Parity check in one array only: Use the other one.
  • No parity check but array outputs unequal: Check-stop.
Synchronism check
The control parts of two L2 chips are identical and must always be in synchronism. The error-detection capability is improved when the state of the control logic for these two chips is compared every cycle. The implementation is simple. The main control signals from each of the four caches on a chip are exclusive-ORed, driven via latches to the second chip of a pair, and compared there with the equivalent XOR sum. A mismatch leads to check-stop. The XOR tree is large. For a better isolation down to the source of a defect, it is advisable to implement indicator latches for groups of signals. The latches can be inspected in the scan chain after a sync check has occurred. Mismatching latch states between the chips point to the error.

Bus-switching network (BSN)

o Overview
The BSN chip (Figure 11) is required to connect different physical data buses to one logical data bus. To support a high system throughput, the bus control logic, caches, and a memory address translation are provided. The BSN chip features include

  • 14.5-mm chip size, CMOS 5X, 16.6 million transistors.
  • 5.9-ns cycle time.
  • 758 chip I/Os used.
  • Four 64KB shared caches (L2.5) per chip.
  • Configuration table (CFT) for DSR/2 support.
  • High-speed switch for seven electrically decoupled ports, four L2s, two MBAs, and one STC.
  • Support for L2 and L2.5 cache coherence.
  • 8-byte-wide buses on BSN chips.
  • System measurement instrument (SMI) support.
  • Redrive logic for processor-to-processor communication.
  • Trace buffers for system debug.
Figure 11

o Switching part
The BSN is used as bus controller and bus-switching chip. Up to four L2 cache chips can be connected to the chip with point-to-point nets for buses and control signals. Up to two MBAs can be connected to the BSN with three-point nets (one MBA and two BSNs). Finally, one memory card with its STC is connected with point-to-point nets. This structure allows the chip as well as the buses to operate with system cycle speed; no speed-matching buffers are required. Additionally, the switching network must connect various internal units to the external buses:

  • Four L2.5 caches that are especially designed to hold shared data for all PUs.
  • Configuration table (CFT), used to implement the S/390 DSR/2 feature (dynamic storage reconfiguration).
  • Parts of the system measurement instrument (SMI) to support performance measurements, especially in the multiprocessing environment.
  • Cycle and command trace buffers for system bring-up/debug.
To achieve good system performance, the BSN must guarantee a high bus bandwidth combined with a low bus latency. For the required bandwidth, two-way bus and memory interleaving is supported. A two-cycle bus operation (low latency) from L2 cache to L2 cache is achieved by a special bus-switching logic on the BSN chip. This logic is placed between master- and slave-clocked latches (see Figure 12) and causes command and data to be flushed through the BSN chip, disregarding the chip-to-chip clock skew, which is about 20% of the system cycle time. This concept allows a two-cycle operation with only one clock skew added to the total path delay; the cycle time of the BSN buses and control signals can therefore keep pace with the system. The switching logic shown in Figure 12 is also used for the chip-to-chip control signals to implement a fast, low-latency protocol for data transfer and bus snooping. Two round-robin arbiters for up to twelve PUs and two MBAs reduce the number of wires in the MP system and lead to a good bus utilization. The BSN control logic must maintain and support data integrity and cache coherence for all connected chips.

Figure 12

Besides the two-way memory interleaving, the switching logic generates gaps on the bus to allow the L2s and MBAs to put new commands on the bus [4], therefore improving bus utilization. An example of a typical bus sequence is shown in Figure 13. In this case two interleaving line fetches are processed; one line fetch is served by the DRAM and the other by the L2.5 cache on the BSN chip.

Figure 13

The timing diagram shows three different ports (two L2-BSN buses and the BSN-STC bus): F indicates a line-fetch command which is driven/redriven for two cycles on the buses, and 0···77 indicates the requested line-fetch data of 128 bytes, transferred with eight datashots (0···7), each 16 bytes wide. For electrical reasons, the last shot of a command or data packet must always be driven for two cycles. The ...Sel signals validate the command on the different ports; the XFer... signals validate the line-fetch data. The second STC Sel for the interleaving line fetch cancels the line-fetch request to the STC and is caused by the L2.5 match.

o L2.5 cache
The L2.5 cache reduces the latency for data not kept in the L1 and L2 caches of the requesting processor. It provides the data immediately if the requested cache line is already shared by other processors. Memory access latency is thereby avoided, saving 18 cycles in the cache-line "fetch" operation. This increases the overall processor performance by about 12% in the G3 systems, where this three-level cache hierarchy scheme was first implemented and tested.

L2.5 cache organization
The total L2.5 cache is organized to support the cache-line interleaving mechanism; it is therefore split into four identical independent parts called quadrants, as shown in Figure 11. A L2.5 cache bank comprises two quadrants. The switch selects the banks by address bit 21 and the quadrants by address bit 20. Each quadrant holds an eight-way set-associative cache of 64 KB with a line size of 128 bytes. The design effort could be kept at a minimum by using the same array macros as for the L2 chip in the G3 system.

L2.5 cache-line-state handling
The data integrity within the three-level cache hierarchy is controlled according to the MESI protocol. The M state is not included in the L2.5 cache, since data integrity for modified data is maintained by L2 caches. All other MESI states are handled by the L2.5 cache, and its behavior is optimized to hold actual shared data. Any request for a missing line in the processor's L1 and L2 caches is routed to the L2.5 cache and to the memory. If the requested data are kept in the L2.5 cache and shared among other processors (S state), the L2.5 cache will provide the data immediately. If the data are available but not yet shared (E state), the L2.5 waits to provide data until all L2 caches have searched their directories, with the result that none of them holds the line in the M state. For an active M state in an L2, this L2 provides the actual data, the switch in the BSN routes the data to the requesting PU, and the L2.5 is updated in parallel. In both cases the line is marked as S if it is requested because of a "fetch"-type instruction in the processor. For a "store"-type instruction, it remains in the E state in the L2.5 cache. Data are provided by the memory only if the requested line is not kept in the L2.5 cache (I state). The switch provides the data to the requesting PU, and the L2.5 cache is updated in parallel. The same rules apply to the transfer into the S or E state in the L2.5 cache.

Table 4 summarizes cache-line-state handling by the L2.5 cache. It shows the state of a cache line before and after the execution of a line-fetch operation. The L2.5 cache receives different line-fetch commands, according to the type of instruction which caused the line-fetch operation (LF) in the processor; these are line fetch for data-fetch-type instructions (LF-DFETCH); line fetch for instruction fetch (LF-IFETCH); and line fetch for data-store-type instructions (LF-DSTORE). If a cache line is not kept in the L2.5 cache (see the I-state column) for an LF-DFETCH, the new state can be either E if no other processor owns the cache line or S if another processor has this line in the E or S state already. This is indicated in the table by E/S.

Table 4 L2.5 cache-line state before and after line-fetch operation.
LF operation type Cache-line state
before line fetch
New state
LF-DFETCH IE/S
ES
SS
LF-IFETCH IS
ES
SS
LF-DSTORE IS
EE
SE

Table 4 does not show the fact that for all line-fetch operations in the I or E state, the requested data could be kept in another L2 cache in the M state. In this case, this L2 would provide the data (cross-cache cast-out), which would be loaded in parallel into the L2.5 cache. The implemented scheme is optimized for processor performance. Actual measurements show its effectiveness, since more than 80% of the cache lines kept in the L2.5 are in the S state. This cache-line state provides the most benefit, since processor data latency is reduced by 18 cycles compared to an access to memory.

o Configuration table
The configuration table is the implementation of the dynamic storage reconfiguration (DSR/2) facility. It is used to map the absolute addresses coming from the PUs or MBAs into physical memory addresses. Depending on the memory size, up to 512 storage elements, which can be different in size, are supported. The elements can be reconfigured during normal system operation. The address mapping is done within one cycle, before the address is routed to the memory card. The PUs can write and read the configuration table with special controls and senses. A bypass of the table can be activated and deactivated, and some special memory functions always bypass the table (i.e., senses and controls to the memory card).

o System configurations
Different systems (G3 and G4) with their different PU/L2 chips, memory cards with different sizes and access times, and varying configurations (one-, two-, and four-bus systems) require considerable programmable logic inside the switch to achieve a good bus performance; i.e., different DRAMs require different access times.

Additional logic is spent for PU-to-PU communication signals and MBA-to-PU interrupt signals. To reduce the required I/O pin count for the high-end configuration, this "redrive logic" is spread over multiple chips. To provide the same functions in smaller systems, especially the one-bus system with only two BSN chips, this logic has to be programmable. At system start-up, the BSN chips are "personalized" for these configurations.

o Error detection and recovery
For a highly available and reliable system, error detection and error recovery are very important in avoiding data loss and ensuring data integrity.

Switching part
Because the BSN is stimulated by external signals, the first step for error detection is the checking of all chip inputs. Additionally, the internal states of the chip must be checked. The implemented functions are as follows:

  • The buses and the internal dataflow are parity-checked on byte boundaries and cause a system check-stop in the event of an error.
  • The control signal inputs of the chip, which are wired point-to-point, use the fact that two BSN chips per bus always work in parallel. So-called sync checks observe the correct function of these inputs, and, in the case of a mismatch, the system is stopped. The information concerning which other chip caused the error is stored and can be accessed by the service element.
  • The powering trees for the internal multiplexors are checked in order to avoid incorrect routing of data through the system.
  • Bus protocol functions check important control signals (check the checkers).
  • Addressing invalid entries in the configuration table suppresses command forwarding to the memory card. This is signaled to the PU or MBA.
  • The memory card can handle certain "accept errors" which are routed via the BSN to the requesting PU or MBA. Examples of accept errors are "address exceeds maximum range," "bad parity on data," and "illegal command sequence."
  • Especially for problem fixes during system bring-up, so-called "escape logic" is implemented to allow the detection and correction of protocol errors, i.e., time-outs.
L2.5 cache
The entire dataflow of the L2.5 cache [5] includes parity, which is checked for correctness in all operations. In addition, parity is included in the L2.5 cache directory, which keeps track of the lines included in the L2.5 cache and their state. The data in the cache include a double-bit error-detection and single-bit error-correction scheme (ECC) [6]. Single-bit errors are corrected before the data are transferred to the switch. A hardware "deconfiguration" scheme is implemented for both parity errors in the directory and double-bit errors in the data. In these cases, the failing quadrant is "deleted"; i.e., it no longer participates as a cache. All further accesses are automatically routed to the memory subsystem. No data loss occurs, since the L2.5 does not include the M state. Because of this mechanism, the inclusion of a cache in the BSN does not decrease system reliability and availability, but increases overall processor performance.

Configuration table
The configuration table comprises two arrays (per chip) which hold the same address-mapping tables. In the case of a single-bit error, the "good" array is used. The correct address is routed to the memory card, and the single-bit error is latched in the BSN. The error latch can be sensed by the PU, and the configuration table can then be rewritten.

Memory bus adapter (MBA)

o Overview
The MBA (Figure 14) is a high-speed, low-latency DMA controller that provides the connection between memory and I/O. The key features of the MBA are the following:

  • Bandwidth of 2 GB/s through two BSN adapters.
  • High-speed self-timed interface (STI) with cable lengths between 5 cm and 20 m.
  • CMOS 5X technology.
  • 15.5-mm chip size.
  • 770 chip I/Os used.
  • 3.6 million transistors.
  • 5.9-ns cycle time.
Figure 14

o Main functional units
There are two primary types of storage operations: Fetch operations (transfer data from memory to I/O), and store operations (transfer data from I/O to memory). Programming of the MBA is done with sense and control instructions issued by a PU: Control instructions set/modify registers, and sense instructions read registers. The main functional units on the MBA are described in the following subsections.

BSN adapter unit
There are two BSN adapter units on each MBA chip. The BSN adapter connects MBA-BSN bus 0/2 (MBA-BSN bus 1/3) to the speed-matching buffers, which contain the command and data for the storage operations. In addition, it provides an interface to the central sense/control unit. One BSN adapter has a bandwidth of 1 GB/s.

Speed-matching buffer (SMB)
For every BSN adapter there are three speed-matching buffers, to hold commands, store data, and fetch data. These buffers are necessary to adapt the word-wide STI macro interface to the quadword-wide BSN interface. The command buffer has room for four fetch and four store commands; the store and fetch data buffers each have room for four lines.

Switch
The switch connects the STI path logic to the speed-matching buffers. To realize a high bandwidth, a split transaction protocol is implemented, allowing the concurrent execution of two store data transfers, two fetch data transfers, and one command transfer.

STI path logic (SPL)
This logic has three purposes:

  1. It splits information packets received from the STI-REC macro into a command part and a data part, which are sent over the switch to the speed-matching buffers.
  2. It receives fetch data from the switch and builds an information packet (IP) as required by the STI-SND macro.
  3. It receives from the port sense/control logic the data to send sense/control commands down the STI link. Again, it builds the IPs as required by the STI-SND macro.
Central sense/control (CSC)
This unit provides the PU with access to the registers in the MBA. To balance sense/control loads on the BSN buses, there are two CSC units on each MBA chip, and each CSC is connected to the BSN 02 adapter and the BSN 13 adapter. STI ports 0, 1, and 2 are accessed via central sense/control BSN 02; STI ports 3, 4, and 5 are accessed via central sense/control BSN 13. Central sense/control BSN 02 connects to the ETR (external time reference). In addition, it contains all logic for the master TOD (time of day) and the facilities to synchronize several local TODs to the master TOD.

Port sense/control (PSC)
This logic provides data and commands for sense/control signals sent via the STI. It contains a set of registers to manage interrupt and busy conditions on a channel basis. Further, it permits programming the STI interface to run with a byte-transfer rate of 3 ns or 4 ns.

STI macro
The STI link provides the connection via the fast internal bus adapter (FIB) to the channel [7] and to the intersystem channel (ISC) [8]. The system supports up to 256 channels and up to 32 intersystem channels. The macro consists of two parts, a receive and a send macro. The STI [9] is a byte-wide very high-speed data interface using differential drivers/receivers. It is a full-duplex bus with a raw data rate of 250/333 MB/s in each direction. In addition to the eight data bits and the parity bit, a clock is sent in each direction. With every clock edge, data are transmitted/received. Information is transmitted on the link in "information packets" (IP) consisting of header and data blocks. The link protocol causes overhead, which leads to an effective data rate of approximately 200 MB/s in each direction.

There are nine different clock domains on the chip (six STI receive clocks, one STI send clock, one system clock, and one ETR clock). One of the goals of the design was to minimize latency, which is caused by crossing asynchronous clock boundaries. For example, if an IP arrives at the STI receive macro, this is signaled to the SPL. The SPL, which runs with the system clock, starts to read out an IP buffer of the STI if sufficient data have been received in the IP buffer. Since there are many combinations of system clock speeds and STI clock speeds, a set of programmable counters were implemented in order to determine the best read-out start time.

o Error detection and recovery
All registers in the data path and most state machines in the control flow are parity-checked. If a check occurs, this is signaled to the originator of the operation (e.g., the channel) and the corresponding operation is retried. If the error is of intermittent nature, the system continues to run; if it is a permanent error, the system tries to continue operation in a deconfigured mode. A special mechanism was designed to check for failures in the speed-matching buffer.

The principle is shown in Figure 15. In a horizontal data-checking mechanism, each doubleword of data (cmd) is protected by a corresponding parity byte. In a vertical data-checking approach, a block of data is protected by a longitudinal redundancy check (LRC) byte. This protection mechanism is typical for link protocols. Both approaches can be combined in a concurrent signature-checking approach. Every time cmd/data are written into the buffers, a new signature is generated from the actual data to be written and the previous written signature. This signature is saved as in the horizontal data-checking scheme. Every time cmd/data are read out of the buffers, a new signature is generated from the actual data read and the previous signature read. The new calculated signature must be identical in every cycle with the signature stored in the array. The scheme described has a modest circuit overhead and very good error-detection capabilities. Note that the number of array bits is the same as in the horizontal checking approach.

Figure 15

o Verification
Compared to its predecessor, the MBA G3 increased the bandwidth by a factor of 10 (two GB/s versus 200 MB/s). To achieve such a high bandwidth, a heavily queued chip had to be built; therefore, the number of concurrent operations increased from 3 to 30. Verification of a chip with such huge numbers of concurrent operations was a challenging task. Part of the switch logic was verified using formal verification [10].

Storage controller (STC) and memory subsystem

o Overview
The memory subsystem [11] is designed to serve as S/390 main and expanded storage for the G3 and G4 systems. Both levels are located on the same physical unit, a separate memory card. To reduce the bus traffic between PU and memory, certain memory-related operations are implemented on the memory card, as well as the basic store and fetch line functions. The S/390 storage key protection also resides on the memory card, with all the necessary logic. Special features have been implemented which increase the memory reliability and availability. Memory card characteristics (Figure 16) are the following:

  • Maximum card size 6 GB, based on 64Mb DRAM technology.
  • 4Mb/16Mb/64Mb Extended Data Output (EDO) DRAM support (50-ns and 60-ns RAS access).
  • Latency 17 cycles at 5.9 ns (50 ns) or 18 cycles at 6.25 ns (60 ns).
  • Busy time 25 cycles at 5.9 ns for one line.
  • Card technology mixed-grid array (MGA) 12 signal, 10 power layers (9 × 11 in.).
  • DRAM package correction ECC; two spare DRAMs per card.

Figure 16

Figure 17 is a photograph of the card. STC characteristics (Figure 18) are the following:

  • 12.7-mm chip size, CMOS 5X, approximately 1.3 million transistors.
  • 5.9-ns cycle time.
  • 748 chip I/Os used for signals.
  • Independent line buffers for store and fetch operations.
  • Line interleaving on one memory card.
  • Expanded storage support.
  • Fetch before store for fast LRU cast-outs.
  • Wraparound feature; first quadword of line served first.
  • Redundancy setting per DRAM module on the fly.
  • Active parallel redundancy for the key store.
  • Trace buffers for system debug.
  • Programmable DRAM and SRAM self-test performed by STC.

Figure 17 Figure 18

o Memory card structure

Two independent storage banks
Because of the relatively long DRAM access time (50 ns or 60 ns), two independent storage banks have been implemented. This permits utilization of the BSN-STC bus for command or data transfer of one storage bank while the other bank is active with RAM accesses (even/odd interleave on the BSN-STC bus). This maximizes the BSN-STC bus utilization by filling the "latency gap" on an individual bus. Each storage bank utilizes independent store and fetch buffers in order to decouple the activities on the BSN-STC bus from those on the buses to the DRAMs.

On-card array bus structure
To meet the bandwidth requirements, every STC physically interleaves four array buses (each array bus is 64 + 12 bits wide), and for one line transfer it selects an individual DRAM module twice. This sequence supplies eight doublewords per STC to complete one line. This structure exploits the CAS cycle time of 25 ns for 60-ns standard EDO RAMs (4 × 6.25 ns = 25 ns).

The electrical redrive challenge
On the 6GB card (maximum configuration), the two STCs must drive at their DRAM interface 48 PSIMMs, 19 DRAM modules per PSIMM, four data I/Os, 12 addresses, and three controls per DRAM. Overall, 17366 DRAM I/Os and 600 SRAM I/Os must be attached to the support logic. Of the 748 signal I/Os per storage controller, 520 were available for the DRAM interface; they were arranged in the following matrix:

  • 304 data I/Os for four data buses, 76 bits per data bus (both storage banks share the data bus).
  • 88 address I/Os for four data buses and two storage banks.
  • 48 address I/Os for every SIMM for the most critical address.
  • 48 CAS I/Os for every SIMM.
  • 24 RAS I/Os for every SIMM pair.
  • 8 write-enable I/Os for four data buses and two storage banks.
Because of the pin-count limitation, one STC supports the SRAM addresses while the other supports data and enabling signals, logically multiplexed over the same physical pins. The rest of the STC I/Os are used for the BSN-STC bus, STC-to-STC communication, spare DRAMs, clock signals, and test support. Overall, this array wiring structure supports cycle times at the DRAM interface down to 5.1 ns if DRAMs with 50-ns RAS access time are used.

Memory scalability and granularity
One of the major requirements in designing the memory subsystem was to support the modular concept, as well as a very wide range of memory sizes for different product offerings with just one design point (e.g., one raw card and one storage controller part number). From the point of view of memory card design, the overall G3/G4 system memory size ranges from 64 MB to 24 GB (factor 384), although not all options are actually used in the system. This has been achieved with the following steps:

  • Factor 4 by number of buses in the system (one-, two-, four-bus system).
  • Factor 2 by using only one storage bank on the card (even/odd interleaving disabled).
  • Factor 3 by populating PSIMMs for one, two, or three RAS banks (address depth).
  • Factor 16 by DRAM technology (4 Mb, 16 Mb, 64 Mb).
o Functions complying with S/390 architecture
The following S/390 architecture storage-related functions have been implemented on the memory card.

Key store
The S/390 architecture requires a storage-protection mechanism which ensures that only those elements of the system which have a correct access key can obtain access to storage locations. The access key protects storage in increments of 4KB pages.

The storage for this key and the related fetch, reference, and change information has been implemented as a decentralized fast SRAM array ("key store") residing on the memory card. The STC compares the access key of a requestor with the key stored on the card and permits or denies the alteration of the storage location. The time required for the key control is included in the latency numbers, mentioned earlier. If permission is denied, the STC communicates with the other STC to suppress the storage of data.

Fast LRU cast-out
To support fast LRU cast-out, the STC contains separate fetch and store buffers for each storage bank. The command is sent together with the store data to the STC. While the STC initiates the fetch access to the DRAMs, the store data are deposited in the store buffer. After the fetch data are delivered on the BSN-STC bus, the STC begins a second access cycle to store the data (see Figure 18). During this time, the other storage bank can concurrently receive and execute other commands.

Expanded storage support
A similar mechanism, called "move page fetch/move page store," is used to support the exchange of lines between the S/390 main storage and expanded storage. The lines are fetched from the source address into the fetch buffer and subsequently stored through the store buffer to the destination address.

Data handling within one line
Certain operations require that individual bytes or sequences of bytes be changed within one line. This is also done in the storage controller. It examines the byte address and the access key, initiates the prefetch of the line, merges the data, and stores the changed line back to the memory. This mechanism allows the alteration of 1···127 bytes within one 128-byte line. Examples of this operation are partial stores from the I/O and the conditional marking of individual lines.

o Error detection and recovery
Because of the high number of RAM modules, the memory card contains several error-detection/correction schemes to provide high reliability and availability.

DRAM ECC
A (76, 64) ECC scheme has been implemented for the DRAMs which corrects all errors within one DRAM module, including completely defective DRAM modules (four out of four DRAM I/Os defective). This gives the capability of DRAM chip-kill correction on the fly, without impact to the running application. The ECC scheme also detects a very high percentage of failures of two or three DRAM modules within one ECC word. In addition to the correction of defective data, the DRAM address has been included in the check-bit generation in order to detect misaddressing. Errors detected by one STC are communicated to the other STC to ensure synchronous data delivery.

SAP-controlled error scrubbing
In a background process, the system assist processor periodically monitors the DRAMs by scanning the complete address space. If a line has correctable errors, it is corrected by the STC and restored into the memory. Depending on the nature of the defect, this eliminates or "scrubs" soft errors in memory areas which are seldom used.

DRAM sparing on the fly
During the scrubbing process a hardware error map is written which contains error counters, defective PSIMM locations, defective DRAM modules on a PSIMM, and the defective I/Os, as well as the nature of the failure (correctable/uncorrectable). The error map is also used in card manufacturing.

If defined thresholds are exceeded, the redundant module is activated, replacing the most defective DRAM on the card. This is done dynamically; i.e., the data that are still correctable are corrected and copied to the redundant module. After the copy is complete, the defective DRAM is shut off. The redundant module on each half of the card can be assigned to any DRAM location. This mechanism ensures an uninterrupted customer operation.

Key-store redundancy
To provide high availability, the key store uses redundancy; i.e., each key-store entry is stored twice in two separate SRAM modules. In the case of a failure, the entry with the correct parity is used. Either of the two key stores can be disabled.

o Integrated test support
Traditionally memory cards are tested on separate, stand-alone card testers in manufacturing or in a laboratory environment. With the progress of CMOS technology, it became difficult and very expensive to provide equipment which keeps pace with the cycle-time requirements. Therefore, a freely programmable test processor has been implemented in the storage controller which can generate practically any test sequence.

Internal test mode
In this mode the test processor stimulates the BSN-STC bus with the same commands or command sequences as those issued by the BSN. It analyzes the response of the memory card and monitors protocol and data correctness. The card "tests itself."

External test mode
In the external mode, the test processor sends the stimuli to the memory card connector. Thus, a memory card without DRAMs and SRAMs, with only two STCs, can stimulate a normal memory card via the card connector. The card "tests another card."

As additional equipment external power supplies, a PC and a clock generator card are sufficient. Compared to commercially available stand-alone test equipment, this concept results in very low cost for test equipment, in the laboratory and in manufacturing. The test processor program can be loaded in the G3/G4 system during power-on from the service element. The major advantage of the integrated test processor is that the technology of the "tester" migrates with the technology of the device to be tested.

Clock-generation chip (CGC)

o Overview
The clock-generation chip (Figure 19) provides the necessary clock pulses and run control signals for all S/390-related hardware building blocks of the G3/G4 processor subsystem. The key features of the CGC are the following:

  • CMOS 5X technology.
  • 10.0-mm chip size, 0.5 million transistors.
  • 414 chip I/Os used.
  • 44-mm MLCI single-chip module.
  • 5.9-ns cycle time.
  • <0.4-ns on-chip clock skew, <1.2-ns chip-to-chip clock skew.
  • Two supply voltages.
    • 2.5 V for connections to other CMOS 5X chips.
    • 3.6 V for connections to the ETR and the service element.
  • Clock and control-signal generation.
    • Power-on reset recognition.
    • Five-wire interface to service element.
    • Start/stop control of entire G3/G4 processor subsystem (including check recognition and handling).
    • Serial link to all PU chips.
    • External time reference (ETR) support.
    • Programming of an external PLL.
    • Self-test control for all connected chips.
Figure 19

o Power-on reset
After recognition of an external power-on reset signal or an equivalent command from the service element, the CGC controls the hardware initialization of all chips, consisting of the following phases:

  • Detection of all chips connected to the CGC.
  • Shift in zeros through the serial SRL network (initial self-test data).
  • Test all embedded arrays via ABIST (array built-in self-test) and subsequent initialization of embedded arrays (storing zeros with good parity and directory array valid bits OFF).
  • Self-test of the chip internal logic via LBIST (logic built-in self-test).
  • Set up the S/390 processor(s) for transfer of bootstrapping code by the service element.
Any chip that does not pass the ABIST or LBIST remains disabled and is no longer available for system operation. However, the system may operate in a degraded mode until the defective component has been replaced.

o Attachment to the service element
The clock chip connects to the service element subsystem via the five-wire interface, which is the physical connection between the service element and the CGC and consists of the following lines:

  • Shift gate instructs the CGC to shift 1-bit information on the data-in/data-out lines.
  • Address/data declares the information transported on the data lines as either address or data, depending on the polarity.
  • Set pulse is used to strobe data and for commands.
  • Data-in is the input line for serial bit transport.
  • Data-out is the output line for serial bit transport.
The CGC implements an address register holding the address shifted in by the service element and a decoder to address an individual target based on the contents of the address register. Each chip connected to the CGC has a unique address, making it an individual target. Some addresses are reserved for facilities on the CGC itself, e.g., for the "status register," which provides status information about the entire system to the service element. Another example of a CGC address is the command to initiate an initial microprogram load (IML).

During address mode, a 9-bit address consisting of eight address bits and one parity bit is shifted into the address register, which selects the target for the following set pulse or shift gate in data mode. While the shift gate in data mode propagates a single bit into an SRL chain, the set pulse is used to assist the array access on the chips and to control the CGC. The five-wire interface supports service functions such as

  • Serial R/W access of all SRL chains and embedded arrays on building blocks connected.
  • Serial R/W access of SRL chains on the CGC itself.
  • Single cycle.
  • Start/stop of the G3/G4 processor.
In addition to the five-wire interface, the CGC has the capability to alert the service element by raising the interrupt line to indicate an asynchronous event (e.g., a check condition). This mechanism eliminates the need for polling for certain conditions and increases the overall performance.

o Clock and control-signal generation
The CGC provides the clocks and control signals as well as the address decoding for the service element subsystem for the following maximum configuration:

  • Twelve PU chips.
  • Twelve L2 chips.
  • Eight BSN chips.
  • Eight STC chips (four memory cards).
  • Two MBA chips.
  • One ETR chip.
  • Two crypto chips.
The control signals of the CGC are optimized for a high-end system implemented entirely in chips. The clock chip supports any combination of chips. The service element must validate each configuration and disable particular chips if necessary. The valid configurations for a G3 and a G4 processor subsystem differ from one another. All chips except the ETR chip receive clock signals that are synchronous with the system cycle of 5.9 ns. Each clock chip generates its 5.9-ns clock signal, locally derived from a 23.6-ns external reference clock, via an on-chip PLL. The ETR and those parts of the MBA chips that communicate with the ETR chip get a separate clock with a 27-ns cycle. The CGC has the capability to start and stop each PU chip on an individual basis. The CGC starts the other chips if at least one PU chip is active. All start/stop actions execute synchronously, which means that the clocks for all chips are enabled or disabled within the same cycle. There is also a clock domain available that never stops once it has been started after power-on. This "continuous-running" clock domain is used on the PU and MBA chips for the timer facilities and on the STC chips to control the refresh of the DRAMs.

o On-product clock generation (OPCG)
In order to achieve this function, the CGC distributes a raw oscillator signal to all chips with an individual line for each chip. Each chip including the CGC itself receives this oscillator signal and multiplies the frequency of this signal by four by means of an internal PLL. The output of the PLL feeds a clock-generation network (CGN). The CGN uses the PLL output and some control lines driven by the CGC and generates the master, slave, shift, and array clocks locally on each chip. The CGC drives the control lines as multidrop nets, and each chip synchronizes them locally and generates "Allow_x_int" signals.

Figure 20 shows the logic of the CGN implemented on each chip. The CGN uses the standard book set; higher drive capability is achieved by paralleling the standard circuits. The numbers in the diagram show the maximum number of parallel circuits. The output load of each circuit is tightly controlled during the physical design process, resulting in a clock skew of less than 0.4 ns for any two latches on the same chip [12]. Some additional adders contribute to the clock skew between any two latches on different chips:

  • Different driver characteristics of the drivers of the CGC.
  • Tolerances in card/module wiring.
  • On-chip PLL phase error plus jitter.
Figure 20

All of these contributors add up to a total clock skew of less than 1.2 ns for any two latches on different chips. The technique of distributing just an oscillator signal and several control lines and using OPCG has several advantages compared to distributing all master, slave, shift, and array clocks from the CGC:

  • Only the reference clock signals must be length-adjusted on all packaging levels.
  • Clock-gating signals may be implemented as multidrop signals, reducing the pin count of the CGC.
o Error detection
Each chip can signal a check condition to the clock chip via an individual line. The action performed by the CGC depends on the source of the check. These external checks are grouped into two different groups: All PU chips, and all L2, STC, BSN, MBA, and crypto chips. If the check is raised by any PU chip, only this PU is stopped by turning off the appropriate control signals. The CGC informs the remaining PUs, which continue to run without interruption of this situation, by sending a "malfunction alert" to all other PUs via the clock-PU serial link. Any check signaled to the clock chip by any other chip stops the entire system. In addition, the CGC detects internal malfunctions and alerts the service element accordingly. In all cases the service element must perform further investigation of the check.

o Clock-PU serial link
The communication between the clock chip and the PUs is done by means of a synchronous serial interface and is bidirectional [13]. Several checking mechanisms have been introduced to improve reliability. Some of the conditions signaled via this interface are active PUs, malfunction alert, start pulse, soft stop state, and wait state.

The clock chip is the master of all communication. A communication package consists of a specific header, a PU field, a command field, and a checksum sent by the clock chip. All data besides the header are return-to-zero (RZ) coded, and one bit takes two cycles to transmit. The header is B'110'. This is the only case in which two consecutive ones are sent. This information is used to inform all PUs about the beginning of a new frame. The header is the only source for synchronization. Two consecutive ones occurring anywhere else in the frame lead to a check. After sending the header, the CGC sends the 16-bit PU field. Each bit in the PU field selects a specific PU. Depending on the command field following, these bits take on a different meaning. The command following specifies the command to be executed after the checksum has been verified by the PU. There are commands that must be executed by each PU and commands that are valid for individual PUs. Finally, the CGC sends an 8-bit checksum to provide error detection on the interface. Each PU then sends its 8-bit response to the CGC, leading to the layout for a complete frame given in Table 5.

Table 5 Complete frame.
BitsCycles Content
 3 Header
1632 PU field
1632 Command field
816 Checksum (number of ones in PU and command field)
816 Response bits PU0···PU15
    
--96 Total cycles

Summary

The IBM S/390 Parallel Enterprise Server Generation 3 is based on a well-balanced cache and modular system structure. The G3 processor chip set covers a wide performance range from a uniprocessor to a high-performance multiprocessing system. A three-level cache hierarchy and a high-speed processor interconnection scheme reduce the data latency for the processor significantly. High reliability and availability are guaranteed by the error-detection and recovery mechanisms implemented through the entire chip set. The design quality of the G3 chip set has been outstanding since the first silicon was functional to the extent that all hardware verification programs including the operating systems could be tested. Only one metallization change (metal EC) was needed for each of the G3 chips to achieve full functionality. This was the key to keeping the time from first silicon power-on to general availability of the G3 system to only eight months.

Acknowledgments

The authors would like to thank all colleagues in the IBM development laboratories in Boeblingen, Poughkeepsie, and Endicott who contributed to the success of the S/390 Parallel Enterprise Server Generation 3.

*Trademark or registered trademark of International Business Machines Corporation.

References

Received January 13, 1997; accepted for publication July 15, 1997