IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
Select a country
IBM Home
IBM Research
Project description
Innovation Matters!
The Cell story
Heterogeneous
Chip Multi-Processing
Synergistic Processor
Unit (SPU)
Scalar Layering
and Cell Compilation
Synergistic Memory Flow
Control (MFC)
Cell Programming
and Applications
Cell Systems
Cell Chip
Awards and Honors
Patents
Publications
IBM Cell Chips
Home Page
Cell Specification
More information

Michael Gschwind



The Cell project at IBM Research 
  Project description 


The Cell Synergistic Processor Unit (SPU)

The SPU architecture was built with the goals to

  • provide a large register file,
  • simplify code generation,
  • reduce the size and power consumption by unifying resources, and
  • simplify decode and dispatch.

These goals were achieved by architecting a novel SIMD-based architecture with 32 bit wide instructions encoding a 3-operand instruction format. Designing a new instruction set architecture (ISA) allowed us to streamline the instruction side, and provide 7-bit register operand specifiers to directly address 128 registers from all instructions using a single pervasive SIMD computation approach for both scalar and vector data. In this approach, a unified 128 entry 128bit SIMD register file provides scalar, condition and address operands, such as for conditional operations, branches, and memory accesses.

Synergistic Processor Unit BLOCK DIAGRAM

Cell SPU Block Diagram

While the SPU ISA is a novel architecture, the operations selected for the SPU are closely aligned with the functionality of the Power VMX unit. This facilitates and simplifies code portability between the Power main processor and the SPU SIMD-based co-processors. However, the range of data types supported in the SPU has been reduced for most computation formats. While VMX supports a number of densely packed saturating integer data types, these data types lead to a loss of dynamic range which typically degrades computation results. The preferred computation approach is to widen integer data types for intermediate operations and perform them without saturation. Unpack and saturating pack operations allow memory bandwidth and memory footprint to be reduced while maintaining high data integrity.

Floating point data types automatically support a wide dynamic data range and saturation, so no additional data conditioning is required. To reduce area and power requirements, floating point arithmetic is restricted to the most common and useful modes. As a result, denormalized numbers are automatically flushed to 0 when presented as input, and when a denormalized result is generated. Also, a single rounding mode is supported.

Single precision floating point computation is geared for throughput of media and 3D graphics objects. In this vein, the decision to support only a subset of IEEE floating point arithmetic and sacrifice full IEEE compliance was driven by the target applications. Thus, multiple rounding modes and IEEE-compliant exceptions are typically unimportant for these workloads, and are not supported. This design decision is based the real time nature of game workloads and other media applications: most often, saturation is mathematically the right solution. Also, occasional small display glitches caused by saturation in a display frame is tolerable. On the other hand, incomplete rendering of a display frame, missing objects or tearing video due to long exception handling is objectionable.

Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the Power processor or an SPU. The DMA-based interface uses the Power page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure. A high-performance on-chip bus connects the SPU and Power computing elements.

The SPU is an in-order dual-issue statically scheduled architecture. Two SIMD instructions can be issued per cycle: one compute instruction and one memory operation. The SPU branch architecture does not include dynamic branch prediction, but instead relies on compiler-generated branch prediction using "prepare-to-branch" instructions to redirect instruction prefetch to branch targets.

 
  Related Project Pages 
arrow Innovation Matters! The Cell Story
arrow The original Cell press release
arrow A Cell compiler
arrow The IBM Cell BE Product Page
arrow STI Cell Processor @ IBM Venture Capital Group
arrow High-frequency microarchitecture
arrow Power-aware microarchitectures
arrow guTS

 
  About IBM  | Privacy  | Legal  | Contact