The Cell Synergistic Processor Unit (SPU)
The SPU architecture was built with
the goals to
-
provide
a large register file,
-
simplify
code generation,
-
reduce the size and power consumption by unifying
resources, and
-
simplify decode and dispatch.
These goals were achieved by architecting a novel SIMD-based
architecture with 32 bit wide instructions encoding a 3-operand
instruction format. Designing a new instruction set
architecture (ISA) allowed us to streamline
the instruction side, and provide 7-bit register operand specifiers
to directly address 128 registers from all instructions using a single pervasive SIMD
computation approach for both scalar and vector data. In this approach,
a unified 128 entry 128bit SIMD register file provides scalar,
condition and address operands, such as for conditional operations,
branches, and memory accesses.

Cell SPU Block Diagram
While the SPU ISA is a novel architecture, the
operations selected for the SPU are closely aligned with the functionality
of the Power VMX unit. This facilitates and simplifies code portability
between the Power main processor and the SPU SIMD-based co-processors.
However, the range of data types supported in the SPU has been reduced
for most computation formats. While VMX supports a number of densely
packed saturating integer data types, these data types lead to a
loss of dynamic range which typically degrades computation results.
The preferred computation approach is to widen integer data types
for intermediate operations and perform them without saturation.
Unpack and saturating pack operations allow memory bandwidth and
memory footprint to be reduced while maintaining high data integrity.
Floating point data types automatically support
a wide dynamic data range and saturation, so no additional data
conditioning is required. To reduce area and power requirements,
floating point arithmetic is restricted to the most common and useful
modes. As a result, denormalized numbers are automatically flushed
to 0 when presented as input, and when a denormalized result is
generated. Also, a single rounding mode is supported.
Single precision floating point computation is
geared for throughput of media and 3D graphics objects. In this
vein, the decision to support only a subset of IEEE floating point
arithmetic and sacrifice full IEEE compliance was driven by the
target applications. Thus, multiple rounding modes and IEEE-compliant
exceptions are typically unimportant for these workloads, and are
not supported. This design decision is based the real time nature
of game workloads and other media applications: most often, saturation
is mathematically the right solution. Also, occasional small display
glitches caused by saturation in a display frame is tolerable. On
the other hand, incomplete rendering of a display frame, missing
objects or tearing video due to long exception handling is objectionable.
Memory access is performed via a DMA-based interface
using copy-in/copy-out semantics, and data transfers can be initiated
by either the Power processor or an SPU. The DMA-based interface
uses the Power page protection model, giving a consistent interface
to the system storage map for all processor structures despite its
heterogeneous instruction set architecture structure. A high-performance
on-chip bus connects the SPU and Power computing elements.
The SPU is an in-order dual-issue
statically scheduled architecture. Two SIMD instructions can be issued per cycle: one
compute instruction and one memory operation. The SPU branch architecture
does not include dynamic branch prediction, but instead relies on
compiler-generated branch prediction using "prepare-to-branch" instructions
to redirect instruction prefetch to branch targets.