Project Skip to main content
IBM Research Homepage 
 Research Home  >> The Compiler Project for Scalable Architectures at IBM Research


mimd compilation

Content
arrow Project Abstract
arrow Cell Architecture
arrow Parallelization
arrow Simdization
  Alignment Handling
  Code Examples
arrow Optimized CodeGen
arrow Performance Eval
arrow Presentations
arrow Publications
arrow Patents

simd compilation

Related Projects 
arrow IBM Cell Research
arrow BlueGene Research

Overview of Automatic Simdization

Targets: VMX (JS20, Cell PPE), SPE (Cell), BlueGene/L (Double Hummer)

The SIMD vectorization (a.k.a. simdization) is implemented in IBM's XL product compiler, and thus supports multiple programming languages (e.g., C, C++, and Fortran), and multiple target machines (e.g., SPU, VMX, and BG/L).

As shown below, successful simdization involves several tasks that closely interact with each other. First Single Instruction Multiple Data (SIMD) parallelism must be extracted from an application, which may involve extraction from a basic-block, a loop, or a combination of both. Once that parallelism is extracted, various additional SIMD hardware constraints must be satisfied, such as memory alignment requirements, physical vector length, and hardware instruction set.

Simdization ApproachSatisfy ConstraintsExtract SIMD ParallelismSimdization Algorithm

Our approach uses a virtual vector representation [ICS05] as an initial model to successfully extract SIMD parallelism from multiple sources in the code. Virtual registers have no alignment constraint and have arbitrary lengths. The virtual vectors are then progressively lowered in level to the physical SIMD registers of the target machine in successive de-virtualization steps.

Simdization Features

A summary of simdization features:

  • Loop-level simdization
    • Loops with induction/reduction/local variables, invariant computation, stride-one array accesses
    • Loops with small constant trip count
    • Loops with mixed data types
  • Basic-block level simdization
    • Unrolled loops
    • Computation on adjacent members of aggregates
    • Scalar packing
    • Quasi-isomorphic packing
    • Aggregate copy statement
  • Alignment handling
    • Data realignment on the fly (only when data are relatively misaligned)
    • Minimizing data realignment overhead
    • Versioning for relative alignment (for BG/L)
    • SIMD prologue/epilogue for misaligned stores
    • Scalar prologue/epilogue for misaligned stores to avoid false sharing and for memory protection
  • Other optimizations
    • Diagnostic information on whether loops are simdizable and reasons for not being simdized
    • Interprocedure alignment analysis
    • Idiom recognition to exploit special hardware instructions
    • Loop distribution to isolate simdizable statements
    • Array recovery for loops with pointer arithmetics
    • Runtime dependence analysis
  • And a lot more to come ....

Examples of code extract handled by our frameork are found here.

Simdization Framework: Extracting SIMD Parallelism

Embedded in the Toronto Portable Optimizer (TPO) component of the XL compiler, the simdization optimization leverages a rich set of high-level optimizations such as interprocedure analysis and loop transformations. Here we focus on describing the new components that are introduced specifically for simdization purposes. There are 6 new phases. The first 3 phases are mainly concerned with extracting SIMD parallelism from various code structures, e.g., basic blocks or loops.


Phase 1: Basic-Block Level Aggregation.
During this phase, we aggregate isomorphic statements that address consecutive memory locations into virtual vectors. This phase is particularly successful at extracting SIMD parallelism in unrolled loops and consecutive elements of structs, such as 3-dimensional coordinates found in multimedia applications. It also handles semi-isomorphic statements, for example statements that are nearly identical but for an add versus a subtract operation.

Phase 2: Short-Loop Aggregation. During this phase, we aggregate entire loops with small trip counts into virtual vectors. This approach is successful in eliminating short innermost loop and may thus enable further SIMD aggregation within the second innermost loop. Typical examples are FIR based algorithms.

Phase 3: Loop-Level Aggregation. During this phase, we aggregate instances of a statement in consecutive iterations of a loop. This phase is particularly successful at extracting SIMD parallelism in loops and recognizing special patterns such reductions and linear recurrences. The blocking factor is selected so that the length of each virtual vector is a multiple of the physical vector length: 16 bytes in the VMX/Cell/BGL architectures.

These first 3 phases proceed regardless of the alignments and lengths of the vectors, as if targeting an idealized SIMD machine. This simplifies the extraction phases greatly, especially in the presence of mixed data sizes. Each phase extracts SIMD parallelism, treating input scalar or virtual vectors (generated by a prior phase) in a similar fashion.

Simdization Framework: Satisfying SIMD Unit Constraints

The next 3 phases deal with specific constraints of SIMD units, such alignment, physical vector length, and SIMD instruction sets.


Phase 4: Alignment Devirtualization.
During this phase, we satisfy the alignment constraint which states that the data being logically operated on by a SIMD operation must reside in the same slot of their respective input and output SIMD registers. In other words, when adding a[i] and b[i+1] in SIMD fashion, we must ensure that both a[0] and b[1] values are in the same relative position in their respective register. Data realignment instructions are inserted only for relatively misaligned memory streams. We attempt to minimize the number of data realignment instructions, both in the presence of compile-time [PLDI04] and runtime [CGO05] misalignment.

More information on the impact of alignment and our approach to minimize the impact of misalignment is found here.

After the completion of this phase, all virtual vector accesses are to aligned data regardless of the compile time/runtime nature of their alignment. A virtual vector may still have a virtual length, i.e. a length that is too long with respect to the physical vector registers provided by the underlying architecture.

Phase 5: Length Devirtualization. During this phase, long vectors are chunked into physical length vector registers, or they may revert back to scalar statements. This phase also handles data size conversion [CGO05], e.g., generating twice as many instructions for 4-byte integer expressions that use results generated by 2-byte short integer expressions.

After this phase, all virtual vectors are aligned and have the physical vector length, namely they are compatible with the physical vector registers provided by the underlying SIMD architecture.

Phase 6: SIMD Code Generation. Up to this point, most SIMD instructions were generic operators such as "add," "mult," or primitive data reorganization instructions. During this phase, we generate SIMD instructions that are specific to the target SIMD units.

After this last phase, the simdization process is complete and the SIMD operations can be safely scheduled and register allocated.

 Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact