| Project |
| Research Home >> | The
Compiler Project for Scalable Architectures at IBM Research |
|||||||||||||||||||||||||||||||||||||||||||||||||
|
Targets: VMX (JS20, Cell PPE), SPE (Cell), BlueGene/L (Double Hummer) The SIMD vectorization (a.k.a. simdization) is implemented in IBM's XL product compiler, and thus supports multiple programming languages (e.g., C, C++, and Fortran), and multiple target machines (e.g., SPU, VMX, and BG/L). As shown below, successful simdization involves several tasks that closely interact with each other. First Single Instruction Multiple Data (SIMD) parallelism must be extracted from an application, which may involve extraction from a basic-block, a loop, or a combination of both. Once that parallelism is extracted, various additional SIMD hardware constraints must be satisfied, such as memory alignment requirements, physical vector length, and hardware instruction set.
Our approach uses a virtual vector representation [ICS05] as an initial model to successfully extract SIMD parallelism from multiple sources in the code. Virtual registers have no alignment constraint and have arbitrary lengths. The virtual vectors are then progressively lowered in level to the physical SIMD registers of the target machine in successive de-virtualization steps.
A summary of simdization features:
Examples of code extract handled by our frameork are found here.
Embedded in the Toronto Portable Optimizer (TPO) component of the XL compiler, the simdization optimization leverages a rich set of high-level optimizations such as interprocedure analysis and loop transformations. Here we focus on describing the new components that are introduced specifically for simdization purposes. There are 6 new phases. The first 3 phases are mainly concerned with extracting SIMD parallelism from various code structures, e.g., basic blocks or loops.
Phase 2: Short-Loop Aggregation. During this phase, we aggregate entire loops with small trip counts into virtual vectors. This approach is successful in eliminating short innermost loop and may thus enable further SIMD aggregation within the second innermost loop. Typical examples are FIR based algorithms. Phase 3: Loop-Level Aggregation. During this phase, we aggregate instances of a statement in consecutive iterations of a loop. This phase is particularly successful at extracting SIMD parallelism in loops and recognizing special patterns such reductions and linear recurrences. The blocking factor is selected so that the length of each virtual vector is a multiple of the physical vector length: 16 bytes in the VMX/Cell/BGL architectures. These first 3 phases proceed regardless of the alignments and lengths of the vectors, as if targeting an idealized SIMD machine. This simplifies the extraction phases greatly, especially in the presence of mixed data sizes. Each phase extracts SIMD parallelism, treating input scalar or virtual vectors (generated by a prior phase) in a similar fashion.
The next 3 phases deal with specific constraints of SIMD units, such alignment, physical vector length, and SIMD instruction sets.
More information on the impact of alignment and our approach to minimize the impact of misalignment is found here. After the completion of this phase, all virtual vector accesses are to aligned data regardless of the compile time/runtime nature of their alignment. A virtual vector may still have a virtual length, i.e. a length that is too long with respect to the physical vector registers provided by the underlying architecture. Phase 5: Length Devirtualization. During this phase, long vectors are chunked into physical length vector registers, or they may revert back to scalar statements. This phase also handles data size conversion [CGO05], e.g., generating twice as many instructions for 4-byte integer expressions that use results generated by 2-byte short integer expressions. After this phase, all virtual vectors are aligned and have the physical vector length, namely they are compatible with the physical vector registers provided by the underlying SIMD architecture. Phase 6: SIMD Code Generation. Up to this point, most SIMD instructions were generic operators such as "add," "mult," or primitive data reorganization instructions. During this phase, we generate SIMD instructions that are specific to the target SIMD units. After this last phase, the simdization process is complete and the SIMD operations can be safely scheduled and register allocated.
|
|||||||||||||||||||||||||||||||||||||||||||||||||
| Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact | ||||||||||||||||||||||||||||||||||||||||||||||||||