Project Skip to main content
IBM Research Homepage 
 Research Home  >> The Compiler Project for Scalable Architectures at IBM Research


mimd compilation

Content
arrow Project Abstract
arrow Cell Architecture
arrow Parallelization
arrow Simdization
  Alignment Handling
  Code Examples
arrow Optimized CodeGen
arrow Performance Eval
arrow Presentations
arrow Publications
arrow Patents

simd compilation


Related Projects 
arrow IBM Cell Research
arrow BlueGene Research

Alignment Issue, Prior Work

Unlike in scalar code, data alignment matters when accessing or operating on data that reside in vector registers. Consider the example below, which correspond to a naive simdization of a b[i+1] + c[i+3] in a loop iterating over a sufficient number of "i"s.

In Figure 1, as well as all other figures in this page, the scrolls correspond to the data in memory and the discrete shadowed boxes correspond to data in vector registers. To simplify the discussion below, we highlight the i=0 first loop iteration by highlighting the data involved in that iteration using yellow circles. The data of the first simdized iteration (containing four 4-byte integers within their respective 16-byte SIMD registers) are shown in dark gray.

Figure 1 below illustrates an erroneous simdization where no attention is paid to the alignment of the b[i+1] + c[i+3] data in memory. When performing a SIMD load on b[1] (iteration i=0), the unaligning load simply returns 4 integers starting from the truncated b[1] address, an address that is truncated to the nearest lower 16-byte boundary. In hardware terms, the lower 4 bits of the address are simply ignored and assumed to be zero.

Thus, a SIMD load of b[1] yields the b0...b3 value with the desired b1 value in the second slot of the vector register. Proceeding similarly with the c[3] SIMD load, we obtain the c0...c3 values with the desired c3 value in the fourth and last slot of the vector register.

simd naive
Figure 1. Naive, erroneous SIMD code for b[i+1] + c[i+3].

As seen in Figure 1, simply adding the loaded vector registers does not give the desired b[1]+c[3] result, as the alignment of the b1 and c3 values do not match in their respective registers as they are being added.

It is relatively straightforward to see that the alignment of the data in memory matches the alignment of that same data as it is loaded in register, unless that data is shifted, either in hardware as it is being loaded from memory by an aligning load, or in software by scheduling a shift or permute operation in the computation. Consider the same b[1]+c[3] computation, where the data is shifted so that the data touched by the first iteration (namely b1 and c3) are both in the first slot of their respective vector registers, as shown below in Figure 2.

simd zero
Figure 2. Correct SIMD code for a[i+2] = b[i+1] + c[i+3].

Clearly, since the data corresponding to the first iteration (yellow circles) reside in the first slot of their respective vector registers, adding these two registers will yield the expected b[1]+c[3] result, also in the first slot. Assuming that the result is then stored in a[i+2], then the result must be shifted back to the alignment expected by the store.

This scheme above is the traditional analogue of what a hardware aligning load and store would do. When such hardware support is not available, this scheme is also similar to what prior compiler technology does in software using extra shift or permute instructions.

 

Alignment Issue, Our Contribution

The key idea for our contribution is to realize that when alignment is done in software, there is no reason to remain as constrained as a hardware approach, which has little knowledge about the alignment of the entire computation. Indeed, prior work corresponds to an "align-to-slot-zero" approach where the data of the first loop iteration is artificially aligned to the first slot in their respective vector registers.

To illustrate our approach, consider the same a[i+2] = b[i+1] + c[i+3] as before, in Figure 3. But this time, the compiler is using global alignment knowledge to directly shift the misaligned input memory stream to the alignment of the output memory stream.

simd eager
Figure 3. Optimized SIMD code for a[i+2] = b[i+1] + c[i+3].

In doing so, we eliminate the need for misalignment handling before the store and thus reduce the number of shifts by 50%, from 3 to 2 per simdized iteration, in this example.

We can generalize the insertion of shifts by inserting shift operation in the original dependence graph whenever there is a boundary between two alignments. One constraint is that the alignment of memory operations is dictated by the alignment property of arrays. The second constraint is that whenever an operation is to proceed, all its inputs and its output must have an identical alignment. Finally, the property of the shifts (referred to as "vshiftstreams" here) is that it can convert the alignment property of its single input data streams to an arbitrary alignment for its output data stream.

Using the above definition, a "shift-to-zero" policy is shown below in Figure 4a, where each of the memory inputs is shifted to the zero alignment, and then the outcome of the computation is shifted back to the alignment of the final store alignment.

The optimized shift policy used to derive Figure 3 is shown in Figure 4b. With this policy, refered to as "eager shift," each of the misaligned inputs is eagerly shifted to the alignment of the final store.

simd shift policies
Figure 4. SIMD shift placement policy.

A "lazy shift" policy is illustrated in Figure 4c for a slightly different example, a[i+3] = b[i+1] + c[i+1]. In this example, both inputs are respectively aligned with each other. Thus, we may lazily shift the result of the addition instead eagerly aligning of each of the addition's inputs.

Finally, Figure 4d illustrates a "dominant shift" policy where the inputs are aligned to the most frequent alignment seen in the computation.

 Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact