IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Author's Guide  
Journal of Research
and Development
  Staff  
  Contact Us  
Systems Journal  
Volume 37, Number 3, 1998
Java Technology
 Table of contents: arrowHTML arrowASCII   This article: HTML arrowASCII   DOI: 10.1147/sj.373.0409 arrowCopyright info
   

Optimizing array reference checking in Java programs

by S. P. Midkiff, J. E. Moreira, and M. Snir
The Java™ language specification requires that all array references be checked for validity. If a reference is invalid, an exception must be thrown. Furthermore, the environment at the time of the exception must be preserved and made available to whatever code handles the exception. Performing the checks at run time incurs a large penalty in execution time. In this paper we describe a collection of transformations that can dramatically reduce this overhead in the common case (when the access is valid) while preserving the program state at the time of an exception. The transformations allow trade-offs to be made in the efficiency and size of the resulting code, and are fully compliant with the Java language semantics. Preliminary evaluation of the effectiveness of these transformations shows that performance improvements of 10 times and more can be achieved for array-intensive Java programs.

Three goals of the Java** programming language1 and execution environment are bit-for-bit reproducibility of results on different platforms, safety of execution, and ease of programming and testing. A crucial component of the Java language specification for achieving the latter two goals is that a program only be allowed to access objects via a valid pointer, and that programs only be allowed to access elements of an array that are part of the defined extent of the array. A naive implementation of this specification component requires every access to every element of an array to check the validity of all base pointers and indices involved in the access.

Figure 1 shows a typical representation of a four-dimensional array in Java. A naive checking of a reference to element A[1][2][3][4], indicated in the figure, would require four base pointer tests and four range tests: A, A[1], A[1][2], and A[1][2][3] must all be tested as valid pointers; 1, 2, 3, and 4 must all be tested as valid indices along the corresponding axes of the array. If the entire array is accessed, a total of 4N base pointer tests and 4N range tests will be necessary, where N is the total number of elements in the array. In this paper, we present a suite of techniques to greatly reduce the number of run-time tests. In many cases, the run-time tests can be completely eliminated.

Figure 1

The above-mentioned goal of bit-for-bit reproducibility of results, combined with the ability to catch exceptions and examine the state of the program at the time the exception was thrown, imply that it is not sufficient to determine that an invalid reference occurred. Rather, any optimizations of valid reference checking must cause all exceptions that would have been thrown in the original program to still be thrown. The exceptions must be thrown in precisely the same order, with respect to the rest of the computation, dictated by the original program semantics. The techniques we describe fulfill this requirement. Our methods also handle loops that catch exceptions within their bodies (i.e., with nested try blocks).

We note that the techniques and methods described here are not restricted to Java. They can be applied to any language in which the bounds of an array can be determined at run time. They could be used, for example, in C and FORTRAN compilers that want to provide an option to check array references.

The rest of this paper is organized as follows. The next section presents an informal overview of our optimizations and is followed by an introduction to the concept of safe bounds, used throughout the paper. The fourth section presents the first of our optimization techniques, the exact method. The succeeding sections present other techniques, the general, compact, and restricted methods, respectively. The eighth section discusses an inspector-executor variant of our methods, to handle more complex cases. The ninth section explains our optimizations in the context of multithreaded execution. The tenth section presents some experimental results, followed by a discussion of related work. Finally, we present our conclusions.

Overview of the optimizations

The main goal of our work is to develop techniques that minimize the number of run-time tests that must be performed for array reference operations in Java. An array A is defined by a lower bound lo(A) and an upper bound up(A). We use a Java-like notation for array declarations:

doubleA[] = new double[lo(A):up(A)]

declares a one-dimensional array A of doubles. Note that we allow an explicit declaration of the lower bound lo(A) of array A. In Java, the lower bound is always zero. Also, Java array declarations specify the number of elements of the array. Therefore, the Java declaration

double A[] = new double[n]

would correspond to the declaration

double A[] = new double[0:n - 1]

in our notation.

An array element reference is denoted by A[sigma], where sigma is an index into the array. For the reference to be valid, A must not be null, and sigma must be within the valid range of indices for A: lo(A), lo(A) + 1, ... , up(A). If A is null, then the reference causes a null pointer violation. In Java, this corresponds to the throwing of a NullPointerException. If A is a valid pointer and sigma is less than the lower bound lo(A) of the array, then the reference causes a lower bound violation. If A is a valid pointer and sigma is greater than the upper bound up(A) of the array, then the reference causes an upper bound violation. Java does not distinguish between lower and upper bound violations. A bound violation (lower or upper) in Java causes an ArrayIndexOutOfBoundsException to be thrown.

Multidimensional arrays are treated as arrays of arrays:

double M[][] = new double[l1:u1] [l2:u2]

declares an array M with lower bound l1 and upper bound u1. Each element of M is an array of doubles, with lower bound l2 and upper bound u2. M is an example of a rectangular array. Rectangular arrays have uncoupled bounds along each axis. As in Java, we also allow ragged arrays, where the bounds for one axis depend on the coordinate along another axis. A triangular array is an example of a ragged array:

double T[][] = new double[1:n][]

T[i] = new double[1:i], i = 1,...,n

Even though ragged arrays are allowed and treated by our techniques, we do have optimizations that take advantage of rectangular arrays.

Arbitrary array lower bounds, the distinction between lower bound and upper bound violations, and treatment for rectangular arrays are features of our work that are not strictly necessary for Java applications. Our motivation to include them is twofold: First, we want our methods to be applicable to other programming languages, in particular C, C++, and FORTRAN 90. Second, we want to be prepared to handle extensions to Java that are being considered to efficiently support numerical computing, as described in Reference 2.

Array accesses typically occur in the body of loops. Our optimizations operate on do-loops, which are for-loops of the form:

for (i = l; iless than or equal tou; i++){
B(i)
}

where i is the index variable of the loop, l defines the lower bound of the loop, and u defines the upper bound of the loop. The iteration space of the loop is defined by the range of values that i takes: l, l + 1, ... , u. All loops have a unit step, and therefore a loop with l > u has an empty iteration space. B(i) is the body of the loop, which typically contains references to the loop index variable i. As a shorthand, we represent the above loop structure by the notation L(i, l, u, B(i)). Note that loops with positive nonunity strides can be normalized by the transformation

for (i = l; iless than or equal tou; i = i + s){
B(i)
}
becomes
for (i = 0; iless than or equal to left floor(u - l)/s right floor; i++){
B(l + is)
}
A loop with negative stride can be first transformed into a loop with a positive stride:
for (i = u; i greater than or equal to l; i = i - s){
B(i)
}
becomes
for (i = l; iless than or equal to u; i = i + s){
B(u + l - i)
}

Loops are often nested within other loops. Standard control-flow and data-flow techniques3 can be used to recognize many for, while, and do-while loops, which occur in Java and C, as do-loops. Many goto loops, occurring in C and FORTRAN, can be recognized as do-loops as well.

Four different tests can be applied to an array reference A[i]. A null test verifies whether A is a null pointer. An lb test verifies whether i greater than or equal to lo(A), and a ub test verifies whether i less than or equal to up(A). Finally, a test called all tests verifies whether lo(A) less than or equal to i less than or equal to up(A). Tests for bounds subsume null tests, since a null array can be set to have an empty extent. Furthermore, we show in the next section that for do-loops only one of three cases can occur for each iteration and each array reference: (1) an lb test is required, (2) a ub test is required, or (3) no test is required. In many situations it is possible to precisely determine which case occurs. In other situations, one has to be conservative and adopt a stronger test than absolutely required. In particular, an all tests can be used to detect any possible violations.

We introduce later a method for deriving the minimum set of tests required at each iteration of a simple loop with no nesting. This method will be referred to as the exact method. It provides the general framework that is used by the subsequent methods, which handle nested loops. The exact method specifies, for each iteration and each array reference, which test of the three named above is required, if any. A loop with rho array references in its body is split at compile time into up to 3rho consecutive regions, each with a specialized version of the loop body. At run time, no more than 2rho + 1 regions are actually executed.

This level of code replication is not practical in general. However, one can reduce the number of code versions by merging together regions, at the expense of performing superfluous tests. This is shown in the section describing the general method. In practice, this does not increase execution time. The regions where tests are required will usually be found to be empty at run time so that the tests in these regions are never normally executed. Reducing the number of distinct regions will not only decrease code size, but may also decrease execution time. A practical version of the exact method splits an iteration space i = l, l + 1, ... , u into three regions:

  1. A region of the iteration space where no tests are needed. This region is defined by a safe lower bound ls and a safe upper bound us. The range of values of i in this region is ls, ls + 1, ... , us.
  2. A region of the iteration space where the index variable has values smaller than the safe lower bound ls. For this region i = l, l + 1, ... , ls - 1.
  3. A region of the iteration space where the index variable has values greater than the safe upper bound us For this region i = us + 1, us + 2, ... , u.
The loop is split into three loops, each associated with a different code version. As a simpler alternative, the number of code versions can be further reduced to two: one with no tests, and one with all tests enabled.

We extend this method to nested loops in the later section, "The general method," recursively splitting each iteration space into three regions. When applied to a perfect d-dimensional loop nest, this method leads to 3d distinct loop nests, each potentially executing a different code version. One can reduce the number of distinct code versions by merging different versions. In the extreme case, it is possible to use only two versions. This method will be referred to as the general method.

In the sixth section we present a method that avoids this exponential blow-up in static code size. This method is referred to as the compact method. When applied to a perfect d-dimensional loop nest, it results in 2d + 1 loop nests. Depending on the simplifications adopted, it can use from 2 up to 2d + 1 different code versions.

Finally, in the seventh section we describe a tiling method that can be applied to certain types of loop nests. It effectively implements the compact method through generated code of a very simple form. The method uses two versions for each loop body (no tests and all tests enabled). This method is referred to as the restricted method.

These four methods apply to Java as well as to FORTRAN or C and work for all machine architectures. Additional optimizations are possible in important special cases. We briefly discuss two of these optimizations now: how to determine a valid index with a single test and how to test for null pointers. We do not discuss special optimization techniques any further in this paper.

In the particular case of Java an all tests test can be implemented with a single comparison. Because Java does not distinguish between lower bound and upper bound violations, and because lo(A) = 0 always, it suffices to test if i < up(A) + 1 using unsigned arithmetic. A negative number appears, in unsigned arithmetic, as a very large positive number which, by the Java language specification, is always larger than the largest possible array extent. This technique is used, for example, in the IBM HPCJ (High-Performance Compiler for Java) project.4

The optimization of checks for null pointer violations in array references is a direct consequence of the optimization of checks for bound violations, as discussed in the next section. There are also many ad hoc solutions to the optimization of null pointer violations. For machines with segmented memory architecture, such as the IBM POWER2 Architecture*, null pointers can be represented by a pointer to a protected segment. For machines with linear, but paged, memory architecture, a protected segment can be simulated with a region of protected pages. Any attempts at access through a null pointer will immediately cause a hardware exception that can then be translated into the appropriate software exception. This implementation completely eliminates the need for any explicit checks for null pointer violations in the code. It is also applicable to all pointer dereferences, not just array accesses. Despite the complications in properly translating a hardware exception, this technique has been successfully used, for example, in the CACAO project.5

Computing safe bounds

The basic idea of our methods is to partition the iteration space of a loop into regions. A region is a contiguous subset of the iteration space. It is characterized by a collection of tests that need to be performed for array references in that subset. We are particularly interested in finding safe regions, where we know there can be no violations in array references. If an array reference is guaranteed not to generate a violation, explicit tests are unnecessary.

We illustrate the computation of safe regions with a simple example. We then proceed to formalize it to more general cases. Consider the simple loop:

for (i = l; i less than or equal to u; i++){
B(i)
}

which we represent as L(i, l, u, B(i)). B(i) is the body of the loop and, in general, contains references to the loop index variable i. The iteration space of this loop consists of the range of values i = l, l + 1, ... , u.

Let there be a single array reference A[i] in the loop body B(i). We denote the lower bound of A by lo(A) and the upper bound of A by up(A). Therefore, for the array reference A[i] to be valid we must have lo(A) less than or equal to i less than or equal to up(A). If A is null, we define lo(A) = 0 and up(A) = - 1. This guarantees that A[i] is never valid if A is null. We can split the iteration space of the loop into three regions special R[1], special R[2], and special R[3], defined as follows:

special R[1] : (l less than or equal to i less than or equal to u) ^ (i < lo(A)) (1)
special R[2] : (l less than or equal to i less than or equal to u) ^ (lo(A) less than or equal to i less than or equal to up(A)) (2)
special R[3] : (l less than or equal to i less than or equal to u) ^ (i > up(A)) (3)

Region special R[1] corresponds to those iterations of the loop for which the index i into array A is too small. Therefore, an lb test is required before each array reference in this region. No tests are required in region special R[2], since the index i falls within the bounds of A. Finally, a ub test is required in region special R[3], because the values of i are too large to index A.

Using Equation 2, we can compute the lower and upper bounds special L and special U of the safe region:
special L = max(l, lo(A)) (4)
special U = min(u, up(A)) (5)
special R[2] : i = special L,..., special U (6)

Similarly, we use Equation 1 and Equation 3 to compute the lower and upper bounds of regions special R[1] and special R[3], respectively:
special R[1] : i = l,..., min(u + 1, lo(A)) - 1 (7)
special R[3] : i = max(l - 1, up(A)) + 1,..., u (8)

Note that min(u + 1, lo(A)) = special L, except when lo(A) < l or lo(A) > u + 1. In the former case, special R[1] is empty; in the latter case, special R[2] is empty. To handle these cases, and the symmetric upper bound cases, we redefine
special L = min(u + 1, max(l, lo(A))) (9)
special U = max(l - 1, min(u, up(A))) (10)

We can now express the bounds of each of the regions just in terms of l, u, special L, and special U:
special R[1] : i = l,..., special L - 1 (11)
special R[2] : i = special L,..., special U (12)
special R[3] : i = special U + 1,..., u (13)

Equations 11-13 define the same regions as Equations 6-8. The values of region bounds are different only for empty regions, but they are still empty.

In Figure 2 we illustrate the values of special L and special U for different relative positions of iteration bounds and array bounds. Figure 2A has empty regions special R[1] and special R[3], whereas region special R[2] comprises the entire iteration space. Region special R[1] is empty in Figure 2B, whereas region special R[3] is empty in Figure 2C. All three regions are nonempty in Figure 2D. Regions special R[2] and special R[3] are empty in Figure 2E, because all values of i fall below the lower bound of A. Finally, regions special R[1] and special R[2] are empty in Figure 2F, since all values of i fall above the upper bound of A.

Figure 2

In the general case, we have an array reference of the form A[f(i)] in the body B(i) of a loop. Depending on the behavior of f(i), we can compute safe bounds special L and special U that partition the iteration space into three regions, as defined by Equations 11-13. Region special R[2] is safe, and no test is defined in it. We define tests tau[1] and tau[3] that are sufficient for regions special R[1] and special R[3], respectively. Expressions for special L, special U, and tau for various forms of f(i) are described in Appendix A. The safe bounds are computed so that we always have special U greater than or equal to special L - 1 and special L less than or equal to special U + 1. These properties are important for the correct functioning of our methods.

An exact method

In this section we consider the case of a simple (depth one) loop, with multiple references. The exact method described here performs only the tests strictly necessary (as specified in Appendix A) to detect the array reference violations that occur during the execution of a loop. It is, in general, of limited practicality because up to 3rho versions of the loop body must be instantiated in the code, where rho is the number of array references in the loop body. In some situations, when enough information is available to the compiler, it is possible to generate efficient code using this technique. Our main interest in studying this method is that the other, more practical, transformations are derived from this technique.

In this section we treat references of the form A[f(i)] that allow the computation of safe bounds as described in the previous section. The eighth section discusses how to handle more complex subscripts. We first describe how the method generates optimized code, and then we illustrate the method with an example.

Code generation. The method works as follows. Let L(i, l, u, B(i)) be a loop on index variable i and body B(i). Let there be rho references of the form Aj[fj(i)], j = 1, ... , rho, in body B(i). Each array reference Aj[fj(i)] partitions the iteration space into three (each possibly empty) regions:

special Rj[1] : i = l,..., special L j - 1

special Rj[2] : i = special L j,..., special Uj

special Rj[3] : i = special U j + 1,..., u

as defined by Equations 11-13. We call these regions defined by a single array reference simple regions. Also, each region special R j[k] has a test tau j[k] associated with it, which describes what kind of test has to be performed for reference Aj[fj(i)] in that region. The test tauj[2] for region special R j[2] always specifies no test, because special R j[2] is a safe region with respect to this reference. The tests tau j[1] and tau j[3], for regions special R j[1] and special R j[3], respectively, can be any one of ub test (an upper bound test), lb test (a lower bound test), or all tests (both a lower and an upper bound test). The exact choice depends on characteristics of the array reference Aj[fj(i)]. This issue is discussed in detail in Appendix A. For each region special R j[k] we denote its lower bound by special R j[k].l and its upper bound by special R j[k].u.

Two references Aj[fj(i)] and Ak[fk(i)] can be thought of as partitioning the iteration space into five regions defined by the four points special L j, special Uj, special Lk, and special Uk. We illustrate this in Figure 3 for two references, A1[f1(i)] and A2[f2(i)]. Note that some of the resulting regions may be empty, in general. The resulting regions are a subset of the 3 · 3 = 9 regions formed by all combinations of intersections of two simple regions, one from each reference.

Figure 3

In general, given the rho references Aj[fj(i)], we can create a vector of all 3rho regions formed by intersecting the simple regions from different array references:

special R [x1·x2·...·x rho] =
special R ¹[x1] intersect special R ²[x2] intersect... intersect special R rho [xrho](14)

Each index xk is 1, 2, or 3. The lower bound of a region special R [x1 · x2 · ... · xrho] is the maximum of the lower bounds of the forming simple regions. Correspondingly, its upper bound is the minimum of the upper bounds of the forming simple regions:
special R [x1 · x2 · ... · xrho].l =
max (special R¹ [x1].l, special R² [x2].l,..., special R rho [xrho].l) (15)
special R [x1 · x2 · ... · xrho].u =
min(special R¹ [x1].u, special R² [x2].u,..., special R rho [xrho].u) (16)

Finally, the tests that characterize region special R [x1 · x2 · ... · x rho] can be described by:
tau [x1·x2·...· xrho] = {tauj [xj], j = 1,..., rho} (17)

That is, the combination of the tests for each simple region special Rj [xj].

Using this partitioning of the iteration space into 3rho regions, the loop
for (i = l; i less than or equal tou; i++){
      B(i)
}
(18)

can be transformed into 3rho loops, each one implementing one of the regions special R [x1 · x2 · ... · xrho]. The body of each region can be specialized to perform exactly the tests described by tau [x1 · x2 · ... · xrho]. We use Bx1x2... xrho(i) to denote the body B(i) specialized with the tests described by tau [x1 · x2 · ... · xrho]:
for (i = special R [1 · 1 ·...· 1].l;
     i less than or equal to special R [1 · 1 ·...· 1].u; i++) {B11...1(i)}
for (i = special R [1 · 1 ·...· 2].l;
     i less than or equal to special R [1 · 1 ·...· 2].u; i++) {B11...2(i)}
for (i = special R [1 · 1 ·...· 3].l;
     i less than or equal to special R [1 · 1 ·...· 3].l; i++) {B11...3(i)}

·
·
·

for (i = special R [1 · 2 ·...· 1].l;
     i less than or equal to special R [1 · 2 ·...· 1].u; i++) {B12...1(i)}
for (i = special R [1 · 2 ·...· 2].l;
     i less than or equal to special R [1·2·...·2].u; i++) {B12...2(i)}
for (i special R [1 · 2 ·...· 3].l;
      iless than or equal to special R [1 · 2 ·...· 3].u; i++){B12...3(i)}

·
·
·

for (i = special R [3 · 3 ·...· 1].l;
     i less than or equal to special R [3 · 3 ·...· 1].u; i++) {B33...1(i)}
for (i = special R [3 · 3 ·...· 2].l;
     i less than or equal to special R [3·3·...·2].u; i++) {B33...2(i)}
for (i = special R [3 · 3 ·...· 3].l;
     i less than or equal to special R [3 · 3 ·...· 3].u; i++) {B33...3(i)}

(19)

Note that the order of the resulting loops is important, although there are many correct orders. The requirement is that, for any value of j, region special R [x1 · ... · x - 1 · 1 · xj + 1 · ... · xrho] has to precede special R [x1 · ... · xj - 1 · 2 · xj + 1 · ... · xrho] which in turn has to precede special R [x1 · ... · xj - 1 · 3 · xj + 1 · ... · xrho].

Out of the 3rho possible regions formed by the intersection of the simple regions, no more than 2rho + 1 are nonempty and are actually executed at run time. Which of the 3rho regions are nonempty depends on the relative positions of the 2rho safe bounds special L¹,..., special L rho, special U¹,..., special U rho. Let p1,..., p2rho be the list obtained by sorting special L¹, ..., special L rho, special U¹ + 1,...,special U rho + 1 in ascending order. Define p0 = l and p 2rho + 1 = u + 1. The safe bounds special L¹,..., special L rho, special U¹, ... , special U rho partition the iteration space into 2rho + 1 (each possibly empty) regions special P[k], k = 0, 1,..., 2rho defined by
special P[k] : i = pk,...,pk + 1 - 1 (20)

Each region special P[k] corresponds to a region special R [x1 · x2 · ... · xrho] defined by
xj ={ 1 if special Lj greater than or equal to pk + 1
3 if special Uj < pk
2 if (special L j < pk + 1) ^ (special Uj greater than or equal topk)
(21)

(It can occur that both special Lj greater than or equal to pk + 1 and special U j < pk. In that case, region special P[k] is necessarily empty, and the choice of xj is irrelevant.) Let M(k) be the function that defines the correspondence special R [M(k)] ident special P[k] for k = 0,..., 2rho. Then the loop
for (i = l; i less than or equal to u; i++){
     B(i)
}
(22)

can be transformed to
for (k = 0; k less than or equal to 2rho; k++){
     for (i = pk; i < pk + 1; i++){
           BM(k)(i)
}
(23)

If the order of the safe bounds special L¹,..., special L rho, special U¹,..., special U rho is known at compile time, the mapping function M(k) can also be determined. In this case, only the loop body versions that are actually executed need to be generated. In general, the order of the safe bounds is not known, and BM(k)(i) has to be selected at run time from the set of all possible 3rho body versions.

An example. In the interest of conciseness, we use a very simple code fragment to illustrate this method. Figure 4A illustrates the original loop to be transformed. It has two array references, A1[i - 2] and A2[i + 2], each generating three simple regions: special R¹[1:3] and special R²[1:3], respectively. The straightforwardly transformed code, using the exact method, is shown in Figure 4B. We note that:
special L¹ = min(u + 1, max(l, 3)) (24)
special L² = min(u + 1, max(l, - 1)) (25)
special U¹ = max(l - 1, min(u, n + 2)) (26)
special U² = max(l - 1, min(u, n - 2)) (27)
tau¹[1] = tau²[1] = lb test (28)
tau¹[3] = tau²[3] = ub test (29)

which does not provide us with enough information to order the safe bounds special L¹, special L², special U¹, and special U². We implement the code in Figure 4B according to Equation 19. We could have used the method in Equation 23, but it would still have required all 3rho body versions to be generated.

Figure 4

Now let n = 10, as shown in Figure 5A. In this case, the expressions for the safe bounds become:
special L¹ = min(u + 1, max(l, 3)) (30)
specialL² = min(u + 1, max(l, - 1)) (31)
special U¹ = max(l - 1, min(u, 12)) (32)
special U² = max(l - 1, min(u, 8)) (33)

and we can order special L² less than or equal to special L¹ less than or equal to special U² less than or equal to special U¹. In particular, this is the ordering shown in Figure 3. We only need to generate code for five loops, as shown in Figure 5B. In some cases, a compiler with appropriate symbolic analysis can even eliminate some of these five loops. For example, if the compiler could prove that l > 2, neither of the first two loops (body versions B11(i) and B12(i)) would be necessary.

Figure 5

Summary of the exact method. The exact method transforms code so that, for each iteration of the original loop, only those tests that cannot be shown to be unnecessary are performed. In the straightforward application of the method to a loop with rho array references in its body, 3rho new loops are generated, each with a slightly different body. No more than 2rho + 1 of these loops are actually executed at run time. In some situations, compile-time analysis can show that some of these loops implement empty regions of the iteration space and, therefore, can be discarded.

The general method

The general method works by partitioning the iteration space of a loop always into three regions, independent of the number of array references in the body of the loop. One of the regions is a safe region: no array reference in that region can cause a violation. Another region precedes the safe region, in iteration order. Finally, the third region succeeds the safe region, in iteration order. This method can be applied to each and every loop of an arbitrary loop nest individually. The general method does not specialize the tests as much as the exact method, but it does identify the same safe regions that can be executed without any tests.

Transforming a single loop. Consider the loop L(i, l, u, B(i)), with rho references of the form Aj[fj(i)] in its body. Using the concepts developed in the previous two sections, we can compute its safe region as the intersection of all simple safe regions. This safe region is defined by the range of values of i = ls, ls + 1, ... , us where:
ls=max(special L¹, special L²,..., special Lrho) (34)
us=max(ls-1, min(special U¹, special U²,..., special Urho)) (35)

The ls - 1 term in the expression for us is necessary to handle cases where the safe region is empty. We can then partition the iteration space of the loop into three (each possibly empty) regions:
special R[1] : i = l,..., ls-1 (36)
special R[2] : i = ls,..., us (37)
special R[3] : i = us+1,..., u (38)

This partitioning is not very different from our very first example in the third section, except that now it applies to a loop with an arbitrary number of array references in its body. We illustrate this for two references: A1[f1(i)] and A2[f2(i)] in Figure 6. Region special R[2] is the intersection of the safe simple regions special R¹[2] and special R²[2]. Correspondingly, region special R[1] is the union of the unsafe simple regions special R¹[1] and special R²[1]. Region special R[3] is the union of the unsafe simple regions special R¹[3] and special R²[3]. With reference to Figure 3, special R[1] is formed by merging all regions preceding special R[2 · 2], and region special R[3] is formed by merging all regions succeeding region special R[2 · 2].

Figure 6

Region special R[2] is the safe region, and its body can be implemented without any tests. We denote this version of the loop body by B(notest(i)). Conversely, the bodies of regions special R[1] and special R[3] need at least some tests. For simplicity, we can implement both with a version B(test(i)) of body B(i). This version performs an all tests test before each and every array reference.

The implementation of the general method consists of the transformation:

for(i = l; iless than or equal tou; i++){
     B(i)
}

becomes
for (i = l; iless than or equal tols-1; i++){
     B(test(i))
}
for (i = ls; iless than or equal tous; i++){
     B(notest(i))
}
for (i = us+1; iless than or equal tou; i++){
     B(test(i))
}
(39)

or, in shorthand:

L(i, l, u, B(i)) implies {L1(i, l, ls-1, B(test(i)))
L2(i, ls, us, B(notest(i)))
L3(i, us+1, u, B(test(i)))
(40)

The code expansion in this case is only twofold, since the same code for B(test(i)) can be used twice. Methods to realize all three resulting loops using only two instances of B(i) are discussed in Appendix B.

Recursive application of the transformation. If the body B of a loop contains other loops, then the same transformation can be applied to each of the loops in B. The transformation can be applied individually to each and every loop in a general loop nest. In fact, the final result is independent of the order in which the individual loops are transformed.

Consider the case of the two-dimensional loop nest L(i, li, ui, L'(j, lj, uj, B(i, j))). By first applying the transformation to the L loop, we generate a region without any tests on array references indexed by i:

L(i, li, ui, L'(j, lj, uj, B(i, j)))

becomes

L1(i, li, lis-1, L'(j, lj, uj, B(test(i), j)))
L2(i, lis, uis, L'(j, lj, uj, B
(notest(i), j)))
L3(i, uis+1, ui, L'(j, lj, uj, B
(test(i), j)))

We can now apply the transformation to each of the three L' loops. This will generate two regions without any tests on array references indexed by j and one region without any tests on array references indexed by either i or j as seen in Figure 7. This transformation partitions the iteration space into nine regions, and requires four different versions of the loop body B(i).

Figure 7

An example of the general method. For simplicity, we give an example of single-threaded code with rectangular arrays. In the ninth section is a discussion of the application of these optimizations to multithreaded code. The program of Figure 8 will be used in this example. It implements a step of a two-dimensional Jacobi relaxation.6 We chose it because it is a well-known operation that illustrates a loop nest with various array references in its body. Also, the array references all use slightly different indices, making our example more interesting.

Figure 8

Figure 9 shows the loop implemented with naive tests. The statement

ifAccessViolation(A,i) throwException

throws the appropriate exception if A is a null pointer, i is below the lower bound of A, or i is above the upper bound of A. In the actual executable code, the tests would be interweaved with the individual array references, and the order of tests would be slightly different, but this example gives an idea of the cost of naive testing. Figure 10 shows the loop nest optimized for testing using the general method. In each of the resulting loops, we list the violations that can occur in its body. The code with explicit tests for the different versions of the loop body are shown in Figure 11. The minimal tests needed for the i1 and i3 loops differ in that the i1 loop only needs to test for lower bounds violations, and the i3 loop only needs to test for upper bounds violations. By using the transformation of Equation 39, the version for both loops can be the same, as indicated in Figure 10.

Figure 9 Figure 10 Figure 11

Transforming the loops. The optimized code is the result of the following transformations. First, the outer i loop is split into three loops (i1, i2, and i3), as shown in Figure 10. The middle loop, i2, corresponds to those iterations of the original i loop that cannot cause bound violations on array references indexed by i. Within this loop, these array references need not be tested. This is the safe region for loop i.

The first loop, i1, executes those iterations of the original i loop whose values of i precede the safe region. Within this loop, all array references indexed by i need to be tested. Because the subscript expressions on i are monotonically increasing across the iteration space of the i loop, only lower bound violations can occur in the i1 region.

The third loop, i3, executes those iterations of the original i loop whose values of i succeed the safe region. Again, within this loop, all array references indexed by i need to be tested. Because the subscript expressions on i are monotonically increasing across the iteration space of the i loop, only upper bound violations can occur in the i3 region.

Within each of the resulting loops i1, i2, and i3 the j loop is similarly split. For example, the body of resulting loop j1,3 executes all iterations of the nest that attempt to reference elements of a or b that are below the corresponding lower bound, and elements of b[i], a[i], a[i + 1], and a[i - 1] that are above the corresponding upper bound. In a typical correct numerical code, where all references are in bounds, only the body of loop j2,2 will execute. This body is represented by B(notest(i), notest(j)), and because it has no tests, it executes faster.

Computing the bounds of the split loops. To compute the bounds for the resulting i1, i2, and i3 loops, we first have to compute the safe bounds for loop i: lis and uis. In the body of the i loop there are five array references indexed by i: b[i], a[i] (appearing twice), a[i + 1], and a[i - 1]. The values of special Lik and special Uik are shown in Figure 12A. They are computed using Equation 64 of Appendix A. The values of lis and uis can be computed according to Equations 34 and 35:

lis = max(special Li1, special Li2, special Li3, special Li4, special Li5)

uis = max(lis-1, min(special Ui1, special Ui2, special Ui3, special Ui4, special Ui5))

The bounds for the resulting j loops are computed similarly. In the body of the j loop there are five array references indexed by j: (a[i])[j + 1], (a[i])[j - 1], (a[i + 1])[j], (a[i - 1])[j], and (b[i])[j]. Note that b[i], a[i], a[i + 1], and a[i - 1] are the arrays being accessed. The values of ljs and ujs can be computed by Equations 34 and 35, using values of special Ljk and special Ujk obtained through Equation 64 and shown in Figure 12B. In this example, the arrays are rectangular, and the lower and upper bounds for a[i], a[i + 1], a[i - 1], and b[i] do not depend on the value of i. (Note that, in general, ljs and ujs would be functions of i.)

Figure 12

Checks that must still be done. The transformation just discussed creates regions of the loop that are free from violations on array references indexed by either i, j, or both. Figure 10 lists the specific violations that can occur in each region. Note that this particular behavior is specific to this loop. We chose to implement the loop using versions of the body that perform tests covering more than the strictly necessary violations. This allowed us to use only four versions of the loop body, as shown in Figure 10 and detailed in Figure 11. We perform an all tests test on any i reference when outside the safe region for i. Correspondingly, we perform an all tests on any j reference when outside the safe region for j.

Summary of the general method. The exact method of the previous section formed a different region of the iteration space of a loop for each possible combination of necessary array reference tests. The general method always partitions the iteration space of a loop into three regions: (1) a region for those iterations that need no test (the safe region), (2) a region for those iterations that occur prior to the safe region, and (3) a region for those iterations that occur after the safe region. In each of the nonsafe loop regions, any test that might be needed for a reference in any iteration of that region is performed in all iterations of that region. This means that some additional tests might be executed compared to the exact method. For a nest of loops, the general method can be applied to each and every loop recursively and independently. Thus, for a loop nest of depth d, 3d loop regions are created.

When generating code for the regions, several approaches can be taken. One approach generates a different version of the loop body for each region. When applied to the example of Figure 10, this would generate only lower bound tests on arrays indexed by i in the i1 loop and only upper bound tests in the i3 loop. This approach leads to 3d versions of the loop body for a loop nest of depth d. A second approach, actually used in our example of Figures 10 and 11, uses only two versions of the loop body for each loop index: one with no tests on that index and one with all tests. This leads to 2d versions of the loop body for a loop nest of depth d. A third approach would be to generate only two versions of the loop body, independent of the number of nested loops: one with all tests on all indices and one with no tests on any index. Obviously, the version with no tests can only be used in that region where no violations in any array reference can occur.

The compact method

The exponential expansion on the number of loops caused by the application of the general method can be highly undesirable in some cases. In this section we propose an alternative method that results in only a linear increase in the number of regions. The difference resides in how the transformation is applied to loop nests. In the compact method, the partitioning into three regions is always applied in outermost to innermost loop order. Furthermore, it is only applied to inner loops in the untested version of an outer loop body.

Again, consider the case of the two-dimensional loop nest L(i, li, ui, L'(j, lj, uj, B(i, j))). By first applying the transformation to the outer i loop, we generate a region without tests on array references indexed by i:

L(i, li, ui, L'(j, lj, uj, B(i, j)))

becomes
L1(i, li, lis-1, L'(j, lj, uj, B(test(i), j)))
L2(i, lis, uis, L'(j, lj, uj, B(
notest(i), j)))
L3(i, uis+1, ui, L'(j, lj, uj, B(test(i), j)))
(41)

We now apply the transformation to the instance of L' in the L2 region. This results in a region without any tests on array references indexed by either i or j as seen in Figure 13.

Figure 13

This transformation partitions the two-dimensional iteration space into five regions, and requires three different versions of the loop body B(i). It still generates the same region safe on i and j as the general method.

An example of the compact method. As an example, we apply the compact method to the two-dimensional loop nest of Figure 8. The resulting code is shown in Figure 14. Note that only loop i2 has its j loop partitioned into three regions. When applied to a d-dimensional loop nest, the compact method generates 2d + 1 regions of the loop iteration space. Because the two regions preceding and succeeding a safe region are similar, only d + 1 versions of the loop body must be generated.

Figure 14

Summary of the compact method. Like the general method of the previous section, the compact method divides each loop into one safe and two nonsafe regions. The differences between the two methods are manifested only when applying the transformation to a loop nest. The general method splits all nested loops into three regions, whereas the compact method splits only loops nested within the version without tests. This implies that within the nonsafe versions of an outer loop the compact method may perform more tests than the general method. The benefit is that only a linear number of loop regions (2d + 1 for a loop nest of depth d) are necessary with the compact method, as opposed to an exponential number with the general method. The number of versions of the loop needed is a linear function of the nesting depth of the loop (d + 1 for a loop nest of depth d).

The number of versions of code can be further reduced to two by using a fully tested version on any region that needs tests. Also, if the loop nesting is perfect, the structure of the generated code can be greatly simplified when only two versions are used. These observations provide the motivation for the restricted method of the next section.

The restricted method

We now consider a transformation that can be applied to perfect loop nests, or any loop nest that can be transformed to a perfect loop nest via compiler techniques. The restricted method partitions the loop into multiple regions. Each region executes either a test (all array references are fully tested) or a notest (no array references are tested) version of the loop body. Having a perfect loop nest allows us to map every iteration of the loop nest onto a single point in a d-dimensional space, where d is the depth of the loop nest. Partitioning the loop into test and notest regions is equivalent to tiling this iteration space.

The advantage of the restricted method, when applied to a perfect d-dimensional loop nest, is that it can be implemented with only one instance of each version of the loop body. Also, the instances can be generated in place in a fixed loop structure. The previous methods, when implemented with only two versions of code, require a specialized loop structure that invokes those versions from more than one point in the program.

We start by considering a d-dimensional rectangular loop nest:

L1(i1, li1, ui1, L2(i2, li2, ui2,...,
Ld(id, lid, uid, B(i1, i2,..., id))...)) (42)

That is, the bounds lik and uik of loop ik do not depend on the values of i1, ... , ik-1. The iteration space of this loop can be partitioned into consecutive regions described by a vector special R[1:udelta]. Each entry in this vector has the form:
special R[delta] = (l(delta), u(delta), tau) (43)
l(delta) = (li1(delta), li2(delta),..., lid(delta)) (44)
u(delta) = (ui1(delta), ui2(delta),..., uid(delta)) (45)
tau = {test|notest} (46)

The vectors l and u define the lower and upper bounds for each loop in the region, respectively. The elements lij(delta) and uij denote the lower and upper bounds, respectively, of loop ij in the region special R[delta]. The definition of a region also contains a flag tau that has the value test if the region requires any array reference to be tested and notest otherwise.

With partitioning, the original loop can be executed by a driver loop that iterates over the regions:

for (delta=1; deltaless than or equal toudelta; delta++){
     L1(i1, li1(delta), ui1(delta), L2(i2, li2(delta), ui2(delta),...,
          Ld(id, lid(delta), uid(delta), B(i1, i2,..., id))...)).

}

Note that in previous sections single-dimensional regions were described by a vector containing the upper and lower bounds for the regions. Here a multidimensional region is described by vectors with upper and lower bounds for every index variable in the loop nest. The techniques of the previous sections formed regions loop by loop, whereas the techniques of this section form regions for the entire loop nest.

Our goal is to partition the iteration space into regions that we can identify as either requiring tests on array references or not. Those regions that do not require tests can then be executed with a notest version of the loop body. We use the same kind of partitioning described in the compact method. First, an outer loop is divided into three regions. Then, the inner loop in the region without tests is divided again. This partitioning is illustrated in Figure 15, for a two-dimensional iteration space.

Figure 15

Computing vector special R. Vector special R can be computed by procedure regions in Figure 16. Procedure regions takes seven parameters. The first five are input parameters and describe the loop nest being optimized. They are:

  1. The index j indicating that region extents along index variable ij are being computed
  2. The vector (alpha1, alpha2, ... , alphaj-1), where alphak is the lower bound for loop index ik in the regions to be computed
  3. The vector (omega1, omega2, ... , omegaj-1), where omegak is the upper bound for loop index ik in the regions to be computed
  4. The dimensionality d of the loop nest
  5. The vector special R[1:d], where special R[k] = (lik, uik, liks, uiks) contains the full and safe bounds for loop ik

Figure 16

The next two parameters are the output of the procedure. The first output parameter is the vector of regions, special R, described in Equations 43 through 46, for the loop. The second is udelta, the count of those regions. Note that udelta is also used as an input, which gets incremented each time a new region is created.

An invocation of procedure regions for a given value of j partitions the loop nest formed by loops ij, ij+1, ... , id. It is only performed for safe values of i1, i2, ... , ij-1. Procedure regions partitions loop ij into three parts. The first part consists of iterations of ij that precede the safe region of ij. The inner loops of ij are not partitioned. These regions necessarily require testing of the array references, since they are unsafe on references indexed by ij. Regions corresponding to this part of the iteration space are computed in statements S1 and S2, and correspond to the regions special R[1] (for j = 1), special R[2], special R[5], special R[8], and special R[11] (for j = 2) in the two-dimensional iteration space of Figure 15. The third part consists of the iterations of ij that succeed the safe region of ij. Again, the inner loops are not partitioned, and testing is required. Regions corresponding to this part of the iteration space are computed in statements S11 and S12, and correspond to the regions special R[4], special R[7], special R[10], special R[13] (for j = 2), and special R[14] (for j = 1) in Figure 15. The middle (second) part consists of the iterations of ij that are within its safe region. If ij is the innermost loop, this is computed by statements S4 and S5, and the partitioning of loop ij is complete. No tests are required. If ij is not the innermost loop, the partitioning is applied recursively to loop ij+1 for each iteration of ij in its safe region, as shown in lines S7 and S8.

To compute the entire vector of regions, the invocation regions(1, (), (), d, B, R, udelta = 0) should be performed. At the end of the computation, vector special R contains the description of the regions, and the value of udelta is the total number of regions in special R.

Although the example in Figure 15 and the notation imply that the iteration space passed to regions in B is rectangular, the algorithm is not restricted to rectangular loop nests. In particular, if the expressions for lij, uij, lijs, and uijs are functions of ik, 1 less than or equal to k < j, the computation performed by regions is correct.

We can optimize the execution of the loop nest by using two versions of code inside the driver loop: (1) a version B(notest(i1), notest(i2), ... , notest(id)) that does not perform any of the array reference tests, and (2) a version B(test(i1), test(i2), ... , test(id)) that performs tests on all array references. Version 1 is used only for regions where special R[delta].tau is notest, while version 2 is used for all other regions. This corresponds to forcing all regions with any potential reference violation to perform all reference tests. The optimized loop nest can be implemented by the following construct:

for (delta=1; deltaless than or equal toudelta; delta++){
if (special R[delta].tau ==test){
L1(i1, li1(delta), ui1(delta),
L2(i2, li2(delta), ui2(delta),
... ,
Ld(id, lid(delta), uid(delta), B(test(i1),
test(i2),..., test(id)
...
)
)
}else{
L1(i1, li1(delta), ui1(delta),
L2(i2, li2(delta), ui2(delta),
... ,
Ld(id, lid(delta), uid(delta), B(notest(i1),
notest(i2),..., notest(id)))
...
)
)
}
}(47)

An important optimization. If, for a particular value of j, lijs = lij and uijs = uij, then the safe region along axis ij of the iteration space corresponds to the entire extent of the axis. If liks = lik and uiks = uik for k = j, ... , d, then axis ij-1 can be partitioned into only three regions: (1) one region from lj-1 to lij-1s - 1, (2) one region from lij-1s to uij-1s, and (3) one region from uij-1s + 1 to uij-1. Each of these regions spans the entire iteration space along axes ij, ij+1, ... , id. This situation for a two-dimensional iteration space is illustrated in Figure 17. Note that the new partitioning results in only three regions. Collapsing multiple regions into a single region reduces the cost of computing the regions, the total number of iterations in the driver loop (47), and, consequently, reduces the run-time overhead of the restricted method. To incorporate this optimization in the computation of regions, procedure regions is modified as shown in Figure 37 in Appendix F.

Figure 17

An example of the restricted method. We apply the restricted method to the two-dimensional loop nest of Figure 8. The logical structure of the generated code is the same as that generated by the compact method in Figure 14. The actual implementation with the driver loop and the two versions of code is shown in Figure 18. The driver loop (with index delta) of Figure 18 executes udelta iterations, where udelta is the number of regions computed by procedure regions. On each iteration, either the version of the loop with run-time tests or the version with no run-time tests is executed. The code in Figure 18 instantiates the regions dynamically. This contrasts with the static instantiation of Figure 14.

Figure 18

Summary of the restricted method. The restricted method works on perfect loop nests. It partitions the iteration space into multidimensional regions and generates two versions of the loop body. Each region is executed with either the test version of the body or the notest version. Only regions that are free from any possible access violation can use the notest version.

Iterative computation of loop bounds

In the third section and in Appendix A we discuss how to partition an iteration space into regions for a variety of common forms of subscript functions. In this section, we discuss how regions-based testing optimization can often be performed even when the subscript expressions are not one of the forms discussed in those sections.

The technique is similar to that of the inspector/ executor method for parallelizing loops.7 We decompose the computation and execution of the regions into an inspector phase and an executor phase. The inspector phase examines the references within the iteration space of a loop and computes a list of iteration subspaces (regions). Some regions need run-time tests, whereas others do not. The inspector phase is analogous to procedure regions of the previous section. The executor phase traverses the list of iteration subspaces and executes them using different versions of the loop body. Thus the executor phase corresponds to the driver loop of the last section.

Construction of the inspector phase. We show how to construct an inspector for a singly nested loop. This method can then be recursively applied to a loop nest. Let L(i, l, u, B(i)) be a loop on i, where B(i) contains a series of rho references A1[sigma1], A2[sigma2], ... , Arho[sigmarho], where sigmaj is a function defined in the iteration space of L. A reference can be of the form Aj[Ak[sigmak]], where Ak is also an array. In general, Ak[sigmak] may be present as a term in sigmaj. We label the array references so that Aj[sigmaj] executes before Aj+1[sigmaj+1].

The inspector constructed takes the form shown in Figure 19A. The first argument to procedure regions is the output special R, a vector of regions. An element special R[delta] of the region vector consists of:

special R[delta]=(l(delta), u(delta), tau)

where l(delta) and u(delta) are the bounds of region special R[delta], and tau is its test flag. The next argument, udelta, is also an output: the number of regions in vector special R. The other arguments for procedure regions are the input of the procedure. They are: (1) the lower and upper bounds, l and u, of the loop, and (2) the list of array references (A1[sigma1], A2[sigma2], ... , Arho[sigmarho]).

Figure 19

The inspector enumerates all iterations from the loop, by executing a for (i = l; i less than or equal to u; i++) loop. For each iteration i, it checks all array references Aj[sigmaj] and determines if a test is necessary. If the evaluation of sigmaj itself causes a violation, we define sigmaj = -infinity. If sigmaj is invariant with respect to i, optimizations can be performed to reduce the number of evaluations in the inspector. The inspector marks in a flag check if a test is required for iteration i. If check is the same as oldcheck, this iteration i belongs to the same region as the previous iterations, and we update the upper bound of the current region. Otherwise, it is the first iteration of a new region.

Construction of the executor phase. The executor, shown in Figure 19B, consists of a driver loop (indicated by an index variable delta), that iterates over the regions. Each region is executed with one of the two different versions of the loop body, one with tests and one without.

Optimizations and alternate constructions. The inspector/executor technique can be applied to each loop in a loop nest independently, as we did for the general method discussed earlier. Alternatively, it can be directly applied to a d-dimensional perfect loop nest. In this case, the inspector has to enumerate all iterations in the d-dimensional iteration space, and the executor consists of a driver loop around different versions of the loop nest.

For the inspector/executor method to be effective, the inspector phase has to be hoisted out of a loop nest. When that is possible, the same vector of regions special R[1:udelta] can be used in multiple instances of the executor. The number of checks is reduced by the number of iterations in the loops from which the inspector was hoisted. We show such an example in the following subsection.

Finally, it is also possible to build an inspector/executor pair that uses more refined versions of the loop body. For example, one could use all 3rho versions generated in the exact method.

An example. In this example, the loop of Figure 20A is transformed. The inspector for this loop is shown in Figure 20B and the executor in Figure 20C. Note that we have optimized the loop by moving the inspector phase out of the j loop. This is possible because the array reference patterns are not dependent on j. Thus, even though there are O(uiuj) computations being performed, only O(ui) tests are necessary.

Figure 20

Summary of inspector-based methods. The inspector-based methods differ from the exact, general, compact, and restricted methods in how regions are formed. In the former methods, regions are computed analytically using the formulas of the third section and Appendix A. In the inspector-based methods the regions are computed by executing the code that generates the subscripts for the references being optimized. This allows references whose subscripts preclude an analytic computation of regions to be optimized. There are two caveats, however. First, because the inspector overhead is proportional within a constant factor to the cost of executing an instance of the loop, the methods are most effective when the inspector for a loop can be hoisted out of that loop. Second, there are still some loops which cannot be optimized with inspector-based methods. Figure 21 shows such a loop. Because the execution of an inspector for this loop would have side effects (other than region computation) that live beyond the life of the inspector, forming an inspector with the methods we have discussed would alter the outcome of the program. This loop can be optimized using speculative methods. Because they are not region-based, they are beyond the scope of this paper. They are discussed in Reference 8.

Figure 21

Multithreading considerations

Up to now, the problem of other threads of execution and optimized code being active at the same time has been ignored. If arrays had static shape, that is, if it were not possible for an array to change shape during the course of execution of a routine, multithreading would not be a problem. Multithreading is beyond the scope of the current FORTRAN, C, and C++ language definitions. Java, however, has made multithreading an integral part of the language. Furthermore, it is possible for one thread of a Java program to alter the shape of a multidimensional array that is being accessed by other threads.

The Java memory model enables a solution to this problem. The coherent copy of a shared variable resides in main memory. A local copy of any shared variable can be copied by a thread from main memory to working memory. The working memory is accessible only by the thread, and therefore is not alterable by any other thread. Java requires that:

  1. No synchronization can occur within a thread between time tm, when a variable is copied from main to local memory, and time ta, when the variable is accessed by the thread. This can be enforced by refreshing the shared variable in local memory after a synchronization and before the access.
  2. Before a synchronization within a thread is finished, all shared variables that have been modified since the last synchronization point in that thread must be written back to the main memory.
  3. If a shared variable has been written by a thread, it must be written to main memory before its value can be read from main memory by some thread. This prevents locally written values to shared variables from being lost.
  4. As long as item 3 is obeyed, the value of a shared variable can be updated from global memory at any time prior to an access by a thread.

The consequences of the Java memory model rules to our work are threefold. First, it is legal and valid within a thread to cache in local memory the values of a variable—even if the thread is not synchronized. Second, if the thread is not synchronized, the cached values need not be written back. Third, if a synchronization point exists within a loop nest being optimized, the shapes of shared arrays have to be conservatively assumed to have changed at the synchronization point. To prove that the shape of a shared array does not change across a synchronization requires proving that none of the threads that have access to the array at that time can change its shape.

We now describe how to handle the case in which a body of code without synchronization accesses a (potentially) shared array. We can divide arrays into two distinct parts. The first part is comprised of descriptors, those parts of an array that contain pointers to other descriptors or to rows of data. The second part is comprised of data, the one-dimensional rows that contain the actual elements. In Figure 22, the dashed boxes indicate descriptors, and the solid boxes indicate data. In general, a d-dimensional rectangular array A[1:n1][1:n2] ... [1:nd] consists of one level-0 descriptor that points to a vector of n1 level-1 descriptors. Each level-1 descriptor points to a vector of n2 level-2 descriptors, for a total of n1n2 level-2 descriptors. There are a total of O(n1n2 ... nd-1) descriptors. The last axis of the array contains the actual data, in the form of vectors of nd elements, pointed to by the level-(d - 1) descriptors.

Figure 22

The following then ensures the thread safety of our transformations: before every optimized body of code, a copy of the descriptors part of each shared array A is cached in working memory. Note that if a cached descriptor for A that is valid by the Java thread semantics is already present, it may be used for the purposes we describe. This action insulates the thread executing the optimized code from any changes to the shared array shape caused by other threads. Also note that only the array shape is relevant in computing the safe and unsafe regions for our transformations. The data part of the array can be modified by other threads without any effects on those regions. By using this caching of descriptors, the only places that shape changes become visible are at synchronization points. The shape of a shared array should be marked as variant across those points. This, in turn, can prevent many optimizations.

At first glance, the caching of descriptors may seem expensive. In general, given a d-dimensional rectangular array A[1:n1][1:n2] ... [1:nd], only O(n1n2 ... nd-1) storage needs to be cached. Thus, a one-dimensional array needs only a constant amount of storage, and an n1 × n2 array needs only n1 words of storage. Therefore, if most of the array is accessed within the loop, the amount of storage cached is approximately a factor of nd less than the number of references.

Experimental results

To test the effectiveness of our compiler transformations, we developed a prototype framework. This prototype framework currently implements only the restricted method and only handles perfectly nested rectangular loops. The framework consists of a Java-to-Java translator that produces the two versions of code necessary for the restricted method. These versions are then compiled into executable object code using the IBM High Performance Compiler for Java (HPCJ).4 HPCJ has a switch to generate code without any run-time checks, on a class basis.

Benchmarks. Using the prototype framework, we applied the transformations in the restricted method to Java programs in a benchmark suite. This suite consists of two numerical kernels and three application kernels, all with array-intensive computations. The two numerical kernels are a vector dot-product operation (DDOT), and a matrix-multiply operation (MATMUL). The three application kernels are a shallow-water simulation kernel (SHALLOW), a data-mining kernel (BSOM), and a cryptography kernel (CBC). We compare the performance of a Java implementation of each benchmark with a corresponding version written in either C (DDOT, MATMUL, and CBC) or C++ (SHALLOW and BSOM). The C and C++ versions were written and compiled according to what we call Java rules: (1) two-dimensional arrays are represented as vectors of pointers to one-dimensional arrays, and (2) any compiler optimizations that violate the Java IEEE 754 floating-point semantics (exact reproducibility of results) are disabled. The impact of these rules on the performance of C or C++ programs is beyond the scope of this paper. We discuss each of the benchmarks in more detail below.

DDOT. The DDOT benchmark computes the dot-product sumni=1 xiyi of two vectors x and y of n double-precision floating-point numbers. The Java code that implements DDOT is straightforward and shown in Figure 23. The C code is very similar. The computation of DDOT requires 2n floating-point loads and 2n floating-point operations (n multiplies and n adds). For our measurements we use n = 106. We report the performance of DDOT in Mflops.

Figure 23

MATMUL. The MATMUL benchmark computes the matrix operation C = C + A × B, where C is an m × p matrix, A is an m × n matrix, and B is an n × p matrix. The elements are double-precision floating-point numbers. The Java implementation of this matrix