|
Three goals of the Java** programming
language1
and execution environment are bit-for-bit
reproducibility of results on different platforms, safety of execution,
and ease of programming and testing. A crucial component of the Java
language specification for achieving the latter two goals is that a
program only be allowed to access objects via a valid pointer, and that
programs only be allowed to access elements of an array that are part
of the defined extent of the array. A naive implementation of this
specification component requires every access to every element of an
array to check the validity of all base pointers and indices involved
in the access.
Figure 1 shows a typical representation
of a four-dimensional array in Java. A naive checking of a
reference to element A[1][2][3][4], indicated in the figure, would
require four base pointer tests and four range tests: A, A[1],
A[1][2], and A[1][2][3] must all be tested as
valid pointers; 1, 2, 3, and 4 must all be tested as valid indices
along the corresponding axes of the array. If the entire array is
accessed, a total of 4N base pointer tests and 4N
range tests will be necessary, where N is the total number
of elements in the array. In this paper, we present a suite of
techniques to greatly reduce the number of run-time tests. In many
cases, the run-time tests can be completely eliminated.
Figure 1
The above-mentioned goal of bit-for-bit reproducibility of results,
combined with the ability to catch exceptions and examine the
state of the program at the time the exception was thrown, imply that
it is not sufficient to determine that an invalid reference occurred.
Rather, any optimizations of valid reference checking must cause all
exceptions that would have been thrown in the original program to still
be thrown. The exceptions must be thrown in precisely the same order,
with respect to the rest of the computation, dictated by the original
program semantics. The techniques we describe fulfill this requirement.
Our methods also handle loops that catch exceptions within their bodies
(i.e., with nested try blocks).
We note that the techniques and methods described here are not
restricted to Java. They can be applied to any language in which the
bounds of an array can be determined at run time. They could be used,
for example, in C and FORTRAN compilers
that want to provide an option to check array references.
The rest of this paper is organized as follows. The next section
presents an informal overview of our optimizations and is followed by
an introduction to the concept of safe bounds, used throughout the
paper. The fourth section presents the first of our optimization
techniques, the exact method. The succeeding sections present other
techniques, the general, compact, and restricted methods, respectively.
The eighth section discusses an inspector-executor variant of our
methods, to handle more complex cases. The ninth section
explains our optimizations in the context of multithreaded execution.
The tenth section presents some experimental results, followed by a
discussion of related work. Finally, we present our conclusions.
Overview of the optimizations
The main goal of our work is to develop techniques that minimize
the number of run-time tests that must be performed for array reference
operations in Java. An array A is defined by a lower bound
lo(A) and an upper bound up(A). We use a
Java-like notation for array declarations:
doubleA[] = new
double[lo(A):up(A)]
declares a one-dimensional array A of
doubles. Note that we allow an explicit
declaration of the lower bound lo(A) of array A.
In Java, the lower bound is always zero. Also, Java array declarations
specify the number of elements of the array. Therefore, the Java
declaration
double A[] = new double[n]
would correspond to the declaration
double A[] = new double[0:n - 1]
in our notation.
An array element reference is denoted by
A[ ], where
is an index into the array. For the reference to be valid,
A must not be null, and
must be within
the valid range of indices for A: lo(A),
lo(A) + 1, ... , up(A). If A
is null, then the reference causes a null pointer
violation. In Java, this corresponds to the throwing of a
NullPointerException. If A is a valid pointer
and is less than the
lower bound lo(A) of the array,
then the reference causes a lower bound violation. If
A is a valid pointer and is greater than the upper bound
up(A) of the array, then the reference causes an upper
bound violation. Java does not distinguish between lower and upper
bound violations. A bound violation (lower or upper) in Java causes an
ArrayIndexOutOfBoundsException to be thrown.
Multidimensional arrays are treated as arrays of arrays:
double M[][] = new
double[l1:u1]
[l2:u2]
declares an array M with lower bound
l1 and upper bound u1.
Each element of M is an array of
doubles, with lower bound
l2 and upper bound u2. M
is an example of a rectangular array. Rectangular arrays
have uncoupled bounds along each axis. As in Java, we also allow
ragged arrays, where the bounds for one axis depend on the
coordinate along another axis. A triangular array is an example of a
ragged array:
double T[][] = new
double[1:n][]
T[i] = new double[1:i],
i = 1,...,n
Even though ragged arrays are allowed and treated by our
techniques, we do have optimizations that take advantage of rectangular
arrays.
Arbitrary array lower bounds, the distinction between lower bound and
upper bound violations, and treatment for rectangular arrays are
features of our work that are not strictly necessary for Java
applications. Our motivation to include them is twofold: First, we want
our methods to be applicable to other programming languages, in
particular C, C++, and FORTRAN 90. Second,
we want to be prepared to handle extensions to Java that are being
considered to efficiently support numerical computing, as described in
Reference 2.
Array accesses typically occur in the body of loops. Our optimizations
operate on do-loops, which are for-loops of the form:
for (i = l;
i u;
i++){
| | B(i)
| | }
|
where i is the index variable of the loop, l
defines the lower bound of the loop, and u defines the upper
bound of the loop. The iteration space of the loop is defined by the
range of values that i takes: l, l +
1, ... , u. All loops have a unit step, and therefore a loop
with l > u has an empty iteration space.
B(i) is the body of the loop, which typically contains
references to the loop index variable i. As a shorthand, we
represent the above loop structure by the notation L(i, l, u,
B(i)). Note that loops with positive nonunity strides can be normalized by the transformation
| for (i = l;
i u;
i = i + s){
| | | B(i)
| | }
| |
| | becomes
| |
| | for (i = 0;
i
(u - l)/s
; i++){
| | | B(l + is)
| | }
| |
| | A loop with negative stride can be first transformed into a loop
with a positive stride:
| |
| | for (i = u; i
l; i = i
- s){
| | | B(i)
| | }
| |
| | becomes
| |
| | for (i = l;
i
u; i = i + s){
| | | B(u + l - i)
| | }
|
Loops are often nested within other loops. Standard control-flow
and data-flow techniques3 can be used
to recognize many for, while, and do-while loops, which occur in Java
and C, as do-loops. Many goto loops, occurring in C and
FORTRAN, can be recognized as do-loops as well.
Four different tests can be applied to an array reference
A[i]. A null test verifies whether A is a
null pointer. An lb test verifies whether i
lo(A), and a ub test verifies whether i
up(A). Finally, a test called all tests verifies
whether lo(A)
i
up(A). Tests for
bounds subsume null tests, since a null array can
be set to have an empty extent. Furthermore, we show in the next
section that for do-loops only one of three cases can occur for each
iteration and each array reference: (1) an lb test is required, (2) a
ub test is required, or (3) no test is required. In many situations it
is possible to precisely determine which case occurs. In other
situations, one has to be conservative and adopt a stronger test than
absolutely required. In particular, an all tests can be used to detect
any possible violations.
We introduce later a method for deriving the minimum set of tests
required at each iteration of a simple loop with no nesting. This
method will be referred to as the exact method. It provides
the general framework that is used by the subsequent methods, which
handle nested loops. The exact method specifies, for each iteration and
each array reference, which test of the three named above is required,
if any. A loop with array
references in its body is split at compile time into up to
3 consecutive
regions, each with a specialized version of the loop body. At run time,
no more than 2
+ 1 regions are actually executed.
This level of code replication is not practical in general. However,
one can reduce the number of code versions by merging together regions,
at the expense of performing superfluous tests. This is shown in the
section describing the general method. In practice, this does not
increase execution time. The regions where tests are required will
usually be found to be empty at run time so that the tests in these
regions are never normally executed. Reducing the number of distinct
regions will not only decrease code size, but may also decrease
execution time. A practical version of the exact method splits an
iteration space i = l, l + 1, ... , u into
three regions:
- A region of the iteration space where no tests are needed.
This region is defined by a safe lower bound
ls and a safe upper bound
us. The range of values of i in this
region is ls, ls + 1, ... ,
us.
- A region of the iteration space where the index variable has
values smaller than the safe lower bound
ls. For this region i = l, l +
1, ... , ls - 1.
- A region of the iteration space where the index variable
has values greater than the safe upper bound us
For this region i = us + 1, us +
2, ... , u.
The loop is split into three loops, each associated with a
different code version. As a simpler alternative, the number of code
versions can be further reduced to two: one with no tests, and one with
all tests enabled.
We extend this method to nested loops in the later section, "The
general method," recursively splitting each iteration space into
three regions. When applied to a perfect d-dimensional loop
nest, this method leads to 3d distinct loop
nests, each potentially executing a different code version. One can
reduce the number of distinct code versions by merging different
versions. In the extreme case, it is possible to use only two versions.
This method will be referred to as the general method.
In the sixth section we present a method that avoids this exponential
blow-up in static code size. This method is referred to as the
compact method. When applied to a perfect
d-dimensional loop nest, it results in 2d +
1 loop nests. Depending on the simplifications adopted, it can use
from 2 up to 2d + 1 different code versions.
Finally, in the seventh section we describe a tiling method that can be
applied to certain types of loop nests. It effectively implements the
compact method through generated code of a very simple form.
The method uses two versions for each loop body (no tests and all tests
enabled). This method is referred to as the restricted
method.
These four methods apply to Java as well as to
FORTRAN or C and work for all machine
architectures. Additional optimizations are possible in important
special cases. We briefly discuss two of these optimizations now: how
to determine a valid index with a single test and how to test for null
pointers. We do not discuss special optimization techniques any further
in this paper.
In the particular case of Java an all tests test can be implemented
with a single comparison. Because Java does not distinguish between
lower bound and upper bound violations, and because lo(A) =
0 always, it suffices to test if i < up(A)
+ 1 using unsigned arithmetic. A negative number appears, in
unsigned arithmetic, as a very large positive number which, by the Java
language specification, is always larger than the largest possible
array extent. This technique is used, for example, in the
IBM HPCJ (High-Performance Compiler for
Java) project.4
The optimization of checks for null pointer violations in
array references is a direct consequence of the optimization of checks
for bound violations, as discussed in the next section. There are also
many ad hoc solutions to the optimization of null
pointer violations. For machines with segmented memory architecture,
such as the IBM POWER2 Architecture*,
null pointers can be represented by a pointer to a protected
segment. For machines with linear, but paged, memory architecture, a
protected segment can be simulated with a region of protected pages.
Any attempts at access through a null pointer will
immediately cause a hardware exception that can then be translated into
the appropriate software exception. This implementation completely
eliminates the need for any explicit checks for null pointer
violations in the code. It is also applicable to all pointer
dereferences, not just array accesses. Despite the complications in
properly translating a hardware exception, this technique has been
successfully used, for example, in the
CACAO project.5
Computing safe bounds
The basic idea of our methods is to partition the iteration space
of a loop into regions. A region is a contiguous subset of
the iteration space. It is characterized by a collection of tests that
need to be performed for array references in that subset. We are
particularly interested in finding safe regions, where we
know there can be no violations in array references. If an array
reference is guaranteed not to generate a violation, explicit tests are
unnecessary.
We illustrate the computation of safe regions with a simple example. We
then proceed to formalize it to more general cases. Consider the simple loop:
|
| for (i = l; i
u; i++){
| | B(i)
| }
|
which we represent as L(i, l, u, B(i)).
B(i) is the
body of the loop and, in general, contains references to the loop index
variable i. The iteration space of this loop consists of the
range of values i = l, l + 1, ... , u.
Let there be a single array reference A[i] in the loop
body B(i). We denote the lower bound of A by
lo(A) and the upper bound of A by
up(A). Therefore, for the array reference A[i]
to be valid we must have lo(A)
i
up(A). If A is null, we define
lo(A) = 0 and up(A) = - 1. This guarantees that
A[i] is never valid if A is null. We
can split the iteration space of the loop into three regions
[1],
[2], and
[3], defined as follows:
[1] :
(l
i u)
^ (i < lo(A)) |
(1) |
[2] :
(l
i u) ^
(lo(A)
i
up(A)) |
(2) |
[3] :
(l
i u) ^
(i > up(A)) |
(3) |
Region [1] corresponds
to those iterations
of the loop for which the index i into array A is
too small. Therefore, an lb test is required before each array
reference in this region. No tests are required in region
[2], since the index
i falls within
the bounds of A. Finally, a ub test is required in region
[3], because the values
of i are too large to index A.
Using Equation 2, we can compute the lower and upper bounds
and
of the safe region:
=
max(l, lo(A)) |
(4) |
=
min(u, up(A)) |
(5) |
[2] :
i = ,...,
 |
(6) |
Similarly, we use Equation 1 and Equation 3 to compute the lower
and upper bounds of regions
[1] and
[3], respectively:
[1] :
i = l,..., min(u + 1, lo(A)) - 1 |
(7) |
[3] :
i = max(l - 1, up(A)) + 1,..., u |
(8) |
Note that min(u + 1, lo(A)) =
, except when lo(A)
< l or lo(A) > u + 1.
In the former case, [1] is
empty; in the latter case,
[2] is empty. To
handle these cases, and the symmetric upper bound cases, we redefine
= min(u + 1,
max(l, lo(A))) |
(9) |
= max(l - 1,
min(u, up(A))) |
(10) |
We can now express the bounds of each of the regions just in terms
of l, u, ,
and :
[1] : i = l,...,
- 1 |
(11) |
[2] :
i = ,...,
 |
(12) |
[3] :
i = + 1,...,
u |
(13) |
Equations 11-13 define the same regions as Equations 6-8. The
values of region bounds are different only for empty regions, but they
are still empty.
In Figure 2 we illustrate the values of
and
for different relative
positions of iteration bounds and array bounds.
Figure 2A has empty regions
[1] and
[3], whereas region
[2] comprises the entire
iteration space. Region
[1] is empty in
Figure 2B, whereas region
[3] is empty in
Figure 2C. All three regions are
nonempty in Figure 2D. Regions
[2] and
[3] are empty in
Figure 2E, because all values of
i fall below the lower bound of A. Finally,
regions [1] and
[2] are
empty in Figure 2F, since all values of i
fall above the upper bound of A.
Figure 2
In the general case, we have an array reference of the form
A[f(i)] in the body B(i) of a loop.
Depending on the behavior of f(i), we can compute safe
bounds and
that partition the
iteration space into three regions, as defined by Equations 11-13. Region
[2] is safe, and no
test is defined in it. We
define tests [1] and
[3] that are sufficient for regions
[1] and
[3],
respectively. Expressions for
,
, and
for various forms of f(i) are described in Appendix A. The
safe bounds are computed so that we always have
- 1 and
+ 1. These
properties are important for the correct functioning of our methods.
An exact method
In this section we consider the case of a simple (depth
one) loop, with multiple references. The exact method
described here performs only the tests strictly necessary (as specified
in Appendix A) to detect the array reference violations that occur
during the execution of a loop. It is, in general, of limited
practicality because up to
3 versions of the
loop body must be instantiated in the code, where
is the number of array
references in the loop body. In some situations, when enough
information is available to the compiler, it is possible to generate
efficient code using this technique. Our main interest in studying this
method is that the other, more practical, transformations are derived
from this technique.
In this section we treat references of the form
A[f(i)] that allow the computation of safe bounds
as described in the previous section. The eighth section discusses how
to handle more complex subscripts. We first describe how the method
generates optimized code, and then we illustrate the method with an
example.
Code generation. The method works as follows. Let L(i,
l, u, B(i)) be a loop on index variable i and body
B(i). Let there be
references of the form
Aj[fj(i)], j = 1, ... ,
, in body B(i).
Each array reference
Aj[fj(i)] partitions the iteration
space into three (each possibly empty) regions:
j[1] :
i = l,...,
j - 1
j[2] :
i =
j,...,
j
j[3] :
i =
j + 1,..., u
as defined by Equations 11-13. We call these regions defined by a
single array reference simple regions. Also, each region
j[k] has a test
j[k] associated with it, which describes
what kind of test has to be performed for reference
Aj[fj(i)] in that region. The test
j[2] for region
j[2] always specifies no
test, because
j[2] is a safe region with respect to this reference.
The tests
j[1] and
j[3],
for regions
j[1] and
j[3], respectively, can be
any one of ub test (an upper bound test), lb test (a lower bound test),
or all tests (both a lower and an upper bound test). The exact choice
depends on characteristics of the array reference
Aj[fj(i)]. This issue is discussed
in detail in Appendix A. For each region
j[k] we denote its lower bound by
j[k].l and its upper bound by
j[k].u.
Two references Aj[fj(i)] and
Ak[fk(i)] can be thought of as
partitioning the iteration space into five regions defined by the four
points
j,
j,
k, and
k. We
illustrate this in Figure 3 for two
references, A1[f1(i)] and
A2[f2(i)]. Note that some of the
resulting regions may be empty, in general. The resulting regions are a
subset of the 3 · 3 = 9 regions formed by all combinations
of intersections of two simple regions, one from each reference.
Figure 3
In general, given the references
Aj[fj(i)], we can create a vector
of all 3 regions formed
by intersecting the simple regions from different array references:
[x1·x2·...·x
] =
|
¹[x1]
²[x2]
...
[x ] | (14) |
Each index xk is 1, 2, or 3. The lower
bound of a region
[x1 · x2 · ... ·
x ] is the
maximum of the lower bounds of the forming simple regions.
Correspondingly, its upper bound is the minimum of the upper bounds of
the forming simple regions:
[x1 · x2 · ... ·
x ].l =
| | max
( ¹
[x1].l,
²
[x2].l,...,
[x ].l) |
(15) |
[x1 · x2 · ... ·
x ].u =
| | min( ¹
[x1].u,
²
[x2].u,...,
[x ].u) |
(16) |
Finally, the tests that characterize region
[x1 · x2
· ... · x
] can be described by:
[x1·x2·...·
x ] =
{ j
[xj], j = 1,...,
} |
(17) |
That is, the combination of the tests for each simple region
j
[xj ].
Using this partitioning of the iteration space into
3 regions, the loop
for (i = l; i
u;
i++){
B(i)
} |
(18) |
can be transformed into 3
loops, each one implementing one of the regions
[x1 · x2
· ... · x ].
The body of each region can be specialized to perform exactly the tests described by
[x1 · x2 · ...
· x ].
We use Bx1x2...
x (i)
to denote the body B(i) specialized with the tests described
by
[x1 · x2 ·
... · x ]:
Note that the order of the resulting loops is important,
although there are many correct orders. The requirement is that, for
any value of j, region
[x1 · ... ·
x - 1 · 1 · xj + 1 ·
... · x ]
has to precede
[x1 · ... ·
xj - 1 · 2 · xj + 1 ·
... · x ]
which in turn has to precede
[x1 · ... ·
xj - 1 · 3 · xj + 1 ·
... · x ].
Out of the 3 possible
regions formed by the intersection of the simple regions, no more than 2 + 1 are nonempty and are actually executed at run time. Which of the
3 regions
are nonempty depends on the relative positions of the
2 safe bounds
¹,...,
,
¹,...,
. Let
p1,...,
p2 be the list
obtained by sorting ¹,
...,
,
¹
+ 1,...,
+ 1 in ascending
order. Define p0 = l and p
2 + 1 = u + 1.
The safe bounds ¹,...,
,
¹, ... ,
partition the iteration
space into 2 + 1 (each possibly
empty) regions [k],
k = 0, 1,...,
2 defined by
[k] : i =
pk,...,pk + 1 - 1 |
(20) |
Each region [k]
corresponds to a region
[x1 · x2 · ... ·
x ]
defined by
| xj = | { |
1 if j
pk + 1
3 if j < pk
2 if (
j < pk + 1) ^
( j
pk) |
(21) |
(It can occur that both
j
pk + 1 and
j < pk. In that
case, region [k] is
necessarily empty, and the choice of xj is
irrelevant.) Let M(k) be the function that defines the
correspondence
[M(k)]
[k] for k =
0,..., 2 . Then the loop
for (i = l; i
u; i++){
B(i)
} |
(22) |
can be transformed to
for (k = 0;
k
2 ; k++){
for (i = pk;
i < pk + 1; i++){
BM(k)(i)
} |
(23) |
If the order of the safe bounds
¹,...,
,
¹,...,
is known at
compile time, the
mapping function M(k) can also be determined. In this case,
only the loop body versions that are actually executed need to be
generated. In general, the order of the safe bounds is not known, and
BM(k)(i) has to be selected at run time from the
set of all possible 3
body versions.
An example. In the interest of conciseness, we use a very
simple code fragment to illustrate this method.
Figure 4A illustrates the original loop to be
transformed. It has two array references, A1[i
- 2] and A2[i + 2], each
generating three simple regions:
¹[1:3] and
²[1:3], respectively.
The straightforwardly transformed code, using the exact method, is shown in
Figure 4B. We note that:
¹ = |
min(u + 1, max(l, 3)) |
(24) |
² = |
min(u + 1, max(l, - 1)) |
(25) |
¹ = |
max(l - 1, min(u, n + 2)) |
(26) |
² = |
max(l - 1, min(u, n - 2)) |
(27) |
¹[1] =
²[1] = |
lb test |
(28) |
¹[3] =
²[3] = |
ub test |
(29) |
which does not provide us with enough information to order the
safe bounds ¹,
²,
¹,
and ².
We implement the code in Figure 4B according to Equation 19. We could have used the method in Equation
23, but it would still have required all
3
body versions to be generated.
Figure 4
Now let n = 10, as shown in
Figure 5A.
In this case, the expressions for the safe bounds become:
¹ =
min(u + 1, max(l, 3)) |
(30) |
² =
min(u + 1, max(l, - 1)) |
(31) |
¹ =
max(l - 1, min(u, 12)) |
(32) |
² =
max(l - 1, min(u, 8)) |
(33) |
and we can order ²
¹
²
¹. In particular, this is the
ordering shown in Figure 3. We only need to generate code for five loops, as shown in
Figure 5B. In some cases, a compiler with
appropriate symbolic analysis can even eliminate some of these five
loops. For example, if the compiler could prove that l >
2, neither of the first two loops (body versions
B11(i) and B12(i)) would
be necessary.
Figure 5
Summary of the exact method. The exact method transforms code
so that, for each iteration of the original loop, only those tests that
cannot be shown to be unnecessary are performed. In the straightforward
application of the method to a loop with
array references in its
body, 3 new loops are generated, each with a slightly different body. No more than
2 + 1 of these loops are actually
executed at run time. In some situations, compile-time analysis can
show that some of these loops implement empty regions of the iteration
space and, therefore, can be discarded.
The general method
The general method works by partitioning the iteration space of a
loop always into three regions, independent of the number of array
references in the body of the loop. One of the regions is a safe
region: no array reference in that region can cause a violation.
Another region precedes the safe region, in iteration order. Finally,
the third region succeeds the safe region, in iteration order. This
method can be applied to each and every loop of an arbitrary loop nest
individually. The general method does not specialize the tests as much
as the exact method, but it does identify the same safe regions that
can be executed without any tests.
Transforming a single loop. Consider the loop L(i, l, u,
B(i)), with references of the form
Aj[fj(i)] in its body. Using the
concepts developed in the previous two sections, we can compute its
safe region as the intersection of all simple safe regions. This safe
region is defined by the range of values of i =
ls, ls + 1, ... , us
where:
ls=max( ¹, ²,...,  ) |
(34) |
us=max(ls-1, min( ¹, ²,...,  )) |
(35) |
The ls - 1 term in the expression for
us is necessary to handle cases where the safe
region is empty. We can then partition the iteration space of the loop
into three (each possibly empty) regions:
[1] : i = l,..., ls-1 |
(36) |
[2] : i = ls,..., us |
(37) |
[3] : i = us+1,..., u |
(38) |
This partitioning is not very different from our very first
example in the third section, except that now it applies to a loop with
an arbitrary number of array references in its body. We illustrate this
for two references:
A1[f1(i)] and
A2[f2(i)] in
Figure 6. Region [2] is
the intersection of the safe simple regions
¹[2] and
²[2]. Correspondingly, region
[1] is the union of the unsafe simple regions
¹[1] and
²[1]. Region
[3] is the union of the unsafe simple regions
¹[3] and
²[3]. With reference to
Figure 3,
[1] is formed by merging all regions preceding
[2 · 2], and region
[3] is formed by merging all regions succeeding
region [2 · 2].
Figure 6
Region [2] is the safe region, and its body can
be implemented without any tests. We denote this version of the loop
body by B(notest(i)). Conversely, the bodies of
regions [1] and [3] need at
least some tests. For simplicity, we can implement both with a version
B(test(i)) of body B(i). This version
performs an all tests test before each and every array reference.
The implementation of the general method consists of the
transformation:
for(i = l; i u; i++){
B(i)
} |
becomes
for (i = l; i ls-1; i++){
B(test(i))
}
for (i = ls; i us; i++){
B(notest(i))
}
for (i = us+1; i u; i++){
B(test(i))
} |
(39) |
or, in shorthand:
L(i, l, u, B(i))  |
{ | L1(i, l, ls-1, B(test(i)))
L2(i, ls, us, B(notest(i)))
L3(i, us+1, u, B(test(i))) |
(40) |
The code expansion in this case is only twofold, since the same
code for B(test(i)) can be used twice. Methods
to realize all three resulting loops using only two instances of
B(i) are discussed in Appendix B.
Recursive application of the transformation. If the body
B of a loop contains other loops, then the same
transformation can be applied to each of the loops in B. The
transformation can be applied individually to each and every loop in a
general loop nest. In fact, the final result is independent of the
order in which the individual loops are transformed.
Consider the case of the two-dimensional loop nest L(i,
li, ui, L'(j, lj, uj,
B(i, j))). By first applying the transformation to the
L loop, we generate a region without any tests on array
references indexed by i:
L(i, li, ui, L'(j, lj, uj, B(i, j)))
becomes
L1(i, li, lis-1, L'(j, lj, uj, B(test(i), j)))
L2(i, lis, uis, L'(j, lj, uj, B(notest(i), j)))
L3(i, uis+1, ui, L'(j, lj, uj, B(test(i), j)))
We can now apply the transformation to each of the three
L' loops. This will generate two regions without any tests
on array references indexed by j and one region without any
tests on array references indexed by either i or
j as seen in Figure 7. This
transformation partitions the iteration space into nine regions, and
requires four different versions of the loop body B(i).
Figure 7
An example of the general method. For simplicity, we give an
example of single-threaded code with rectangular arrays. In the ninth
section is a discussion of the application of these optimizations to
multithreaded code. The program of
Figure 8 will be used in this example. It
implements a step of a two-dimensional Jacobi
relaxation.6 We chose it because it is a well-known
operation that illustrates a loop nest with various array references in
its body. Also, the array references all use slightly different
indices, making our example more interesting.
Figure 8
Figure 9 shows the loop
implemented with naive tests. The statement
ifAccessViolation(A,i) throwException
throws the appropriate exception if A is a
null pointer, i is below the lower bound of
A, or i is above the upper bound of
A. In the actual executable code, the tests would be
interweaved with the individual array references, and the order of
tests would be slightly different, but this example gives an idea of
the cost of naive testing. Figure 10
shows the loop nest optimized for testing using the general method. In
each of the resulting loops, we list the violations that can occur in
its body. The code with explicit tests for the different versions of
the loop body are shown in Figure 11.
The minimal tests needed for the i1 and
i3 loops differ in that the
i1 loop only needs to test for lower bounds
violations, and the i3 loop only needs to test
for upper bounds violations. By using the transformation of Equation
39, the version for both loops can be the same, as indicated
in Figure 10.
Figure 9
Figure 10
Figure 11
Transforming the loops. The optimized code is the result of
the following transformations. First, the outer i loop is
split into three loops (i1, i2, and
i3), as shown in Figure 10. The middle loop,
i2, corresponds to those iterations of the
original i loop that cannot cause bound violations on array
references indexed by i. Within this loop, these array
references need not be tested. This is the safe region for
loop i.
The first loop, i1, executes those iterations of
the original i loop whose values of i precede the
safe region. Within this loop, all array references indexed by
i need to be tested. Because the subscript expressions on
i are monotonically increasing across the iteration space of
the i loop, only lower bound violations can occur in the
i1 region.
The third loop, i3, executes those iterations of
the original i loop whose values of i succeed the
safe region. Again, within this loop, all array references indexed by
i need to be tested. Because the subscript expressions on
i are monotonically increasing across the iteration space of
the i loop, only upper bound violations can occur in the
i3 region.
Within each of the resulting loops i1,
i2, and i3 the j
loop is similarly split. For example, the body of resulting loop
j1,3 executes all iterations of the nest that
attempt to reference elements of a or b that are
below the corresponding lower bound, and elements of b[i],
a[i], a[i + 1], and a[i - 1] that are
above the corresponding upper bound. In a typical correct numerical
code, where all references are in bounds, only the body of loop
j2,2 will execute. This body is represented by
B(notest(i), notest(j)), and because
it has no tests, it executes faster.
Computing the bounds of the split loops. To compute the bounds
for the resulting i1, i2, and
i3 loops, we first have to compute the safe
bounds for loop i: lis and
uis. In the body of the i loop
there are five array references indexed by i: b[i], a[i]
(appearing twice), a[i + 1], and a[i -
1]. The values of
ik and
ik
are shown in
Figure 12A. They are computed using Equation
64 of Appendix A. The values of lis and
uis can be computed according to Equations
34 and 35:
lis = max( i1, i2, i3, i4, i5)
uis = max(lis-1, min( i1, i2, i3, i4, i5))
The bounds for the resulting j loops are computed
similarly. In the body of the j loop there are five
array references indexed by j: (a[i])[j + 1],
(a[i])[j - 1], (a[i + 1])[j], (a[i -
1])[j], and (b[i])[j]. Note that b[i],
a[i], a[i + 1], and a[i - 1] are the
arrays being accessed. The values of ljs and
ujs can be computed by Equations 34 and 35,
using values of
jk and
jk
obtained through
Equation 64 and shown in Figure 12B. In this example, the
arrays are
rectangular, and the lower and upper bounds for a[i], a[i
+ 1], a[i - 1], and b[i] do not depend on
the value of i. (Note that, in general,
ljs and ujs would
be functions of i.)
Figure 12
Checks that must still be done. The transformation just
discussed creates regions of the loop that are free from violations on
array references indexed by either i, j, or both.
Figure 10
lists the specific violations that can occur in each region. Note that
this particular behavior is specific to this loop. We chose to
implement the loop using versions of the body that perform tests
covering more than the strictly necessary violations. This allowed us
to use only four versions of the loop body, as shown in
Figure 10 and
detailed in Figure 11. We perform an all tests
test on any i
reference when outside the safe region for i.
Correspondingly, we perform an all tests on any j reference
when outside the safe region for j.
Summary of the general method. The exact method of the
previous section formed a different region of the iteration space of a
loop for each possible combination of necessary array reference tests.
The general method always partitions the iteration space of a loop into
three regions: (1) a region for those iterations that need no test (the
safe region), (2) a region for those iterations that occur prior to the
safe region, and (3) a region for those iterations that occur after the
safe region. In each of the nonsafe loop regions, any test that might
be needed for a reference in any iteration of that region is performed
in all iterations of that region. This means that some additional tests
might be executed compared to the exact method. For a nest of loops,
the general method can be applied to each and every loop recursively
and independently. Thus, for a loop nest of depth d,
3d loop regions are created.
When generating code for the regions, several approaches can be taken.
One approach generates a different version of the loop body for each
region. When applied to the example of
Figure 10, this would generate
only lower bound tests on arrays indexed by i in the
i1 loop and only upper bound tests in the
i3 loop. This approach leads to
3d versions of the loop body for a loop nest of
depth d. A second approach, actually used in our example of
Figures 10 and 11,
uses only two versions of the loop body for each
loop index: one with no tests on that index and one with all tests.
This leads to 2d versions of the loop body for a
loop nest of depth d. A third approach would be to generate
only two versions of the loop body, independent of the number of nested
loops: one with all tests on all indices and one with no tests on any
index. Obviously, the version with no tests can only be used in that
region where no violations in any array reference can occur.
The compact method
The exponential expansion on the number of loops caused by the
application of the general method can be highly undesirable in some
cases. In this section we propose an alternative method that results in
only a linear increase in the number of regions. The difference resides
in how the transformation is applied to loop nests. In the compact
method, the partitioning into three regions is always applied in
outermost to innermost loop order. Furthermore, it is only applied to
inner loops in the untested version of an outer loop body.
Again, consider the case of the two-dimensional loop nest L(i,
li, ui, L'(j, lj, uj,
B(i, j))). By first applying the transformation to the outer
i loop, we generate a region without tests on array
references indexed by i:
L(i, li, ui, L'(j, lj, uj, B(i, j)))
becomes
L1(i, li, lis-1, L'(j, lj, uj, B(test(i), j)))
L2(i, lis, uis, L'(j, lj, uj, B(notest(i), j)))
L3(i, uis+1, ui, L'(j, lj, uj, B(test(i), j))) |
(41) |
We now apply the transformation to the instance of L'
in the L2 region. This results in a region
without any tests on array references indexed by either i or
j as seen in Figure 13.
Figure 13
This transformation partitions the two-dimensional iteration space into
five regions, and requires three different versions of the loop body
B(i). It still generates the same region safe on
i and j as the general method.
An example of the compact method. As an example, we apply the
compact method to the two-dimensional loop nest of
Figure 8. The
resulting code is shown in Figure 14.
Note that only loop i2 has its j loop
partitioned into three regions. When applied to a
d-dimensional loop nest, the compact method generates
2d + 1 regions of the loop iteration space. Because the
two regions preceding and succeeding a safe region are similar, only
d + 1 versions of the loop body must be generated.
Figure 14
Summary of the compact method. Like the general method of the
previous section, the compact method divides each loop into one safe
and two nonsafe regions. The differences between the two methods are
manifested only when applying the transformation to a loop nest. The
general method splits all nested loops into three regions, whereas the
compact method splits only loops nested within the version without
tests. This implies that within the nonsafe versions of an outer loop
the compact method may perform more tests than the general method. The
benefit is that only a linear number of loop regions (2d +
1 for a loop nest of depth d) are necessary with the
compact method, as opposed to an exponential number with the general
method. The number of versions of the loop needed is a linear function
of the nesting depth of the loop (d + 1 for a loop nest
of depth d).
The number of versions of code can be further reduced to two by using a
fully tested version on any region that needs tests. Also, if the loop
nesting is perfect, the structure of the generated code can be greatly
simplified when only two versions are used. These observations provide
the motivation for the restricted method of the next section.
The restricted method
We now consider a transformation that can be applied to perfect
loop nests, or any loop nest that can be transformed to a perfect loop
nest via compiler techniques. The restricted method partitions the loop
into multiple regions. Each region executes either a test
(all array references are fully tested) or a notest (no
array references are tested) version of the loop body. Having a perfect
loop nest allows us to map every iteration of the loop nest onto a
single point in a d-dimensional space, where d is
the depth of the loop nest. Partitioning the loop into test
and notest regions is equivalent to tiling this iteration
space.
The advantage of the restricted method, when applied to a perfect
d-dimensional loop nest, is that it can be implemented with
only one instance of each version of the loop body. Also, the instances
can be generated in place in a fixed loop structure. The previous
methods, when implemented with only two versions of code, require a
specialized loop structure that invokes those versions from more than
one point in the program.
We start by considering a d-dimensional rectangular loop
nest:
| L1(i1, li1, ui1, L2(i2, li2, ui2,...,
| |
Ld(id, lid, uid, B(i1, i2,..., id))...))
| (42) |
That is, the bounds lik and
uik of loop ik do
not depend on the values of i1, ... ,
ik-1. The iteration space of this loop can be
partitioned into consecutive regions described by a vector
[1:u ]. Each entry in
this vector has the form:
[ ] = |
(l( ), u( ), ) |
(43) |
l( ) = |
(li1( ), li2( ),..., lid( )) |
(44) |
u( ) = |
(ui1( ), ui2( ),..., uid( )) |
(45) |
= |
{test|notest} |
(46) |
The vectors l and u define the lower and
upper bounds for each loop in the region, respectively. The elements
lij( ) and
uij denote the lower and upper bounds,
respectively, of loop ij in the region
[ ]. The definition of a region also contains a
flag that has the value test if the region requires any
array reference to be tested and notest otherwise.
With partitioning, the original loop can be executed by a driver loop
that iterates over the regions:
for ( =1;  u ; ++){
L1(i1, li1( ), ui1( ), L2(i2, li2( ), ui2( ),...,
Ld(id, lid( ), uid( ), B(i1, i2,..., id))...)).
}
Note that in previous sections single-dimensional regions were
described by a vector containing the upper and lower bounds for the
regions. Here a multidimensional region is described by vectors with
upper and lower bounds for every index variable in the loop
nest. The techniques of the previous sections formed regions loop by
loop, whereas the techniques of this section form regions for the
entire loop nest.
Our goal is to partition the iteration space into regions that we can
identify as either requiring tests on array references or not. Those
regions that do not require tests can then be executed with a
notest version of the loop body. We use the same kind of
partitioning described in the compact method. First, an outer loop is
divided into three regions. Then, the inner loop in the region without
tests is divided again. This partitioning is illustrated in
Figure 15, for a two-dimensional iteration
space.
Figure 15
Computing vector .
Vector can
be computed by procedure regions in
Figure 16. Procedure regions takes
seven parameters. The first five are input parameters and describe the
loop nest being optimized. They are:
- The index j indicating that region extents along
index variable ij are being computed
- The vector (
1,
2, ... , j-1), where
k is the lower bound for loop index
ik in the regions to be computed
- The vector (
1,
2, ... , j-1), where
k is the upper bound for loop index
ik in the regions to be computed
- The dimensionality d of the loop nest
- The vector
[1:d], where
[k] = (lik,
uik, liks,
uiks) contains the full and safe
bounds for loop ik
Figure 16
The next two parameters are the output of the procedure. The first
output parameter is the vector of regions, ,
described in Equations 43 through 46, for the loop. The second is
u , the count of those regions. Note that
u is also used as an input, which gets
incremented each time a new region is created.
An invocation of procedure regions for a given value of
j partitions the loop nest formed by loops
ij, ij+1, ... , id.
It is only performed for safe values of i1,
i2, ... , ij-1. Procedure
regions partitions loop ij into three
parts. The first part consists of iterations of
ij that precede the safe region of
ij. The inner loops of ij
are not partitioned. These regions necessarily require testing of the
array references, since they are unsafe on references indexed by
ij. Regions corresponding to this part of the
iteration space are computed in statements S1 and
S2, and correspond to the regions [1]
(for j = 1), [2],
[5], [8], and
[11] (for j = 2) in the
two-dimensional iteration space of Figure 15. The third part consists
of the iterations of ij that succeed the safe
region of ij. Again, the inner loops are not
partitioned, and testing is required. Regions corresponding to this
part of the iteration space are computed in statements S11
and S12, and correspond to the regions
[4], [7],
[10], [13] (for
j = 2), and [14] (for
j = 1) in Figure 15. The middle (second) part consists
of the iterations of ij that are within its safe
region. If ij is the innermost loop, this is
computed by statements S4 and S5, and the
partitioning of loop ij is complete. No tests
are required. If ij is not the innermost loop,
the partitioning is applied recursively to loop
ij+1 for each iteration of
ij in its safe region, as shown in lines
S7 and S8.
To compute the entire vector of regions, the invocation
regions(1, (), (), d,
B, R,
u = 0) should be performed. At the end of the
computation, vector contains the description of
the regions, and the value of u is the total
number of regions in .
Although the example in Figure 15 and the notation imply that the
iteration space passed to regions in B is
rectangular, the algorithm is not restricted to rectangular loop nests.
In particular, if the expressions for lij,
uij, lijs, and
uijs are functions of
ik, 1 k < j, the computation
performed by regions is correct.
We can optimize the execution of the loop nest by using two versions of
code inside the driver loop: (1) a version
B(notest(i1),
notest(i2), ... ,
notest(id)) that does not perform any of
the array reference tests, and (2) a version
B(test(i1),
test(i2), ... ,
test(id)) that performs tests on all array
references. Version 1 is used only for regions where
[ ]. is notest, while version 2
is used for all other regions. This corresponds to forcing all regions
with any potential reference violation to perform
all reference tests. The optimized loop nest can be
implemented by the following construct:
for ( =1;  u ; ++){
| | if ( [ ]. ==test){
| | | L1(i1, li1( ), ui1( ),
| | | | L2(i2, li2( ), ui2( ),
| | | | ... ,
| | | | | Ld(id, lid( ), uid( ), B(test(i1),
| | | | | | test(i2),..., test(id)
| | | | | ...
| | | | )
| | | )
| | }else{
| | | L1(i1, li1( ), ui1( ),
| | | | L2(i2, li2( ), ui2( ),
| | | | | ... ,
| | | | | Ld(id, lid( ), uid( ), B(notest(i1),
| | | | | | notest(i2),..., notest(id)))
| | | | | ...
| | | | )
| | | )
| | }
| | } | (47) |
An important optimization. If, for a particular value of
j, lijs = lij
and uijs = uij,
then the safe region along axis ij of the
iteration space corresponds to the entire extent of the axis. If
liks = lik and
uiks = uik for
k = j, ... , d, then axis
ij-1 can be partitioned into only three
regions: (1) one region from lj-1 to
lij-1s - 1, (2) one region
from lij-1s to
uij-1s, and (3) one region from
uij-1s + 1 to
uij-1. Each of these regions spans the
entire iteration space along axes ij,
ij+1, ... , id. This situation for a
two-dimensional iteration space is illustrated in
Figure 17. Note that the new partitioning
results in only three regions. Collapsing multiple regions into a
single region reduces the cost of computing the regions, the total
number of iterations in the driver loop (47), and, consequently,
reduces the run-time overhead of the restricted method. To incorporate
this optimization in the computation of regions, procedure
regions is modified as shown in Figure 37 in Appendix F.
Figure 17
An example of the restricted method. We apply the restricted
method to the two-dimensional loop nest of
Figure 8. The logical
structure of the generated code is the same as that generated by the
compact method in Figure 14. The actual implementation with the driver
loop and the two versions of code is shown in
Figure 18. The driver loop (with index ) of
Figure 18 executes u iterations, where
u is the number of regions computed by
procedure regions. On each iteration, either the version of
the loop with run-time tests or the version with no run-time tests is
executed. The code in Figure 18 instantiates the regions dynamically.
This contrasts with the static instantiation of Figure 14.
Figure 18
Summary of the restricted method. The restricted method works
on perfect loop nests. It partitions the iteration space into
multidimensional regions and generates two versions of the loop body.
Each region is executed with either the test version of the
body or the notest version. Only regions that are free from
any possible access violation can use the notest version.
Iterative computation of loop bounds
In the third section and in Appendix A we discuss how to partition
an iteration space into regions for a variety of common forms of
subscript functions. In this section, we discuss how regions-based
testing optimization can often be performed even when the subscript
expressions are not one of the forms discussed in those sections.
The technique is similar to that of the inspector/
executor method for parallelizing loops.7 We
decompose the computation and execution of the regions into an
inspector phase and an executor phase. The
inspector phase examines the references within the iteration
space of a loop and computes a list of iteration subspaces (regions).
Some regions need run-time tests, whereas others do not. The
inspector phase is analogous to procedure regions
of the previous section. The executor phase traverses the
list of iteration subspaces and executes them using different versions
of the loop body. Thus the executor phase corresponds to the
driver loop of the last section.
Construction of the inspector phase. We show how
to construct an inspector for a singly nested loop. This method can
then be recursively applied to a loop nest. Let L(i, l, u,
B(i)) be a loop on i, where B(i) contains a
series of references
A1[ 1],
A2[ 2], ... ,
A [ ], where
j is a function defined in the iteration
space of L. A reference can be of the form
Aj[Ak[ k]], where
Ak is also an array. In general,
Ak[ k] may be present as a term
in j. We label the array references so that
Aj[ j] executes before
Aj+1[ j+1].
The inspector constructed takes the form shown in
Figure 19A. The first argument to procedure
regions is the output , a vector of
regions. An element [ ] of the region vector
consists of:
[ ]=(l( ), u( ), )
where l( ) and u( ) are the bounds of
region [ ], and is its test flag. The next
argument, u , is also an output: the number of
regions in vector . The other arguments for
procedure regions are the input of the procedure. They are:
(1) the lower and upper bounds, l and u, of the
loop, and (2) the list of array references
(A1[ 1],
A2[ 2], ... ,
A [ ]).
Figure 19
The inspector enumerates all iterations from the loop, by executing a
for (i = l; i u; i++) loop. For each iteration i, it checks all array references
Aj[ j] and determines if a test is necessary. If the evaluation of j itself causes a violation, we define j = - . If
j is invariant with respect to i,
optimizations can be performed to reduce the number of evaluations in
the inspector. The inspector marks in a flag check if a test
is required for iteration i. If check is the same
as oldcheck, this iteration i belongs to the same
region as the previous iterations, and we update the upper bound of the
current region. Otherwise, it is the first iteration of a new region.
Construction of the executor phase. The
executor, shown in Figure 19B, consists of a driver loop
(indicated by an index variable ), that iterates over the regions.
Each region is executed with one of the two different versions of the
loop body, one with tests and one without.
Optimizations and alternate constructions. The
inspector/executor technique can be applied to each loop in a loop nest
independently, as we did for the general method discussed earlier.
Alternatively, it can be directly applied to a d-dimensional
perfect loop nest. In this case, the inspector has to enumerate all
iterations in the d-dimensional iteration space, and the
executor consists of a driver loop around different versions of the
loop nest.
For the inspector/executor method to be effective, the inspector phase
has to be hoisted out of a loop nest. When that is possible, the same
vector of regions
[1:u ] can be used in
multiple instances of the executor. The number of checks is reduced by
the number of iterations in the loops from which the inspector was
hoisted. We show such an example in the following subsection.
Finally, it is also possible to build an inspector/executor
pair that uses more refined versions of the loop body. For example, one
could use all 3 versions generated in the exact method.
An example. In this example, the loop of
Figure 20A is transformed. The inspector for
this loop is shown in Figure 20B and the executor in
Figure 20C. Note
that we have optimized the loop by moving the inspector
phase out of the j loop. This is possible because the array
reference patterns are not dependent on j. Thus, even though
there are O(uiuj) computations being
performed, only O(ui) tests are necessary.
Figure 20
Summary of inspector-based methods. The inspector-based
methods differ from the exact, general, compact, and restricted methods
in how regions are formed. In the former methods, regions are computed
analytically using the formulas of the third section and Appendix A. In
the inspector-based methods the regions are computed by executing the
code that generates the subscripts for the references being optimized.
This allows references whose subscripts preclude an analytic
computation of regions to be optimized. There are two caveats, however.
First, because the inspector overhead is proportional within a constant
factor to the cost of executing an instance of the loop, the methods
are most effective when the inspector for a loop can be hoisted out of
that loop. Second, there are still some loops which cannot be optimized
with inspector-based methods. Figure 21
shows such a loop. Because the execution of an inspector for this loop
would have side effects (other than region computation) that live
beyond the life of the inspector, forming an inspector with the methods
we have discussed would alter the outcome of the program. This loop can
be optimized using speculative methods. Because they are not
region-based, they are beyond the scope of this paper. They are
discussed in Reference 8.
Figure 21
Multithreading considerations
Up to now, the problem of other threads of execution and optimized
code being active at the same time has been ignored. If arrays had
static shape, that is, if it were not possible for an array to change
shape during the course of execution of a routine, multithreading would
not be a problem. Multithreading is beyond the scope of the current
FORTRAN, C, and C++ language definitions.
Java, however, has made multithreading an integral part of the
language. Furthermore, it is possible for one thread of a Java program
to alter the shape of a multidimensional array that is being accessed
by other threads.
The Java memory model enables a solution to this problem. The coherent
copy of a shared variable resides in main memory. A local
copy of any shared variable can be copied by a thread from main memory
to working memory. The working memory is accessible only by
the thread, and therefore is not alterable by any other thread. Java
requires that:
- No synchronization can occur within a thread between time
tm, when a variable is copied from main to local
memory, and time ta, when the variable is
accessed by the thread. This can be enforced by refreshing the shared
variable in local memory after a synchronization and before the access.
- Before a synchronization within a thread is finished, all
shared variables that have been modified since the last synchronization
point in that thread must be written back to the main memory.
- If a shared variable has been written by a thread, it must be
written to main memory before its value can be read from main memory by
some thread. This prevents locally written values to shared variables
from being lost.
- As long as item 3 is obeyed, the value of a shared variable can
be updated from global memory at any time prior to an access by a
thread.
The consequences of the Java memory model rules to our work are
threefold. First, it is legal and valid within a thread to cache in
local memory the values of a variableeven if the thread is not
synchronized. Second, if the thread is not synchronized, the cached
values need not be written back. Third, if a synchronization point
exists within a loop nest being optimized, the shapes of shared arrays
have to be conservatively assumed to have changed at the
synchronization point. To prove that the shape of a shared
array does not change across a synchronization requires proving that
none of the threads that have access to the array at that time can
change its shape.
We now describe how to handle the case in which a body of code without
synchronization accesses a (potentially) shared array. We can divide
arrays into two distinct parts. The first part is comprised of
descriptors, those parts of an array that contain pointers
to other descriptors or to rows of data. The second part is comprised
of data, the one-dimensional rows that contain the actual elements. In
Figure 22, the dashed boxes indicate
descriptors, and the solid boxes indicate data. In general, a
d-dimensional rectangular array
A[1:n1][1:n2] ... [1:nd]
consists of one level-0 descriptor that points to a vector of
n1 level-1 descriptors. Each level-1 descriptor
points to a vector of n2 level-2
descriptors, for a total of n1n2
level-2 descriptors. There are a total of
O(n1n2 ... nd-1)
descriptors. The last axis of the array contains the actual data, in
the form of vectors of nd elements, pointed to
by the level-(d - 1) descriptors.
Figure 22
The following then ensures the thread safety of our transformations:
before every optimized body of code, a copy of the descriptors part of
each shared array A is cached in working memory. Note that
if a cached descriptor for A that is valid by the Java
thread semantics is already present, it may be used for the purposes we
describe. This action insulates the thread executing the optimized code
from any changes to the shared array shape caused by other threads.
Also note that only the array shape is relevant in computing the safe
and unsafe regions for our transformations. The data part of the array
can be modified by other threads without any effects on those regions.
By using this caching of descriptors, the only places that shape
changes become visible are at synchronization points. The shape of a
shared array should be marked as variant across those points. This, in
turn, can prevent many optimizations.
At first glance, the caching of descriptors may seem expensive. In
general, given a d-dimensional rectangular array
A[1:n1][1:n2] ... [1:nd],
only
O(n1n2 ... nd-1)
storage needs to be cached. Thus, a one-dimensional array needs only a
constant amount of storage, and an n1 ×
n2 array needs only n1 words of
storage. Therefore, if most of the array is accessed within the loop,
the amount of storage cached is approximately a factor of
nd less than the number of references.
Experimental results
To test the effectiveness of our compiler transformations, we
developed a prototype framework. This prototype framework currently
implements only the restricted method and only handles perfectly nested
rectangular loops. The framework consists of a Java-to-Java translator
that produces the two versions of code necessary for the restricted
method. These versions are then compiled into executable object code
using the IBM High Performance Compiler
for Java (HPCJ).4
HPCJ has a switch to generate code without
any run-time checks, on a class basis.
Benchmarks. Using the prototype framework, we applied the
transformations in the restricted method to Java programs in a
benchmark suite. This suite consists of two numerical kernels and three
application kernels, all with array-intensive computations. The two
numerical kernels are a vector dot-product operation
(DDOT), and a matrix-multiply
operation (MATMUL). The three
application kernels are a shallow-water simulation kernel
(SHALLOW), a data-mining kernel
(BSOM), and a cryptography kernel
(CBC). We compare the performance of
a Java implementation of each benchmark with a corresponding version
written in either C (DDOT,
MATMUL, and
CBC) or C++
(SHALLOW and
BSOM). The C and C++ versions
were written and compiled according to what we call
Java rules: (1) two-dimensional arrays are
represented as vectors of pointers to one-dimensional arrays, and
(2) any compiler optimizations that violate the Java
IEEE 754 floating-point semantics (exact
reproducibility of results) are disabled. The impact of these rules on
the performance of C or C++ programs is beyond the scope of this paper.
We discuss each of the benchmarks in more detail below.
DDOT. The DDOT benchmark
computes the dot-product ni=1
xiyi of two vectors x and
y of n double-precision floating-point numbers.
The Java code that implements DDOT
is straightforward and shown in
Figure 23. The C code is very similar. The
computation of DDOT requires
2n floating-point loads and 2n floating-point
operations (n multiplies and n adds). For our
measurements we use n = 106. We report the
performance of DDOT in Mflops.
Figure 23
MATMUL. The MATMUL
benchmark computes the matrix operation C = C + A
× B, where C is an m × p matrix,
A is an m × n matrix, and B is
an n × p matrix. The elements are double-precision
floating-point numbers. The Java implementation of this matrix |