 |
 |
Many numerically intensive computations done in a scientific computing environment require uniformly distributed pseudorandom numbers in the range (0, 1) and (-1, 1). For multiplicative congruential generators with modulus 2k, k 52, and period 2k2, we show that the cost per random number for these two distributions is 3 and 3.125 multiplyadds on RS/6000® processors. Our code, on the IBM POWER2 Model 590, produces more than 40 million uniformly distributed pseudorandom numbers per second for both ranges (0, 1) and (1, 1). Additionally, our code sustains the 40 million per second rate for data out of cache. The Numerical Aerodynamic Simulation (NAS) parallel benchmarks use a linear congruential generator with modulus 246. Our result is about 50 times faster than the generic implementation given in the benchmarks. The extra-accuracy fused multiplyadd instruction of RS/6000 machines combined with a few algorithmic innovations gives rise to the 50-fold increase. If IEEE 64-bit arithmetic is used with our Fortran code on POWER and PowerPC® architectures, the results we obtain are bit-wise identical to the generic algorithms. The paper gives several illustrations of a general technique called the Algorithm and Architecture approach. We demonstrate herein that programmer-controlled unrolling of loops is equivalent to customized vectorization of RISC-type code. Customized vectorization is more powerful than ordinary vectorization, and it is only possible on RISC-type machines. We illustrate its use to show that RS/6000 processors can compute the distribution (1, 1) at the rate of 3.125 multiplyadds. We also specify a linear congruential generator that is related to the multiplicative congruential generator referred to above. It has a full period of 2k, where 2k is the modulus. The cost per random number [in the range (0, 1)] for this generator is four multiplyadds on RS/6000 processors. Our code, on the IBM POWER2 Model 590, for this generator produces more than 30 million uniformly distributed pseudorandom numbers per second for the range (0, 1). We show that this generator is embarrassingly parallel, or EP. Using the Algorithm and Architecture approach, we describe a new concept called generalized unrolling. Finally, we present a multiplicative congruential generator for which the modulus is not a power of 2. Such a generator, as well as one with modulus 2k, is selectable as the generator used in the RANDOM_NUMBER intrinsic function of IBM XL Fortran and XL High Performance Fortran. All of the generators reported here are EP. Using an IBM SP2 machine with 250 wide nodes, it is possible to compute more than ten billion uniform random numbers in a second.
|
 |
 |
|
1. Introduction
The extra-accurate fused multiplyadd (FMA) operation of the RS/6000* and PowerPC* family of RISC microprocessors offers many opportunities to use mathematical innovation to produce fast algorithms for numerically intensive computation (NIC). In this paper we illustrate this assertion by giving several examples and by demonstrating a 50-fold increase in performance (over generic algorithms [1]) for pseudorandom-number generation. The results obtained are bit-wise identical to the results of the generic algorithms.
For multiplicative congruential generators [2] of the type
|
si+1 = asi mod 2k,
xi+1 = 2ksi+1 or 2k+1si+1 1,
|
(1)
|
with k 52, we show that the cost per random number for the two distribution intervals (0, 1) and (1, 1) is respectively 3 and 3.125 multiplyadds on RS/6000 processors.
We also specify a linear congruential generator of the form
si+1 = (asi + c) mod 2k,
xi+1 = 2ksi+1,
related to the generator in [2], which has a full period of 2k. The cost per random number [in the range (0, 1)] for this generator is four multiplyadds on RS/6000 processors.
Finally, we present a multiplicative congruential generator for the interval (0, 1) of the form
si+1 = asi mod (2k 1),
(discussed in [3]), for which the modulus is not a power of 2. Such a generator, as well as a generator of type (1), is selectable as the generator used in the RANDOM_NUMBER intrinsic function of IBM XL Fortran (XLF) [4] and XL High Performance Fortran (XLHPF) [5]. [By a generator of type (n), or generator (n), we mean a generator whose description is given by equation(s) (n).]
We first introduce some notation. We use q to represent the modulus, either 2k or 2k 1, depending on the generator. Arithmetic operators in equations are exact. Operands are IEEE normalized numbers. We use the operators , , , ø for IEEE double-precision (64-bit) floating-point arithmetic, and fl(x) for the correctly rounded double-precision value corresponding to x. In most cases, x = ab + c, where a, b, and c are IEEE numbers. Unless otherwise stated, the rounding mode we use is round-to-zero (chop). This leads to the best performance.
Let a, b, and c be arbitrary IEEE 64-bit floating-point numbers. The fused multiplyadd operation on RS/6000 and PowerPC computes the correctly rounded d = fl(ab + c) for any of the four IEEE rounding modes. On RS/6000 machines, all 106 bits of the product a * b are added to c in order to guarantee that d will be correctly rounded for all possible values of a, b, and c. POWER and POWER2 RS/6000 machines can on a continuous basis compute one and two FMAs every machine cycle if the results of the FMAs do not cause any pipeline delays. The pipeline delay for these machines is two or three cycles. Loop unrolling can be used to avoid any pipeline delays for the pseudorandom-number calculation.
We first consider a multiplicative congruential pseudorandom-number generator for which the modulus is a power of 2. See Knuth [2] for a thorough discussion of pseudorandom-number generators. Let 0 < s0 < 2k, s0 odd, 1 < a < 2k, k 52, and for i 0, let
|
si+1 = asi mod 2k,
xi+1 = 2ksi+1.
|
(3)
|
For k = 46 and a = 513 we have a specific instance of such a generator. This generator has period 2k2, and it is extensively used by the Numerical Aerodynamic Simulation (NAS) suite of paper and pencil benchmarks [1]. We mention here that the random-number generator (3) is embarrassingly parallel, or EP. In [1, p. 31, bullet 3], this point is made for the EP benchmark. Also, on page 29 of [1], Bailey describes the binary algorithm for exponentiation that allows one to compute an mod 2k in log2(n) steps. The fact that an mod 2k is computable in log2(n) steps is crucial to making the random-number generator (3) embarrassingly parallel. Bailey implicitly points this out in [1, p. 31, bullet 2]. On the IBM POWER2 Model 590, our bit-wise identical algorithms corresponding to Equation (3) compute more than 40 million random numbers per second. On the IBM SP2 machine, using the Model 590 for its nodes, and using p such nodes, we can compute more than 40p million random numbers per second because of the EP nature of these algorithms. Thus, using 25 nodes our algorithms can compute more than a billion random numbers in a second.
In [1], Bailey gives a generic algorithm that simulates base 223 multiple-precision arithmetic to compute the si's in (3). This algorithm requires 18 floating-point operations and four convert-to-integer operations per random number. We implement the same generator using three multiplyadds. Any pseudorandom si from (3) can then be placed in the range 0 < xi < 1 by the scaling xi = 2ksi or be put in the range 1 < xi < 1 by computing xi = 2k+1si 1. These computations respectively require one and two additional floating-point operations. In this paper, we redefine (3) to compute each xi directly, without first computing si, and thus compute
xi+1 = axi mod 1.
This change avoids the actual scalings done above. We also show how these computations can be done by three and 3.125 multiplyadds per random number on RS/6000 machines.
When doing modular arithmetic on integers, it is natural to use the greatest integer function. In floating-point arithmetic this is achieved by using the IEEE round-to-zero (chop) rounding mode. Throughout this paper, except for one place in Section 2, we use the chop rounding mode.
Many NIC algorithms are rated by their megaflop rate, although this popular measure is often misleading. We think that pseudorandom-number generation is such a case. An NIC computation related to pseudorandom generation is the EP NAS benchmark [1, 6]. By using mathematical innovation and the fast random-number generation described here, we were able to demonstrate that the RS/6000 POWER2 Model 590 could currently outperform a Cray YMP for EP [6]. In this paper we compare the ratio of the computing time for the generic algorithm [1] to the time of our algorithm. In both cases we use the same model of RS/6000. The new algorithm is written entirely in Fortran and is compiled using the XL Fortran (XLF) compiler [4].
We also implemented another generator of the type (3) in which a = 44485709377909 and k = 48. This generator is the RANF( ) pseudorandom function used on the CDC Cyber 174 computer, and is also generator 2 in the RANDOM_NUMBER intrinsic function provided with the IBM XLF and XLHPF compilers. It is described in the book Stochastic Simulation by Brian Ripley [7, p. 216], who presents statistical test results for several generators; he finds this generator quite acceptable. Gordon Slishman has implemented this generator1 using three multiplyadds for single-processor execution of RANDOM_NUMBER. The parallel implementation in XLF and XLHPF is described in Section 4.
Slishman also implemented the generator (2), for which the modulus is not a power of 2, for generator 1 of single-processor RANDOM_NUMBER. Parallel versions of both generators 1 and 2 in RANDOM_NUMBER were implemented by Robert Enenkel for SP (distributed-memory) machines, and by Enenkel and Xinmin Tian2 for SMP (shared-memory) machines. Generator (2) is discussed in detail in Section 5.
In Section 2 we describe some elementary mathematical ideas that can be used to compute (3) rapidly. These ideas are used to derive an algorithm to compute pseudorandom numbers in the range (0, 1) in three multiplyadds and pseudorandom numbers in the range (1, 1) in 3.125 multiplyadds. Also, using the IEEE round-to-nearest mode, we show how the latter computation can be done in three multiplyadds.
We now discuss POWER and PowerPC models. Before the advent of vector processors, integer arithmetic was generally faster than floating-point arithmetic. It was then customary to produce pseudorandom numbers using fixed-point arithmetic and then convert these integers to floating point with scaling to get floating-point pseudorandom numbers. When fast floating-point processors became available, it became more economical to produce the random numbers directly in floating point. For example, the first equation of generator (1) could be computed on an RS/6000 by setting 0 < s0 < 2k, s0 odd, and letting, for i 0,
u = fl(asi + 252+k),
v = u 252+k,
si+1 = fl(asi v).
The above three statements define pseudorandom integers in the range 0 < si < 2k that are bit-wise identical to the generator (1). However, one usually wants random floating-point numbers in the range (0, 1) or (1, 1). The formulas above would then require modification; i.e.,
xi = 2ksi or xi = 2k+1si 1.
However, these computations require an extra multiplyadd in addition to the four needed to compute si. One intention of this paper is to demonstrate that this additional multiplyadd can be removed. This results in improvement factors of 4/3 and 4/3.125, which come to 33% and 28% improvements per iteration over computation based on the above four statements.
In Section 3 we describe two full-period linear congruential generators,
|
xi+1 = (axi + c) mod 1,
|
(4)
|
that compute pseudorandom numbers in the range (0, 1) in four multiplyadds for c = 2k and c = a2k. We show a proof that these generators are EP. The proof involves showing that (1 + a + · · · + an1) mod q can be computed in log2(n) steps. Our four-multiplyadd implementation of these generators requires us to avoid conditional operations in the unrolled inner loop. It turns out that ordinary unrolling via vectorization fails. To overcome this failure, we introduce generalized unrolling, which becomes possible because the generator is EP.
In [2, Section 3.6, pp. 170173], Knuth provides a summary on how to generate random numbers. He recommends using a linear congruential generator,
X (aX + c) mod q,
|
(5)
|
that satisfies seven properties. However, he states that this class of generator applies primarily to machine-level coding and hence is not portable. For IEEE arithmetic, it makes sense to choose q = 2k. The fused multiplyadd instruction is provided by many computer architectures, including the IBM RS/6000 POWER and PowerPC, IBM S/390* G5, Intel/HP IA-64 Itanium**, Apple Power Mac**, HP Precision Architecture RISC 2.0 (e.g., HP900/800), and SGI MIPS** R-8000.3 Thus, within this more restrictive domain, we can propose our four-multiplyadd generator as being portable.
In Section 5 we extend the ideas of the previous sections to the generator (2). The modulus of this generator is 231 1, and not a power of 2 as for the previous generators. To achieve high performance, nontrivial changes to the algorithm are required.
In Section 6 we describe the performance of these algorithms relative to the generic algorithm of [1].
2. Three multiplicative congruential generators
Let a and y be positive integers less than 2k, where k 52. Clearly a and y are IEEE numbers. Let x = 2ky so that 0 < x < 1. Also, x is an IEEE number. Let p = ax. The base-2 (bit) representation of p is p1p2 · · · pk.pk+1pk+2 · · · p2k, where each pi is 0 or 1. Represent p = I + F, where I = p1 · · · pk and F = .pk+1 · · · p2k; i.e., I and F are the integer and fractional parts of p. Note that ax mod 1 = F, that I and F are IEEE numbers, and that 252 p + 252 253 1. Let
All 252 positive integers in the range 252 to 253 1 are all of the IEEE numbers in that range. Thus, u in (6) is an integer, and since we are in chop mode, u is the largest integer not exceeding p + 252; i.e., u = p + I. (See Lemma 1 for a detailed constructive proof.) Let
v = u 252.
|
(7)
|
The computation in (7) is done exactly, since v is computed using IEEE arithmetic, in which u, 252, and v = I are all IEEE numbers. Now let
On RS/6000, w is the fractional part F of ax because the fused multiplyadd instruction always delivers the correct answer when its operands and result are IEEE numbers. Note that ax v = F. Thus,
We have just proved the following.
Theorem 1
The computation in (6), (7), and (8) above produces the result (9). The value computed by (8) is bit-wise identical to the infinitely precise result (9).
When unrolled, Equations (6), (7), and (8) constitute a three-multiplyadd implementation of the random-number generator (3). Let a = 513 and choose 1 < s0 < 2k with s0 an odd integer and k = 46. Set x0 = s02k. Then 0 < x0 < 1. Note that the seven lower-order bits of x0 are zero. In fact, each xi, i 0, given by
has this property.
We now discuss some aspects of the Algorithm and Architecture approach [8] and how it relates to unrolling. In Section 1.3 of [1], Bailey et al. describe sample codes for the NAS benchmarks. There were two codes distributed for random-number generation, namely RANDLC and VRANLC; the latter is of interest to this paper. Subroutine VRANLC generates n REAL*8 uniform pseudorandom numbers in the range (0, 1) by using Equation (3). The documentation states that VRANLC is the standard version designed for scalar or RISC systems. A comment in this code states that the DO loop below in (11) which generates the n uniform pseudorandom numbers is not vectorizable.
T1 = R23 * A
A1 = AINT (T1)
A2 = A T23 * A1
C
C Generate N results. This loop is not
C vectorizable.
C
DO 120 I = 1, N
C
C Break X into two parts such that
C X = 223 * X1 + X2, compute
C Z = A1 * X2 + A2 * X1 (mod 223), and then
C X = 223 * Z + A2 * X2 (mod 246).
C
T1 = R23 * X
X1 = AINT (T1)
X2 = X T23 * X1
T1 = A1 * X2 + A2 * X1
T2 = AINT (R23 * T1)
Z = T1 T23 * T2
T3 = T23 * Z + A2 * X2
T4 = AINT (R46 * T3)
X = T3 T46 * T4
Y(I)= R46 * X
We believe the authors' statement about the code in (11). Today's compilers are not able to vectorize this loop. On the other hand, Equation (3) is vectorizable. The Algorithm and Architecture approach to handling Equation (3) would be to rewrite the code so that it performs well when run on some actual machine, e.g., an RS/6000 RISC-type machine. In the present case we would also need to unroll Equations (6), (7), and (8). Please note that the code in (11) can be replaced by the following code, in which TP52 = 252:
DO i = 0, n 1
u = a * x(i) + TP52
v = u TP52
x(i+1) = a * x(i) v
However, the above code will not execute in three cycles because of pipeline delays. This code is also not vectorizable by a compiler. In producing the code in (12), we have used the extra accuracy feature of the fused multiplyadd instruction that is present on RS/6000 and PowerPC machines. Also, the code in (12) is equivalent to using Equation (9). A compiler cannot recognize this fact. This is where we use the Architecture feature of our approach. We now show how to unroll the code in (12). In doing so, we vectorize the code in (12).
Let aj = aj mod 2k for 1 j m. Note that, from (3),
|
si+j
|
= ajsi mod q
|
|
= ajsi mod q,
|
so that
|
xi+j
|
=
|
ajsi mod q
|
|
|
q
|
|
=
|
ajxi mod 1, 1 j m.
|
|
(13)
|
Thus, given xi and (13), we have defined the next m iterations of (10). This unrolling of (10) by m constitutes using a vector of length m to compute these next m iterations of (10). In effect, RS/6000 machines can be viewed as vector machines with a short vector length. Because RS/6000 machines possess functional parallelism (see [9]), RS/6000 machines have very low vector start-up costs. This fact implies that using a short vector length is not a performance drawback. However, to achieve this vectorization it must be programmed. Now let m = 4. Thus, the code in (12) becomes
DO i = 0, n 4, 4
u = a * x(i) + TP52
u2 = a2 * x(i) + TP52
u3 = a3 * x(i) + TP52
u4 = a4 * x(i) + TP52
v = u TP52
v2 = u2 TP52
v3 = u3 TP52
v4 = u4 TP52
x(i+1) = a * x(i) v
x(i+2) = a2 * x(i) v2
x(i+3) = a3 * x(i) v3
x(i+4) = a4 * x(i) v4
ENDDO
The above code eliminates most of the pipeline delays. Every target of an FMA in that code is separated by three independent FMAs. However, the last FMA of the loop is followed by the first FMA of the loop with i replaced by i + 4; thus, there is no separation between the target x(i + 4) and the target u. In order to eliminate all pipeline delays, we need a more complicated unrolling of the code in (12). Consider
xi = x(4)
xi3 = x(3)
xi2 = x(2)
xi1 = x(1)
DO i = 0, n 8, 8
u4 = a4 * xi + TP52
u3 = a3 * xi + TP52
u2 = a2 * xi + TP52
u = a * xi + TP52
x(i+3) = xi3
x(i+4) = xi
v4 = u4 TP52
v3 = u3 TP52
v2 = u2 TP52
v = u TP52
xi0 = a4 * xi v4
xi3 = a3 * xi v3
x(i+1) = xi1
x(i+2) = xi2
xi2 = a2 * xi v2
xi1 = a * xi v
u4 = a4 * xi0 + TP52
u3 = a3 * xi0 + TP52
u2 = a2 * xi0 + TP52
u = a * xi0 + TP52
x(i+7) = xi3
x(i+8) = xi0
v4 = u4 TP52
v3 = u3 TP52
v2 = u2 TP52
v = u TP52
xi = a4 * xi0 v4
xi3 = a3 * xi0 v3
x(i+5) = xi1
x(i+6) = xi2
xi2 = a2 * xi0 v2
xi1 = a * xi0 v
The code in (14) eliminates all pipeline delays and hence should execute at the rate of one pseudorandom number every three multiplyadds. To start the pipeline, we need to precompute outside of the loop the first four pseudorandom numbers, x(i), i = 1, · · · , 4. The code in (14) is unrolled by 8.
Now we demonstrate another feature of the Algorithm and Architecture approach. Suppose we decide to use (13) to unroll by m = 8. We have seen that a compiler cannot unroll the present loop (12). We were forced to program the unrolling and to introduce the concept of a vector instruction into RS/6000 coding. We now claim that this necessity of having to program a vector instruction gives rise to a feature of superscalar machines that vector machines do not possess. This feature allows us to produce customized vector instructions. We remark that this is a part of the Algorithm and Architecture approach, and we now illustrate this concept with the example of generating uniform pseudorandom numbers in the range (1, 1). A standard implementation requires an extra operation to convert the range from (0, 1) to (1, 1).
Let c1 = 253 and c2 = 253 1. Let 0 < s < 2k with s odd and k = 46. Let x = 2k+1s. Then 0 < x < 2. Let, for i 1,
v = u c2,
|
(16)
|
and
where 1 j < 8. For j = 8 we let
v = u c1,
|
(19)
|
and
xi+8 = x 1.
|
(21)
|
Equations (15) to (21) define a loop, unrolled by 8, in which the next eight random iterates are computed and also the seed x is updated in Equation (20).
The first seven iterations cost three multiplyadds, while the last iteration costs four multiplyadds. The functional parallelism of RS/6000 indicates that this loop will execute in 25 multiplyadds, or 25/8 = 3.125 multiplyadds per iteration.
We close Section 2 by demonstrating a use of the IEEE round-to-nearest rounding mode. Let 0 < ui < 1 be the xi computed by Equation (10). For each ui, let xi = 2ui 1. Then 1 < xi < 1. Let c1 = 253 + 2k. Let SEED be the seed x0 for the ui computation given by Equation (10):
C use round-to-nearest mode
x(0) = TWO * SEED ONE
DO i = 0, n 1
uc = a * x(i) + c1
v = uc c1
x(i+1) = a * x(i) v
ENDDO
SEED = HALF * x(n) + HALF
We now prove the following.
Theorem 2
The above code produces results that are bit-wise identical to xi = 2ui 1, where ui is the xi given by (10).
Proof Let aui = I + F, where 0 < F < 1, so ui+1 = F. We want to show xi+1 = 2F 1. Let u = axi + c1. Since a is odd, a = 2a1 + 1 for some a1, and r = 2(I a1) + c1 + 2F 1. Note that 2k < axi < 2k, so that 253 < u < 253 + 2k+1. For IEEE numbers x with exponent 53, namely those x that satisfy 253 x 254 2, x is an even integer. For round-to-nearest, the computed value of u, namely uc, is the IEEE number closest to u. We claim uc = c1 + 2(I a1). Since uc is an even number with exponent 53, it is an IEEE number. Also, u uc = 2F 1 or 1 < u uc < 1 as 0 < F < 1. Thus, uc is the closest IEEE number to u. The fused multiplyadd instruction computes v and xi+1 exactly. Thus, v = 2(I a1) and xi+1 = 2F 1.
3. Two full-period generators
We now specify a high-level description of the full-period generators. Recall that the full-cycle generator is a linear congruential generator of the form (5) with q = 2k. We follow the recommendation of Knuth (see [2, p. 171]) and choose the value of c to be 1 or a. These two choices of c give rise to the two versions of our full-period generator. These generators have implementations that are nearly the same as the three-multiplyadd generator given by Equations (6), (7), and (8).
Let x be the current iterate, in which 0 < x < 1. Then the following four multiplyadd statements in (22) to (25) below describe the full-period generator:
v = u 252,
|
(23)
|
x = w .
|
(25)
|
In (25), the last operation, x = w , computes the next random number. In (25), = 2k when c = 1, and = a2k when c = a. We first consider the case = 2k. We were not able to reduce the number of multiplyadds to three as we did in Equations (15) to (21). We would need to define c2 = 252 + 246, and this value cannot be represented in a double-precision word. In Equation (25), note that w < 1. This fact follows from (9). Since is the smallest random number, we are guaranteed that the new x 1. In fact, x will equal 1 only once in 2k iterations of the generator. For k = 46, this is a very rare event. Suppose x = 1. Then (22), (23), and (24) compute w = 0 because x is an integer. However, the new x = , and we reach the conclusion that the generator is self-correcting! This feature is very important (see [3] for another example), because it is not necessary to check x during each iteration of the calculation. Suppose we chose 246 < < 1. Then, x in (25) could be greater than 1. In that case we would have to test and possibly correct x during every iteration:
IF
(x .gt. 1) x = x 1
|
(26)
|
A conditional statement of this sort, when present in a pipelined inner loop, can significantly degrade performance. Thus, the choice of = 2k is essential to making the performance of the full-cycle generator fast.
To obtain a new random number every four multiplyadds, it is necessary to unroll (22), (23), (24), and (25) by at least a factor of 2. Our numerical experiments on POWER2 showed that unrolling by 8 was necessary. The code for a linear congruential generator without unrolling is
DO i = 0, n 1
u = a * x(i) + TP52
v = u TP52
w = a * x(i) v
x(i+1) = w +
Because of pipeline delays, this code will not execute at the rate of one random number produced every four multiplyadds. It turns out that unrolling via program vectorization described in Section 2 does not work well. To see this, note that
xj+i = ajxi + j,
|
(28)
|
where aj = aj mod 2k and j = [(1 + a + · · · + aj1) mod 2k] . The j's are now variable! We have previously seen that the choice of = 2k was essential in order to maintain good performance. Any other value of gave rise to the use of conditional statements in the code, such as the statement in (26). Thus, the type of unrolling that comes from ordinary vectorization fails to provide peak performance. However, another form of unrolling works. In the present context, we call this generalized unrolling.
Let N = 2n be an arbitrary even integer. Suppose that we can compute an and n of Equation (28) in log2(n) steps. Then we can unroll the code by 2 in (27) as follows:
DO i = 0, n 1
u = a * x(i) + TP52
u1 = a * x(i+n) + TP52
v = u TP52
v1 = u1 TP52
w = a * x(i) v
w1 = a * x(i+n) v1
x(i+1) = w +
x(n+i+1) = w1 +
The code in (29) constitutes generalized unrolling. We have divided the vector x(0:N 1) into two vectors x(0:n 1) and y(0:n 1) = x(n:N 1) and worked on them independently. This type of unrolling gives rise to a form of single-instruction, multiple-data (SIMD) parallelism on a single processor. In [9] we also exploit this form of SIMD parallelism on POWER2 machines. The crucial point of the code in (29) for the present application is that the two vectors x and y both use the same .
We show how to compute n in log2(n) steps. First note that
j+i mod 1 = (aj mod 2k)( i mod 1) + j mod 1
|
(30)
|
and
|
aj+i mod 2k = (aj mod 2k)(ai mod 2k)
|
(31)
|
hold for all i and j. We construct a table = TBL(2, 0:46) whose ith entries TBL(1:2, i) are a2i and 2i, for 0 i 46. Note that
|
TBL(1, i + 1) = TBL(1, i) * TBL(1, i) mod 2k,
|
(32)
|
|
TBL(2, i + 1) = [TBL(1, i) + 1] * TBL(2, i) mod< 1,
|
(33)
|
along with TBL(1, 0) = a and TBL(2, 0) = 246, generates the table in 46 steps. We can now, given x0 and the precomputed table, use Equations (28), (30), and (31) to compute xn in log2(n) steps as follows:
t = x(0)
n0 = n
lb = 0
1 nn = n0/2
IF (n0 .gt. 2 * nn) THEN
u = t * TBL (1, lb) + TP52
v = u TP52
w = t * TBL (1, lb) v
t = t + TBL (2, lb)
IF (t .gt. 1) t = t 1
ENDIF
lb = lb + 1
n0 = nn
IF (n0 .gt. 0) goto 1
We now consider the second generator; this generator has = a2k. Consider the following code, in which TPMK = 2k:
DO i = 0, n 1
u = x(i) + TPMK
v = a * u + TP52
w = v TP52
x(i+1) = a * u w
Here we have 0 xi < 1. When xi = 1 2k, xi+1 = 0 as u = 1. Again, this generator is self-correcting. The unrolled-by-2 version of (35) becomes
DO i = 0, n 1
u = x(i) + TPMK
u1 = x(i+n) + TPMK
v = a * u + TP52
v1 = a * u1 + TP52
w = v TP52
w1 = v1 TP52
x(i+1) = a * u w
x(n+i+1) = a * u1 w1
Equations (30), (31), (32), and (34) remain unchanged. Equation (33) is modified to
|
TBL(2, i + 1) = [TBL(1, i) + a] * TBL(2, i) mod 1.
|
(37)
|
4. Parallel implementation for HPF
The IBM XL Fortran (XLF) [4] and XL High Performance Fortran (HPF) [5] languages include a RANDOM_NUMBER intrinsic function that implements two multiplicative congruential generators. These generators are also part of PESSL, the IBM Parallel Engineering and Scientific Subroutine Library for AIX [10]. Generator 1 is of type (2), with a = 75, k = 31, and modulus q = 2k 1. Generator 2 is of type (3), with a = 44 485 709 377 909, k = 48, and modulus q = 2k. The generator that is to be used is selected by setting the generator argument of RANDOM_NUMBER to 1 or 2, respectively. In this section, we discuss implementation issues arising from the parallel directives in HPF for the modulus 248 generator. The modulus 231 1 generator is discussed in Section 5.
HPF parallel directives
The HPF statement
CALL RANDOM_NUMBER (A)
fills the scalar, vector, or array A with random numbers. HPF provides directives (ALIGN, DISTRIBUTE, BLOCK, CYCLIC, etc.) that allow the programmer to specify what elements of A are to reside on what processor. To achieve high performance, the parallel algorithm works by computing on each processor only those random numbers resident on that processor, ultimately achieving linear speedups for large quantities of random numbers.
The N random numbers required on a particular processor j {0, · · · , P 1} are sub-sequences of x0, x1, x2, · · · , xN from (3) of the form
|
xij, xij + k, xij + 2k, · · · , xij + (N 1)k,
|
(38)
|
where k is called the stride of the sequence. If A is BLOCK-distributed, k = 1 and ij = jN, j = 0, · · · , P 1. If A is CYCLIC-distributed, k = P and ij = j, j = 0, · · · , P 1. A contiguous sequence is one with stride k = 1, and one with k > 1 is called strided. (There is also a BLOCK-CYCLIC distribution and other variations which are not discussed here for brevity, but which are handled by applying techniques similar to those presented here.)
The rest of this section shows how it is possible to compute (38) in O(N + log ij) time, resulting in asymptotically linear speedup. The algorithm used depends on whether the required sequence is contiguous or strided.
Contiguous random-number sequences
A contiguous random-number sequence has the form
|
xi, xi+1, xi+2, · · · , xi+N1.
|
Contiguous sequences are required in HPF on each processor when the random numbers are written to a BLOCK-distributed vector or array. Once the starting number xi is known for a particular processor, the rest of the sequence can be directly computed by the code (14). The fast calculation of the starting numbers on each processor is dealt with next.
Strided random-number sequences
If the CYCLIC distribution directive is used in HPF with more than one processor, the sequence of random numbers resident on each processor consists of strided (k > 1) sub-sequences of the form (38). In addition, for a BLOCK distribution, the starting values xij, j = 0, · · · , P 1 in (38) must be computed from some available previous xi, for example, x0. Both of these situations require the efficient computation of multiple steps of the recurrence (13). That is, given xi and n, we must compute xi+n. The code in (14) does this for several fixed n to achieve the unrolling. However, since n is not known a priori in the current context, we cannot simply precompute the value of
and must therefore devise an efficient means to compute it for general n.
We now show how to compute (13) in O(log2n) steps, which is crucial to the asymptotically linear speedup of the parallel algorithm. Binary algorithms for exponentiation are described in [2], and one is used in Bailey's random-number generator implementation [1]. We use a binary exponentiation algorithm here, but achieve high performance through the use of the FMA instruction.
Let the binary representation of n be
n = bk · · · b0, bk 0, k = log2n , n 0.
Then
|
n =
|
k
|
2jbj
|
|
|
j = 0
|
and
|
anxi mod 1
|
= [(a kj = 02jbj) mod q] xi mod 1
|
|
=
|
|
|
k
|
a2j
|
|
mod q
|
|
xi mod 1
|
|
|
j=0
|
bj 0
|
|
=
|
|
k
|
(a2j mod q)
|
|
xi mod 1
|
|
|
j=0
|
bj 0
|
|
(40)
|
The values Ti = a2i mod q, i = 0, · · · , log2q 1, are precomputed and stored in a table, allowing the computation of the product (40) in k log2n steps. The product is accumulated in a loop, as in the following pseudocode:
t = X(i)
DO i = 0, k
IF (bi 0) THEN
C Compute t = tTi mod 1.
u = t * T(i) + TP52
v = u TP52
t = t * T(i) v
ENDIF
ENDDO
5. A modulus 231 1 generator
We now consider efficient implementation of the generator of type (2) used in RANDOM_NUMBER. This is an extension of the work in the previous section, but is significantly more complicated since the modulus is no longer a power of 2. For the modulus 231 1 generator, high performance is achieved by scaling the integer recurrence by 231, finding a fast implementation of the scaled recurrence, and finally transforming the resulting numbers to the range (0, 1). Both the mathematical issues and the implementation issues arising from the parallel directives in HPF are discussed next.
Generator recurrence
The random-number sequence
xi (0, 1), i = 0, 1, · · ·
|
(42)
|
produced by the generator is defined by the recurrence
where s0 is an integer, 0 < s0 < q, and
a = 75 = 16807,
q = 231 1.
It has a period of q 1.
This value of q is well suited to an implementation on a machine with 32-bit floating-point arithmetic, such as the IBM S/360 [3]. Although the target architecture of XLF and XLHPF is the IBM RS/6000, which uses IEEE floating-point, this generator has been provided for compatibility with existing applications. The mathematical properties of this generator are discussed in [3].
As for the modulus 2k generator, the algorithm used depends on whether the required sequence is contiguous or strided.
Contiguous random-number sequences
Before discussing the main contribution of this section, parallel algorithms for strided random-number sequences, it is useful to describe the sequential algorithms (due to G. Slishman4) on which they are based. Although the ideas here are based on earlier sections of this paper, the implementation for this generator is more involved, since q is not a power of 2.
RANDOM_NUMBER uses a sequential implementation of (43) and (44) to produce contiguous random-number sequences of the form
xi, xi+1, xi+2, · · · , xi+N1,
|
when the starting number xi is known. These sequences are also required by HPF when the random numbers are written to a BLOCK-distributed vector or array.
It is convenient for computational purposes to rewrite (43) and (44) in terms of the value
so that
|
yi+1
|
|
| |
|
|
=
|
ayi231
|
|
ayi231
|
|
q
|
|
|
q
|
|
|
|
231
|
|
| |
|
|
= ayi
|
|
ayi231
|
|
(1 231)
|
|
|
q
|
| | |
|
= ayi ayis (1 231)
|
|
|
(46)
|
where
|
s
|
|
|
|
|
| |
|
|
| |
|
=
|
1 + 231 + 262 + 293 + · · ·
|
|
|
(47)
|
We now use Lemmas 1, 2, and 6 in the Appendix to transform (46) into an expression (51) that can be efficiently computed with FMAs. Let
= 1 + 231,
which is exactly representable as a double-precision IEEE floating-point number. By Lemma 2, we can use in place of s in (48), giving (49). By Lemma 1, we compute ayi , giving (50). [Since ayi = 75231si(1 + 231) where si < 231 1, ayi < 215, satisfying the condition of the lemma.] By Lemma 6, we rearrange the terms in the form of three FMAs (51):
|
yi+1
|
= ayi ayis (1 231)
|
(48)
|
| |
= ayi ayi (1 231)
|
(49)
|
| |
= ayi (ayi 252 252)(1 231)
|
(50)
|
| |
= fl[ayi (ayi 252)(1 231) 252(1 231)].
|
(51)
|
|
The random number xi+1 is recovered via (44) and (45) as
|
xi+1
|
=
|
231yi+1
|
|
|
231 1
|
| |
|
= syi+1.
|
The floating-point values i+1 and i+1, corresponding respectively to yi+1 and xi+1, are computed in RANDOM_NUMBER by the sequence of instructions
0 = y0
and
u = fl( i + 252),
|
(52)
|
i+1 = fl( ia v),
|
(54)
|
where
= a ,
k1 = 1 231,
and
k2 = 252 221
are exactly representable in IEEE double precision. The program (52)(55) computes u =  i + 252, v is computed exactly, and it follows that i+1 = yi+1 and i+1 = fl(xi+1). (A detailed proof is given by Lemma 7 in the Appendix.)
Although an RS/6000 POWER machine is capable of computing one FMA per clock cycle, the code in (52)(55) will not execute in four cycles because of pipeline delays. The reason for this is that the result of each FMA is not available for 23 cycles, although another independent FMA can be started immediately. Pipeline delays can be eliminated by loop unrolling [as explained in Section 2 for the modulus 2k generator (12)], if a means of efficiently computing yi+n, given yi, is available. Using (43), (44), and (45), and proceeding as in the derivation of (49), we have
|
si+n
|
= ansi mod q
= ansi mod q,
|
|
yi+n
|
= anyi anyi (1 231),
|
|
(56)
|
where
an = an mod q.
This is computed by the following sequence of instructions:
0 = y0
and
v = fl( ian+u),
|
(58)
|
w = v 252,
|
(59)
|
x = w 252,
|
(60)
|
y = fl( ian x),
|
(61)
|
i+n = fl(231x + y),
|
(62)
|
where
n = 231an.
|
(64)
|
The program (57)(63) computes x = an i , y is computed exactly, and it follows that i+1 = yi+1 and i+1 = fl(xi+1). (A detailed proof is given by Lemma 8 in the Appendix.)
Unlike the modulus 2k generator, the modulus 231 1 generator requires more instructions to compute a strided random-number sequence than a contiguous one. An unrolled loop based on (57)(63) takes seven (instead of four) cycles per number on an RS/6000 POWER machine. Therefore, the direct unrolling used for the 2k generator in (14) does not achieve maximum performance for the 231 1 generator.
Instead, the sequential implementation of the 231 1 generator in RANDOM_NUMBER produces contiguous random-number sequences by using two nested unrolled loops based on (57)(63) and (52)(55), analogous to the generalized unrolling in Section 3. For a fixed, preselected batch size m = 32, precomputed values of am = am, a2m = a2m, a3m = a3m, and abar = , ambar = m, a2mbar = 2m, a3mbar = 3m are kept. Given an initial seed yi = yi, the following code (65) then computes random numbers x(0), · · · , x(N) in batches of size 4m. (We assume that N is a multiple of 4m for simplicity.) It takes 18 + 16m cycles per 4m random numbers, for an average of 4.14 cycles per number.
This approach is also adopted in the following XLHPF 1.2 parallel algorithm for generating contiguous random-number sequences, which are required when the RANDOM_NUMBER argument vector or array is BLOCK-distributed:
DO i = 0, N 4 * m, 4 * m
C (57)(63) unrolled by 3
u1 = yi * ambar
u2 = yi * a2mbar
u3 = yi * a3mbar
v1 = yi * am + u1
v2 = yi * a2m + u2
v3 = yi * a3m + u3
w1 = v1 + TP52
w2 = v2 + TP52
w3 = v3 + TP52
x1 = w1 TP52
x2 = w2 TP52
x3 = w3 TP52
y1 = yi * am x1
y2 = yi * a2m x2
y3 = yi * a3m x3
yipm = TPM31 * x1 + y1
yip2m = TPM31 * x2 + y2
yip3m = TPM31 * x3 + y3
DO j = 1, m
C (52)(55) unrolled by 4
u = yi * abar + TP52
u1 = yipm * abar + TP52
u2 = yip2m * abar + TP52
u3 = yip3m * abar + TP52
v = u * k1 k2
v1 = u1 * k1 k2
v2 = u2 * k1 k2
v3 = u3 * k1 k2
yj = yi * a v
yjpm = yipm * a v1
yjp2m = yip2m * a v2
yjp3m = yip3m * a v3
x(i+j) = yj * sbar
x(i+m+j) = yjpm * sbar
x(i+2 * m+j) = yjp2m * sbar
x(i+3 * m+j) = yjp3m * sbar
END DO
yi = yjp3m
Strided random-number sequences
If the CYCLIC distribution directive is used in HPF with more than one processor, the sequence of random numbers resident on each processor consists of strided (k > 1) sub-sequences of the form (38). In addition, for a BLOCK distribution, the starting values xij, j = 0, · · · , P 1, in (38) must be computed from some available previous xi, for example x0. Both of these situations require the efficient computation of multiple steps of the recurrence (43), (44). That is, given yi and n in (56), we must compute yi+n. This is performed by the code in (57)(63). However, since n is not known a priori, we cannot simply precompute the value of
an = an mod q,
and must therefore devise an efficient means to compute it for general n.
We now consider how to compute an in O(log2n) steps, which is crucial to the asymptotically linear speedup of the parallel algorithm. The outline of the approach is similar to (41), but it is complicated by the need to compute the product modulo q instead of 1:
an = 1
DO i = 0, k
IF (bi 0) THEN
an = anTi mod q
ENDIF
A means for efficiently computing the modulus operation in (66) using IEEE double-precision arithmetic and the RS/6000 FMA instruction is a main contribution of this section, and is derived next.
Computing products modulo q
We now consider how to efficiently compute modular products of the form (66). (This is a more detailed description of results that were briefly outlined in [11].) Let
d = bc mod q,
where b and c are integers representable in IEEE double precision (IEEE integers, for short), with 0 < b < q and 0 < c < q, where b and c are powers of a, modulo q. [In particular, the values b = an and c = T(i) from (66) satisfy these conditions.] Then
|
d = bc
|
|
bc
|
|
q.
|
|
|
q
|
By (47),
so that
d = bc bc231s (2311).
By Lemma 3 in the Appendix, it follows that the infinite series s can be replaced by its first two terms; that is,
d = bc bc231 (231 1).
|
(68)
|
[In Lemma 3 we use m = bc < q2 < 262 and the condition that q m is satisfied, since q b and q c (b < q and c < q) and q is prime.]
We now show how (68) can be computed.
Theorem 3
Let b and c be integers representable in IEEE double precision (IEEE integers), with 0 < b < 231 1 and 0 < c < 231 1. Let
d = bc mod (231 1),
and compute as follows using IEEE double-precision arithmetic with the round-to-zero rounding mode:
u = (262b) c,
|
(69)
|
|
v = fl[(231b)c + u],
|
(70)
|
w = v 252,
|
(71)
|
x = w 252,
|
(72)
|
y = 231 x,
|
(73)
|
= z x.
|
(75)
|
Then = d.
Proof (See Appendix.)
In RANDOM_NUMBER, strided random-number sequences are computed by an unrolled loop based on (57)(63), similar to (76) below. Given a stride k, ai = aki, i = 1, · · · , 4, are first computed by (66) and (69)(75). Let aibar = ki, i = 1, · · · , 4. Starting with y1 = yi3, y2 = yi2, y3 = yi1, yi = yi, the unrolled loop (76) computes strided random numbers X(i) = xki, i = 1, · · · , N. It takes seven cycles per random number.
DO i = 0, N 4, 4
u4 = yi * a4bar
u3 = yi * a3bar
u2 = yi * a2bar
u1 = yi * a1bar
X(i+1) = y1 * sbar
v4 = yi * a4 + u4
v3 = yi * a3 + u3
v2 = yi * a2 + u2
v1 = yi * a1 + u1
X(i+2) = y2 * sbar
w4 = v4 + TP52
w3 = v3 + TP52
w2 = v2 + TP52
w1 = v1 + TP52
X(i+3) = y3 * sbar
x4 = w4 TP52
x2 = w2 TP52
x3 = w3 TP52
x1 = w1 TP52
X(i+4) = yi * sbar
z4 = yi * a4 x4
z3 = yi * a3 x3
z2 = yi * a2 x2
z1 = yi * a1 x1
yi = TPM31 * x4 + z4
y3 = TPM31 * x3 + z3
y2 = TPM31 * x2 + z2
y1 = TPM31 * x1 + z1
6. POWER2 timing results
We now give timings for the generators described in Sections 2 and 3. We have confined our timing studies to the RS/6000 POWER2 Model 590. POWER2 is capable of producing two FMAs every cycle, with a pipeline delay of two or three cycles [8, 9]. The POWER2 Model 590 has a clock cycle of 15 ns. The code in Equation (14) gives a pipeline delay of two cycles. To be safe we have doubled the unrolling from 8 to 16 in the actual code used for the timing below. (There are enough floating-point registers on the POWER2 to allow one to double the unrolling factor without negative impact. Running at the lesser unrolling, however, did not affect performance in any measurable way.) Since each uniform random number in the range (0, 1) costs three FMAs, the peak possible rate of random-number generation is 44.44 million per second. We measured performance of the (0, 1) generator for batches of double-word random numbers of size n = 2i, where i = 12 to 21. The size of the 590 cache is 215 doublewords, and the size of its TLB is 218 doublewords. All timing measurements were done using the XLF RTC (real-time clock) utility, and represent actual elapsed wall-clock time, including system time, etc. For i between 14 and 21, the performance was essentially constant. It varied between 42.90 and 43.50 million random numbers per second. For i = 12 and 13, the values were 39.33 and 41.79. This dropoff is due to a fixed setup cost for the generator. The setup cost consists of saving and restoring the user rounding mode, setting the rounding mode to chop, initializing the unrolled loop so that stores to memory are optimal, and the completion of the loop modulo the unrolled count. Without the setup cost, we measured a rate of 42.33 million random numbers per second for i = 12. For large n this cost becomes negligible. The (0, 1) generator runs at 98% of the peak obtainable rate. The result was independent of a and k. We tested two generators where a = 513 and k = 46, and a = 44 485 709 377 909 and k = 48.
The generic generator of Bailey et al. [1] ran at the rate of 810000 pseudorandom numbers for the data in cache and 803000 for data not in cache. Thus, the new generator is 53 times faster than the generic generator for both data in cache and data not in cache.
We tested two versions of the (1, 1) generator. For one of these, the number of cycles for 16 random numbers was 25. This is the 3.125-FMA generator, in which rounding is toward zero. The second one produced 16 random numbers in 24 cycles. This is the three-FMA generator, where rounding is to nearest. The results for the round-to-nearest (1, 1) generator were identical to the results for the (0, 1) generator. Returning to the 3.125-FMA (1, 1) generator, for i = 14 to 21 the performance was essentially constant; it varied between 41.15 and 40.52 million random numbers per second. The peak obtainable rate is 24/25 of 44.44 = 42.67 million random numbers per second. The measured results were at 96.4% of the peak obtainable rate. For i = 12 and 13, the rates were 37.53 and 39.91 million random numbers per second. Here again we see the effect of the fixed setup cost. In summary, for large enough batches of numbers the multiplicative random-number generators deliver more than 40 million random numbers per second regardless of whether the batch size fits in cache.
Now we discuss the four-FMA linear congruential generator. Here we chose batches of size 2i for i = 12, 13, 14, and 15. For i = 12, 13, and 14, we generated 27.62, 29.35, and 30.14 million random numbers per second. The peak possible rate is 33.33 million per second. Thus, our measured rate is about 90% of the peak obtainable. The smaller value for i = 12 was due to the fixed setup cost for this generator. For i = 15, the result was 26.40 million random numbers per second. For larger values of i, this rate per second dropped off very sharply because of cache and TLB thrashing. To alleviate this problem, we set n = 2i 544 * 8. The number 544 is the sum of a page plus a line in doublewords. The factor of 8 was obtained by dividing n by 8 to set up eight different store queues. However, almost any other value of n would have worked equally well. We tried new batches of size n, where i = 15 to 21. The rates were 30.32, 29.76, 29.91, 29.76, 29.76, 29.75, and 29.76 million random numbers per second. In summary, the four-FMA generator delivers about 30 million pseudorandom numbers per second both for data in cache and data out of cache.
7. Conclusions
In this paper, we have given several illustrations of a general technique called the Algorithm and Architecture approach [11]. We have used algorithmic innovation and the FMA instruction in the design of several uniformly distributed pseudorandom-number generators for the intervals (0, 1) and (1, 1).
We have implemented multiplicative congruential pseudorandom-number generators, for the ranges (0, 1) and (1, 1), of the form
si+1 = asi mod 2k,
|
xi+1 = 2ksi+1 or 2k+1si+1 1,
|
(77)
|
which have a period of 2k2. We have shown that the theoretical cost per random number for the two ranges is respectively 3 and 3.125 multiplyadds on RS/6000 processors. Our codes, on the IBM POWER2 Model 590, run at 98% and 96.4% of the peak obtainable rate, respectively, and produce more than 40 million uniformly distributed pseudorandom numbers per second for both ranges. Additionally, our code sustains the 40 million per second rate for data out of cache. Our code is about 50 times faster than the generic implementation with k = 46 given in the NAS parallel benchmarks [1], while producing bit-wise identical results.
We also implemented a linear congruential generator of the form
si+1 = (asi + c) mod 2k,
which has the full period of 2k. We have shown that the theoretical cost per random number [in the range (0, 1)] for this generator is four multiplyadds on RS/6000 processors. Our code, on the IBM POWER2 Model 590, runs at about 90% of the peak obtainable rate and produces more than 30 million uniformly distributed pseudorandom numbers per second.
Finally, we implemented a multiplicative-congruential generator for the interval (0, 1) of the form
si + 1 = asi mod (2k 1),
for which the modulus is not a power of 2. Generators of type (77) and (79), for the interval (0, 1), are available in the RANDOM_NUMBER intrinsic function of IBM XL Fortran [4] and XL High Performance Fortran [5].
All of the generators reported here are embarrassingly parallel. Using an IBM SP2 machine with 250 POWER2 wide nodes, it is possible to compute more than ten billion uniform random numbers in a second.
8. Appendix
This section contains detailed proofs for the lemmas used in the paper.
We first prove a result that allows the floor operation to be computed quickly using IEEE arithmetic.
Lemma 1
Let v be representable in IEEE double precision, with 0 v < 252. Then, with the round-to-zero rounding mode,
v 252 = v + 252
and
(v 252) 252 = v .
Proof The lemma is trivially true for v = 0. Therefore, assume that v > 0. Since v is IEEE, it can be written in binary as
v = b0 . b1 · · · b52 × 2e, b0 = 1,
where e < 52 since v < 252. Then,
v =
|
|
b0 · · · be
| |
e 0,
|
|
0
|
e < 0.
|
Clearly v is an IEEE number.
Suppose first that e 0. Then v 252 performs the following addition, the result of which is truncated to 53 bits. The number of zeros inserted for the normalization of v is k = 52 e, and k 1 since e < 52:
|
v
|
= 0 . 0 0 b0 be be+1 b52
|
× 252,
|
|
252
|
= 1 .
|
× 252,
|
v 252
|
= 1 . 0 0 b0 be
|
× 252.
|
If e < 0, then k > 52 and (v 252) = 252.
Thus,
v 252 = v + 252,
and this establishes the first part of Lemma 1. Now
(v 252) 252 = v ,
|
because the operands and result are IEEE numbers.
|
|
Corollary 1
Let x and y be IEEE double-precision numbers satisfying 0 xy < 252. Then fl(xy + 252) = xy + 252 and fl(xy + 252) 252 = xy .
Proof Use Lemma 1, except that v 252 in the lemma is replaced by fl(xy + 252), and also
|
xy
|
= 0 . 0 · · · 0 b0 · · · be be+1 · · · b105
|
× 252
|
|
252
|
= 1 .
|
× 252
|
|
fl(xy + 252)
|
= 1 . 0 · · · 0 b1 · · · be
|
× 252.
|
Therefore fl(xy + 252) = xy + 252, and the result follows.
|
|
Lemma 2
Let a = 75, yi = 231si, where si is an integer with 0 < si < q = 231 1, and = 1 + 231, s = 1 + 231 + 262 + · · · . Then ayis = ayi .
Proof Let m = 231ayi. Then Lemma 3 gives the required result, provided we can show that m satisfies the conditions of Lemma 3.
First, we show that m is an integer,
|
m = 231ayi = 23175231si = 75si,
|
(80)
|
which is an integer because si is.
Second, we show that 0 < m < 262. From (80), m > 0 since si is. Also from (80), m < 75231 < 85231 = 246 < 262.
Third, we show that q m. Since q is prime, if q | m, then q | a or q | si. But q > a and q > si; thus, q m.
|
|
Lemma 3
Let m Z, 0 < m < 262, (231 1) m. Then
231m + 262m = 231m + 262m + 293m + 2124m + · · · .
Proof Let the binary representation of m be
m = b1b2 · · · b62,
where each bi is either 0 or 1. Then,
|
231m =
|
b1 · · · b31 .
|
b32 · · · b62,
|
|
262m =
| |
. b1 · · · b31
|
b32 · · · b62,
|
|
293m =
| |
. 0 · · · 0
|
b1 · · · b31
|
b32 · · · b62.
|
Let
g = 231m + 262m
and
h3 = 293m,
|
h4
|
=
|
293m + 2124m,
|
|
.
.
.
|
|
hi
|
=
|
i
|
231jm,
|
i = 3, 4, · · · ,
|
|
|
j=3
|
| |
|
h
|
=
|
|
231jm,
|
|
|
j=3
|
| |
|
=
|
lim hi
|
|
i
|
|
(81)
|
We must show that g + h = g . Suppose, to get a contradiction, that g + h ![]() |