|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.492.0465 | Copyright info |  |
 |
 |
Custom math functions for molecular dynamics
|  |  |
by R. F. Enenkel, B. G. Fitch, R. S. Germain, F. G. Gustavson, A. Martin, M. Mendell, J. W. Pitera, M. C. Pitman, A. Rayshubskiy, F. Suits, W. C. Swope, and T. J. C. Ward |
|
|  |
 |  |  |
|
| |
|
Sequencing the genome has enabled scientists to read the “words” in the building blocks of life. All-atom molecular dynamics is one of the tools in the grand challenge of understanding the stories told by those words.
We want to model the time–series behavior of a covalently bonded structure, such as a protein molecule that is surrounded by water molecules, as it would be in a living cell. We usually imagine a single protein molecule in a cubic box of a few thousand water molecules, and then imagine that there are identical boxes stacked in all directions, rather like atomic-scale synchronized swimming, with the swimmer made up of balls held together with springs. To understand the behavior of a single “spring” would require quantum mechanics, but on the larger scale of wanting to understand the “swimmer,” classical mechanics is sufficient.
Most of the forces to be calculated are the long-range electrostatic forces between atoms in separate water molecules, but the interesting behavior is related to the short-range forces along the springs and between various three-atom and four-atom bonded groups. This requires calculation of large quantities of square roots and their reciprocals (for multiplying and dividing by distances); error functions (one way of approaching the electrostatics); angles (between pairs of springs); periodic images (to work out which swimmer a water molecule is nearest to); and polynomials (for Lennard–Jones bonded forces and to softly switch off forces as pairs of atoms move farther apart and fade to the background).
| |
|
Source code for the functions presented here can be found at [1].
| |
|
The IBM xlc compiler can schedule instructions flexibly within a basic block, that is, a sequence of code with no conditional branches and no entry points other than the first instruction. This paper explains how to exploit this for functions commonly used in molecular dynamics; if the compiler can be enabled to see a sufficient number of independent instructions, it will schedule instructions to avoid stalls in the floating-point execution pipeline, and so the hardware will run at a high fraction of peak throughput. To make good use of the compiler instruction scheduling facility, the use of branch instructions should be minimized. This means that special cases and error handling should be omitted or done in a way that avoids branches. Therefore, all of these math functions will return a scalar result, will not set errno2, and will not signal a NaN (a not-a-number exception value in IEEE floating-point) in any useful way. Wrapper code could be placed around the functions to produce conventional results for out-of-domain cases, for example, to produce NaN for log(−1), but for molecular dynamics, we are generally confident that they will not be asked to process out-of-domain cases, and so the extra computation involved in obtaining conventional answers is best skipped.
One way to enhance scheduling opportunities by exposing independent instructions to the compiler is to write each independent computation explicitly in the source code. Another way is to compute the same basic block repeatedly with different arguments in a counted loop and verify that the compiler can see that loop iterations are independent; the compiler then applies loop transformation optimizations, such as unrolling3 and modulo scheduling4, to construct the appropriate work itself. Both techniques aim to reduce stalls5.
| |
|
The function log may be vectorized by appreciating that a floating-point number is represented as an exponent k and a mantissa (also called a fraction) m; i.e., as m × 2k, for some m in [1.0, 2.0) and for integer k,
ln(m ×2k) = ln(m) + ln(2k).
The approximation is produced as three terms, which are added together to give the result.
The variable k is extracted as the exponent part of the argument, giving the first term of the result as k × ln(2).
The variable m is expressed as m0 × m1, where m0 is 1 + (a/16) for integer a in (0, 15), and m1 is m/[1 + (a/16)].
The variable a is determined by extracting the first four bits after the binary point from m.
The expression 1/[1 + (a/16)] is looked up in a 16-element table, and this gives a value for m1 roughly between 1 and [1 + (1/16)].
The second term of the result is ln(m0), which comes from another 16-element table.
The third term of the result comes from a Taylor series for ln(1 + x). This converges quite rapidly for x < (1/16).
The full result then is
ln(a) k × ln(2) + Lookup(a) + TaylorSeries(x).
An improvement comes from a slight modification, where m1 is arranged to be in the domain [1 − (1/32), 1 + (1/32)), and so the Taylor series is used for |x| < (1/32).
| |
|
The function exp may be vectorized by using the relation
exp(a0 + a1 + a2 + a3) = exp(a0) × exp(a1) × exp(a2) × exp(a3).
The variable a0 is extracted as the integer part of the argument; a1 is the next four bits; a2 is the subsequent four bits; a3 is the remaining bits; a3 is a number between 0 and (1/256).
The variable a0 is shifted into the exponent of the resulting floating-point number; exp(a1) and exp(a2) are looked up in 16-element tables; exp(a3) is estimated by a Taylor series, which converges quite rapidly for 0 < a3 < (1/256).
Again, an improvement comes from a slight modification, setting a3 in the domain [−(1/512), + (1/512)).
IBM PowerPC* and follow-on hardware supports a floating-point “select” instruction that performs the equivalent of
double fsel (double a, double b, double c) { if (a >=0 . 0) return b ; return c }
as a single hardware instruction. This can be used to arrange that exp(x) returns 0 for a sufficiently large negative argument and Inf6 for a sufficiently large positive argument without causing a branch in the generated code.
| |
|
Traditionally in molecular dynamics codes, erfc(x) has been approximated using the approximation for erf(x) in Section 7.1.26 of [2], related by erfc(x) + erf(x) = 1. The Abramowitz and Stegun approximation from the reference is
erfx = 1 − (a1t + a2t2 + a3t3 + a4t4 + a5t5)e−x2 + ε(x),
|ε(x)| ≤ 1.5 × 10−7,
p = 0.3275911,
a1 = 0.254829592,
a2 = −0.284496736,
a3 = 1.421413741,
a4 = −1.453152027,
a5 = 1.061405429.
Vectorizable exp(x) can be used to form vectorizable erfc(x) in the obvious way, but there is an alternative that can be used to form a more accurate result, which is desirable in molecular dynamics because it should give better energy conservation for a given timestep size or, alternatively, will allow a larger timestep size before numerical instability sets in.
The reciprocal required above is a special case; for molecular dynamics codes, the dividend will be in the single-precision range, and there is no point returning a result much more accurate than the one part in 105 of the complete approximation. This leads to a faster expression of reciprocal than the hardware double-precision divide will give (more on this below).
For molecular dynamics, we are interested in erfc to support electrostatics, erfc(x) for a limited domain of x, typically (−4, 4).
We partition the domain into equal-sized subdomains, say [−4, −3), [−3, −2), …, [3, 4). Represent x as x0 + x1, where x1 is in [−0.5, 0.5) and x0 is an integer that identifies the subdomain. Each subdomain is associated with a polynomial approximator—a set of eight Chebyshev polynomials works well.
Select the appropriate polynomial by using x0 to index an array, and erfc(x) follows.
It is relatively easy to set the polynomials up to give erfc(x) accurate within 1 ulp7 over the whole domain. It is desirable to use fsel to avoid misleading results in case the function is used for a value of x outside the designed domain.
It is possible to exploit the symmetry between erfc(x) and erfc(−x) to halve the number of tables required.
The required table for Chebyshev coefficients is machine-generated. The algorithm is shown in [3]. First, the Chebyshev coefficients for (d/dx) erfc(x) are generated using the analytic expression (−2/ ) exp(−x2). Then the coefficients for erfc(x) are generated by applying the appropriate transformation on these.
| |
|
Derivative erfc is (−2/ ) exp(−x2) and may be vectorized using vectorizable exp(x).
However, for molecular dynamics, it is desirable to have derivative erfc and erfc related accurately as derivative and integral of each other; this results in better reported energy conservation and better accuracy when switch or soft force cutoff is in use.
When the Abramowitz and Stegun approximation for erfc(x) is in use, we can differentiate the expression analytically. The derivative has an exponential term of the same form as the original, i.e., exp(−x2), so a single evaluation of exp(X) will do duty for both functions when erfc and its derivative are both required in a computation.
When the multiple Chebyshev approach is in use, another set of Chebyshev polynomials can be used to deliver derivative erfc. If these are on the same subdomains, there is a computational economy.
| |
|
In molecular dynamics, erfc and its derivative are used in the evaluation of electrostatic forces. Another approximation (particle mesh) means that it is not useful to get erfc(x) more precise than a relative error of about 10−5; the imprecision due to the “particle mesh” approximation dominates.
However, it is important for the values returned for erfc(x) and its derivative to be continuous and an analytic integral/derivative pair.
This can be satisfied by approximating (d/dx) erfc(x) with a set of cubic splines, matching the (−2/ ) exp(−x2) function and its derivative at the piecewise endpoints and integrating these polynomials to give piecewise-quartic approximations for erfc(x). A set of 64 piecewise-cubic polynomials and their integrals, for domains [0, (1/16)), [(1/16), (2/16)), …, [(63/16), (64/16)) gives the ability to approximate erfc(x) and its derivative to the required precision in the domain [0–4).
| |
|
It is convenient to use a multiple-Chebyshev-polynomial approach for this as well. Divide sin(x) into domains [−45, 45), [45, 135), [135, 225), and [225, 315) degrees and repeat cyclically.
In domains [−45, 45) degrees and [135, 225) degrees, use a Chebyshev polynomial for [sin(x)/x], and multiply the result by x. This arranges that the result for small |x| can be within an ulp without requiring an excessive number of terms in the polynomial.
In domains [45, 135) and [225, 315), use a Chebyshev polynomial for cos(x).
The required Chebyshev polynomials are always even functions, such that f(x) = f(−x). This economizes on the computation.
After the polynomial evaluation, fix up the result using a suitable multiply and add, according to the subdomain.
Since cos(x) = sin(x + 90) with angles in degrees, cos and sin are related.
The tables are machine-generated offline, using higher-precision sin and cos functions and the algorithm in [3].
| |
|
Sometimes an application knows the sin and cos of an angle and wishes to evaluate the angle. Traditional arcsin involves an ambiguity as to the angle (as between 80 degrees or 100 degrees, for example), is ill-conditioned in ranges near 90 and 270 degrees, and usually involves a conditional branch and a square root.
By expressing it as
double acossin(double cos_angle, double sin_angle)
we can overcome these limitations and produce an implementation without branches.
We want to compute θ such that cosangle = cos θ and sinangle = sin θ. Let c = |cosangle|, s = |sinangle|, and use the fsel instruction to obtain minsc = min (c, s) and maxsc = max (c, s). Then 0 < minsc < √0.5 and √0.5 < maxsc < 1, and there is an angle in [0,45] degrees such that minsc = sin and maxsc = cos .
Then we use the compound angle formula
sin (a − b) = sin (a) cos (b) − cos (a) sin (b)
for b = 22.5 degrees to form the sine of an angle in [−22.5, 22.5] degrees, a value approximately in the domain [−0.38, 0.38].
Next, we use the Taylor expansion for arcsin(x), which converges quite rapidly over this domain, and we multiply by and add suitable constants (according to whether the original parameters were negated and which was smaller) to evaluate the called-for angle.
| |
|
The natural way to express this is
double a=1.0/sqrt (x) ;
The IBM xlc compiler “-qnostrict” option causes this to be recognized as an idiom. There is a hardware reciprocal square root estimate instruction that gives a result accurate to five bits (POWER3)8 or 13 bits (Blue Gene/L)9 using lookup tables in the same amount of time that a multiply–add instruction would take; and the compiler generates a suitable number of iterations of Newton's method, or a suitable Taylor correction polynomial, to bring the result to double-precision accuracy. This avoids the division operation, and this direct “reciprocal square root” evaluation is faster than “square root” would be.
Newton's iteration is expressed in terms of multiplies and adds. The “divide by b” that seems to be required is replaced with “multiply by estimate of 1/b.” The running estimate of 1/b is steadily improving, so quadratic convergence is maintained.
| |
|
The compiler recognizes the use of √x in a source program
double a=sqrt (x)
and rather than calling a function, it generates the fsqrt hardware instruction on the POWER3 processor. Blue Gene/L (BG/L) lacks this instruction, so the compiler, in effect, changes the computation to x/√x, which it implements with the help of the floating reciprocal square root estimate instruction. However, x/√x on its own will give “not-a-number” for x = 0; the compiler generates additional code to handle this case correctly, but it is computationally expensive.
If the source program is not dependent on the result for x = 0, it will run better on both POWER3 and BG/L if coded as
double a=x/sqrt (x) .
| |
|
Molecular dynamics is frequently run with periodic boundary conditions, i.e., where we imagine that the simulation volume is surrounded by a never-ending sequence of matching simulation volumes and the interaction force between a pair of atoms is calculated as if one of the atoms is influenced by only by the nearest of the 27 images of the other atom.
To find the nearest image vector between a pair of atoms, one algorithm would remap the simulation volume to a unit cube and scale the vector appropriately, drop the integer part of the x, y, and z coordinates of the vector (each of which would be −1, 0, or +1), and subtract
giving a vector in
and rescale back to the original coordinate system.
This appears to require divisions, tests, and conditional branches, but can actually be calculated without requiring any of these.
| |
|
Vectorizable nearest_integer relies on the IEEE floating-point representation. Double precision takes 64 bits. The top bit is a sign bit, the next 11 bits are a binary exponent, and the remaining 52 bits are a binary mantissa, with an implied leading 1.
IEEE addition, with the hardware in its usual mode, is specified to round to the nearest representable number. Thus, if one takes a double-precision floating-point number and adds (252 + 251), the fractional part is dropped. One can then subtract the (252 + 251) and obtain the integer nearest to the number used to start.
There is a range around 252 in which one obtains the nearest even integer, so this is not applicable in all cases, but is acceptable for molecular dynamics.
The compiler is being asked to generate code for (x + k) − k. It is important to prevent the optimizer from reassociating this to x + (k − k) and then optimizing this to x + 0, that is, x.
The sample code does this by expressing (x + k × k1) × k1 − k, where k1 is 1.0, but the compiler is unable to tell that k1 is a constant. Since the basic floating-point instruction in the IBM Power Architecture* is multiply–add, this does not cause any extra processing cycles.
| |
|
Molecular dynamics is generally concerned with forces between atoms in an imagined simulation box with periodic boundary conditions. Computation of the force between a pair of atoms is skipped if the atoms are more than a threshold distance apart.
For computational convenience, the atoms are grouped into fragments, typically a water molecule or a covalently bonded set of atoms within a larger molecule. The question arises, “Given fragment a, what is the set of fragments {b0, b1, …} such that an atom in a is in range of an atom in each bi, accounting for the periodic boundary?” The simulation will be functionally correct if extra fragments b are in the set, because the forces involved will evaluate to zero, but the simulation is more efficient with fewer extra fragments.
There is an algorithm for this that makes 100% use of the floating-point units (FPUs), successively slicing for slab, cylinder, and sphere.
There is another algorithm that does not use the FPUs; instead, it uses the integer units with wrap at 232, successively slicing for slab, square prism, and cube. It then uses the FPUs to slice for sphere. On POWER3 and BG/L, the integer algorithm is faster. Either algorithm is sufficiently fast that our implementation of the molecular dynamics code does not have to maintain lists of fragments (known as Verlet lists) that may be within a “cut-off” distance.
These algorithms show how to do “vector compress,” i.e., produce a vector that is a subset of a starting vector, including only those elements matching a selection criterion, without requiring a conditional branch.
| |
|
The reciprocal square root function evaluates the reciprocal square root for each of nine values, as would be needed to support the calculation of distances between atoms in a pair of three-site10 water molecules.
Figure 1 (Part b) shows source code and the compiler-generated assembly listing for the BG/L machine architecture. Compiler intermediate code with cycle counts and corresponding listings for POWER3 can be found at [1].
Figure 1 Part a
Figure 1 Part b
Values are copied into local variables to make it clear to the compiler what is intended if the function is called with source and target overlapping in memory.
POWER3 requires a vector of length at least 6 to keep the FPUs fully busy on this algorithm. BG/L requires a vector of length 10. In each case, the compiler finds an optimal instruction sequence; 100% floating-point utilization for POWER3 and 90% utilization (four “parallel” ops, then a “primary” op) for BG/L.
The “reciprocal square root estimate” instruction of POWER3 gives five bits of precision; that of BG/L gives 13 bits of precision. BG/L requires fewer follow-on instructions to converge the estimate to double precision. POWER3 uses a Newton–Raphson algorithm for convergence; BG/L uses a Taylor expansion.
The theoretical peak rate for each 440 processor core in the BG/L hardware is ten double-precision square roots per 40 clock cycles. By enclosing similar code in a “for” loop, it is possible to get the VisualAge* compiler to generate code that achieves within a few cycles of this rate.
Examining the machine code reveals that when a floating-point value is calculated, there are at least four other floating-point instructions between the calculation and the first use of the result. This keeps the floating-point pipeline full, allowing the FPU to operate at maximum throughput.
Each 440 processor core in BG/L can dispatch two instructions per clock cycle. Each has one floating-point instruction pipeline, one load/store pipeline, two integer pipelines, and one branch pipeline. The 440 does not have “out-of-order” processing capability or “rename” registers; both of these cost transistors and electrical energy, which the BG/L design puts to better use elsewhere. Therefore, we are dependent for good performance on the ability of the compiler to schedule instructions and allocate registers in the optimal patterns for the real hardware. The compiler effort to exploit the two-instructions-per-cycle capability can be seen in the assembly fragment shown in the figure.
IBM Power Architecture defines 32 double-precision floating-point registers. Floating-point operations, in general, work on three operand registers and a result register. For example, “floating-point multiply–add” might evaluate f1 = f2 + (f3 × f4) in a single pass through the FPU. The Blue Gene/L chip has an additional 32 double-precision floating-point registers, an additional FPU, and extensions to the instruction decoder to implement “parallel” versions of these, such as
f1 = f2 + (f3 × f4), s1 = s2 + (s3 × s4)
and various “cross” versions, such as
f1 = f2 + (f3 × f4), s1 = s2 + (s3 × 44)
or
f1 = f2 + (f3 × s4), s1 = s2 + (s3 × f4)
and an “antisymmetric” version,
f1 = f2 + (f3 × f4), s1 = s2 − (s3 × s4).
Several of these can be seen in the assembly fragment shown in Figure 1.
| |
|
When we started designing the protein folding application, we imagined that we would be unable to fully exploit the floating-point capacity of a modern uniprocessor because of the sequential nature of the scalar library functions, which we expected would limit the performance of the application. This would limit the fraction of peak flops that we would achieve on the massively parallel machine we had in mind.
Working with the life scientists on the actual requirements of the application and with the compiler programmers on optimization capabilities has resulted in techniques for evaluating the required functions and presenting the machine with sufficient independent work that we, in fact, achieved a high fraction of peak flops on a uniprocessor. This is expressible as source code in a high-level language, such as C++; it has not been necessary to hand-code anything in assembler.
Knowing that we can well exploit a uniprocessor, we are motivated to use the machine efficiently as we take the next steps to scale up to the 40,960 processors that will be available to us in our contribution to the “grand challenge” of understanding the Protein Economy.11
Blue Gene/L can do more than just lead the world at linear algebra. It takes some thought, but the results are worth the effort.
| |
The Blue Gene/L project has been supported and partially funded by the Lawrence Livermore National Laboratory on behalf of the United States Department of Energy under Lawrence Livermore National Laboratory Subcontract No. B517552.
*Trademark or registered trademark of International Business Machines Corporation.
**Trademark or registered trademark of Linus Torvalds in the United States, other countries, or both.
| |
| |
1On Blue Gene*/L, if code has dependencies such that a, b, and c must be computed in that order, with b depending on a and c depending on b, and no other work is available, the machine will deliver 10% of its theoretical peak performance. Here, the term vectorizable stands for assorted techniques to get closer to 100%.
2A global variable used to indicate which error has occurred.
3Grouping of multiple loop iterations so that the instructions from multiple iterations can be worked on in parallel.
4Software pipelining of loops—rearranging them to work on parts of more than one iteration at a time, the way a button is sewn on a shirt.
5Situations in which an instruction must wait before entering the processor because the calculations which produce one or more operands have not yet completed.
6A bit pattern representing infinity, or larger than the largest representable value, in IEEE floating-point.
7Ultimate limit of precision—one double-precision unit in the last place of the IEEE fraction part.
8Using a stepwise lookup.
9Using a piecewise-linear lookup, stepwise for “offset” and “slope,” then passing through the multiply–add unit, which would otherwise be idle. It is better precision with the same transistor count.
10A three-site water molecule is a model with electrostatic charges centered on the three atom locations. A five-site model has fractional electron charges at two other locations. Models run this way often match experiment more closely, and always take more computation for a simulation timestep.
11Proteomics: the Protein Economy. Deoxyribonucleic acid (DNA) stores information. We know; we have sequenced it. Ribonucleic acid (RNA) copies information. We know; we can make it happen in the laboratory. Ribosome (a protein) reads the RNA and assembles a protein out of amino acids according to the recipe it has read. We know; every living thing on earth works in much the same way. The protein folds into its equilibrium structure, more or less, quickly or slowly. We're curious about this. Function follows form, and all the diversity of life happens.
Received May 5, 2004; accepted for publication August 17, 2004; Published online April 12, 2005.
|
|