|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.521.0145 | Copyright info |  |
 |
 |
Blue Matter: Scaling of N-body simulations to one atom per node
|  |  |
by B. G. Fitch, A. Rayshubskiy, M. Eleftheriou, T. J. C. Ward, M. E. Giampapa, M. C. Pitman, J. W. Pitera, W. C. Swope, and R. S. Germain
|
|
|  |
 |  |  |
|
| |
|
Numerical simulation of molecular systems can yield unique insights into the details of the structure and dynamics of biomolecules [1]. Such simulations are used to sample the configurations assumed by the biomolecule at a specified temperature and also to study the evolution of the dynamical system under some specified set of conditions. Among the many challenges facing the biomolecular simulation community, the one that stresses computer systems to the utmost is increasing the timescales probed by simulation in order to better make contact with physical experiment. Even sampling techniques that do not themselves yield kinetic information can benefit from an increased computation rate for a single trajectory.
Classical biomolecular simulation includes both Monte Carlo and molecular dynamics (MD) [2]. The focus of our work has been on MD, although the replica exchange or parallel tempering method [3], which combines MD with Monte Carlo-style moves, has been implemented in Blue Matter as well [4]. Classical MD is an N-body problem in which the evolution of the system is computed by numerical integration of the classical equations of motion. At each timestep, forces on particles are computed, and then the equations of motion are integrated to update the velocities and positions of the particles. The forces on the particles can be classified as follows:
Bonded forces—These forces act between covalently bonded atoms and include bond stretches, angle bends, and torsions.
Nonbonded forces—These forces act between all pairs of particles and include the hard-core repulsion and van der Waals interactions, which are typically modeled by a Lennard–Jones 6-12 potential of the form

as well as the electrostatic forces, which have a potential energy of the form
Ve(rij) = qi · qj /rij.
The forces induced by the Lennard–Jones potential drop off rapidly with distance (varying as 1/rij13 and 1/rij7) and can be modeled as finite-ranged forces. The electrostatic forces fall off much more slowly with distance (varying as 1/rij2) and cannot be approximated simply by neglecting interactions between pairs of particles beyond some cutoff distance (even when a smooth switching function is used) [5].
Ideally, a simulation would model a protein in an infinite volume of water, but this is not practical. Instead, the usual approach is to use periodic boundary conditions so that the simulation models an infinite array of identical cells that contain the biomolecules under study, along with the water. This is generally preferred over simulating a single finite box in a vacuum (or some dielectric) because it eliminates the interface at the surface of the simulation cell. The choice of simulation cell size is important. If the cell is too large, unnecessary computations will be done; too small, and the interactions between biomolecules in different cells of the periodic array can introduce artifacts. The most commonly used techniques for handling the long-range interactions with periodic boundary conditions are based on the Ewald summation method and particle-mesh techniques that divide the electrostatic force evaluation into a real-space portion that can be approximated by a potential with a finite-range cutoff and a reciprocal space portion that involves a convolution of the charge distribution with an interaction kernel [6–9]. This convolution is evaluated using a fast Fourier transform (FFT) method in all of the particle-mesh techniques, including the particle-particle particle-mesh Ewald (P3ME) technique [6, 9] used by Blue Matter. In this case, the O(n2) dependence on particle number n is reduced to O(n log n). The evaluation of the three-dimensional (3D) FFT and its inverse on every timestep imposes a global data dependency on the program. That is, the result depends on the position and value of every charge in the system.
Various algorithmic techniques for increasing the effective rate at which the kinetics of the systems evolve have been explored, including kinetic acceleration techniques [10, 11] and multiple timestepping algorithms [12]. Issues such as the difficulty in defining appropriate states between which transitions take place within a simulation have raised concerns about the applicability of the kinetic acceleration techniques to biomolecular simulation. In our experience, even with a correct splitting of the electrostatic forces [13], the use of multiple timestepping leads to significantly larger drifts in the total energy for constant particle number, volume, and energy (NVE) simulations over that obtained using the velocity Verlet integrator [14]. Our view is that direct kinetic simulation is still the most reliable technique for accessing dynamical information over a long timescale. This is in spite of the challenge of scaling a fixed-size N-body problem with some tens of thousands of particles onto a parallel computer with many thousands of nodes (strong scaling). The scale of this challenge can be envisioned by considering that a simulation rate of 1 μs per 2 weeks of wall-clock time requires each MD timestep to complete within 1.2 ms, or fewer than one million processor clock cycles on the IBM Blue Gene/L* machine when using a timestep size of one femtosecond (10−15 seconds). Carrying out such a long timescale simulation also potentially exposes correctness issues with the implementation of MD in a particular application. In this context, correctness refers to the degree to which the numerically simulated trajectory is representative of an actual trajectory within the model (potential surface) used. For a constant energy simulation, the size of both the short-term fluctuations and especially the long-term drift in the energy can be used as indicators of correctness or the lack thereof (see Reference [15] and the references therein). Given the plateauing of microprocessor clock speeds, any attempt to directly access millisecond timescales would require completion of each MD timestep in fewer than 10,000 cycles.
| |
|
Prior to the availability of the Blue Gene/L hardware platform [16], the highest degree of strong scaling in the published literature was that achieved by the Nanoscale Molecular Dynamics (NAMD) [17] package, which demonstrated continued speedup through about 60 atoms per processor (and a time per timestep of about 15 ms without multiple timestepping) on the Pittsburgh Supercomputing Center Lemieux system using 1,536 processors [18]. At that time, the NAMD code used a combination of volume and force decompositions [19, 20] for the evaluation of the real-space forces. This made it possible to create a large number of units of work—14 times the number of volume elements in the system, where the dimension of each volume element was larger than a cutoff radius—that could be distributed for load balancing. Subsequently, the NAMD developers incrementally increased the number of units of work available by effectively splitting volume elements in half along a selected axis or axes. Also, the parallel decomposition used for the 3D FFT in the NAMD code has thus far been a slab decomposition, which limits the distribution of work for that module to N for an N × N × N FFT. Much of the published scientific work using NAMD has been on simulations of large biomolecular systems rather than on smaller systems at very long timescales.
The intellectual point of departure for much of the work on highly scalable parallel decompositions of the N‐body problem (the real-space portion, at least) is the work of Hendrickson and Plimpton [19, 21], who demonstrated a force decomposition method for which the number of communicating partners of a single node scales like O( ), where p is the number of nodes. Algorithms that attempted to achieve comparable theoretical scaling behavior within a volume decomposition have been published by Snir [22] and Shaw [23], although, to our knowledge, no implementation of either of these algorithms has been published. Subsequent to the publication of our initial description and performance characterization of the Blue Matter Version 4 technique in September 2005 [24, 25], which is described below, the D. E. Shaw team published a description of essentially the same technique [26]. An implementation of this algorithm on a commodity cluster was described in a subsequent publication [27], but performance was reported using aggressive approximations that included the use of single-precision arithmetic and multiple timestepping, with a reported energy drift of 7 × 10−4 K/ns (100 times larger than the worst energy drift measured on Blue Matter under production or benchmarking conditions).
| |
|
Blue Matter has provided a concrete context in which to explore algorithmic techniques and programming models required to exploit massively parallel machine architectures, such as the IBM Blue Gene* supercomputer, as well as providing the capability required to execute one of the primary goals of the IBM Blue Gene Project [28]: to use the unprecedented computational resource developed during the course of the project to attack grand-challenge life-sciences problems, such as advancing our understanding of biologically important processes, in particular, the mechanisms behind protein folding. Blue Matter has been used in production by computational scientists on the Blue Gene Project since 2003, initially on the IBM SP* platform and later on the Blue Gene/L platform [29–36]. Through the use of the real-space parallelization techniques, described below in the section “Parallel decompositions,” and the highly scalable 3D FFT developed as part of this project [37], the Blue Matter MD code has demonstrated continued speedup through approximately one atom per node on 16,384 Blue Gene/L nodes and a time per timestep of less than 2 ms for a 43,222-atom solvated membrane protein system. All of this was achieved with methodologically rigorous methods for classical fixed-charge force fields. Our success in meeting the challenges of strong scaling has enabled detailed atomistic simulations of biologically interesting systems at timescales and with ensemble sizes that were previously unattainable, including 26 trajectories of 100 ns each of a G-protein-coupled receptor (GPCR), rhodopsin, in a realistic membrane environment [34] and multiple-microsecond-scale simulations of that system [36] and others. Furthermore, since the path to increased hardware performance now seems to lie more along the path of increasing concurrency (multiple CPU cores per chip and increased parallelism) rather than increasing clock speed [38], future work with even very large molecular systems with hundreds of thousands of atoms may require scalability to small ratios of atoms per node. While there have been some theoretical studies of scaling in this limit [39], the Blue Matter classical biomolecular simulation application running on the Blue Gene/L machine represents the first demonstration of strong scaling of such a code to this degree [15, 24, 25, 40]. With access to such timescales comes increased concern about whether the simulations are valid, and Blue Matter has also demonstrated the ability to generate trajectories with excellent energy conservation over microsecond timescales.
| |
|
We describe MD as comprised of four major modules:
- Real space, nonbonded (finite-range pair interactions).
- K space (FFT based).
- Bonded (graph based).
- Integration (per particle).
Before attempting to scale an algorithm onto many thousands of nodes, it is useful to estimate how much concurrency (potential for parallelism) is inherent in various components of that algorithm. It is interesting that while the machine architecture of the Blue Gene supercomputer forced us to think this way, it is likely that this sort of analysis would be useful on other massively parallel machines. First, consider the anatomy of an MD timestep starting with the availability of the coordinates and velocities of all of the particles in the system (ri, vi):
-
Compute forces on each particle due to bonded (intramolecular) interactions:
- Bond stretches.
- Angle bends.
- Torsions.
-
Compute forces on each particle due to nonbonded interactions (assuming periodic boundary conditions):
-
Hard-core repulsive and van der Waals forces (usually represented by a Lennard–Jones 6-12 potential function that is smoothly switched off beyond some cutoff distance Rc).
-
Electrostatic forces (1/r2 forces that are most commonly evaluated using the Ewald summation technique or its mesh-based variants [9]).
-
Accumulate the total force on each particle and use that force along with the current position and velocity of the particle to propagate its motion forward in time by some small increment.
It is possible to view an MD timestep (or any parallel computation) as the successive materialization of distributed data structures on which local computation takes place. Given a choice of granularity below which no parallelism will be attempted and taking into account the data dependencies in the algorithm, one can estimate the number of data-independent computations required or available at each phase. An example of such an analysis for an MD simulation using the P3ME method to treat the long-range electrostatic forces is shown in Figure 1. One can afford a lack of concurrency in components that impose very little computation or communication burden, but eventually even these will become bottlenecks if they are not parallelized. This is an instance of Amdahl's Law, which states that the amount of speedup possible for a fixed-size problem is limited by the fraction of the problem that is nonparallelizable [41]. More precisely, the PotentialSpeedup = 1/(f + (1 − f)/N), where f is the fraction of time consumed by serial operations (or those replicated on every node) and N is the number of nodes.
Figure 1
In the limit of very strong scaling, the P3ME convolution step is expected to be the limiting factor, assuming that a good distribution of work can be achieved for the bonded and real-space nonbonded force computations. We conjecture that this would be the case even for an architecture with full bisectional bandwidth because of latencies due to the hardware and software overheads in the successive communication phases required by the P3ME shown in Figure 1. The development of a highly scalable 3D FFT was essential for Blue Matter and has been reported in detail previously [37].
| |
|
Our explorations of parallel decompositions have included replicated data methods that leverage the hardware facilities of the Blue Gene/L platform to globalize and reduce data structures [42, 43] and combined spatial and interaction decompositions [15, 24, 25]. This paper focuses on the two spatial decompositions used in Blue Matter for which we have strong scaling data. We refer to these two decompositions, described below, as Version 4 (V4) and Version 5 (V5). Our investigations began with the realization that for finite-range pair potentials, given a 3D domain decomposition of the simulation cell onto a set of nodes, it is possible to limit the broadcast of positions by any originating node to those nodes containing a portion of simulation space within half a cutoff radius of the boundary of the originating node. V4, the first spatial decomposition deployed in Blue Matter, uses geometric criteria to determine where the real-space nonbonded interaction between two particles is to be computed, namely, on whichever node contains the point halfway between the two particles. This provides a large number of distinct units of computational work that can be distributed by partitioning space using an optimal recursive bisection scheme [24, 25, 40]. The implementation of this method entails the broadcast of a particle position to a sphere with radius Reff. Nominally, Reff is half the MD cutoff distance Rc for real-space nonbonded interactions for both V4 and V5. However, enabling the preservation of particle assignments to nodes over several timesteps requires the introduction of a guard zone that increases the Reff beyond half of the MD cutoff. The size of the guard zone is a tuning parameter. The use of this guard zone in V4 also ensures that the particle positions required by the K-space and bonded-force modules are usually available without additional pairs of communicating nodes.
The most recent parallel decomposition, V5, uses geometry primarily as a heuristic to prime the set-based optimization process that follows [15]. Whereas V4 managed data distribution (of particle positions) and reduction (of forces) by means of a data push and a caching module, V5 specializes the communication between the integrator module and the three force-computation modules described in the previous section. Whereas the V4 push and cache method required a distinct communication phase, V5 allows the overlap of the communication or computation, or both, of one module with that of another module. On the Blue Gene/L machine, which has two processors per node, modules are scheduled to cores to maximize overlap. Currently, the scheduling is static and we place the longest running module on its own core.
Achieving a reduction in the number of communicating partners per node (for the real-space nonbonded computations) was the major impetus behind the evolution from V4 to V5. In Blue Matter V4, each node (the home node) broadcasts the positions of the particles homed on that node to each node containing a volume element that intersects the volume enclosed by an imaginary shell extending half the cutoff distance (with some additional guard distance) away from the volume owned by the home node, as shown in Figure 2(a). This local volumetric broadcast and the corresponding reduction of forces back to the home node involve a number of nodes that scale like O(p), the same scaling obtained with a classic volumetric decomposition (though broadcasting to only one-eighth of the volume in simulation space). The advantage of a technique such as that used by V4 on a mesh interconnect topology is that the number of hops for each message is minimized.
Figure 2
The next incremental step in improving the V4 algorithm was taken by noting that all interactions can still be computed if the local broadcast only goes to nodes containing a portion of the imaginary shell that forms the boundary of volumetric broadcast used in the V4 technique, as shown in Figure 2(b). This improves the scaling of the number of communicating partners per node to O(p2/3). In this method, the interaction between two particles can be computed on any of the nodes that contain the intersection of the broadcast shells for the nodes containing the two particles. In contrast to the V4 technique, there is no simple geometric construction (analogous to the choice of the midpoint in V4) to select the node that should compute the interaction.
Implementation realities can affect the purely geometric analysis of the volumetric and shell broadcast techniques as well as those of other geometric approaches [22, 23]. The use of a guard zone added onto the nominal position broadcast radius enables the assignment of particles and interaction computations to nodes in order to remain fixed over multiple timesteps. This means that, for example, the ideal broadcast to nodes that intersect a spherical surface of zero thickness becomes a broadcast to nodes that intersect a spherical surface with a thickness equal to that of the guard zone.
The insight that the real-space nonbonded algorithm can be cast as an optimization problem leads to the method actually implemented in Blue Matter V5 [15]. The full optimization problem is to minimize the average execution time per MD timestep, and this is beyond our ability to solve at present. Attempting to minimize the number of communicating partners for each node, subject to the constraint of load balance, is a much more tractable problem, particularly given a good heuristic (such as the broadcast to a shell described above) for starting this optimization process. We begin a description of the optimization process used in V5 with a set of definitions and initializations:
-
Begin by defining the set of nodes to which each node will potentially broadcast particle coordinates. In the case of Blue Matter V5, this is the set of nodes that contains some portion of the surface in simulation space defined by the locus of points that are exactly a broadcast distance Reff away from the surface of the simulation space volume assigned to the originating, or central, node i. We call this the Candidate Send To Node Set or, in the case of V5, the Surface Node Set, which we denote Ci. Of course, the optimization process could also use the set of nodes defined by the broadcast volume in V4 or some other set of nodes as a starting point.
-
For each pair of nodes i and j, define the Interaction Assignment Option Set to be the intersection of the corresponding Candidate Send To Node Sets. In principle, the task of computing the interactions between particles in i and j can be assigned to any node in the Interaction Assignment Option Set, which we will call Oij.
-
The Interacting Pair Assignment structure is defined to be an upper triangular two-dimensional integer array indexed by node identifiers i and j. When the optimization procedure is finished, the array element Iij will contain the node identifier of exactly one of the nodes in Oij, which is where the interactions between the node pair will be computed.
-
We also define the Sparse Send To Node Set Si for each node i. Si is empty at the start, but at the end, Si will be a subset of Ci.
Next, we outline the iterative procedure used to construct the Sparse Send To Set Si: {Construct Sparse Send To Node Set}
Initialize all elements of the Interacting Pair Assignment structure Iij to −1 Let sequence P = {(i, j) ∈ P × P| (i < j)} where P is the set of node identifiers Sort P according to the size of the corresponding Interaction Assignment Option Set ||Oij|| {smallest first}
for k = 0 to ||P|| − 1 do
(i, j) = Pk
D = Oij {Interaction Assignment Option Set} if ∃a ∈ P|a ∈ (D ∩ Si ∩ Sj) then {No need to add any nodes to Sparse Send To Node Sets} else if ∃a ∈ P|a ∈ (D ∩ Si) ∨ (D ∩ Sj) then
Choose the element a that appears in the smallest number of Sparse Send To Node Sets Sn and append it to Si and to Sj {One of these appends will be a no-op because (a ∈ Si) ∨ (a ∈ Sj) already} else
From b ∈ D choose b|b appears in the smallest number of Sparse Send To Sets Sm
and append it to Si and to Sj
end if
end for
The results of this optimization process in terms of the communicating partner count are shown in Table 1. In fact, scaling plots of the communicating partner count as a function of node count [15] indicate that the Blue Matter V5 algorithm equals and is sometimes better than the O( ) scaling of the number of communicating partners achieved by the Plimpton–Hendrickson technique.
|
| Table 1 Communicating partner counts as a function of partition size for Blue Matter Version 4 (V4) volumetric broadcast, full-skin broadcast, and Version 5 (V5) set-based optimization. The V5 results give the minimum and maximum values as well as the average because the V5 algorithm result depends on the detailed distribution of particles. |
|
|
|
|
|
| Node count | Partition | V4 | V5 |
|
|
| Px | Py | Pz | Full skin | Sparse (average) | Sparse (minimum) | Sparse (maximum) |
|
| 512 | 8 | 8 | 8 | 45 | 42 | 22 | 19 | 24 |
| 1,024 | 8 | 8 | 16 | 63 | 58 | 27 | 24 | 30 |
| 2,048 | 8 | 16 | 16 | 105 | 90 | 38 | 34 | 40 |
| 4,096 | 16 | 16 | 16 | 147 | 122 | 50 | 47 | 54 |
| 4,096 | 8 | 32 | 16 | 147 | 122 | 51 | 46 | 52 |
| 8,192 | 16 | 32 | 16 | 209 | 162 | 65 | 61 | 67 |
| 16,384 | 16 | 32 | 32 | 349 | 242 | 89 | 84 | 91 |
|
| |
|
| |
|
Table 2 provides information about the specific parameters used in the runs whose performance is described here. Figure 3 shows the computational throughput in timesteps per second compared with the number of atoms per node. Plotting the data as a function of atoms per node provides some degree of normalization for system size when the real-space nonbonded interactions are the dominant contribution to the iteration time. One can observe that the scalability plots in the left-hand portion of Figure 3 (corresponding to larger values of atoms per node and lower node counts) seem to lie on a universal curve. This would correspond even more closely to a universal curve if we plotted the number of real-space nonbonded interactions per node rather than atoms per node on the horizontal axis. Figure 3 includes the Blue Matter V5 [system programming interface (SPI) only] data, as well the V4 data on both SPI and Message Passing Interface (MPI) [40].
|
| Table 2 Details about the systems benchmarked with Blue Matter. All runs were made with the velocity Verlet integrator [14], all runs used the P3ME technique to handle long-range electrostatic interactions, and all runs were NVE simulations. Rigid water models were used, and all heavy atom-to-hydrogen bonds in nonwater molecules were constrained using Rattle [44]. All runs performed the P3ME calculation on every timestep. Except for the SOPE (643) data, these choices are those used in production scientific work (hairpin, rhodopsin) or attempt to match (or exceed) benchmarking conditions reported elsewhere (ApoAl) [15, 24, 40]. |
|
|
|
|
|
| System | Total atoms | Cutoff/switch (Å) | P3ME mesh | Timestep (fs) |
|
| Hairpin | 5,239 | 9.0/1.0 | 643 | 1 |
| SOPE | 13,758 | 9.0/1.0 | 643; 1283 | 1 |
| Rhodopsin | 43,222 | 9.0/1.0 | 1283 | 2 |
| ApoAl | 92,224 | 10.0/2.0 | 1283 | 1 |
|
Figure 3
Table 3 provides both the time per timestep and the length of time required for the neighborhood broadcast and reduce required by the V4 and V5 real-space nonbonded algorithms. Table 4 shows a different view of V5 performance by providing the number of days required to simulate a microsecond for various molecular systems as a function of partition size. The effects of the reduction in numbers of communicating partners from V4 to V5, shown in Table 1 for rhodopsin, are manifest in the broadcast and reduction data in Table 3. The data also shows the dramatic difference in scalability between MPI and SPI for V4 that was previously reported [46]. The application-based tracing facility used by Blue Matter allows us to collect timing data from all of the nodes in the system. Trace points in this facility are placed in start–finish pairs within the source code and are turned on by compiletime macro definitions. For a single node, a timing diagram could be constructed by setting an indicator variable equal to 1 when the start-trace point is executed and resetting it to 0 after the stop-trace point is executed. In order to create something analogous to a timing diagram that contains data aggregated from all the nodes, we compute the distribution function of timestamps (over nodes) for every timestep. The distribution function is a function of time that gives the fraction of nodes for which the application program had executed the specified trace point at time t.
|
| Table 3 Performance data for Version 4 (V4) and Version 5 (V5) algorithms. Results for two 4,096-node partition geometries are shown: 4,096r is 8 × 32 × 16 and 4,096c is 16 × 16 × 16. |
|
|
|
|
|
| β-hairpin |
|
| Node count | Total (ms) | Broadcast (ms) | Reduce (ms) | Atoms per node |
|
|
|
| V4 | V5 | V4 | V5 | V4 | V5 |
|
|
|
|
|
|
| SPI | SPI | SPI | SPI | SPI | SPI |
|
| 512 | 3.01 | 2.25 | 0.25 | 0.11 | 0.24 | 0.11 | 10.2 |
| 1,024 | 2.02 | 1.45 | 0.26 | 0.11 | 0.25 | 0.10 | 5.1 |
| 2,048 | 1.48 | 1.00 | 0.25 | 0.12 | 0.23 | 0.10 | 2.6 |
| 4,096r | 1.52 | 1.12 | 0.26 | 0.14 | 0.23 | 0.17 | 1.3 |
| 4,096c | 1.26 | 0.83 | 0.24 | 0.12 | 0.22 | 0.08 | 1.3 |
|
| SOPE |
|
Node count | Total (ms) | Broadcast (ms) | Reduce (ms) | Atoms per node |
|
|
|
| V4 | V5 | V4 | V5 | V4 | V5 |
|
|
|
|
|
|
| MPI | SPI | SPI | SPI 643 | MPI | SPI | SPI | SPI 643 | MPI | SPI | SPI | SPI 643 |
|
| 512 | 7.47 | 6.81 | 6.22 | 4.83 | 0.56 | 0.35 | 0.16 | 0.13 | 0.44 | 0.35 | 0.19 | 0.16 | 26.9 |
| 1,024 | 5.25 | 4.30 | 3.89 | 2.79 | 0.63 | 0.31 | 0.14 | 0.10 | 0.53 | 0.30 | 0.16 | 0.12 | 13.4 |
| 2,048 | 4.66 | 2.81 | 2.45 | 1.80 | 0.88 | 0.25 | 0.12 | 0.09 | 0.86 | 0.23 | 0.15 | 0.10 | 6.7 |
| 4,096r | 5.61 | 2.57 | 2.11 | 1.28 | 1.38 | 0.25 | 0.14 | 0.09 | 1.33 | 0.24 | 0.13 | 0.10 | 3.4 |
| 4,096c | 5.08 | 1.95 | 1.60 | 1.25 | 1.40 | 0.22 | 0.12 | 0.08 | 1.31 | 0.21 | 0.15 | 0.08 | 3.4 |
| 8,192 | 7.31 | 1.89 | 1.50 | 0.97 | 2.49 | 0.23 | 0.14 | 0.08 | 2.21 | 0.21 | 0.11 | 0.08 | 1.7 |
|
| Rhodopsin |
|
| Node count | Total (ms) | Broadcast (ms) | Reduce (ms) | Atoms per node |
|
|
|
| V4 | V5 | V4 | V5 | V4 | V5 |
|
|
|
|
|
|
| MPI | SPI | SPI | MPI | SPI | SPI | MPI | SPI | SPI |
|
| 512 | 16.77 | 16.82 | 17.52 | 0.77 | 0.47 | 0.39 | 0.54 | 0.51 | 0.44 | 84.4 |
| 1,024 | 9.42 | 9.50 | 9.48 | 0.77 | 0.39 | 0.24 | 0.59 | 0.39 | 0.26 | 42.2 |
| 2,048 | 6.46 | 5.58 | 5.07 | 0.95 | 0.35 | 0.19 | 0.77 | 0.33 | 0.19 | 21.1 |
| 4,096r | 5.83 | 3.55 | 3.21 | 1.44 | 0.28 | 0.19 | 1.23 | 0.26 | 0.15 | 10.6 |
| 4,096c | 5.56 | 3.47 | 3.11 | 1.42 | 0.29 | 0.18 | 1.16 | 0.27 | 0.15 | 10.6 |
| 8,192 | 7.17 | 2.51 | 2.16 | 2.47 | 0.24 | 0.20 | 2.05 | 0.23 | 0.13 | 5.3 |
| 16,384 | 12.88 | 2.28 | 1.88 | 5.30 | 0.25 | 0.28 | 4.52 | 0.24 | 0.20 | 2.6 |
|
| ApoAl |
|
| Node count | Total (ms) | Broadcast (ms) | Reduce (ms) | Atoms per node |
|
|
|
| V4 | V5 | V4 | V5 | V4 | V5 |
|
|
|
|
|
|
| MPI | SPI | SPI | MPI | SPI | SPI | MPI | SPI | SPI |
|
| 512 | 35.56 | 36.37 | 38.42 | 1.05 | 1.08 | 0.66 | 0.68 | 0.97 | 0.90 | 180.1 |
| 1,024 | 19.29 | 19.18 | 18.95 | 1.02 | 0.74 | 0.40 | 0.71 | 0.76 | 0.51 | 90.1 |
| 2,048 | 11.48 | 10.68 | 9.97 | 1.14 | 0.60 | 0.26 | 0.88 | 0.55 | 0.30 | 45.0 |
| 4,096r | 8.57 | 6.33 | 5.39 | 1.66 | 0.51 | 0.22 | 1.54 | 0.47 | 0.23 | 22.5 |
| 4,096c | 7.55 | 5.97 | 5.44 | 1.37 | 0.50 | 0.19 | 1.23 | 0.48 | 0.21 | 22.5 |
| 8,192 | 7.34 | 3.68 | 3.14 | 2.22 | 0.43 | 0.15 | 2.15 | 0.38 | 0.16 | 11.3 |
| 16,384 | 12.58 | 2.57 | 2.09 | 4.83 | 0.40 | 0.13 | 4.80 | 0.34 | 0.13 | 5.6 |
|
|
| Table 4 The performance results for Blue Matter Version 5 expressed as the number of days required to simulate 1 μs for a specified timestep size. Results for two 4,096-node partition geometries are shown: 4,096r is 8 × 32 × 16 and 4,096c is 16 × 16 × 16. |
|
|
|
|
|
| | β-Hairpin | SOPE 643 | SOPE 1283 | Rhodopsin | ApoA |
|
| Timestep (fs) | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 |
|
| Partition size | Days required to reach 1 μs |
|
| 512 | 26.0 | 55.9 | 72.0 | 101.4 | 444.7 |
| 1,024 | 16.8 | 32.4 | 45.0 | 54.9 | 219.3 |
| 2,048 | 11.6 | 20.8 | 28.4 | 29.3 | 115.4 |
| 4,096r | 13.0 | 14.9 | 24.4 | 18.6 | 63.0 |
| 4,096c | 9.6 | 14.6 | 18.5 | 18.0 | 62.4 |
| 8,192 | | 11.2 | 17.4 | 12.5 | 36.3 |
| 16,384 | | | 19.2 | 10.9 | 24.2 |
|
This information is used to construct the plots, shown in Figure 4, that present the aggregated data from all 16,384 nodes for Blue Matter V5 running the rhodopsin system on the Blue Gene/L computer. In Figure 4(a), only the distribution functions of the start–finish trace-point pairs that bracket the entire timestep are shown. The periodic long timestep is caused by the need to migrate particles from one node to another as their positions shift. We avoid migration on every timestep by keeping track of particles that are somewhat farther away than half the cutoff distance (the additional amount is the guard zone mentioned earlier in the introduction to the section “Parallel decompositions”) and also by monitoring the drift of each particle since the last migration. When one or more particles drift far enough, an alarm is raised using the fast short vector reductions available on the Blue Gene/L system, and a new migration phase follows. In Figure 1, these are the guard zone 3-bit all_reduce operations shown in the rectangle just below the circle containing the Velocity Verlet integrator engine. Figure 4(b) zooms in on a few timesteps and shows a plot of the distribution functions for a number of important components of a timestep. It is obvious from this plot that the forward and reverse 3D‐FFT operations dominate the total timestep. Also, the scheduling of communication and computation tasks on both CPUs on each node can also be seen, e.g., the overlap of the reverse 3D FFT and the real-space reduce.
Figure 4
A detailed breakdown of the time required for the various operations required to complete a timestep is provided for the rhodopsin system in Figure 5. It can be seen from the figure that with more than 2,048 nodes, the dominant contribution to the timestep switches from the real-space nonbonded computation to the total K space. At the highest node count, it is also evident that the time required for the various localized broadcasts and reductions increases, and even if the total K-space contribution could be decreased, the reductions, in particular, would become the limiting factors for scalability. The total time per timestep is not simply the sum of the various contributions shown in Figure 5 for two reasons: First, both cores are used on each node, and so, for example, the real-space nonbonded computation can overlap with portions of the total K-space work as seen in Figure 4(b), and second, some of the quantities plotted are already aggregations of other quantities in the plot, e.g., the various K-space contributions and total K space.
Figure 5
| |
|
In general, in order to maximize throughput, a computational scientist wants to use the largest timestep possible consistent with providing results that adequately approximate an ideal simulation of the system (potential surface) under study. Other performance-critical simulation parameters affecting simulation accuracy and stability include the FFT mesh spacing for P3ME methods [47] and the force-splitting scheme and timestep ratios chosen for symplectic multiple timestepping methods [12, 13, 48].
It is a considerable challenge to determine the optimal parameters for simulations that involve billions or tens of billions of timesteps and require machines running at multiteraflops or faster. Figure 6 shows the change in the total energy in a simulation of a 66,728-atom solvated protein system, the lambda repressor, for a series of timestep sizes. Previously, we reported a measured energy drift of 6 × 10−4 K/ns (over a 1.6-μs simulation) for the 43,222-atom rhodopsin system using a 2-fs timestep [15, 46], which is consistent with the drift of 5.1 × 10−4 K/ns seen in the 2-fs timestep data for the lambda repressor.
Figure 6
| |
|
We have described the progression of parallel decompositions for the real-space nonbonded forces explored as part of the Blue Matter effort. The progression started with nongeometric replicated data decompositions, continued with the V4 spatial and interaction hybrid decomposition with geometric assignment of workload, and culminated in the V5 decomposition that uses geometry only as a heuristic that defines the starting point of the set-based optimization procedure for interaction assignment. As implemented in Blue Matter, these algorithms have made methodologically rigorous simulations of biologically interesting systems on the microsecond-scale routine on the Blue Gene Watson supercomputer. Furthermore, we have demonstrated that excellent energy conservation over microsecond-scale NVE simulations of solvated protein and membrane–protein systems can be achieved. The quality and time to solution that has been demonstrated by Blue Matter running on the Blue Gene/L platform enables increased ability to make contact with experiment by accessing timescales that are also accessible to physical experiment and it has already had significant scientific impact [31, 34–36].
| |
We thank the members of the Blue Gene hardware team, including P. Heidelberger, A. Gara, M. Blumrich, and J. Sexton, as well as others who have made use of and contributed to the Blue Matter code over the years, particularly Y. Zhestkov, Y. Sham, F. Suits, A. Grossfield, and R. Zhou.
*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.
| |
|
Received February 13, 2007; accepted for publication April 2, 2007; Published online December 19, 2007.
|
|