Introduction
Astrophysical N-body simulation
Simulations play multiple roles in science. They may elucidate fundamental phenomena such as why proteins fold quickly or how astrophysical structures from the solar system through large-scale structure form and evolve. Simulations can also be key for data assimilation—from using isolated weather observations to build a full model of the state of the earth's ocean and atmosphere to using observations of the universe to aid in our understanding of its nature, including the roles of dark matter and energy.
On the cosmological side, we have epochal surveys that have thrown down the gauntlet for cosmological simulation. We want to use those surveys to understand a variety of questions, the most important one at the moment being the equation of state of dark energy. For the solar system, we want to understand how trillions of small rocky bodies form so few planets and how the inner solar system is cleansed of small bodies so well that the last major assault on the earth was when the dinosaurs became extinct.
We have reached a key point in astrophysical N-body simulation. N-body simulations were useful for showing qualitative features in the classic “hypothesis-driven” mode of science. A wonderful example was the classic study of bridges, tails, and mergers by Toomre and Toomre [1]. However, it was exceedingly dangerous to use simulations in a “discovery science” mode [2], as most “discoveries” were dominated by numerical artifacts, leading for example to the belief that it was impossible to form clusters of galaxies, systems that were bound and stable with substructures that lasted for many dynamical times. Simulations had always shown that all substructure detonated spontaneously, forming a single overmerged object. This matched galaxies and made clusters mysterious. With higher-quality simulations, we have discovered that substructure is preserved at all levels in the Cold Dark Matter (CDM) model, making the observed overmerging in galaxies the mystery instead [3]. This result has been reproduced by a number of groups and is now recognized as one of the lingering problems in a theory that has worked extremely well on larger scales [4]. We have also discovered that galaxies do influence one another in clusters. When small disk galaxies experience the tidal shocks of fast fly-by collisions by large galaxies, they are shaken to elevated star formation for a few Gyr (billion years) and then beaten into featureless spheroidal galaxies. We called this galaxy harassment [5]. At the time of the discovery of galaxy harassment, the consensus was that it had to be wrong because it was such a big effect that it should not have been missed before. However, it was easy to miss an effect that transformed galaxies when the numerical artifacts were completely destroying them!
In this paper, we describe the techniques that we use to address these challenges and place them in the context of cosmology as well as the formation and stability of planetary systems. With planetary systems, the challenge is to perform accurate integrations that retain Hamiltonian properties for 1013 timesteps.
Cosmological N-body simulations
Simulations are required to calculate the nonlinear final states of theories of structure formation as well as to design and analyze observational programs. Galaxies have six coordinates of velocity and position, but observations determine just two coordinates of position and the line-of-sight velocity that bundles the expansion of the universe (the distance via Hubble's law) together with random velocities created by the mass concentrations (see Figure 1). To determine the underlying structure and masses, and separate the physical structures from those created in velocity space, we must use simulations. If we want to determine the structure of a cluster of galaxies, how large must the survey volume be? Without using simulations to define observing programs, the scarce resource of observing time on $2 billion space observatories may be misspent. Finally, to test theories for the formation of structure, we must simulate the nonlinear evolution to the present epoch.
Figure 1
This relationship to observational surveys defines our goal for the next decade. The Sloan Digital Sky Survey (SDSS) [6] will produce fluxes and sky positions for 5 × 107 galaxies with redshifts for the brightest 106. Our ambitious observational colleagues have cut steel and ground glass to survey a “fair volume” that we must simulate, but we would need N = 1012 particles to do this in a simplistic brute-force way. Direct summation of the gravitational forces using fixed timesteps would take 1010 teraflop-years.
We explain why this is a unique time to survey the universe, as well as describing the technical breakthroughs required to create a better survey of the cosmos. We then present the three keys to a realistic floating-point operation (flop) count:
-
Spatially adaptive potential solvers.
-
Temporally adaptive integrators.
-
Volume renormalizations.
Another goal of this paper is to define “high-quality simulations” and the niche science that can be done with N 108 particles.
The progress of simulations
Over the last 20 years, the N of our simulations has increased as log10N = 0.3(year - 1973). Figure 2 shows the relative contributions of hardware and algorithms. The power of computers has doubled every 18 months (open circles, log scale to the right), with algorithmic advances having a slightly more rapid pace (filled circles, log scale to the left). Together, the doubling time in power is eight months, accumulating to a trillionfold increase in less than three decades. We cannot wait to simulate 1012 particles; we would have to invent algorithms that are a thousand times faster!
Figure 2
There are two constraints on our choice of N. The cost of computing a full cosmological simulation is 105.7N4/3 flops. Here, we have scaled N within a fixed physical volume. The scaling with N4/3 arises from the increased time resolution needed as interparticle separation decreases. This extra 1/3 power in the scaling with resolution elements is an inherent part of nearly all physical simulation. For example, in a hydro grid code, one must follow sound waves across an element of the grid which decreases in size as the number of grid points to the 1/3 power if the overall physical size is fixed. Clearly, the scaling would be different if N represented the number of residues in a simulation of protein dynamics, so that the physical scales were changing. A simulation with a particularly large N was the “Hubble Volume Simulation” done with 109 particles by the Virgo consortium [7]. Since this simulation used an extremely large volume (roughly 100 times the size of the Sloan survey volume), the resolution was very low, and the timesteps could be large. The size of the dataset made it particularly difficult to analyze at the time that it was done, and there were no scientific results that made use of the larger volume. Clearly, both the resolution and the sample size are important to the science goals of N-body simulations.
This means that the “one byte per flop” rule of system design that applied to gigaflop machines is radically different for petaflop machines, where one likely needs only 30–100 terabytes of memory for most physical problems.
The memory needed to run a simulation is 102N bytes. If we fix N by filling memory, the time needed to run a simulation is 25 days × (bytes/flop rate) × (N/100 million)1/3. Current machines are well-balanced for our simulations. With gigaflops and gigabytes, we have run simulations with N 107.5. With teraflops and terabytes, we can simulate the behavior of 1010 particles or that of fewer particles in more depth. Simulations with N 1012 require 100 TB and 2 months on a petaflop machine. If we limit ourselves to runs of about a month, then 50 TB suffices on a petaflop machine.
There are a variety of problems in which N 106 represents a minimum ante. For example, clusters of galaxies are extremely important for determining cosmological parameters such as the density of the universe. Within a cluster, the galaxies are 1–10% of the mass, and there are roughly 103 of them. If the galaxies have fewer than 102–103 particles [8], they dissolve before the present epoch owing to two-body relaxation in the tidal field of the cluster. To prevent this, we need N > 106 particles per cluster. With the advances in machines and algorithms, these simulations are becoming possible on inexpensive group servers or small clusters. However, if we want to get a good sample of clusters within a Sloan volume, this would require N 1012 particles.
There are 1020 solar masses within the SDSS volume, so even 1012 particles is a paltry number, since each particle would represent 108 solar masses. We need ten times more to represent the internal structure of galaxies. N will always be far smaller than the true number of particles in the universe and will compromise the physics of the system at some level. We can only make sure that
-
The physics being examined has not been compromised by discreteness effects owing to N-deprivation.
-
Gravitational softening, discrete timesteps, force accuracy, and simulation volume do not make matters even worse.
N is not always the figure of merit in simulations, but it should be! The N-body Constitution [9] provides a set of necessary but not sufficient guidelines for N-body simulation.
The main physical effect of discreteness is the energy exchange that results from two-body collisions. Gravity has a negative specific heat owing to the total negative energy (sum of gravitational binding and kinetic energy) of a bound ensemble, like a star cluster. As a star cluster evolves, stars are scattered out by close gravitational encounters, leaving with positive energy. The remaining stars have greater negative energies, the cluster shrinks, the gravitational binding energy increases, and the stars move faster. In galaxies and clusters of galaxies, the time scale for this to occur is 103 to 106 times the age of the universe. In many simulations, the combination of discreteness in mass, time, and force evaluation can make the time scale much shorter, leading to grossly unphysical results. Thus, we must use sufficiently large N to ensure that physical heating mechanisms dominate over numerical artifacts, or that the numerical heating time scale is much longer than the time we simulate. We have inventoried all of the physical heating mechanisms experienced by galaxies in clusters and discovered a unique new phenomenon we call galaxy harassment [5].
Parallel spatially/temporally adaptive N-body solvers with “volume renormalization”
Performance gains of the recent past and near future rely on parallel computers that reduce CPU-years to wallclock-days. The challenge lies in dividing work among the processors while minimizing the latency of communication.
The dynamic range in densities demands that spatially and temporally adaptive methods be used. Our group has concentrated on tree codes [10] that can be made fully spatially and temporally adaptive. These codes use multipole expansions to approximate the gravitational acceleration on each particle. A tree is built with each node storing its multipole moments. Each node is recursively divided into smaller subvolumes until the final leaf nodes are reached. Starting from the root node and moving level by level toward the leaves of the tree, we obtain a progressively more detailed representation of the underlying mass distribution. In calculating the force on a particle, we can tolerate a cruder representation of the more distant particles, leading to an O(N log N) method. We use a rigorous error criterion to ensure the accuracy of forces.
As the number of particles in a cosmological simulation grows, so do the density contrasts and the range of dynamical times
( 1/ ). If we take the final state of a simulation and weight the work done on particles inversely with their natural timesteps, we find a potential gain of 50. The leapfrog time evolution operator, D( /2)K( )D( /2), is the one most often used:
where r is the position vector, v is the velocity, a is the acceleration, and is the timestep. This operator evolves the system under the Hamiltonian
where Herr is of order 2 [11]. The existence of this surrogate Hamiltonian ensures that the leapfrog is symplectic (that is, volume-conserving in phase space, so that total energy is conserved); it is the exact solution of an approximate Hamiltonian. Herr explores the ensemble of systems close to the initial system rather than an ensemble of non-Hamiltonian time evolution operators near the desired one. Leapfrog is a second-order symplectic integrator requiring only one costly force evaluation per timestep and only one copy of the physical state of the system. These properties are so desirable that we have concentrated on making an adaptive leapfrog. Unfortunately, simply choosing a new timestep for each leapfrog step evolves (r, v, ) in a manner that may not be Hamiltonian; hence, it is neither symplectic nor time-reversible. The results can be unmanageable [12]. Time reversibility can be restored [13] if the timestep is determined implicitly from the state of the system at both the beginning and the end of the step. This requires backing up timesteps, throwing away expensive force calculations, and using auxiliary storage. However, we can define an operator A that “adjusts” the timestep yet retains time reversibility and calculates a force only if it is used to complete the timestep [14]. This is done by choosing A such that it commutes with K, so that the sequence DAKD is equivalent to DKAD. Since K changes only the velocities, an A operator that depends entirely on positions satisfies the commutation requirement. The “natural definition” of a timestep,
1/
is ideal but difficult to define when a few particles are close together. Synchronization is maintained by choosing timesteps that are a power-of-2 subdivision of the largest timestep, S. That is, I = ( S/2j), where I is the timestep at a level of the timestep hierarchy. We are currently experimenting with this approach and encourage others to look at variants.
“Volume renormalization” uses a large-scale simulation with modest resolution to identify regions of particular interest: sites of galaxy/QSO formation, large clusters of galaxies, etc. Next, initial conditions are reconstructed using the same low-frequency waves but adding higher spatial frequencies. We have achieved 103-parsec resolution within a cosmological volume of size 108 parsecs to study the origin of quasars [15].
There are methods of calculating potentials that are advertised as being O(N) rather than O(N log N), such as the fast multipole method (FMM) [16]. We use a number of great tricks from FMM (our tree codes now use multipoles and combine and translate the multipole expansions). Nonetheless, there are several reasons why we do not use FMM. The first is that there is no method that is truly O(N). This can be seen with one-dimensional gravity that satisfies Gauss' theorem. In this case, the gravitational force is proportional to the number of particles to one's left, minus the number of particles to one's right. As a result, gravity is equivalent to a sort which cannot be reduced to O(N) for arbitrary real numbers. The one-dimensional gravity is a particularly useful example, as the scaling is O(N log N) to calculate the relative positions, but then one performs only one gravitational interaction per particle [so it is not just O(N), but truly “N”]. Similarly, the adaptive FMM must first build a tree which involves a sort [O(N log N)], but then calculates O(N) gravitational interactions. The interaction calculation is expensive enough that studies have generally seen the algorithm scale as O(N). These arguments are a bit simplistic. We have assumed that the sort necessary to build a tree is O(N log N). There are O(N) sorting routines that work as long as the distribution of points is not pathological, as in a Cantor set. However, the late stages of cosmological simulations become progressively more like a Cantor set! Requiring that the simulation be stably integrated means that particles cannot move too much, so tree reconstruction after each step should certainly be faster than a full build and should scale more like O(N).
Finally, we use a tree code to employ adaptive timesteps. There is a tendency to focus on reducing the simple calculation of forces from the “sum over all particles” N2 scheme to one that is O(N log N) or O(N). However, we need to solve the full dynamical problem that is naively N7/3, where the additional N1/3 comes from the need to take finer steps to implement higher resolution. With the tree code, one divides work into the calculation of forces on a particular particle. In this way, one can employ adaptive timestepping, with each particle having an individual timestep. In the FMM, work is effectively divided by scale, and the scales cannot be decoupled in any way to implement temporal adaptivity. The best sign of speed and relative efficiencies of the methods is that the tree build is a large fraction of the cost of a force evaluation in the tree code with adaptive timesteps, whereas its scaling can be seen in the FMM implementations.
One last class of methods is based on fast Fourier transforms (FFTs) combined with corrected particle–particle sums on small scales. The adaptive versions of this code by Couchman [17] with further refinements by Couchman, Thomas, and Pearce [18, 19] have powered the large Virgo consortium runs [7]. These codes use nested meshes with tailored Green functions with a final particle–particle summation on the smallest scales. For a single potential evaluation, this method is somewhat faster than the tree code, with the tree code winning only when the distributions are heavily clustered owing to its more efficient use of adaptive timestepping. There is also a hybrid composed of a particle-mesh code and a tree code [20]. One pathology of the tree code is that it requires the use of many more terms when the distribution of particles is nearly uniform, such as the early stages of a cosmological simulation. This is just the condition for which mesh codes are blindingly fast and accurate. This is one of the motivations for the more flexible simulation package discussed in the last section.
The distinctions between adaptive mesh, FMM, and tree code are becoming increasingly blurred. As one last example of this, our current method of collecting the interactions for each particle, or “tree-walking,” has been improved by applying a technique inspired by FMM. Instead of collecting all interactions for a single particle (or a small nearby set of particles), we consider which interactions are common to all subcells of a given cell at each level of the tree. This is done in a top-down manner, maintaining a “checklist” structure of cells that remain to be checked either for inclusion on an interaction list or to be opened and have their subcells placed on the checklist [21]. This results in many fewer comparisons in deciding which interactions to perform, thereby reducing one of the big integer-operation-limited steps of tree codes. With this tree-walk and the use of multipoles that are combined and translated, one could argue that our current code is as close to being a variant of FMM [16] as it is to the original tree codes [10, 22].
Current cosmological challenges
We now have multiple mysteries in cosmology as we continue on the path set by Copernicus, who showed that the earth was not the center of the solar system. We now know that our solar system resides on the outskirts of a galaxy which is at the edge of a cluster of galaxies, and that the entire system moves toward a greater attractor [23]. Not only that, but the universe appears to consist more of dark matter than of identifiable matter. Still more of the universe is in dark energy, which has a behavior entirely different from that of matter [24, 25]. Dark energy has a negative pressure or tension that makes the universe expand faster and faster, whereas the gravitational tug of normal matter slows it down. Strangely, we seem to live in a special epoch in which dark energy has just become dominant over normal and even dark matter. We still do not know what either the dark matter or the dark energy might be. A first step will be to determine its equation of state: Does it behave just like Einstein's cosmological constant, or differently?
Matching clusters, groups, galaxies, and cosmic flows has been a sensitive test of cosmological models. The key data for the next decade will be the Sloan Digital Sky Survey. Our proposed program to simulate the Sloan volume has been as follows:
-
Simulate the entire volume (800 Mpc)3 with N = 108, where Mpc = one million parsecs.
-
“Renormalize” dozens of groups, clusters, etc. and simulate with 108 particles.
-
Run selected subvolumes that are 10-3 of the total volume or (800 Mpc)3 with 108 particles to examine the evolution of groups and follow the early evolution of “rare objects.”
The first simulation took 1016 floats or 10 petaflop- seconds and has been run for a few cosmological models. Like many other groups, we found that matching structures in the universe required that the universe be dominated by a dark energy or the cosmological constant that was introduced by Einstein. The dark energy was also needed to fit observations of distant supernovae [26] and has now been determined to very high precision by the Wilkinson Microwave Anisotropy Probe (WMAP) observations of the cosmic microwave background [27]. This satellite mission determined to high precision the age of the universe, its expansion rate, and the relative proportions of normal matter, dark matter, and dark energy. However, these experiments have not been sensitive to the dark energy equation of state, which will stimulate a new generation of dynamical probes coupled with simulations. A recent study [28] finds that there is a strong sensitivity of the cosmic Mach number [29] (the ratio of the flow velocity to the local velocity dispersion) to the equation of state of the dark matter. The next step for using this sensitivity will be to obtain very accurate Mach numbers for a variety of flows.
The fate of the solar system
Advances in hardware and numerical methods have finally enabled us to integrate the solar system over its lifetime. Such an integration represents a thousandfold advance over the best, longest, most accurate integration ever performed [30] and can address numerous questions such as the following: Is the solar system stable? Do all of the planets remain approximately in their current orbits over the lifetime of the solar system, or are there drastic changes, or perhaps even an ejection of a planet? What is the stability of other planetary systems?
Integration of the solar system will allow us to address other questions, such as the following.
What is the effect of orbital changes on the planetary climates? According to the Milankovich hypothesis, climatic variations on the earth are caused by insolation changes arising from slow oscillations in the earth's orbital elements and the direction of the earth's spin [29]. Remarkably, the geophysical data (primarily the volume of water locked up in ice as determined by the 18O/16O ratio in sea-bed cores) covers a longer time than any accurate ephemeris.
How does weak chaos alter the evolution of the solar system? Why does the solar system appear stable if its Lyapunov time, that is, the time scale for two nearly identical trajectories to diverge significantly, is so short?
How are the giant planets related to terrestrial planets in the “habitable zone” between boiling and freezing of water by the central star? Without a cleansing of planetesimals from the solar system by giant planets [30], the bombardment of the earth by asteroids would be steady and frequent throughout the main sequence lifetime of the sun [31]. The chaos produced by Jupiter and Saturn may have played a role in ensuring that planetesimals collided to form the terrestrial planets,1 but too much chaos will eject planets in the habitable zone. While a search for giant planets is the only technically feasible one today, it may be the ideal way to screen systems before searching for terrestrial planets.
Integrating nine planets for 1011 dynamical times
When Laplace expanded the mutual perturbations of the planets to first order in their masses, inclinations, and eccentricities, he found that the orbits could be expressed as a sum of periodic terms—implying stability. Poincaré [32] showed that these expansions do not converge because of the occurrence of resonances. Using the Kolmogorov–Arnold–Moser (KAM) theorem, Arnold [33] derived constraints on planetary masses, eccentricities, and inclinations sufficient to ensure stability. The solar system does not meet his stringent conditions, but this does not imply that it is unstable. Laskar [34] tested the quasi-periodic hypothesis by numerically integrating the perturbations calculated to second order in mass and fifth order in eccentricities and inclinations, 150,000 polynomial terms. Fourier analysis of his 200-million- year integration reveals that the solution is not a sum of periodic terms and implies an instability that is surprisingly short, just 5 Myr (million years). The second method for attacking the stability problem is to explicitly integrate the orbits of the planets (Table 1).
|
| Table 1
Summary of direct solar system integrations. |
|
|
|
|
|
| Year | Team | Length (Myr) | No. of planets | General relativity? | Moon? | Special hardware? |
|
| 1951 | Eckert et al. [35] | 0.00035 | 5 | no | no | no |
| 1965 | Cohen and Hubbard [36] | 0.12 | 5 | no | no | no |
| 1973 | Cohen et al. [37] | 1. | 5 | no | no | no |
| 1984 | Kinoshita and Nakai [38] | 5. | 5 | no | no | no |
| 1986 | Applegate et al. [39] | 217. | 5 | no | no | yes |
| 1986 | Applegate et al. [39] | 3. | 8 | no | no | yes |
| 1987 | Carpino et al. [40] | 100. | 5 | yes | no | no |
| 1988 | Sussman and Wisdom [41] | 845. | 5 | no | no | yes |
| 1989 | Richardson and Walker [42] | 2. | 9 | no | no | no |
| 1991 | Quinn et al. [28] | 6. | 9 | yes | yes | no |
| 1992 | Sussman and Wisdom [43] | 100. | 9 | partly | yes | partly |
| 2002 | Ito and Tanikawa [44] | 10,000. | 5 | no | no | no |
| 2002 | Ito and Tanikawa [44] | 5,000. | 9 | no | no | no |
| 2003 | Varadi et al. [45] | 5,000. | 9 | no | no | no |
| 2003 | Varadi et al. [45] | 90. | 9 | yes | yes | no |
| 2003 | Armstrong et al. [46] | 1,000. | 9 | yes | yes | no |
|
As early as 1965, Pluto's behavior was deemed suspicious. In the last ten years, it has become clear that the solar system is chaotic. However, the source of the chaos is unclear, because the system of resonances is complex and the Lyapunov exponents appear to be sensitive to fine details of initial conditions. Nonetheless, the solar system is almost certainly chaotic. Laskar [47], examining the fate of Mercury, estimates that there is a significant chance of ejection during the lifetime of the solar system. Our belief in the apparent regularity of the solar system may be due to our inability to know that before the last few ejections, there may have been 10, 11, or even 12 planets a few billion years ago. At the very least, the chaotic motion leads to a horizon of predictability for the detailed motions of the planets. With a divergence time scale of 4–5 Myr, an error as small as 10-10 in the initial conditions will lead to a 100% discrepancy in 100 Myr. Every time that NASA launches a rocket, it can turn winter to spring in a mere 10 Myr.
Are the integrations meaningful given this sensitivity to the initial conditions? We investigate Hamiltonian systems that are as close to the solar system as possible. KAM theory tells us that the qualitative behavior of nearby Hamiltonians should be similar. While the exact phasing of winter and spring is uncertain after millions of years, the severity of winter or spring owing to changes in the earth–sun distance and the obliquity are predictable.
We have started a 9-Gyr integration: 4.5 Gyr into the past, when the solar system was formed, and 4.5 Gyr into the future, when the sun becomes a red giant. One basic requirement is a computer with fast quadratic precision to overcome roundoff problems. The IBM Power Series* has been the machine of choice. Our first digital orrery was an IBM POWER2* workstation that evolved the solar system 109 times faster than “real time”; this was two to three orders of magnitude faster than other available CPUs owing to IBM's implementation of quadratic precision. We have now completed a 1-Gyr integration [46]. To understand any chaos, we will need to see it by an independent means and devise methods to determine its underlying source.
A parallel method does not seem promising, since there are only nine planets to distribute among processors. We employ a different form of parallelism—the “time-slice concurrency method” (TSCM) [48]. In this method, each processor takes a different time slice; the initial conditions for processor 2 are the final conditions for processor 1, and so on. The trick is to start processor 2 with a good prediction for what processor 1 will eventually output, and iterate to convergence. This is analogous to the waveform relaxation technique used to solve some partial differential equations [49]. However, Kepler ellipses are a good approximation to the orbits for a time scale that is proportional to the ratio of the mass of the sun to that of Jupiter. Tests show that it is extremely efficient to iterate to convergence in double precision (typically 14 iterations, each costing 10–15% of a quad iteration), then perform just two iterations to get convergence in quad. In this way, the total overhead of the full 16 iterations can be less than a factor of 4.
There are still many algorithmic issues to be addressed. For long-term integrations, TSCM has been formulated in a way that preserves the Hamiltonian structure and exploits the nearness to an exactly soluble system; otherwise, errors grow quadratically with time. TSCM should enable us to integrate 5 Gyr per day on a 512-node IBM POWER4*, a speedup over real time of 1012. This should make it feasible to study the instability of other solar systems. Detailed development and implementation will be much more challenging than for previous methods, and our high-quality serial integration will be required for comparison and validation.
Finally, we use a new technique to gauge the origin of instabilities (the “tangent equation method”) [50]. In the past, it was common to integrate orbits from many slightly different initial conditions. While that works, it is more rigorous and also more economical to integrate the linearized or tangent equations—the equations for differences from nearby orbits. We integrate the tangent equations along with the main orbit equations.
Cosmology meets cosmogony: Planetary system formation
Theories of solar system formation are traditionally divided into four stages [51]: collapse of the local cloud into a protostellar core and a flattened rotating disk (nebular hypothesis); sedimentation of grains from the cooling nebular disk to form condensation sites for planetesimals; growth of planetesimals through binary collision and mutual gravitational interaction to form protoplanets (planetesimal hypothesis); and the final assembly to planets, with the remaining disk cleansed by ejections from chaotic zones. Our cosmology code is ideal for the third stage of solar system formation, particularly in the inner regions where gas was not a primary component and gravitational interactions dominated the evolution. The first stage entails magnetohydrodynamics, the complicated small-particle physics and gas dynamics of the second stage is still not well understood, and the fourth stage is the purview of long-term stability codes.
All that is required for a detailed simulation of the third stage is a model of the collisional physics and a code capable of dealing with a large number of particles. However, previous direct simulations of the planetesimal stage (summarized in Table 2) fall far short of capturing the full dynamic range of the problem. Our cosmology code has the potential to treat as many as 107 particles simultaneously for 107 dynamical times, a ten-millionfold improvement that makes us enthusiastic! Only statistical methods [52] employing prescriptions for the outcomes of gravitational encounters have been used to take a look at this regime. We reach an important threshold at N = 107 in our ability to follow planetesimal evolution. At early times, the relative velocities between planetesimals are small, and inelastic physical collisions lead to “runaway” growth of planetary embryos [53]. Eventually, gravitational scattering increases the planetesimal eccentricities to such an extent that collisions result in fragmentation, not growth. The embryos continue to grow owing to their large mass, but at a slower rate as their “feeding zones” are depleted [54]. The total mass of our planetary system is 448 times that of the earth (448Mearth) or 3.6 × 104Mmoon, while the inner planetesimal disk amenable to simulation had a mass of 102Mmoon. To capture both growth and fragmentation [52] requires a minimum particle mass of 10-5Mmoon, leading to our N = 107.
|
| Table 2
Direct simulations of the formation of the inner planets. |
|
|
|
|
|
| Year | Team | N | t (yr) | a (AU) | Giants? |
|
| 1986 | Lecar and Aarseth [55] | 200 | 6 × 104 | 0.5–1.5 | no |
| 1990 | Beaugé and Aarseth [53] | 200 | 6 × 105 | 0.6–1.6 | no |
| 1992 | Ida and Makino [56, 57] | 800 | 5000 | 0.3 | no |
| 1993 | Ida and Makino [54] | 800 | 5000 | 0.3 | no |
| 1993 | Aarseth et al. [58] | 400 | 1.2 × 104 | 0.04 | no |
| 1996 | Kokubo and Ida [59] | 5000 | 2 × 104 | 0.4 | no |
| 1998 | Kokubo and Ida [60] | 5000 | 2 × 104 | 0.4 | no |
| 1998 | Chambers and Wetherill [61] | 100 | 108 | 0.5–2.0 | yes |
| 1998 | Richardson et al. [62] | 105 | 1000 | 1.2–3.6 | yes |
| 2000 | Richardson et al. [63] | 106 | 1000 | 0.8–3.8 | yes |
|
A detailed direct simulation of planet formation can address a variety of important questions, including the following: Was there runaway growth of a few embryos, or a continuously evolving homogeneous mass distribution? How does the primordial surface density alter the evolution? What fixes the spin orientation and period of the planets—uniform spin-up from planetesimal accretion [64], or a stochastic process dominated by the very last giant collisions [65]? Is it feasible that the earth suffered a giant impact late in its growth that led to the formation of the moon [29]? How much radial mixing was there, and can it explain observed compositional gradients in the asteroid belt [66]? Finally, what is the dominant physical mechanism that drives the late stages of growth—are intrinsic gravitational instabilities between embryos sufficient, or are perturbations by the giant gas planets required?
This last point is of key importance to future searches for terrestrial planets. We strongly suspect that the end result of our research may be the assertion that one should concentrate searches for terrestrial planets in those systems that have giant planets. We have begun to address these issues with a modified version of the cosmology code. Collisions are detected (rather than “softened away”), and the outcomes are determined by the impact energy, the lowest energies generally leading to mergers and the highest energies leading to fragmentation. In our current code, merging and bouncing are implemented, while fragmentation is in the testing stage. Integrations are carried out in the heliocentric frame and may include the giant planets as perturbers. Additional programs are used to generate appropriate initial conditions and to analyze the results of the simulation, but the main work is performed by the modified cosmology code.
Figure 3 shows the mass density (a) and semi-major axis a vs. both eccentricity e and inclination i (b), respectively, at the end of a 100-yr run that began with 106 identical cold planetesimals in a disk from 0.8 to 3.8 AU (astronomical units) with surface density proportional to r -3/2. The present-day outer planets were included in the calculation. The simulation took 60 hours to finish on a Cray T3E** computer2 with 128 dedicated processors using a fixed timestep of 0.01 yr. The effect of Jupiter on the disk, which extends well into the present-day asteroid belt, can be seen clearly in the density plot: There is a large density gap at the 2:1 resonance at 3.2 AU and a narrow groove at the 3:1 at 2.5 AU, along with spiral wave patterns and other telltale features. Corresponding features in Figure 3(b) show how Jupiter stirs up planetesimals at the mean-motion resonances. Note that conservation of the Jacobi integral accounts for the slight bending of the e peaks toward smaller a. Meanwhile, planetesimal growth has proceeded unmolested in the inner region of the disk (under the assumption of perfect accretion). The largest planetesimal at the end of the run is eight times its starting size.
Figure 3
As far as we are aware, this is the largest simulation of a self-gravitating planetesimal disk that has ever been attempted. The figures show, however, that to reach the regime of runaway growth ( 104–105 yr), a new timestepping approach is needed. We are currently developing a technique to exploit the near-Keplerian motion of the planetesimals. For weakly interacting particles, we divide the Hamiltonian into a Kepler component, implemented using Gauss' f and g functions to step along elliptical rather than linear segments, and a perturbation component owing to the force contributions of all of the other particles. In this regime, timesteps can be of the order of the dynamical (i.e., orbital) time, resulting in computational speedups of 10–100. For strongly interacting particles (defined as particles with overlapping Hill spheres), the Hamiltonian is factored into the standard kinetic and potential energy components, with the central force of the sun as an external potential. In this regime, particles are advanced in small steps, which allows for the careful determination of collision circumstances. It also allows the detection of collisions in the correct sequence even if a single particle suffers more than one collision during the interval. This new technique will transfer well to the protein folding problem.
The code that was used to do this simulation has considerable room for improvement. Nonetheless, we can compare the time to solution with GRAPE (GRAvity PipE) computer3/HARP computer4 simulations [59, 60]. We found that our code had an order of magnitude faster time to solution in the host computer used by this group (a Compaq EV5 Alpha computer5) than they obtain by employing the special-purpose hardware back end. Since all hardware is a moving target, we are working on a new version of the code that will have a similar relationship to the current-generation GRAPE-6 (a 64-teraflop computer for stellar dynamics studies). The gains that we anticipate are summarized in Table 3.
|
| Table 3
Software and hardware for integrating planetesimals. |
|
|
|
|
|
| Gain factors |
|
I. Improved software gain
(factors do not all accumulate; adopt 20, with 5 as a reserve factor) | 20 |
| Gravity module (now coding) | 4 |
| Improve tree build and walk | |
| Use tensor multipoles for recursion | |
| Enable local expansions | |
| Collision module (testing) | 4 |
| New collision-detection scheme | |
| Timestepping (testing) | 5–30 |
| Use MVS with collision detection | |
| II. Total hardware gain | 240 |
| Use 512 CPUs (available) | 4 |
| Run 600 hours (available) | 10 |
| Run on current CPUs | 10 |
(special functions used extensively; baseline via CRAY T3E (300 MHz), boosted via Compaq EV5 Alpha) | |
| III. Evolution | |
| Particle number decreases as system evolves | 10 |
| Increase initial cross section to accelerate number decrease | 3 |
|
| Total gains | 10,000 |
| Emergency reserve factors | 15 |
|
Total speedup relative to HARP-2 for 106 particles | 500,000,000 |
| Alpha EV5/HARP-2 ratio on test case = 10 | |
| Algorithmic scaling including HARP-2 efficiency gains | 2,000 |
| 200× speedup on 512-node machine | |
| 10× gain from next-generation CPUs | |
| 20× gain from software | |
|
Note: In this table, we have included some “optimistic” projections as reserve factors (shown in italics). We think that much of this reserve of 15× speedup is likely to be realized, but we expect to encounter some problems achieving every goal in the table, so we hold it in reserve to be reasonably conservative.
|
|
The challenge is to predict when particles will change between the regimes of weak and strong interaction. One method we are considering is to construct a new binary tree ordered by perihelion and aphelion. Those particles with orbits separated by less than a Hill sphere are flagged for further testing. This screening has a cost of N log N and is performed only once per long Kepler step. Flagged pairs of particles with phases that are certain to stay separated over the integration step are reset. The remaining particles are tested by solving Kepler's equation in an elliptical cylindrical coordinate system to determine the time of actual Hill sphere overlap. Switching between Hamiltonians is not strictly symplectic, but it occurs infrequently enough for any given particle that it is not a concern. Dissipating collisions are inherently non-symplectic anyway. (There is some skepticism from colleagues about switching the integration scheme. True conservatives are also worried that the normal tree codes are not momentum-conserving. There is a momentum-conserving variant [67], but it loses this property with adaptive timestepping.) Once particles separate beyond their Hill spheres (or merge), they are returned to the Kepler drift scheme. Although much work remains to be done, the reward will be the first self-consistent direct simulation of planetesimals evolving into planets in a realistic disk. The results can be used to study related problems, such as the formation of planetary satellites, orbital migration of giant planets in a sea of planetesimals, and ultimately the ubiquity and diversity of extra-solar planetary systems.
To resolve the collisions and their outcomes, we are testing a scheme in which the particles are replaced by “rubble piles” when they collide. Rubble piles have little or no tensile strength; they are assemblages of particles that are held together only by gravity [68]. This approach captures the dynamics of gravitational reaccumulation explicitly without relying on questionable scalings of strength-regime laboratory experiments to gravity-regime km-size planetesimals. After a fixed interval, particles that have accumulated into sufficiently large bodies are replaced with single large particles, while the remaining debris is treated as “dust” that slowly accumulates on larger particles and drags on their orbits over time. We believe that this mechanism may be responsible for the low orbital eccentricities of present-day terrestrial planets, a property that has so far eluded all numerical simulations of planet formation.
Combining special hardware and advanced algorithms
In all of the codes that we run, gravity plays an important role. The calculation of gravitational interactions is essential and is normally a large fraction of the simulation cost. The GRAPE consortium has built special hardware to accelerate the gravitational calculations. We have noted that our algorithms lead to faster times to solution on their host machine than would be achieved by using their special-purpose hardware and simpler algorithms. What about using both spatially and temporally adaptive algorithms and their special hardware?
The effort involved in creating a typical gravity-based code comprises the following: 1) tree construction and traversal (including neighbor list determinations), 20%; 2) computation of pairwise gravitational interactions, 50%; and 3) computation of periodic boundary conditions, special functions for planetary integrations, SPH (smooth particle hydrodynamics) gas physics, etc., 30%. To date, special-purpose hardware has focused on making the pairwise gravitational interactions blindingly fast. In the limit that it is infinitely fast, this provides only a factor of 2 speedup. However, obtaining even half of this gain is a severe challenge to the I/O designs of the special hardware. For example, GRAPE was originally designed for the specific problem of globular cluster dynamics, where it has had a number of successes [69]. These systems have 106 stars, so there is no reason to use a larger N. This sets a very modest limit to the amount of memory needed in the special box. With a “sum over all pairs,” it means that for every 24 bytes transferred for the particle positions (and returned for the particle velocities), one does 3 × 107 floating-point operations. Thus, the ratio of the I/O channel speed to the floating-point speed need be only roughly 2 MB/s for every teraflop, so a peripheral component interconnect (PCI) bus is sufficient to connect to a teraflop of hardware. However, in our simulations, there are typically only a few hundred pairwise interactions calculated, so the ratio needed to spend more time on computation than on data transfer is 10ß GB/s for every teraflop, where ß is the number of times that the particle is loaded into the special hardware (it has to be there every time it appears on an interaction list, so with efficient traversals and caching, we might keep ß 10). Thus, it is not straightforward to use special hardware with aggressive algorithms.
Looking toward the future
The astute reader will notice that the phenomenal algorithmic progress shown in Figure 2 stopped a couple of years ago. There are both scientific and social reasons for this. It is extremely important for the fidelity of the scientific results to ensure that the integrators are symplectic. There are many subtle attributes of these aggressive algorithms that will alter the Hamiltonian character of the problem. We were surprised to discover that some of the problems were easy to address, while others were far more difficult. One example was that all past cosmological simulations had performed the integrations in noncanonical coordinates [21], resulting in a strong transient energy change in the early stages of the integration when the expansion rate of the universe was high. Since it had been included in every simulation, it became a “feature.”
We are concerned about a number of other “features” and have been engaged in extensive tests to understand them. These derive from the fact that the forces in a tree code are normally not symmetrical, so that Newton's law of equal and opposite forces (which ensures conservation of global momentum) is not strictly upheld. When a code explicitly conserves a quantity, it is a bug check. When it does not, as in this case, it becomes something to monitor as a reality check on error propagation. We are more concerned about putting in “switches” to the integrators that change timesteps in a fixed integration scheme or completely change the integration scheme, as in the collisions of planetesimals. This requires extensive testing.
The last section makes it clear that the codes have enough complexity that when one element is greatly speeded up, the bottlenecks move elsewhere. This is also true at a higher level. The simulation code stopped being the main bottleneck to scientific production. We had a several-year push to bring the machines and algorithms to the level at which we could do simulations fairly easily with N 107, and it required only mild heroics to run N 108. This also meant that N 106 became a problem for workstations. This opened up a wealth of scientific possibilities, which was the main driving force in the first place. It also meant that we could generate enormous data sets that require analysis for scientific results.
What we need now is more flexible analysis code and a greater range of physics represented in the codes. We also want to create a venue to test and compare codes. There is demand for using Couchman's adaptive mesh techniques for the early stages of cosmological simulation and then switching to tree methods. In some cases, it would certainly be useful to incorporate GRAPE or Blue Gene* hardware [70]. For such cases, we might want to couple our framework to Blue Matter [71], the molecular simulation software to be used on Blue Gene. Both of these are broader community needs and must incorporate work from multiple groups. To this end, we started the “Whole NChilada Project”6 to tackle the simulations, data handling, and analysis. This process mirrors the evolution that has occurred in a number of simulation projects and is reflected in the changes of support from the HPCC (High Performance Computing and Communications) projects in multiple agencies to the newer initiatives such as the NSF (National Science Foundation) ITR (Information Technology Research), the NASA Earth System Modeling Framework and its broader Computing Technologies Project, and the interagency National Virtual Observatory. It also reflects the success of the first-generation University of Washington N-body shop; many of the former students and postdoctoral fellows are now group leaders at other universities. NChilada is a vehicle for realizing collective synergies from the efforts of allied but individual groups.
Summary: Virtual petaflops
Past planetesimal simulations used codes with an algorithmic complexity that would be similar to the point labeled “full tune” in Figure 2 and computers with speeds of 10 Mflops. (Special-purpose “GRAPE” hardware of 106 Mflops has been used, but such implementations involve sums over all interactions, and so they are closer to the direct-sum case in floating-point cost [60].) Our algorithms result in a collective speedup of 108 for simulations with N 107 (with rough accounting for the reduction in N over the course of the simulation in the solar system formation runs).
We have emphasized that both hardware and algorithm improvements are required to achieve the performance gains that we require. Figure 2 shows that the speedup factor from algorithms vastly exceeds that of the hardware. It is not sufficient to simply wait for computers to get better, nor does it seem to pay to build special hardware. McMillan et al. [72] asserted that if GRAPE-6 were built, it could use its petaflop speed to follow 106 planetesimals by 2003. They claimed that this would offer a seven-year advantage over general-purpose computers that would only be able to follow 104 particles by the year 2000. As of September 2003, a 64-teraflop GRAPE-6 system has been built, and smaller GRAPE-6 systems garnered IEEE Gordon Bell prizes in 2000 and 2001.
Our test simulations without the new integrator were already ten years ahead of their projections [63]. Our approach has outpaced special hardware efforts over the last several years, and we believe that it will continue to do so for the foreseeable future. Sir Isaac would love to see the enhancement of “the entire human intellect” by high-performance computing.
Acknowledgments
This project is supported by the NASA High Performance Computing and Communications/Earth and Space Sciences (HPCC/ESS) program, Innovative Research Program (IRP), Applied Information Systems Research (AISR), and the NIST Advanced Technology Program (ATP), the NSF Knowledge and Distributed Intelligence (KDI) and Information Technology Research (ITR) programs, and the Washington State University Band Professorship fund.
Footnotes
*Trademark or registered trademark of International Business Machines Corporation.
**Trademark or registered trademark of Cray Inc.
1In Ancient Greek, chaos was “the great abyss out of which Gaia flowed.”
2Cray, Inc., Seattle, WA.
3RIKEN, Wako, Japan.
4Sundance Corporation, Reno, NV.
5HP Compaq Corporation, Palo Alto, CA.
6http://hpcc.astro.washington.edu/nchilada/bin/view/Nchilada/WebHome.
Received October 16, 2003;
accepted
for publication December 17, 2003; Internet publication March 2, 2004 |