IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Journal of Research and Development

Blue Gene   Volume 49, Number 2/3, 2005
Table of contents: HTMLPDF This article: HTML PDFDOI: 10.1147/rd.492.0457Copyright info

Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements

by M. Eleftheriou,
B. G. Fitch,
A. Rayshubskiy,
T. J. C. Ward,
and R. S. Germain

This paper presents results on a communications-intensive kernel, the three-dimensional fast Fourier transform (3D FFT), running on the 2,048-node Blue Gene®/L (BG/L) prototype. Two implementations of the volumetric FFT algorithm were characterized, one built on the Message Passing Interface library and another built on an active packet Application Program Interface supported by the hardware bring-up environment, the BG/L advanced diagnostics environment. Preliminary performance experiments on the BG/L prototype indicate that both of our implementations scale well up to 1,024 nodes for 3D FFTs of size 128 × 128 × 128. The performance of the volumetric FFT is also compared with that of the Fastest Fourier Transform in the West (FFTW) library. In general, the volumetric FFT outperforms a port of the FFTW Version 2.1.5 library on large-node-count partitions.

Introduction

One of the most commonly used computational approaches for modeling large biophysical systems is the molecular dynamics method, in which the Newtonian equations of motion are solved numerically. Molecular dynamics permits the computation of thermodynamic and dynamic properties of biological systems [1]. Such studies can enhance our understanding of biological functions and provide insight into processes related to the action of potential new medications [2]. One of the most computationally expensive components of this method is the calculation of long-range electrostatic forces [3]. A commonly used technique for computing these forces is the particle-particle particle-mesh Ewald (P3ME) technique [4], which requires the evaluation of a convolution using three-dimensional fast Fourier transforms (3D FFTs). Our interest has been to provide an efficient and scalable 3D FFT implementation on Blue Gene*/L (BG/L). The metric for efficiency used here is the “total time-to-solution” for the problem. Our use cases for the 3D FFT from molecular simulation require strong scaling for relatively small data sizes, such as 323, 643, and 1283.

A distributed 3D FFT represents a considerable challenge for the communications infrastructure of a parallel machine because of the all-to-all nature of the distributed transposes required, and it stresses aspects of the machine that complement those addressed by other benchmark kernels, such as Linpack [5].

A typical decomposition for performing a 3D FFT in parallel is slabwise. With a slab decomposition, the data is partitioned along a single axis. For example, to compute an N × N × N FFT on P nodes, each node would be assigned a slab of size N × N × N/P. While effective in reducing communications costs, the scalability of this method is limited by N, the extent of the data along a single axis. This becomes an issue when one wants to scale to very large node counts for a massively parallel machine, such as BG/L.

To address the above issues, we implemented a 3D FFT that is a good match to the volume domain decomposition natural to BG/L. The volumetric FFT achieves scalability to a larger number of nodes because it allows the distribution of work for an N × N × N FFT over a maximum of N2 nodes rather than N nodes for a slab decomposition, but at the cost of additional communication. The description of an earlier version of the volumetric FFT implementation and its performance on a more conventional machine (POWER4* full bisectional bandwidth switch) was previously published [6].

The major contribution of our work is a highly scalable 3D FFT framework for BG/L that can use any serial 1D FFT as a building block. To assess the scalability and the performance of the volumetric FFT, we provide the following:

  • Scalability characterization of our implementations on BG/L using two communication protocols.
  • A comparison with a port of Fastest Fourier Transform in the West (FFTW) [7] on BG/L.

We present a detailed review of the published FFT algorithm [6] and describe a variation to the algorithm that increases its scalability (i.e., increases the number of nodes on which it is able to run). The algorithm was reimplemented to improve the uniprocessor performance and to allow performance comparison on two communication layers: Message Passing Interface (MPI) and active packets. The performance of the volumetric FFT on BG/L for both communication layers is presented. The measured performance is compared with limits inherent in the hardware capabilities of the machine, such as the bandwidth of the BG/L torus communications network. Our measurements show that the volumetric FFT scales well on the BG/L architecture. The volumetric FFT now comes within 50% of the performance of FFTW on a single processor, while outperforming FFTW at larger node counts for a 1283 FFT. It is important to note that the MPI implementation of the volumetric FFT uses the FFTW 1D serial FFT as a building block.

The paper is organized as follows. We review the volumetric decomposition algorithm for computing the 3D FFT and describe the modified implementation that allows the method to scale to larger numbers of processors. Also in that section, we provide details of the 3D FFT kernel implementation on two different communication protocols. The hardware limits to the ultimate performance of the 3D FFT on a BG/L machine based on the hardware “speeds and feeds” [8] are next discussed, followed by a description of our experimental measurements and corresponding results. Finally, we summarize our work.

Parallel implementation

For a machine with a 3D torus and mesh interconnect, such as BG/L, it is natural to use volumetric decomposition to map the data to perform a 3D FFT onto the machine. Let us assume an array A of complex numbers Nx × Ny × Nz distributed onto a Px × Py × Pz logical processor mesh. Processor (x, y, z) initially stores a block of data nx × ny × nz with

Equation a

to a local array A′. Then,

A′(ijk)[xyz] ≡ A(xnx + i, yny + j, znz + k).

To allow scalability beyond that obtainable with the previous proposed implementation, we impose a new restriction on the sizes of the local array, A′:

nx · ny = alpha × Pz,

nx · nz = β × Py,

and

ny · nz = gamma × Px,

where alpha, β, and gamma are integers.

In evaluating a 3D FFT on an array A of size Nx × Ny × Nz, the “rowcolumn” method is used to decompose the problem into successive evaluations of 1D FFTs along each dimension. More specifically, we perform the 3D FFT in three phases; we first compute Nx × Ny 1D FFTs along the z dimension, then Nx × Nz 1D FFTs along the y dimension, and finally, Ny × Nz 1D FFTs along the x dimension. Figure 1 indicates where the all-to-all exchanges of data take place for each of these three phases.

Figure 1 Figure 1

Each phase of the 3D FFT comprises the communication and computation associated with one of the three dimensions. The communication and computation associated with each phase are analogous. Here we describe only the first phase.

In the first phase, all Nx × Ny 1D FFTs of size Nz are independent. Therefore, it is sufficient to consider only the case of the 1D grid of Pz nodes that has to compute nx × ny 1D FFTs of size Nz. Suppose A(0:nx − 1,0 : ny − 1,0 : Nz − 1) is an nx × ny × Nz array of complex numbers, block-distributed along the z dimension onto Pz nodes. If

A′(0:nx − 1,0 : ny −1,0 : nz − 1)[p]

is the local array in node p, then

A′(ijk)[pz] ≡ A(ijpnz + k).

The data along the x and y dimensions is redistributed onto the Pz nodes. That is, if A″(0:alpha,0 : Nz − 1)[p], with nx · ny = alpha × Pz, is the local array in node p after the redistribution, then

Equation b

Each node p sequentially computes a independent 1D FFTs of size Nz that are stored in its local array A″. For these 1D FFT computations, we typically use the services provided by a port of the FFTW library or the FFT library [9] from the Technical University of Vienna to BG/L, although we have also used a simple 1D FFT implementation developed by our group for the packet layer work. Once the FFTs are computed, data is redistributed to the original distribution described by Equation (1).

In summary, the computation of a 3D FFT is performed in three phases, along the z, y, and x dimensions. In each phase, we first redistribute the data, perform 1D FFTs, and return the data to its original distribution. The number of communication operations in our implementation is minimized by combining the post-FFT data redistribution of one phase with the pre-FFT data redistribution of the next phase. Thus, we reduce the data redistributions to three, as opposed to six.

Hardware limits on 3D FFT performance

Lower bounds can be placed on the bandwidth-limited communication time for the 3D FFT assuming that three all-to-all communications (along a row or within a plane) are required. Simulations of the BG/L torus network indicate that the all-to-all communication time for data that is distributed over a set of nodes in a line, plane, or volume can be estimated using the expression1

Equation c

where Vreceived is the volume of data received by each node; Nhops is the average number of hops required (for a 3D torus in which each dimension is p, Nhops = p/4 for alltoall in a line, Nhops = p/2 for alltoall in a plane, and Nhops = 3p/4 for alltoall in a volume); Nlinks is the number of links available to each node (two for linear communication, four for planar communication, and six for volumetric communication); BW is the raw bandwidth of the torus per link (two bits per processor clock cycle); and f is the link utilization (assumed to be 80%). This expression indicates that the time required for all-to-all communication is independent of the dimensionality of the communication because the increase of the average hop count with dimensionality is compensated for by the increase in the number of links available for the communication. At the limits of scalability, this expression, on the basis of bandwidth considerations, will become inadequate because the hardware and software latencies associated with sending a packet will become significant.

For an idealized bound on the computation time required to compute a single 1D FFT of length N, we assume 8N log2 N cycles for a fused multiply–add machine (although the ideal limit for a 1D FFT is 5N log2 N floating-point operations, data dependencies force a fused multiply–add machine to use eight cycles when one assumes that a fused multiply–add is issued every cycle). Highly optimized FFT implementations, such as the library developed by the team at the Technical University of Vienna [9], can approach this idealized value. However, the nature of the 3D FFT is such that it is communications-bound for even small node counts, and the performance of the 1D FFT building block is not critical.

Parallel performance analysis

All of the performance-measurement results presented in this section were obtained on the 2,048-node Blue Gene/L prototype comprising two racks. Each rack has two 512-processor midplanes, each midplane comprises 16 node cards, each node card contains 16 compute cards, and each compute card contains two nodes. The processor clock speed in the current prototype is 700 MHz. The prototype can be configured to complete a torus in all three dimensions when it is partitioned as 512, 1,024, or 2,048 nodes. Unless otherwise noted, the scalability measurements reported here were taken in mesh mode to simplify the interpretation of the scalability data.

We performed benchmarks of our FFT implementation on the 2,048-node BG/L prototype in which we varied both the number of nodes and the size of data being transformed. For each experiment, the execution time was measured at a series of node counts, while the FFT size was held fixed. The MPI tasks were organized in a 3D grid using the MPI topology constructs. The running times reported here for the MPI implementation are averages of ten runs. Each experiment was run 11 times, and the first run was discarded. The active packet layer execution times were derived from application-specific trace data analyzed after the runs.

We compared the performance of the MPI implementation of the volumetric FFT using the 1D serial FFT from the FFTW library as a building block, and the 3D FFT from the FFTW library [10] on the mesh network. FFTW is a well-established and widely used library in the computational science community for computing multidimensional FFTs. It relies on the slab decomposition approach for running on multiple nodes.

Figure 2 shows the speedup of both the volumetric FFT and FFTW. The speedup achieved by FFTW is observed to be good up to 128 processors. However, the volumetric FFT continues to exhibit good speedups through 1,024 nodes for the 128 × 128 × 128-size FFT, while FFTW loses parallel efficiency above 128 nodes. This flattening on the speedup occurs because FFTW is based on the slab decomposition, which limits its scalability to 128 nodes.

Figure 2 Figure 2

The time-to-solution of the volumetric FFT implementations on both communication layers and FFTW are shown in Figure 3. The sequential FFT from the FFTW library is about 50% faster than the single-processor volumetric FFT using the MPI implementation. While both the FFTW and the volumetric FFT have comparable performance up to 128 nodes, the volumetric FFT implementations on both communication protocols continue to scale up to 1,024 nodes.

Figure 3 Figure 3

Figure 4 shows the measured execution times for the volumetric 3D FFT as implemented using MPI collective communications and the active packet interface on the 2,048-node prototype. Speedups in excess of 250 are observed for 1283 FFTs at 1,024 nodes for both the MPI and active packet implementations. We observe comparable performance on both communication layers up to 256 processors on the mesh topology. We have currently limited the data available on the two modes—mesh and torus—for a larger number of processors, and that prevents us from providing a complete characterization of the performance characteristics of all implementations on all of the available mesh and torus modes. However, we expect the active packet implementation to be faster at large node counts as a result of the smaller software overheads in the active packet implementation. In addition, the performance of the MPI implementation at the limits of scalability of the FFT should improve as more optimized implementations of the MPI_Alltoall [v] collective operation become available.

Figure 4 Figure 4

The comparison of the bounds on execution time—based on hardware bandwidths and idealized 1D serial FFT execution times with the measured total time to solution in Figure 4—shows significant divergences at small node counts. We conjecture that this divergence from the limits estimated from hardware capabilities is caused by memory hierarchy effects, since floating-point efficiencies of the 1D FFTs used as building blocks for the 3D FFT implementation are fairly high; for example, for a 64-point FFT, the efficiency of the FFT from the BG/L FFT library supplied by the Technical University of Vienna is more than 60%, and the naively vectorized FFT implementation used in the active packet implementation has an efficiency of more than 40%.

The memory access pattern of the in-memory transpose required for the 3D FFT is presumably very unfavorable for prefetching, and at low node counts the memory footprint per node of the larger-size 3D FFTs will certainly spill out of the 4-MB L3 cache. No effort has been made to tile the implementation of the memory transposes in the volumetric FFT. Of course, at the high node counts that represent the limits to scalability for the FFT, the data will sit in cache and the measured performance will more closely approach the bandwidth-limited constraints on performance. Note that even without any effort at tiling, the uniprocessor performance of the volumetric FFT implementation is within 50% of that of the FFTW library implementation. We intend to use instrumentation available on the BG/L chip that can access the hardware performance counters to eventually measure the memory hierarchy effects.

Figure 5 shows the measured data in a way that tries to capture how closely measurements approach bandwidth-limited behavior, as discussed above. We conjecture that the deviations from bandwidth-limited performance at high node counts for the smaller FFT sizes are due to software and hardware overheads that become significant at very small message sizes.

Figure 5 Figure 5

In Table 1, data taken on a 512-node partition in both the torus and mesh modes is presented. If the execution time of the FFT is totally dominated by bandwidth effects, the ratio of Tmesh/Ttorus should equal 2. A ratio in excess of 1.8 is realized for the active packet implementation for the largest FFT size, 1283, which is impressively close to the ideal hardware behavior.


Table 1 The Blue Gene 512-way prototype can be operated with the torus network enabled as a torus or as a mesh (without wrapping). Taking data in both torus and mesh modes allows a check on the amount of time taken by the FFT that is attributable to network bandwidth effects since the effective bandwidth available doubles in torus mode.
Mesh dimensionActive packetMPI collectiveModel



323643128332364312833236431283

Tmesh (512)1.36 × 10−45.94 × 10−44.55 × 10−35.01 × 10−34.66 × 10−53.77 × 10−43.05 × 10−3
Ttorus (512)8.5 × 10−53.56 × 10−42.52 × 10−33.17 × 10−32.47 × 10−52.02 × 10−41.65 × 10−3
Tmesh/Ttorus1.6001.6691.8061.5781.8891.8701.851
Tnon-bandwidth3.40 × 10−51.18 × 10−44.90 × 10−41.34 × 10−32.74 × 10−62.63 × 10−52.46 × 10−4
Tbandwidth5.10 × 10−52.38 × 10−42.03 × 10−31.84 × 10−32.19 × 10−51.76 × 10−41.40 × 10−3
Tnon-bandwidth/[N3 log2(N)]2.08 × 10−107.5 × 10−113.34 × 10−119.12 × 10−111.67 × 10−111.67 × 10−111.67 × 10−11
Tbandwidth/N31.56 × 10−99.08 × 10−109.68 × 10−108.76 × 10−106.7 × 10−106.7 × 10−106.7 × 10−10

Conclusions

The parallel computation of three-dimensional FFTs has been studied from two different viewpoints. The first one is aimed at parallelizing the basic one-dimensional FFT [1113]; in the second, the transpose method, 3D FFTs are carried out by successive evaluations of independent 1D local FFTs along each direction [101417]. In this work we have presented a volumetric decomposition FFT that belongs to the second category. We have discussed the implementation of a 3D FFT for the Blue Gene/L supercomputer. We base our approach on a volumetric decomposition of data. This decomposition maps naturally to the torus topology of BG/L and allows scaling to very large numbers of nodes. We can also leverage any serial 1D FFT as a building block for our algorithm.

In this paper, we have reviewed the 3D FFT implementation and provided lower bounds on execution time based on the hardware capabilities of the BG/L supercomputer. In our measurements, we found that the volumetric algorithm performs impressively well up through 1,024 nodes on both the active packet and MPI communication layers. Moreover, we found that the volumetric FFT outperforms FFTW by a significant margin on large numbers of nodes. The results obtained thus far are extremely encouraging with respect to our ability to exploit the capabilities of the BG/L architecture in the context of a real application kernel and to enable the scalability of applications that depend on the evaluation of 3D FFTs to the very large node counts required to achieve new levels of performance.

At the limits of scalability, approached by the 323 FFT at 512 nodes, the active packet implementation is significantly faster than the MPI-based FFT. This performance difference will presumably narrow as further optimization of the MPI collectives takes place. Future work will involve instrumenting the code to understand the role of memory access patterns in the performance at small node counts and continuing optimization of the implementations on both communication layers.

*Trademark or registered trademark of International Business Machines Corporation.

References


Footnote

1A. Gara, P. Heidelberger, and B. D. Steinmacher-Burow, private communication with author.

Received July 20, 2004; accepted for publication September 24, 2004; Published online April 12, 2005.


    About IBMPrivacyContact