IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
IBM Journal of Research and Development  
Volume 44, Number 6, 2000
Advanced microprocessor design
 Table of contents: arrowHTML arrowPDF arrowASCII   This article: HTML arrowPDF arrowASCII    DOI: 10.1147/rd.446.0823 arrowCopyright info
   

Minimal-storage high-performance Cholesky factorization via blocking and recursion

by F. G. Gustavson and I. Jonsson
We present a novel practical algorithm for Cholesky factorization when the matrix is stored in packed format by combining blocking and recursion. The algorithm simultaneously obtains Level 3 performance, conserves about half the storage, and avoids the production of Level 3 BLAS for packed format. We use recursive packed format, which was first described by Andersen et al. [1]. Our algorithm uses only DGEMM and Level 3 kernel routines; it first transforms standard packed format to packed recursive lower row format. Our new algorithm outperforms the Level 3 LAPACK routine DPOTRF even when we include the cost of data transformation. (This is true for three IBM platforms—the POWER3, the POWER2, and the PowerPC 604e.) For large matrices, blocking is not required for acceptable Level 3 performance. However, for small matrices the overhead of pure recursion and/or data transformation is too high. We analyze these costs analytically and provide detailed cost estimates. We show that blocking combined with recursion reduces all overheads to a tiny, acceptable level. However, a new problem of nonlinear addressing arises. We use two-dimensional mappings (tables) or data copying to overcome the high costs of directly computing addresses that are nonlinear functions of i and j.

1. Introduction

We present a novel practical algorithm for Cholesky factorization by combining recursion and blocking. Our aim is to conserve storage and to simultaneously obtain Level 3 performance. Also, we want to avoid producing Level 3 BLAS for packed format.

In [1] a new data format called recursive packed format was used to produce a new algorithm for Cholesky factorization. We use that format to produce a Level 3 performing code using only DGEMM and Level 3 kernel routines. By so doing we achieve the algorithm alluded to in the first paragraph. Our algorithm requires that the user's matrix AP, input in lower packed format [2, 3], first be transformed to packed recursive lower row format, PRLF. Our new algorithm is then applied to PRLF, which has overwritten the input matrix AP. Standard algorithms for Cholesky factorization are given in Version 3 of LAPACK. They are called DPPTRF and DPOTRF. The PP algorithm accepts input in packed format AP, while the PO algorithm requires input to be in full format, which requires about twice the storage of packed format. However, on many platforms, including RISC workstations, the LAPACK PO codes perform about three times faster. It is our view that most users prefer higher performance, so they use twice the storage. However, some applications can only run using half the storage, and these applications must use the packed routines.

Our new algorithm outperforms the LAPACK Level 3 DPOTRF even when we include the cost of the data transformation from packed lower format to packed recursive lower format. A slight drawback is that the data transformation requires a temporary buffer of size 1/8n2, which is allocated and then deallocated.

The new algorithm was implemented and tuned for the ESSL library, in which only lower packed format is supported. However, ESSL provides DPPF, which produces both an LLT factorization (Cholesky) and an LDLT factorization (Gauss). Additionally, like LAPACK, the ESSL library provides routine DPPICD, which computes the inverse matrix AP–1. We mention that we produced new codes for LDLT and matrix inverse for data in packed recursive lower row format. However, we describe only the Cholesky factorization LLT in this paper.

For large matrices, blocking is not needed, because the pure recursive algorithm for Cholesky gives acceptable performance; see [1] for details. However, for small matrices the overhead of recursion slows down the performance to an unacceptable level for a high-performance library such as ESSL. This paper examines in detail the overheads related to pure recursion. We introduce blocking and combine it with recursion to overcome the losses due to the overhead by essentially reducing the overhead to a tiny acceptable amount. However, because the packed recursive formats do not allow for any data expansion, we are faced with a new problem: nonlinear addressing associated with packed recursive storage format. We solve this problem by one of two methods. The first is to use a two-dimensional mapping (table) whose (i, j) entry is the location in PRLF where a(i, j) is stored. The other method is to data-copy a small submatrix A in PRLF to a buffer which stores A in full format so that standard linear addressing can be used. Additionally, our new algorithms require only that two mappings be generated, which results in only a tiny amount of additional storage. Also, since these mappings are used many times by the algorithm, the cost of their generation is amortized by their multiple reuse.

Our paper is organized into five sections, including the Introduction as Section 1. Section 2, called Algorithmic considerations, contains six subsections. In the first, we discuss the existing routines for doing Cholesky and LDLT factorization on matrices stored in packed format. We give our recursive formulation of the Cholesky factorization, called CHOL, in the second subsection, together with sketches of the conventional LAPACK algorithms. In the third, a new data format is presented which enables routines to make better use of high-performance DGEMM routines. The fourth subsection shows how the DTRSM and DSYRK routines are affected by the recursive packed format. We produce recursive algorithms TRSM and SYRK and show proofs of correctness. In the fifth subsection, we explain why one of the formats gives a better data access pattern. Finally, in the sixth subsection, we provide an outline of how to transform data in standard packed format to recursive packed format. Section 3 is titled Algorithmic components. In the first subsection, we describe the problem of a large recursion tree; the second describes our blocked version and how we overcome the problem. We present block counterparts of CHOL, TRSM, and SYRK, which we call BC, BT, and BS, and we briefly develop a major theme of computer science—that an algorithm is tied to its data structure when we apply this theme to BC, BT, and BS. The third subsection describes three types of kernel routines, two of which are the mapping approach and the data entry approach described briefly above. The third type, called the compiled code approach, was not used.

We see in Section 4 that we obtain excellent Level 3 performance using recursive packed row format for Cholesky factorization, especially for large n. The new algorithm uses half the storage (an allocation of 1/8n2 elements is returned after the data transformation is complete) and performs better than most Level 3 implementations. Also, the majority of the processing is done in calls to DGEMM on submatrix blocks of variable size. In the SMP environment, ESSL's DGEMM automatically uses the existing threads to obtain excellent SMP performance. These two factors automatically make our new recursive Cholesky implementation an SMP parallel implementation.

In general, a negative feature of the recursive approach, and for this data format in particular, is that the addressing of the individual (i, j) elements in any triangle becomes nonlinear. This drawback is inevitable: It is ironic, however, that good data locality comes at the cost of making its reference patterns nonlinear.

In Section 5, we give our conclusions, summarizing the successful features of our new algorithm. We briefly indicate why packed format should become more important, and the role recursion plays in its return to prominence. Next, we discuss some other results of our paper: why recursion is particularly suited to the Cholesky algorithm, results not covered, and issues related to Level 2 and Level 3 packed BLAS. Finally, we explain why our current algorithm is not the fastest-performing Cholesky routine.

2. Algorithmic considerations

In this section, we show why routines using the packed format have not displayed high performance. We show how to overcome this by using a recursive data format.

  Preliminary remarks and rationale
Existing codes for Cholesky and LDLT factorization use either full storage or packed storage data format. In earlier times, when there was a uniform memory hierarchy, packed format was usually the method of choice because it conserved memory and performed about as well as a full-storage implementation. See the LINPACK User's Guide [4] for more details.

ESSL [3] and LAPACK [2] support both data formats, so a user can choose either for his application. Also, the ESSL packed storage implementation has always been Level 3, sometimes at great programming cost. On the other hand, LAPACK and some other libraries do not produce Level 3 implementations for packed data formats.

For portability and other reasons, mainly performance and ease of programming, users today generally use the full data format to represent their symmetric matrices. Nonetheless, saving half the storage is an important consideration, especially for those applications which will run with packed storage and fail to run with full storage.

Another important user consideration is migration. Many existing codes use one or the other or both of these formats. Producing an algorithm such as Cholesky using a new data format has little appeal, since existing massive programs cannot use the new algorithm. Thus, the approach we took in ESSL was to redo its packed Cholesky and LDLT factorization codes but then instead use the new recursive packed data structure.

The idea was simple: Given AP holding symmetric A in lower packed storage mode, overwrite AP with A in the recursive packed row format. To do so requires a temporary array, which we allocate and then deallocate, of size 1/8n2. Next, we execute the new recursive Level 3 Cholesky algorithm using only the ½n2 original storage. Even when one includes the cost of converting the data from lower packed to recursive packed lower row format, the performance, as we will see, is better than that of the LAPACK Level 3 DPOTRF.

In summary, the users can now obtain full Level 3 performance using packed format and thereby save about half of their storage.

  Recursive Cholesky factorization
Our recursive algorithm of the Cholesky factorization of A uses a divide and conquer procedure, which gives rise to a binary tree. At every non-leaf node, the identical algorithm executes on one of the submatrices of A. In Figure 1, a parent node and its two children depict this situation.

Figure 1Figure 1

Let a submatrix A of size n be associated with the parent node. Since A is symmetric, we are dealing with an isosceles right triangle of size n. Divide A into two congruent triangles of size n/2 and a square between them. The computation proceeds in the direction of the arrows, and Equation (1), which constitutes Equations (2) to (5) (see below), describes the computation that is done at the parent node. Now we formally describe our algorithm.

In our recursive algorithm, the Cholesky factorization of a positive definite symmetric n × n matrix A,

A = open parenthesis A11 A21T close parenthesis = LLT = open parenthesis L11 0 close parenthesis open parenthesis L11T L21T close parenthesis , (1)
A21 A22 L21 L22 0 T22

is initiated by a recursive Cholesky factorization of the upper left square matrix A11 of order n1 = left floorn/2right floor; i.e.,

A11 = L11L11T. (2)

The lower left matrix A21 is then transformed into L21 by the multiple solving of n2 = left ceilingn/2right ceiling triangular systems of equations (each of size n1 = left floorn/2right floor),

L21L11T = A21. (3)

Then, the lower right square matrix of order n2 is symmetrically rank-k updated by L21; i.e.,
A tilde22 = A22L21L21T. (4)

Finally, the matrix A tilde22 is recursively factored,

A tilde22 = L22L22T. (5)

The recursion in Equations (2) and (5) stops when the matrices to be factored A11, A22 have small order or, if full recursion is used, have order 1.

Proof of correctness     Our method of proof is mathematical induction. The recursion takes place on the order n of A. We want to prove correctness for n = 1, 2, · · · . However, recursion breaks the problem into two nearly equal parts, n1 = left floorn/2right floor and n2 = left ceilingn/2right ceiling = nn1. On the basis of this observation, we use mathematical induction on k = log2 n, k = 0, 1, · · · .

Suppose that the result is true for 0 < n less or = to 2k. Then we establish the result for all j, 2k < j less or = to 2k+1. Initially we need to establish the result for n = 1.

Now we give the proof: For n = 1, we compute l11 = square roota11. Since A is positive definite, we know all principal minors are positive, so l11 exists. Now suppose that the result is true for 0 < j less or = to 2k. Let 2k < j less or = to 2k+1, j1 = left floorj/2right floor and j2 = jj1. Since j > 1, we do the computations indicated by Equations (2), (3), (4), and (5) in that order.

Equation (2) is satisfied by the induction hypothesis. Equation (3) is a calculation. It can be performed since the triangular matrix L11 has nonzero diagonal elements (in fact, they are positive). Equation (4) is also a calculation. Finally, Equation (5) is satisfied by the induction hypothesis. Using Equations (2)–(5), we want to show that Equation (1) is true. However, this is trivially true because all that is needed is to follow the rules of 2 × 2 block multiplication where the partitioning of A is j = j1 + j2. The conditions that the block elements of L11, L21, L22 must satisfy are exactly those of Equations (2)–(5), which we have shown to be true. box

Figure 2 gives the details of the recursive algorithm CHOL(n). We assume that A is an n × n positive definite symmetric matrix. The algorithm will detect that A is not positive definite symmetric in the if clause when and only when it finds aii less or = to 0 for some i. In the else clause there are two recursive calls, one on matrix A(1:n1, 1:n1), the other on matrix A(j1:n, j1:n); the two computations solve XA(1:n1, 1:n1)T = A(j1:n, 1:n1) and update A(j1:n, j1:n) = A(j1:n, j1:n) – A(j1:n, 1:n1) • A(j1:n, 1:n1)T. Here we use colon notation; see page 19 of [5] for a definition. The two computations consist mostly of calls to DGEMM [actually, the first computation is DTRSM (multiple triangular solve) and the second is DSYRK (rank-k update of a symmetric matrix); however, both DTRSM and DSYRK consist mostly of calls to DGEMM]. The correctness of the algorithm CHOL follows immediately from our induction proof.

Figure 2Figure 2

Figures 3 and 4 give annotated descriptions of the algorithms DPOTRF and DPPTRF. These two routines are the LAPACK Level 3 and Level 2 versions of Cholesky factorization, with uplo = 'L'. The former uses full storage and runs about three times faster than the latter, which uses packed storage.

Figure 3Figure 3 Figure 4Figure 4

The algorithm in Figure 3 is a block nb left-looking algorithm. The algorithm in Figure 4 is a Level 2 right-looking algorithm. In earlier versions of LAPACK, the algorithm DPOTRF was right-looking; i.e., it was the Level 3 block analog of Figure 4.

In Section 4 we compare the performance of DPOTRF and DPPTRF versus our new algorithm BC, which is the blocked form of the algorithm CHOL.

  Packed recursive format
In the previous subsection we gave the recursive Cholesky algorithm and proved its correctness for data assumed to be stored in full lower format. In this section we describe the new recursive packed formats. These data formats are modeled in the same way as the recursive blocked row and column formats of a triangle (see [6] for details). This modeling was first described by Andersen, Gustavson, and Was accentniewski in [1]. It turns out that the recursive Cholesky algorithm will work on all four instances of this new data structure, referred to as column lower, column upper, row lower, and row upper (Figure 5). In each case, the algorithm and its proof are similar to the proof of the preceding subsection; hence, they are not repeated.

Figure 5Figure 5

These packed recursive data formats are hybrid triangular formats consisting of (n – 1) full-format rectangles of varying sizes and n triangles of size 1 × 1 on the diagonal. They use the same amount of data storage as the ordinary packed triangular format, i.e., = n(n + 1)/2. Because the rectangles (square submatrices) are in full format, it is possible to use high-performance Level 3 BLAS on the square submatrices. The difference between the formats is shown in Figure 5 for the special case n = 7.

Notice (as in Figure 6, shown later) that each of the original triangles is split into two triangles of sizes n1 = n/2 and n2 = nn1 and a rectangle of size n2 × n1 for lower format and n1 × n2 for upper format. The elements in the upper left triangle are stored first, the elements in the rectangle follow, and the elements in the lower right triangle are stored last. The order of the elements in each triangle is again determined by the recursive scheme of dividing the sides n1 and n2 by 2 and ordering these sets of points in the order triangle, square, triangle. The elements in the rectangle are stored in full format, either by row or by column.

Notice that we can store the elements of a rectangle in two different ways. The first is by column (standard Fortran order), and the second is by row (standard C order). Assume that A is in lower recursive packed format; then the rectangle is size n2 × n1, n1 less or = to n2. Suppose we store A by row; then lda = n1, and the local address of the element a(i, j) with respect to the beginning of the matrix A is

loc[a(i, j)] = j + n1i. (6)

Suppose we store A by column. Then lda = n2, and

loc[a(i, j)] = i + n2j. (7)

Next assume that A is in upper recursive packed format. Then the rectangle is size n1 × n2, n1 less or = to n2. Suppose we store A by row. Then lda = n2, and

loc[a(i, j)] = j + n2i. (8)

Suppose we store A by column. Then lda = n1, and

loc[a(i, j)] = i + n1j. (9)

Now, the storage layout for the matrix in L format is identical to the storage layout of the transpose matrix in U format. This can be seen by noting that (8) and (9) (for AT) become
loc[at(i, j)] = i + n2j (10)

and

loc[at(i, j)] = j + n1i. (11)

We can now state a relationship among the four storage layouts of the lower and upper recursive data formats.

Theorem 1     The storage layout of the recursive lower packed row (column) data format is identical to the storage layout of the recursive upper packed column (row) data format.

Proof     The algorithms for U and for L have the diagonal in common and hence define the diagonal elements in the same way. Alternatively, we can prove this by induction. Thus, the storage layout is identical for the two formats. Now consider the off-diagonal elements. We use induction. The induction hypothesis is for k = log2n, k = 0, 1, · · · . Assume that the result is true for 0 < j less or = to 2k. Let 2k < j less or = to 2k+1. The induction hypothesis states that the layouts of T1 and T3 and of T2 and T4 are identical [Figures 6(a) and 6(b)]. Now S1 and S2 start at the same location. By using Equations (6) and (11) we see that the row layout of S1 is identical to the column layout of S2, and by using Equations (7) and (10) we see that the column layout of S1 is identical to the row layout of S2. box

Figure 6

We have just seen that recursive row storage of L is identical to recursive column storage of U and that recursive column storage of L is identical to recursive row storage of U. We next examine an interesting consequence of this result.

Redundancy of symmetric-type algorithms leads to faster algorithms and less coding effort
Theorem 1 states that we have two essentially different formats for storing a symmetric matrix. Let us arbitrarily choose recursive packed lower column storage and recursive packed upper column storage as the two different formats. Next, note that most mathematical libraries, such as LAPACK, LINPACK, and ESSL, provide symmetric algorithms supporting both data formats. Because of symmetry, both algorithms perform exactly the same computation; the only difference between the algorithms is the way in which they access storage. Before continuing, consider running a symmetric algorithm in C instead of Fortran. Since C stores matrices by row instead of by column, the L algorithm of C becomes the U algorithm of Fortran, and vice versa. (We mention without proof that a theorem similar to Theorem 1 holds for full-format storage and hence in the Fortran and C programming languages.) Now we pose an interesting question: Of the two data formats, which one gives the faster execution for the symmetric algorithm? Several answers are possible, and the answer is dependent on the symmetric algorithm. However, the most natural answer is that the algorithms perform about the same for both data formats. Now, for definiteness, assume that the algorithm is the Cholesky factorization algorithm. In that case, we argue that one should always choose recursive row storage of L even if the input data for L is given in recursive lower column storage. The reason follows from the fact that computations done stride 1 are almost always faster than computations done stride n. Also, Cholesky factorization is a Level 3 computation, while matrix transposition is a Level 2 computation. Finally, if one expresses L in recursive lower row storage, the majority of Cholesky factorization will be done with stride 1 instead of stride n. The gain is multiplied by the Level 3/Level 2 ratio.

  DTRSM and DSYRK using recursive packed format
We now discuss adapting DTRSM and DSYRK to the new data structure. We examine a triangle T of size n spanning nt consecutive storage locations, where

nt = n(n + 1)   .

2

Without loss of generality, we can let T be lower triangular and occupy memory starting at 0 and ending at nt – 1. In global A, T starts at ist and ends at ist + nt – 1. Let n = n1 + n2, where n1 = left floorn/2right floor. By definition, T consists of two triangles T1 and T2 and a “square” S1 of size n2 × n1 [Figure 6(a)]. Hence, let T1, T2, and S1 have “local” addresses in the space 0 to nt – 1. This means that T1 has addresses 0 to nt1 – 1, S1 has addresses 0 to n2n1 – 1, and T2 has addresses 0 to nt2 – 1, where

nt1 = n1(n1 + 1)

2

and

nt2 = n2(n2 + 1)

2

In terms of T coordinates, T1 starts at 0, S1 starts at nt1, and T2 starts at

nt1 + n2n1 = n1(n + n2 + 1)   .

2

It is clear from Figure 6(a) that T1 and T2 are stored recursively, whereas S1 is stored in full format.

In order to use Level 3 BLAS DTRSM and DSYRK, we must have both the triangle (either T1 or T2) and the rectangle (S1) in full format. In our case the triangle is not in full format. However, T1 and T2 consist respectively of (n1 – 1) and (n2 – 1) squares, and their total areas are respectively nt1n1 and nt2n2.

We propose that DTRSM and DSYRK be expressed recursively according to the recursive data layout of T1 and T2. Take DTRSM first. It requires T1 and S as its operands. Now divide T1 into T11, S1, and T12 according to its recursive definition. Let A = T1 and B = S. The relation between Figure 6(a) and Figure 7 is that n2 = m and n1 = k. Now the computation B = BAT becomes

XAT = B, (12)

or

m{( k1 k2 ) open parenthesis A11T         A21T close parenthesis   = m{( k1 k2 ) = B.
Curly Brace on top Curly Brace on top Curly Brace on top Curly Brace on top
X1 X2 0         A22T B1 B2

If we break (12) into its component pieces, we obtain

X1A11T = B1, (13)

B tilde2 = B2X1A21T, (14)

X2A22T = B tilde2. (15)

Figure 7Figure 7

Note that (14) is a DGEMM computation on matrices (A21, B2, B1), which are stored in full format. Computations (13) and (15) are smaller instances of the original problem, i.e., a DTRSM computation where the triangle is stored recursively and the rectangle is stored in full format. It now follows that we can define DTRSM in our recursive data structure, and it comprises only calls to DGEMM and the Level 1 routine DSCAL. We now state the proof of the recursive DTRSM and give the full algorithm.

The recursive algorithm for the triangular solution resembles the Cholesky factorization algorithm. Assume that A is a nonsingular triangular matrix of size k and B is a rectangular matrix of size m × k. Then the recursion starts by solving for X1 using the upper left triangle A11 of order k1 = left floork/2right floor; see (13). The lower matrix B2 of order k2 = left ceilingk/2right ceiling is transformed into B tilde2 by subtracting the product of the solved X1 and lower left matrix A21; see (14). Finally, we can recursively solve for X2; see (15).

The recursion in (13) and (15) stops when the matrices to be solved, X1 and X2, have a small number of rows or, if full recursion is used, one row.

Proof of correctness     Once again mathematical induction is used. The recursion takes place on the order k of A. We want to prove correctness for k = 1, 2, · · · . Since recursion breaks the problem into two nearly equal parts, k1 = left floork/2right floor and k2 = left ceilingk/2right ceiling = kk1, we use mathematical induction on i = log2k, i = 0, 1, · · · .

Suppose that the result is true for 0 < k less or = to 2i. Then we establish the results for all j, 2i < j less or = to 2i+1. As a base for our induction, we need to establish the result for k = 1.

For k = 1, we scale the row X = Ba-111. Since A is nonsingular, a11 neq 0, so X exists. Assume that the result is true for 0 < k less or = to 2i. Let 2i < j less or = to 2i+1, j1 = left floorj/2right floor, and j2 = jj1. Since j > 1, the computations described in (13), (14), and (15) are performed in that order.

Equation (13) is satisfied by the induction hypothesis. Equation (14) is a calculation (matrix multiply), and finally, Equation (15) is also satisfied by the induction hypothesis. Now we want to show that Equation (12) is true by using Equations (13)–(15). This is trivially true as we follow the rules of 2 × 2 block multiplication, where the partitioning of A is k = k1 + k2. The conditions that the block elements of X1 and X2 must satisfy are exactly those of Equations (13)–(15), which we have shown to be true. In Figure 8 we give the details of recursive algorithms TRSM(k).

Figure 8Figure 8

Now we express DSYRK recursively. Since the procedure is very similar to that presented for DTRSM, we just state the algorithm and prove its correctness. First we define some notation. DSYRK requires S [S1 in Figure 6(a)] and T2 as operands. Divide T2 into T21, S2, and T22 according to its recursive definition. Now, let C = T2 and A = S; see Figure 9. The relation between Figure 6(a) and Figure 9 is again n2 = m and n1 = k. The algorithm is shown in Figure 10.

Figure 9Figure 9 Figure 10Figure 10

The correctness of SYRK follows by mathematical induction; its proof is omitted, since it is similar to the DTRSM proof.

  Cholesky algorithm applied to recursive lower row storage produces stride 1 storage access throughout
Having established the content of the preceding subsection, we may now turn to demonstrating why transposing recursive lower column storage leads to stride 1 performance throughout. This is true for Cholesky factorization and many other symmetric algorithms. Assume that we use full recursion. Then the factor part of the code becomes n square root calculations. The rest of the code consists of calls to RTRSM and RSYRK. However, both RTRSM and RSYRK are recursive, and when used with full recursion they always consist of calls to DGEMM and Level 1 calls to DSCAL and DDOT. Now take a DGEMM call from RTRSM. Note that C = C – ABT is computed, where C is m × n, A is m × k, and B is n × k. Similarly, a DGEMM call from RSYRK has the same form, C = C – ABT. Now transpose this generic DGEMM computation to obtain CT = CTBAT and assume that L is stored in recursive lower row-wise storage. Since storing a full matrix row-wise is identical to storing its transpose column-wise, we see that CT = CT – BAT becomes D = D – ETF, where D = CT, E = BT, and A = FT. Note that each computation problem for C and D consists of doing mn dot products each of size k. The form C = C – ABT computes dot products stride lda, ldb, while the form D = D – ETF computes dot products stride 1.

Before continuing, we use the above results to state a possible result about symmetric full storage. Assume that the answer to the question as to which data format gives faster execution for the symmetric algorithm is that the U format algorithm is faster by a sufficient amount. Then two things occur: 1) better performance using the single code, and 2) instead of having to write two codes, the uplo = 'L' code disappears. In its place one need only invoke an existing in-place square transpose code twice, at the beginning and at the end. Moving twice the data twice is necessary for migration purposes.

  Data transformation
In order for the recursive algorithm to use the recursive packed data format, the user's packed data must be transformed to recursive packed format. In the Cholesky case, the user's symmetric matrix is in lower packed format, which is overwritten with the matrix in recursive packed upper format, as described in the two preceding subsections. The recursive packed upper format is chosen so that the DGEMM operation will be ATB; i.e., the A and B matrices are accessed with stride 1.

The in-place data transformation of a lower packed triangle of size n to recursive packed upper format is done in eight steps; see Figure 11 and Table 1. These steps are listed below. Now we overview the algorithm. First we do an in-place data transformation of the firstleft floorn/2right floor columns of the packed array. This figure is a trapezoid and a rectangle of size left ceilingn/2right ceiling by left floorn/2right floor. The lower triangle is moved out of place to reside in the auxiliary buffer in recursive upper packed format. Next the rectangle is moved in place to full (column major order) format to reside in its final position of the trapezoid. Finally, the contents of the auxiliary buffer are copied to the triangle part of the trapezoid. These three steps are described in steps 1 to 6 below. The second major step is described in steps 7 and 8 below.

  1. The first left triangle of size left floorn/2right floor of the packed lower triangle is stored in packed recursive row lower format [see Figure 6(a)] in the auxiliary buffer at the first position (0) to position

    left floorn/2right floor • (left floorn/2right floor + 1) – 1.

    2

    This is done using an out-of-place data transformation subroutine. Notice that the upper left triangle is not stored contiguously in the packed format, so this copy to buffer is a gather operation. That is, all of the vectors in the original matrix, stored in packed format, are now stored recursively in the contiguous memory of the buffer. See Table 1, Step 1, where the vectors containing elements 11–31, 22–32, and 33 of the packed triangular array of size 7 are stored as one vector of length 6 in the auxiliary buffer.

  2. It is easy to do an in-place transpose of a square, but almost impossible to do an in-place transpose of a rectangle. A transpose of a nonsquare rectangle is very hard to do in place, so if n is odd, the first row of the rectangle, which is of size left ceilingn/2right ceiling × left floorn/2right floor, is also stored contiguously in the transfer buffer at position

    left floorn/2right floor • (left floorn/2right floor + 1)

    2

    to

    left floorn/2right floor • (left floorn/2right floor + 3) – 1.

    2

    In Table 1, Step 2, the elements 41, 42, and 43 (that is, the row directly above the lower left 3 × 3 square in the 7 × 7 triangle) are stored contiguously as a vector of length 3 in the auxiliary buffer, following directly after the triangle that was created in Step 1.

  3. The gap between the columns in the remaining lower left square is removed by moving column i, 0 lesser or = to i < left floorn/2right floor, from positions

    i open parenthesis 2ni – 1 close parenthesis + left ceilingn/2right ceiling

    2

    to

    i open parenthesis 2ni – 1 close parenthesis + n – 1

    2

    to positions

    left floorn/2right floor open parenthesis 2i + 2left ceilingn/2right ceilingleft floorn/2right floor + 1 close parenthesis

    2

    to

    left floorn/2right floor open parenthesis 2i + 2left ceilingn/2right ceilingleft floorn/2right floor + 3 close parenthesis – 1.

    2

    This is done in place, starting with the rightmost column, by left floorn/2right floor – 1 calls to DCOPY. Notice that when i = left floorn/2right floor – 1, the rightmost column is already in its final place.

  4. The square lower part of the rectangle is transposed, since we want the resulting matrix to be in recursive packed column upper format (which is equivalent to recursive packed row lower format). The lower left square of size left floorn/2right floor × left floorn/2right floor is contiguous in memory at positions

    left floorn/2right floor open parenthesis 2left ceilingn/2right ceilingleft floorn/2right floor + 1 close parenthesis

    2

    to

    left floorn/2right floor open parenthesis 2nleft floorn/2right floor + 1 close parenthesis – 1.

    2

    This square is transposed in place by a call to the ESSL routine DGETMI.

  5. If n is odd, the row stored in step 2 is now copied to positions

    left floorn/2right floor(left floorn/2right floor + 1)

    2

    to

    left floorn/2right floor(left floorn/2right floor + 3) – 1.

    2

    Compare with Table 1, Step 5. Since this is a contiguous vector, DCOPY is used.

  6. The triangle stored in Step 1 is copied back. At this time, the trapezoid consisting of the entire left part of the triangle is in packed recursive row lower transposed format, with the upper left triangle in the recursive packed lower format and the rectangle in row-major order.
  7. Now, the lower right triangle of size left ceilingn/2right ceiling is stored in packed recursive row lower format in the auxiliary buffer.
  8. The triangle stored from Step 7 is copied back to the packed triangle.
Figure 11Figure 11

Table 1   Data transformation for n = 7. Compare with Figures 5 and 11.

Operation   Storage Auxiliary buffer


1 2 3 4 5 1 2 3 4 5

Initial 0 11 21 31 41 51 0
values 5 61 71 22 32 42 5
(packed 10 52 62 72 33 43
lower) 15 53 63 73 44 54
20 64 74 55 65 75
25 66 76 77
 
(1) 0 (11) (21) (31) 0 [11] [21] [31] [22] [32]
Copy/ 5 (22) (32) 5 [33]
transform 10 (33)
triangle 15
to buffer 20
(gather) 25
 
(2) 0 (41) 0
Copy 5 (42) 5 [41] [42] [43]
rectangle 10 (43)
row 15
(gather) 20
25
 
(3) 0 4(51) 0
Move 5 4(61) 4(71) [51]4 5
square 10 [3(52)61]4 [2(62)71]4 [1(72)52]3 [62]2 [72]1
columns 15 {53} {63} {73}
20
25
 
(4) 0 0
 
Transpose 5 51 5
square 10 52 53 61 62 63
in-place 15 71 72 73
20
25
(5) 0 0
Copy 5 [41] [42] [43] 5 (41) (42) (43)
saved row 10
back 15
20
25
 
(6) 0 [11] [21] [31] [22] [32] 0 (11) (21) (31) (22) (32)
Copy triangle 5 [33] 5 (33)
back 10
15
20
25
 
(7) 0 0 [44] [54] [55] [64] [65]
Copy/ 5 5 [74] [75] [66] [76] [77]
transform 10
lower right 15 (44) (54)
triangle 20 (64) (74) (55) (65) (75)
(gather) 25 (66) (76) (77)
 
(8) 0 0 (44) (54) (55) (64) (65)
Copy 5 5 (74) (75) (66) (76) (77)
triangle 10
back 15 [44] [54]
triangle 20 [55] [64] [65] [74] [75]
(gather) 25 [66] [76] [77]
Final 0 11 21 31 22 32 0
result 5 33 41 42 43 51 5
(recursive 10 52 53 61 62 63
lower 15 71 72 73 44 54
row 20 55 64 65 74 75
format) 25 66 76 77

Legend:
t(ij) Element a(i, j) at location is accessed at time t (read operation). [Order of accesses is only important in operation (3).]
[ij]u Element a(i, j) at location is modified at time u (write operation). [Order of accesses is only important in operation (3).]
{ } Data is part of the operation but neither read nor written [rightmost column in operation (3)].

In Table 1, explicit details for the illustrative example where n = 7 is given. This example contains the salient feature of the in-place data transformation algorithm for a general n. In Table 1, the notation (ij) in position p means that a(i, j) is read from memory location p. The notation [ij] in position p means that a(i, j) is written into memory location p. The notation [t(ij)kl]u in position p means that the element a(i, j) is read from memory location p at time t and that the same memory location p is overwritten with element a(k, l) at time u. Steps 1 and 2 of Table 1 are self-explanatory. Step (3) is more complex. The general idea is to move data in place so that data written at time t does not overwrite any data that will be read at time u > t. In step 3, column 2 [i.e., elements a(5:7, 2) of AP] is moved from locations 11, 12, 13 to locations 13, 14, 15. Since these storage locations overlap, we move this data in a backward manner; i.e., at time step 1, a(7, 2) is read from location 13 and written into location 15; at time step 2, a(6, 2) is read from location 12 and written into location 14; at time step 3, a(5, 2) is read from location 11 and written into location 13. Next, a(5:7, 1) of AP is moved from locations 5:7 to locations 10:12. Since these storage locations do not overlap, this move can be done by calling DCOPY at composite time step 4. Note that column 3, a(5:7, 3), is neither read nor written, as this column is already in place. After step 3, A is stored in full format as a square matrix. Step 4 transposes this matrix in place by a standard in-place square transpose algorithm. Thus, in step 4, we just exhibit the elements of the matrix after transposition has occurred. Steps 5, 6, 7, and 8 of Table 1 are self-explanatory.

3. Algorithmic components

  The problem with a large recursion tree
Blocking reduces a large recursion tree to a recursion tree of small size. Recall that a Cholesky program of size N has a binary tree with N leaves and N – 1 interior nodes. The overhead at any node increases with n, where n is the size of the Cholesky subproblem, since each interior node performs the same Cholesky computation. To see this, note that each call is recursive, so a call at size 2n includes two calls at size n plus calls to recursive routines RTRSM and RSYRK. What is important for high performance is the ratio r of the cost of the recursive call overhead to the computation cost (number of FLOPs). It turns out that when n is large, this ratio is tiny and can be safely neglected. We quantify this ratio in Section 3. Here we discuss the pruning of the recursion tree so that every node of the reduced tree has negligible r.

At a given tree level i, all nodes (there are 2i nodes, each of size m = ni or m = ni + 1, where ni = N/2i) again execute the Cholesky algorithm on a submatrix of size m. The FLOP count of Cholesky is 1/3m3. Thus, the overhead to FLOP-count ratio at tree level i is about eight times smaller than at level i + 1, and the number of nodes doubles. At some level, say k, this ratio times 2k becomes too high. This requires that we prune the recursion tree at level k and below. Suppose nb is some blocking parameter where 2k less or = to nb < 2k+1. If the size of the Cholesky subproblem is n greater or = to 2k+1, another recursion step is attempted. Otherwise, n < 2k+1, and the given Cholesky problem of size n is factored using a direct method. Using this strategy guarantees that the smallest direct-method problem size will be n greater or = to 2k if the original problem size is N greater or = to 2k. If N < 2k, we used a simple packed format code that used register blocking because the entire packed matrix fit easily into cache. In our implementation k was chosen to be 4. An interior node has n greater or = to 2k+1. Here we execute a recursive TRSM problem of size n1 and a recursive SYRK problem of size n2, where n1 + n2 = n, in addition to recursive Cholesky calls of sizes n1 and n2. The FLOP counts of TRSM and SYRK at a node of size m, n are m2n and (m2 + m)n, respectively. If n1 < 2k+1, we solve the TRSM BLAS problem by a direct method and solve the SYRK BLAS problem similarly. Otherwise, we can safely do a recursion step, since the ratio r for the problem will be tiny.

  Block version of recursive packed format Cholesky
In [1], Andersen, Gustavson, and Was accentniewski show an implementation of recursive Cholesky. They let the recursion go down to a single element. Because of the overhead of the recursive calls, the implementation has a significant performance loss for small problems. However, for large problems most of the computation takes place in the DGEMM, so that performance loss, while significant for smaller problems, is minor. The difference between that implementation and the one described here is that we combine recursion with blocking.

In our implementation, however, we combine blocking and recursion to produce a blocked version of their recursive algorithm. We do recursion only down to a fixed block size 2nb. To solve problems for sizes smaller than 2nb, we use a variety of algorithmic techniques, some new and some old, to produce optimized unrolled kernel routines. These techniques are used for both the Cholesky factorization routine and the recursive TRSM and recursive SYRK routines. The main technique we use to produce these fast kernels is register and Level 1 cache blocking. The specific kernels are described in the subsection on three types of kernel routines.

Next we give details about our block version and demonstrate analytically that blocking the recursion tree via pruning completely eliminates performance problems for small matrix sizes.

Algorithms for block Cholesky, block TRSM, and block SYRK
In this short section we modify the algorithms CHOL (Figure 2), TRSM (Figure 9), and SYRK (Figure 10) to produce their block counterparts, algorithms BC, BT, and BS in Figures 12, 13, and 14, respectively. In each case, we have a blocking factor nb, and the only change to these three algorithms is to make their if-then-else clauses dependent on nb. Also, the code of each if clause becomes a call to a kernel routine instead of a square root operation or a call to a Level 1 BLAS.

Figure 12 Figure 13 Figure 14

We omit all proofs of correctness because they are similar to the proofs given in Section 2. Note, however, that when nb = 1 we get algorithms CHOL, TRSM, and SYRK, because then the three Level 3 kernel routines BCK, BTK, and BSK reduce respectively to SQRT, DSCAL, and DDOT computations.

Now we discuss how the packed recursive lower format, PRLF, and algorithms BC, BT, and BS relate to one another. Take BC first. In the recursive part of the code there are four calls, two to BC and one each to BT and BS. The definition of PRLF guarantees that the two BC calls will receive lower triangular submatrices than are in PRLF. To see this, simply note that the if-then-else control logic of BC, where n is replaced by n1 = left floorn/2right floor and n2 = nn1, is precisely the same as the control logic used to define PRLF (see the subsection on data transformation). Now look at the calls to BT and BS. Each of these recursive routines has two operands: a triangle in PRLF and a conventional rectangular matrix stored in row major order with a leading dimension of n1. This fact follows immediately from the definition of PRLF and the fact that BC control logic is identical to the control logic defining PRLF. Turning now to recursive algorithms BT and BS, we see the same if-then-else control logic that was used in BC. Hence, the two recursive calls in both BT and BS define triangle and rectangle operands that are in PRLF and row major format, respectively. Note, however, that the leading dimension of the rectangle remains constant for all recursive calls of BT and BS.

Finally, note that because the algorithm and the data structure are so closely tied together, no data movement of the operands occurs during any of the recursive calls. Also, the floating-point operations occur only in calls to BT, BS, at various internal nodes of the Cholesky tree, and in calls to the kernel routine BCK at the leaves of the Cholesky tree. Now consider any internal node of the Cholesky tree where there is one call each to BT and BS. In both the BT and BS calls there is an associated recursion tree. For BT, the tree is the left child of the node; for BS, the tree is the right child of the node. Now consider either of these trees. At each interior node there is one call to DGEMM, and at the leaves there are calls to kernel routines BTK and BSK. We discuss the implementation of the three kernel routines later in Section 3. The discussion of the DGEMM implementation is beyond the scope of this paper; see [7] for some details. We merely note that DGEMM is a Level 3 BLAS, and by its very nature should be high-performance and fully tuned for a given architecture/platform. The data movements of the operands occur in the calls to DGEMM, BCK, BTK, and BSK. In the Conclusion we give a further discussion of data operand movement in the calls to DGEMM.

Recursion tree has 2i nodes at every level

Theorem 2     Given N greater or = to nb, let 2qnb less or = to N < 2q+1nb determine q. Let ni = left floorN/2iright floor. Now either 1) 2qnb less or = to N less or = to 2q+1nb – 2q, or 2) 2q+1nb – 2q < N < 2q+1nb. When Case 1 holds, the binary recursion tree for the blocked Cholesky algorithm BC(N) has 2q leaves and 2q – 1 interior nodes; i.e., it has q + 1 levels, where Level k has 2k nodes, 0 less or = to k less or = to q. When Case 2 holds, N = 2q+1nb – 2q + i, where 1 less or = to i < 2q. The binary recursion tree for BC(N) has 2q + i leaves and 2q + i – 1 interior nodes. It has q + 2 levels, where Level k has 2k nodes, 0 less or = to k less or = to q, and Level q + 1 has 2 • i leaves. Consider the binary recursion tree associated with BC(N). In both cases, for 0 less or = to k less or = to q we have the following: At a given level k, alphak of these nodes denotes a block Cholesky algorithm BC(nk), and the remaining ßk = 2kalphak denotes a block Cholesky algorithm BC(nk + 1). Also, alphaknk + ßk(nk + 1) = N holds. Additionally, in Case 2 we have nq = 2nb – 1, alphaq = 2qi, and ßq = i, and at Level q + 1 there are 2i leaves with nq+1 = nb.

Proof     We use mathematical induction. For i = 0, n0 = N, alpha0 = 1, ß0 = 0 as we have the original problem BC(n0); see Figure 12. Assume that the result is true for i = j. The induction hypothesis is alphajnj + ßj(nj + 1) = N and alphaj + ßj = 2j. Now nj+1 = left floornj/2right floor. There are two cases, depending on whether nj is even or odd. When nj is even, alphaj+1 = 2alphaj + ßj and ßj+1 = ßj. When nj is odd, alphaj+1 = alphaj and ßj+1 = alphaj + 2ßj. The equations for alphaj+1 and ßj+1 follow directly from the BC algorithm in Figure 12. In both cases, a direct calculation gives alphaj+1nj+1 + ßj+1(nj+1 + 1) = alphajnj + ßj(nj + 1), which equals N via the induction hypothesis. Also, alphaj+1 + ßj+1 = 2(alphaj + ßj), which equals 2 • 2j via the induction hypothesis. Similar proofs hold for Algorithms BT and BS of Figures 13 and 14, respectively. box

Some notes: In going from Level i to Level i + 1, the number of Cholesky factorizations doubles, but their size is halved. This means that the total number of FLOPs decreases by a factor of approximately 4 in going down one tree level. Precise details about this are given on page 740 of [8].

In Figure 15 we give an example with N = 743 and nb = 16. The recursion tree has 63 nodes. At level 0 there is one computation consisting of the single (BT, BS) pair of size (371, 372). The two trees are the two children of node 743, namely nodes 371 and 372. Hence, for BT there are 15 calls to DGEMM and 16 calls to kernel BTK. For BS there are also 15 calls to DGEMM and 16 calls to kernel BSK. For both BT and BS, the sizes (m, n, k) of the DGEMM calls are given by the node labels, which for kernels BT and BS are the same. At Level 1 there are two (BT, BS) pairs of sizes (185, 186) and (186, 186), respectively. The four subtrees are the four trees for BT and BS, two each. For each tree there are seven DGEMM and eight kernel calls. At Level 2 there are one (92, 93) pair and three (93, 93) pairs. The eight subtrees are the eight trees for BT and BS, four each. For each tree there are three DGEMM and four kernel calls. At Level 3 there are one (46, 46) pair and seven (46, 47) pairs. The sixteen subtrees are the sixteen trees for BT and BS, eight each. For each tree there are a single DGEMM and two kernel calls. At Level 4 there are nine (23, 23) pairs and seven (23, 24) pairs. There are 32 leaves, and these leaves correspond to sixteen BT and BS calls. Each of the 32 calls is a kernel routine call. At Level 5 there are 32 BCK calls, 25 of size 23 and seven of size 24. This is summarized in Table 2.

Figure 15Figure 15

Table 2   Values for ni, alphai, and ßi for different levels at the recursion tree for N = 743. Notice that alphaini + ßi(ni + 1) = 743 holds for all values of i, 0 less or = to i less or = to 5.

Level i ni alphai ßi ni + 1

0 743 1 0 744
1 371 1 1 372
2 185 1 3 186
3 92 1 7 93
4 46 9 7 47
5 23 25 7 24

The example for N = 743 is the typical case. Let 512 less or = to N < 1024. Dividing by 2nb, we have 16 less or = to (N/2nb) < 32. This partitions N into nb intervals [512, 544), [544, 576), · · · , [992, 1024) labeled by the numbers nb to 2nb – 1. Note that 743 belongs to the partition [736, 768) labeled by the number 23. Partitions 16 to 30 have the same binary tree as in Figure 15, namely one of 63 nodes. Partition 31, Case 2 of Theorem 2, constitutes a transition between a binary tree of 31 interior nodes and 32 leaves and one of 63 interior nodes and 64 leaves, the next power of 2. There are 2nb numbers 992 + i, 0 less or = to i < 2nb in partition 31. For N = 992 + i, the binary tree has 31 + i interior nodes and 32 + i leaves. The leaf nodes consist of 2i nodes of size nb and 32 – i nodes of size 2nb – 1. Later in Section 3, we refer to this partition as the special case. Finally, note that the interval [29, 210) is generic, as Theorem 2 shows.

Overhead of recursion is negligible
To calculate the cost of the recursive calls, the recursive algorithm is written as a recursive cost function. Let tcall be the cost of a function call overhead, including cycles spent to initialize variables (e.g., n1 = left floorn/2right floor). Let tflop be the cost of a floating-point operation.

Start with Algorithm BC(N) of Figure 12. When n greater or = to 2nb, the else clause of the if statement is executed, and two calls to BC, one call to BT, and one call to BS are performed. All of these calls operate on matrix sizes equal to about half of the original. After investigating the algorithms for BT in Figure 13 and BS in Figure 14, we can, when n greater or = to 2nb, write these costs as

C(n) = 1 • tcallC + C(left floorn/2right floor) + C(left ceilingn/2right ceiling) + T(left floorn/2right floor, left ceilingn/2right ceiling)+ S(left ceilingn/2right ceiling, left floorn/2right floor); (16)

T(n, m) = 1 • tcallT + T(left floorn/2right floor, m) + T(left ceilingn/2right ceiling, m) + G(left ceilingn/2right ceiling, m, left floorn/2right floor); (17)

S(n, m) = 1 • tcallS + S(left floorn/2right floor, m) + S(left ceilingn/2right ceiling, m) + G(left floorn/2right floor, left ceilingn/2right ceiling, m), (18)

where C(n) is the cost of the Cholesky factorization, T(n, m) is the cost of a triangular solution with m right sides, S(n, m) is the cost of a symmetric rank-k update with k = m, and G(m, n, k) is the cost of a matrix–matrix multiply. Notice that each function accounts for the call to itself. For n < 2nb, no recursive calls are performed, so the cost is approximated by

C(n) = 1 • tcallC + 2n3 + 3n2 + n   • tflop; (19)

6

T(n, m) = 1 • tcallT + n2mtflop; (20)

S(n, m) = 1 • tcallS + (n2 + n)mtflop. (21)

Now, only the cost formulation of the matrix–matrix multiply is needed to solve the cost equations:

G(m, n, k) = 1 • tcallG + 2mnktflop. (22)

If Case 1 of Theorem 2 holds (i.e., 2qnb less or = to n less or = to 2qnb – 2q), the solution to Equation (16) can be written as
C(n) = 2n3 + 3n2 + n   • tflop + (2 • 2q – 1) • tcallC

6
 
+ [2q • (q – 1) + 1] • tcallT + [2q • (q – 1) + 1] • tcallS + [2q • (q – 2) + 2] • tcallG. (23)

That is, for Case 1, in order to Cholesky-factorize a symmetric matrix of size n, there will be 2 • 2q – 1 calls to the Cholesky routine, 2q • (q – 1) + 1 calls to each of the TRSM and SYRK routines, and 2q • (q – 2) + 2 calls to the GEMM routine.

Bounds for (23) for all values of n are found in the Appendix. For now, we discuss the properties of Equation (23) for Case 1 values of n with nb = 16.

The important relation here is the number of tcalls compared to the number of tflops. Notice that the number of calls (tcallC + tcallT + tcallS + tcallG) for Cholesky is 3q2q – 2q+1 + 3, while the number of floating-point operations is well known, 1/6n(n + 1)(2n + 1). This clearly shows that the overhead due to recursion is very small, as the quotient between the leading terms of these two expressions is

no. of calls   = 9   log2 n/nb   .


no. of FLOPs n2nb

The relative overhead is illustrated in Figure 16, where the ratio between the number of calls to subroutines and arithmetic work is plotted. Notice that the recursive code handles problems where n greater or = to nb, since we provide special kernels for small problems where 1 less or = to n < nb. The plot is exact only for Case 1 values of n, since the formula underestimates the number of function calls for Case 2 values of n. As an example, let n = 128, nb = 16 (kernels operate on problem sizes 16 to 31), and assume that the overhead of a function call, including the small setup part of the recursive functions, takes 200 times the amount of time it takes to perform a floating-point operation. Then the overhead due to the recursion is about 1.7%! This overhead is also dropping fast; for n = 256, it is about 0.6%. For our example with n = 743, the number of FLOPs is n • (n + 1) • (2n + 1)/6 = 137000284, while the number of calls is 419, with nb = 16; i.e.,

no. of calls   • 200 =   419   • 200 approximate approximate 6.12 • 10–4 approximate approximate 0.06%.


no. of FLOPs 137000284