|
1. Introduction
With higher processor speeds but a lack of a corresponding speedup in disk access, the tendency in server-class computers is toward increasingly large main-memory size in proportion to processor speed. A result is the expectation that memory will be a dominant cost factor in the central electronic complex, comprising in the near future possibly fifty to ninety percent of the cost of typical large machines, despite the usual trends in decreasing memory cost. A possible approach to mitigating this cost is to compress the contents of main memory. It is well known that such contents are generally compressible by a factor of 2 or better, and some commercial programs have been available to exploit this fact (for example, see [1]). Such observations have led to the consideration of systems in which the contents of main memory are maintained in compressed form, and decompressed/compressed on a page basis (for example, see [2, 3]). However, when the unit of compression is a cache line (that is, cache lines in main memory are decompressed on misses and recompressed on writebacks), it is necessary to develop some new technology in order to obtain a practical system. Here we describe results associated with the development of IBM Memory Expansion Technology (MXT) [4]. These results form the basis for the design of the compressed-memory organization of MXT (for additional details see [57]).
First, very fast compression/decompression hardware is required, permitting operation at main-memory bandwidths. There has been some progress in this direction, based on extensions to LempelZiv (LZ) methods. In particular, parallel, shared-dictionary techniques [8], implemented using content-addressable memory (CAM), permit effective compression/decompression at main-memory bandwidths. A serial approach, but based on a larger alphabet of multiple byte entries [2], also offers some speedup over standard LZ approaches.
Next, changes in memory management must be made to the operating system, since with compression, the logical total main-memory size may vary dynamically. Issues associated with such management are discussed in [9]. Finally, a way must be found to efficiently store and access the variable-length objects obtained from compression. This problem is the main topic of the current paper. We consider the design of a compressed random-access memory (C-RAM) with the following properties:
-
Logically, the memory M consists of a collection of randomly accessible fixed-size lines, where L is the line size.
-
Internally, the ith line is stored in a compressed format as L(i) bytes, where L(i)
L, and where L(i) may change on each cache cast-out of this line. L(i) L is ensured by the capability of storing lines in an uncompressed format with suitable directory entries.
-
M comprises a standard random-access memory with a minimum access size (granule) of g bytes. We will generally assume that g is 32.
-
Memory accesses invoke a translation between a logical line address and an internal address. This correspondence is stored in a directory D contained in M. Translation, fetching, and memory management within the C-RAM are carried out by a memory controller rather than by operating system (OS) software.
Use of the directory D to access memory generally implies an added level of addressing indirection, requiring an additional memory fetch. Also, the compressed format may require some average additional latency to handle misses from the next higher level in the memory hierarchy. As described below, this latency can be traded off against compression by varying the line size; sizes in the range of 512 to 1024 bytes appear suitable. Line sizes of this magnitude may require a third level of caching (L3), one more than is typical for many of today's systems. However, L3 caches may be necessary in any case (even without compression), since it is difficult to bring substantially larger memories uniformly close to the processors. The result is that with typical anticipated L3 miss ratios, we expect that average memory access latency will not be significantly affected by the use of a C-RAM for main memory. In the following, we generally assume an L3 line size L of 1024 bytes. The value for L was chosen via a combination of compression results using shared-dictionary parallel compression [7] and estimated L3 cache-miss latencies. An overview of the system structure, showing the L3 cache and the compressed main memory, is given in Figure 1.
Figure 1
Three classes of techniques for managing a memory of the above type are 1) organizing M to be a linear space, where variable-length intervals are allocated and deallocated; 2) organizing M as a collection of blocks of possibly multiple sizes, where space for a variable-length object is allocated as an integral number of such blocks; and 3) organizing M as a collection of blocks, but permitting a variable amount of space to be allocated within a block. The techniques we investigate here are of the third type. We use blocks of a single fixed size, and compressed lines are allocated some integral number of blocks, with leftover pieces (termed fragments) from possibly multiple lines sharing an additional block. A line fragment shares a block with line fragments drawn from a cohort of other lines. The smaller the cohorts, the fewer the memory operations required to store a line, but the larger the potential fragmentation. This is also true for the number of line fragments permitted to share a block. A principal observation is that the size of the cohorts and the number of lines allowed to share a block can be quite small, with low resulting fragmentation. The result is that the technique is of low complexity, and is suitable for implementation in hardware.
2. Design overview
A high-level design for a system using a C-RAM for the lowest level of the main-memory hierarchy is shown in Figure 2. As discussed in the Introduction, we assume that the C-RAM memory M is used to read and write lines resulting from cache misses and stores, respectively, at the next higher level of memory in the memory hierarchy, shown as L3 in Figure 2. The C-RAM memory consists of two parts, the directory D and a collection of fixed-size blocks. Highly compressible lines may be stored entirely within directory entries. Otherwise, the directory entry points to one or more of the fixed-size blocks, which are used to store the line in its compressed format. In the case that fragments from two or more lines are combined and stored in a common block, the directory entries corresponding to the given lines also contain the information necessary to find the fragments.
Figure 2
As usual, virtual addresses are translated to real addresses by means of page tables, which can be of various types depending on details of the processor architecture and operating system. Here, a real address corresponds to the location in D of the directory entry for the line.
As an example, suppose that pages are of size 4 KB, and that the L3 cache immediately above the C-RAM has a line size of 1 KB. In contrast to the usual type of page-table design, when a virtual address is translated, the result is used not only as an associated real memory address, but also as a pointer or an index to an entry in the C-RAM directory D. Compressed lines that do not fit entirely within directory entries are stored using one or more fixed-size blocks, which for this example we assume to be of size 256 bytes (the issue of choosing an optimal block size is discussed in Section 5). Each directory entry contains flags, fragment-combining information, and pointers for up to four blocks. At most four blocks are required, since we assume that if a given line does not compress (in general, this is always a possibility given a fixed compression method), it is stored in an uncompressed format, as indicated by a flag in the directory entry. Finally, sufficiently compressible lines are stored in the directory entry itself (as indicated by another flag bit), in which case no blocks need be allocated.
On an L3 cache miss, the memory controller and decompression hardware find the blocks allocated to store the compressed line and dynamically decompress the line to handle the miss. Similarly, when a new or modified line is stored, the blocks currently allocated to the line are made free (if the line currently resides in the C-RAM), and the line is then compressed and stored in the C-RAM by allocating the required number of blocks. As this is done by hardware, under normal operation the fact that main memory is compressed is transparent to the OS.
Given the parameters above (1KB lines and 256-byte blocks), consider the following scenario: Suppose each line compresses to 1, 2, 3, · · · , or 1024 bytes, with equal likelihood. Then the expected compressed line size is 512.5 bytes; that is, compression is almost exactly 50% (50.05%). However, if some number of full blocks is used to store each line, it is easily seen that the expected number of blocks required to store a line is 2.5. This gives a compression of 62.5%, significantly worse than 50%. One way to address this problem is to make block sizes smaller. However, if block sizes are significantly smaller, the size of the directory can increase dramatically. Another approach to reducing storage fragmentation is to combine two or more fragments, that is, the left-over pieces in the last blocks used to store compressed lines, into single blocks.
In order to make fragment combining feasible using a hardware-based memory controller, and in particular to provide a guarantee of the maximum latency, some constraint must be made on the sets of lines for which fragment combining is allowed. We will call a set of lines for which fragment combining is allowed a cohort. In order to have a small upper bound on the time required for directory scans, ideally the size of cohorts should be small. In subsequent sections we look at cohort sizes ranging from 2 to 16 lines. Another design issue is the fashion in which cohorts are determined. Here there are two general approaches: partitioned cohorts and sliding cohorts.
In a partitioned-cohort approach, lines are divided into a number of disjoint sets (all of a given fixed size, the cohort size), where each such set is a cohort. For example, with a cohort of size 2, the first two 1KB lines in each 4KB page could form one cohort and the last two lines another cohort. Similarly, grouping lines by pages, all of the lines in each page give cohorts of size 4; all of the lines in two consecutive pages give cohorts of size 8; and so on.
In contrast, in a sliding-cohort design, cohorts are not disjoint, but overlap. For example, with a cohort of size 4, the cohort corresponding to any given line could consist of the set containing that line and the previous three lines, and similarly for other cohort sizes. The motivation for considering this type of design is the following. In a partitioned-cohort design with a cohort size of 4, as each successive line is stored sequentially, the average number of lines already stored for which combining is possible is 1.5. In a sliding-cohort design, however, again with a cohort size of 4, assuming that the lines in the previous page have already been stored, each successive line always has three other lines with which it can potentially combine. This suggests that this design might yield decreased fragmentation.
Another issue is the method by which fragments are combined. First, there is a question of the number of fragments that can be combined in a block. From a hardware point of view, the simplest case is to allow only two fragments per block. However, if significant gains would be realized by allowing three-way or greater combining, the increased complexity might be acceptable. Also, given a fragment and the sizes and locations of other fragments in the cohort for which combining is possible, which fragment (or fragments) should be chosen? Two options are to use first-fit or best-fit methods. We also study some other methods primarily for the sake of comparison, since they are impractical to implement in hardware. These are 1) fragment catenation, in which all of the fragments in a cohort are catenated, with fragments crossing block boundaries as necessary; and 2) optimal fit, in which (assuming two-way combining, for example) all possible ways of combining the fragments in a cohort are exhaustively examined, and the one yielding the minimal number of blocks is selected.
Finally, there are design issues related to choice of block size, which we consider in Section 5, and to the design of the directory structure. Here we present an overview of directory structure design; for more details, including possible directory-entry formats, addressability issues, etc., see Section 3 of [5]. There are two types of directory structures. The first is static, and is configured so as to have the required number of entries to support a maximum compression factor of F (where F is greater than 1, with 1/F the compression expressed as a fraction). That is, if the C-RAM has a capacity of N uncompressed lines, the directory contains entries for FN lines. Thus, for example, if F were 2 (i.e., 50% compression), D would contain entries for 2N lines. A possible problem with this type of design is that the maximum compression is limited to a predetermined value: If the contents of memory at some point are more compressible, this cannot be taken advantage of to provide more real memory. On the other hand, if compression is significantly less than this value, there is wasted directory space. These problems can potentially be avoided by the second type of structure we consider, in which the directory is dynamic.
Using a dynamic directory structure, directory entries are created (deleted) whenever real addresses are allocated (deallocated). In this case, free main-memory blocks could be allocated (deallocated) and used for the directory entries for one or more pages whenever the pages were created (deleted). Here, we are assuming that real memory increases (decreases) in units of at least a page. In this case, a pointer or index to the C-RAM directory entries for the page or pages would be maintained as part of the real address by the OS in a page-table entry. An interesting property of the 128-byte block case is that, assuming directory-entry formats as described in [5], one 128-byte block is exactly the right size to hold the four directory entries for the four lines of the page, provided that a maximum physical-memory addressability of 256 GB is sufficient (the maximum logical real-memory addressability would be F times as much, for example 512 GB for F = 2). The implication is that the C-RAM memory could be uniformly divided into a collection of 128-byte blocks, which could be allocated either as directory entries or as blocks to hold compressed data, as required.
3. Uniformly distributed compressed lines
In this and the following section we consider approaches to combining line fragments, that is, those parts of compressed lines occupying a fraction of the last block used to store the line, into single blocks. Issues include the number of fragments to be combined, the way the fragments are chosen, and whether combinations of fragments within a cohort are reorganized.
We first study these issues assuming a uniform distribution of compressed line sizes; in the next section we present results based on memory dumps. In more detail, with parameters as in previous examples (1KB line sizes, 256-byte blocks), we assume that each line is compressed to 1, 2, 3, · · · , or 1024 bytes, with equal likelihood, independently of the compression of any other line. Without fragment combining, each line therefore compresses to 1, 2, 3, or 4 blocks, each with equal likelihood.
It is possible to derive some results for fragment combining for some simple cases (such as for cohorts of size 2) using analytic methods, where it is assumed that each fragment size is a real-valued random variable uniformly distributed on the unit interval, for example. However, in practice, memory at the level of the C-RAM in the memory hierarchy is accessed in certain minimum-size units, which we call granules. For memories of the size we are considering here, granules of size of the order of 32 bytes are appropriate. It is desirable for performance reasons to store each fragment as a number of complete granules. With 256-byte blocks and 32-byte granules, it follows that the last block used to store a line, that is, the fragment, consists of one to eight granules, with each size equally likely.
With these assumptions, it is possible to compute exactly the expected number of blocks used to combine the fragments in a cohort, given the number of granules per block, the fragment-combining method, the degree of fragment combining, and the cohort size, simply by examining all possible cases (providing the cohort size is not too large). The expected number of blocks can then be used to derive the expected compression.
In more detail, this is computed as follows. Assume that the cohort size is 4 and that there are eight granules per block, and let F(f1, f2, f3, f4) be a function that computes, for a given fragment-combining method, the number of blocks required to combine four fragments of sizes f1, f2, f3, and f4 (1 fi 8). Assuming a uniform independent distribution of compressed line sizes as above, the expected number of blocks used by a compressed line not counting the last block is (0 + 1 + 2 + 3)/4 = 1.5. The expected number of blocks used for the fragments in the cohort, say X, is computed as follows:
|
X =
|
1
|
|
|
|
|
F(f1, f2, f3, f4).
|
|
|
84
|
|
1 f1 8
|
1 f2 8
|
1 f3 8
|
1 f4 8
|
 |
 |
 |
 |
 |
 |
The expected compression, computed for the entire cohort, is therefore
|
(compression) =
|
256(1.5C + X)
|
|
|
1024C
|
where C = (cohort size) = 4. This formula is the same for any cohort size; however, for cohort sizes other than 4, the formula for computing X is changed in the obvious way.
Compression results for two-way fragment combining obtained using the above methods are shown in Table 1. The methods used were fragment catenation (CAT), first-fit (FF2), best-fit (BF2), and optimal-fit (OF2). As previously described, catenation and optimal-fit are shown only for comparison.
|
|
Table 1 Expected compression (%), two-way combining.
|
| |
| |
| |
| |
| |
|
C
|
CAT
|
FF2
|
BF2
|
OF2
|
|
|
2
|
57.03
|
57.03
|
57.03
|
57.03
|
|
3
|
55.21
|
56.68
|
56.68
|
56.68
|
|
4
|
54.30
|
55.82
|
55.77
|
55.49
|
|
5
|
53.75
|
55.57
|
55.51
|
55.24
|
|
6
|
53.38
|
55.23
|
55.15
|
54.73
|
|
7
|
53.13
|
55.05
|
54.96
|
54.54
|
|
8
|
52.93
|
54.86
|
54.76
|
54.25 |
|
Exact results on expected compression, using the same set of methods, were also obtained for three-way fragment combining (up to three fragments are allowed to share a block), as shown in Table 2 (results for the catenation method, which of course does not depend on the degree of fragment combining, are shown again for ease of comparison).
|
|
Table 2 Expected compression (%), three-way combining.
|
| |
| |
| |
| |
| |
|
C
|
CAT
|
FF3
|
BF3
|
OF3
|
|
|
3
|
55.21
|
55.76
|
55.76
|
55.76
|
|
4
|
54.30
|
55.23
|
55.21
|
55.08
|
|
5
|
53.75
|
54.83
|
54.76
|
54.55
|
|
6
|
53.38
|
54.55
|
54.45
|
54.19
|
|
7
|
53.13
|
54.35
|
54.21
|
53.93
|
|
8
|
52.93
|
54.18
|
54.01
|
53.71 |
|
First consider the results as compared to not using combining. The expected compression using no combining is obtained by setting X to 1 above, which gives 62.5% (as in a previous example). Thus, combining gives a substantial improvement, even for the simplest case, in which the cohort is of size 2, in which the expected compression is 57.03% (for a cohort of size 2, all combining methods are equivalent, since there is no choice for selection of fragments to combine).
Next, examining various combining results, we note that first-fit, best-fit, and optimal-fit are essentially equivalent, with best-fit giving at most a fraction of a percent in compression improvement over first-fit, and optimal-fit a slightly larger but still fractional percent improvement over best-fit. Since optimal-fit is not suitable for an actual implementation (it is included only as a bound), the conclusion is that whichever method is easiest for use in a hardware-based memory controller can be used with no impact on performance. Note also that even if we remove the constraint that fragments must be combined into single blocks (i.e., the constraint that fragments not cross block boundaries), as in the catenation method, there is still only a modest improvement. Using catenation, we obtain the best compression possible, subject to the constraints that the entire cohort is stored as an integral number of blocks and that each line is stored as an integral number of granules. Although this is not practical for a hardware-based memory controller, it is interesting that practical methods such as two-way first-fit or best-fit fragment combining do almost as well.
Next we examine the effects of cohort size and degree of fragment combining. Consider the results for first-fit, for example. The choice of cohort size and degree of fragment combining have an impact on the complexity and latency of operations for the memory controller. With two-way combining, compression is slightly improved (from about 55.8% to 54.9%) by increasing the cohort size from 4 to 8; that is, less than one percent improvement in compression performance is achieved. Increasing the degree of fragment combining has even less of an effect. Since we might expect this to be more significant for larger cohort sizes, consider the first-fit results for a cohort size of 8: By moving from two-way to three-way fragment combining, compression is improved only from about 54.9% to 54.2%. Overall, of the various factors related to choosing a practical design, the most improvement is seen when one moves up from a minimal cohort of size 2 to larger cohort sizes.
Simulations were also used to investigate fragment-combining methods. These were used to find average compression for larger cohort sizes (up to 16), and also validated the exact computation method results for smaller cohort sizes. Furthermore, simulations permitted studies of a C-RAM under steady-state conditions, in which after filling the C-RAM with compressed lines (with randomly generated sizes as above), lines were then selected randomly (any line equally likely), the size of the selected line changed to a new random value (again, from 1 to 1024 bytes equally likely), and then recombined (if possible) with the other lines in its cohort. In contrast, the above exact computation results correspond to what we term an initial-fill simulation, in which all of the lines in a C-RAM are stored in order (with multiple runs providing an average). Finally, simulations were used for preliminary studies of various alternatives in fragment recombining methods in the steady-state case, and of sliding-cohort-type designs as previously described.
Results for partitioned cohorts, where each cohort consists of one, two, or four pages, are shown in Table 3. These results are for two-way fragment combining, using first-fit (FF2) or best-fit (BF2) methods. Both initial-fill and steady-state results are shown.
|
|
Table 3 Average compressed 4KB page sizes (simulations).
|
| |
| |
| |
| |
| |
|
Cohort size
|
FF2
|
BF2
|
|
|
Initial fill
|
Steady state
|
Initial fill
|
Steady state
|
|
|
4
|
2286
|
2304
|
2284
|
2300
|
|
|
(55.8%)
|
(56.3%)
|
(55.8%)
|
(56.2%)
|
|
8
|
2247
|
2272
|
2243
|
2262
|
|
|
(54.9%)
|
(55.5%)
|
(54.8%)
|
(55.2%)
|
|
16
|
2216
|
2249
|
2211
|
2231
|
|
|
(54.1%)
|
(54.9%)
|
(54.0%)
|
(54.5%) |
|
Note that compression is slightly better in the initial-fill cases than in the corresponding steady-state cases. At first this might seem counterintuitive, since in the steady-state case the fragment of each changed line can potentially recombine with the fragment of any other line in the cohort, whereas in the initial-fill case each fragment can combine only with the preceding fragments in the cohort. However, a more detailed analysis reveals that conditions arise in the steady-state case in which fragments are not combined that would have been combined under initial fill.
Last, we comment on the general characteristics of the sliding-cohort simulation results. The results were mixed, in that slightly better compression was obtained for initial-fill cases, but slightly worse compression was obtained for steady-state cases.
4. Analysis using memory dumps
In this section we present results for three examples of real systems: main-memory dumps were obtained for 1) a 64MB AIX* system, running a memory-intensive logic simulator; 2) a 32MB Windows NT** system in which a dump was obtained immediately after the system boot process completed; and 3) a 32MB Windows NT system in which a number of different applications were started so as to allocate all of main memory. Below, these are referred to respectively as the AIX(active), NT(boot), and NT(active) memory dumps. A program emulating a four-way parallel shared-dictionary compression method [7] was used to compress each memory dump, one line (1024 bytes) at a time, resulting in a sequence of compressed- line sizes. Each sequence of compressed-line sizes was then used as input to various initial-fill-type simulations, as described in the previous section.
Results for the AIX and NT dumps are shown in Tables 4, 5, and 6 for block sizes of 64, 128, and 256 bytes, respectively; in all cases the granule size is 32 bytes. A block size of 64 bytes is a special case: With 32-byte granules, each fragment is either full or half full. Given a cohort, the same number of blocks are produced by any of the fragment-combining methods in this case (including catenation): If there are an even number of half-full fragments in a cohort, they all combine pairwise; otherwise, with an odd number, all but one fragment combine pairwise, leaving exactly one half-full fragment. Therefore, in Table 4 we give (for each memory dump and cohort size) one compression result. For the larger block sizes we give compression results for FF2 and FF3; best-fit results were also obtained but are not shown because, to the accuracy used in the tables (0.1%), the compression results for best-fit are essentially the same as for first-fit.
|
|
Table 4 Compression (%) of memory dumps, block size = 64.
|
| |
| |
| |
| |
| |
|
Cohort size
|
AIX(active) CAT, FF, BF
|
NT(boot) CAT, FF, BF
|
NT(active) CAT, FF, BF
|
|
|
4
|
36.0
|
50.7
|
38.2
|
|
8
|
35.8
|
50.5
|
38.1
|
|
16
|
35.7
|
50.4
|
38.0 |
|
|
|
Table 5 Compression (%) of memory dumps, block size = 128.
|
| |
| |
| |
| |
| |
Cohort size
|
AIX(active)
|
NT(boot)
|
NT(active)
|
|
|
|
|
FF2
|
FF3
|
FF2
|
FF3
|
FF2
|
FF3
|
|
|
4
|
38.1
|
37.2
|
53.4
|
52.0
|
41.2
|
39.5
|
|
8
|
37.8
|
36.7
|
53.0
|
51.3
|
40.8
|
38.8
|
|
16
|
37.5
|
36.4
|
52.7
|
51.0
|
40.5
|
38.5 |
|
|
|
Table 6 Compression (%) of memory dumps, block size = 256.
|
| |
| |
| |
| |
| |
Cohort size
|
AIX(active)
|
NT(boot)
|
NT(active)
|
|
|
|
|
FF2
|
FF3
|
FF2
|
FF3
|
FF2
|
FF3
|
|
|
4
|
43.3
|
40.4
|
59.0
|
55.0
|
47.4
|
42.6
|
|
8
|
42.7
|
38.9
|
58.3
|
53.3
|
46.7
|
40.8
|
|
16
|
42.1
|
38.3
|
57.7
|
52.5
|
46.1
|
40.0 |
|
As discussed in the previous section, the compression obtained using catenation is the best possible, subject to the constraints that each line is stored as an integral number of granules and that each cohort is stored as an integral number of blocks. In fact, the compression results shown in Table 4 (for a block size of 64 bytes) are quite close to the raw line size compressions; that is, they are close to the average of the compressed-line sizes used as input to the initial-fill-type simulation divided by 1024. For comparison, these are 34.1% for the AIX memory dump, 48.8% for the NT(boot) dump, and 36.3% for the NT(active) dump. However, even though the compression results shown here are best for the 64-byte block size, we see in the next section that this is not the best choice of block size when the required directory space is taken into account.
In contrast to the results based on a uniform distribution of compressed-line sizes, note that there is a significant improvement from using three-way combining for a block size of 256 bytes (the improvement is less for the smaller block sizes). An analysis of the distribution of line sizes for the various dumps does in fact show a strong degree of nonuniformity, with sharp peaks for a number of small compressed-line sizes. For example, in the case of the AIX dump, out of 65536 lines, the most common compressed-line size is 13 bytes (corresponding to a line consisting of the same repeated byte, e.g., a line consisting of all nulls); such lines occur 5828 times; i.e., they comprise approximately 9% of all lines. Owing to the presence of these and other small compressed-line sizes, one would expect an advantage in being able to combine more than two fragments. However, as noted previously, in a fixed-size directory entry design, it is possible to store sufficiently small compressed lines entirely in the directory entry for the line. With 256-byte blocks, directory entries could consist of four pointers plus flag bits, etc., and require 16 bytes; in such a case, lines compressing to 15 bytes or less could be stored entirely within the directory entry. Thus, in practice, three-way combining might not be as effective as indicated above, since the small compressed lines that give the improvement would not be available for combining when stored within directory entries. In order to investigate this, results were obtained for the 256-byte block case with the following modification: The compressed-line size for all lines compressing to 15 bytes or less was reset to 0 (emulating storing the line entirely within the directory entry). The results are shown in Table 7.
|
|
Table 7
|
Compression (%) of memory dumps, block size = 256 (lines compressing to 15 bytes or less in directory).
|
| |
| |
| |
| |
| |
Cohort size
|
AIX(active)
|
NT(boot)
|
NT(active)
|
|
|
|
|
FF2
|
FF3
|
FF2
|
FF3
|
FF2
|
FF3
|
|
|
4
|
41.1
|
39.5
|
55.2
|
53.5
|
43.3
|
41.1
|
|
8
|
40.5
|
38.3
|
54.5
|
52.3
|
42.6
|
39.8
|
|
16
|
39.9
|
37.6
|
54.0
|
51.7
|
42.0
|
39.1 |
|
5. Optimal block sizes
In this section we consider the problem of choosing an optimal block size. The tradeoff involved in choosing a block size is that the larger the block size, the more wasted space there is due to fragmentation; the smaller the block size, the more space is required for the directory.
In order to obtain an analytic result, we make some simplifying assumptions. One of these is that the C-RAM has a fixed average compression (that is, we assume that the average compression is a constant). In the case of a static-directory structure, our analysis applies to the case in which the average compression is set to the maximum level of compression supported by the directory structure; the resulting block size will be (subject to the simplifying assumptions) that which minimizes the total required space for a static-directory design with the given maximum compression parameter. In the case of a dynamic-directory structure, the average compression should be set to a value which is expected to be close to that which will occur in practice; the resulting block size will be approximately optimal (in terms of total required space) when the C-RAM is operating at this degree of compression. The following analysis actually applies to any storage structure in which a number of fixed-size blocks are used to store objects of variable size; therefore, we use the term object to refer to a variable-size entity that is being stored; in the context of C-RAM design, each object is a compressed line.
Given a memory of total size M bytes, this memory contains a directory which, for each object (i.e., compressed line in the C-RAM context), gives the location of the blocks used to store the object. Here we use the term directory in a general fashion: The directory consists of all data in the memory other than blocks. For example, a simplistic directory structure is the following: 1) external tables provide a pointer to the first block used to store a given object; 2) each block is extended with a pointer field, which is used to point to the next block used to store the object, or is null if the block is the last such block used. In this example, the collection of all pointers is considered to be the directory.
Let q be the number of bytes used by the directory (in the above general sense) per block stored in the memory. For the previous simple design example, determining q is straightforward: q is just the length (in bytes) of a block pointer. In other cases, estimating q can require additional information. For example, in the case of a directory structure as previously illustrated in Figure 1, each directory entry could have four pointers to blocks, plus flags and fragment-combining information. Suppose this directory entry requires q' bytes. In order to estimate q, we also need to know the average compression. If the average compression is 50%, say, then even though the average number of blocks pointed to by directory entries could be greater than 2 (due to fragment combining), the total number of blocks is half the total number of pointers in the directory. In this case, q would be q'/2. If, for example, each directory entry were 16 bytes long, q would be 8 bytes; if, on the other hand, compression were 25%, q would be 16 bytes, and so on. Although formally q could be considered to be a function of block size and possibly other parameters, the above examples illustrate that in practice q can be estimated easily given a particular directory structure, together with (in some cases) the assumed constant average compression. Therefore, we assume that q is a constant; the implications of this simplifying assumption are discussed below.
Next, let p be the average size (in bytes) of an object, and let b be the block size. Our objective is to find a value of b that maximizes the total number of objects that can be stored in a memory of size M (equivalently, as shown below, given the total number of objects, we will find a value of b that minimizes the memory size M required to store the objects). In order to do this, we need to consider the effect of fragment combining. Suppose there were no fragment combining. Then, assuming a continuous uniform distribution of fragments, the expected number of blocks to store an object would not be simply p/b, but rather p/b + 1/2, since the last block used to store an object (under these assumptions) is on the average half full. If pairs of fragments were catenated, then again, assuming a uniform continuous distribution of fragment sizes, it can be shown that the expected size of a fragment would be 1/2 of a block. However, now this fragment would be shared between two objects, so in this case the expected number of blocks to store a compressed object would be p/b + 1/4. We parameterize the effect of fragment combining as follows: Given a particular fragment-combining scheme, if the average number of blocks to store an object under this scheme is found to be p/b + 1/n, we call n the fragment-combining effectiveness. For fragment-combining methods described in previous sections, such as two-way combining using first-fit or best-fit with a cohort size of 4 and 32-byte granules, n can be computed and has been found to be approximately 4 (e.g., n is found to be 4.3 for FF2 using the values from Table 1 of Section 3).
Given the total memory size M, as discussed earlier, the tradeoff in finding an optimal value for the block size b is as follows: As b increases, there is more wasted space in fragments, since on a per-object basis each fragment is on the average a fraction 1/n of a full block; as b decreases, there is a larger total number of blocks, and therefore more overhead in directory space, since each block requires q bytes of directory space. An estimate of the optimal value of b can be found as follows, using a continuous approximation. First, the total number of blocks in the memory is
Next, the total number of objects T that can be stored in the memory as a function of b is given by the following:
|
T(b) =
|
M/(b + q)
|
=
|
Mnb
|
|
|
|
p/b + 1/n
|
(b + q)(np + b)
|
From this expression, it is easily found (see [4]) that, under the above assumptions, the optimal value of b is given by the following:
bopt = (npq).
As an example, suppose we use a directory structure for a C-RAM with directory entries consisting of K block pointers (which includes flags and fragment-combining information), and suppose each such entry requires 4K bytes (where K depends on the block size; that is, smaller blocks require more pointers for a given fixed line size in the case that the line cannot be compressed). As in the discussion above, under 50% compression this gives a value for q of approximately 8 bytes (note that this is independent of K). Setting p to 512 bytes (for 50% compression of 1024-byte lines), and setting n to 4 (which is typical for two-way combining using first-fit or best-fit with granules of size 32 bytes), we find
bopt (4 × 512 × 8) = 128.
|