1. Introduction
Because of the slow mechanical nature of many storage devices, the importance of optimizing I/O operations has been well recognized. As a result, a plethora of optimization techniques including caching, write buffering, prefetching, request scheduling, and parallel I/O have been invented. The relative effectiveness of these techniques, however, is not clear because they have been studied in isolation by different researchers using different methodologies. Furthermore, because many of the techniques have not been evaluated with real workloads, their actual effect is not known. Some of the ideas have been proposed or implemented, but few or no performance results have been published (e.g., opportunistic prefetching). As the performance gap between the processor and disk-based storage continues to widen [1, 2], increasingly aggressive optimization of the storage system is needed, and this requires a good understanding of the real potential of the various I/O optimization techniques and how they work together. In this paper, we systematically investigate how the different techniques affect actual performance by using trace-driven simulations with a large set of traces gathered from a wide range of real-world settings, including both server and personal computer (PC) environments. To make our findings more broadly applicable, we focus on general rules of thumb about what can be expected from each of these techniques rather than precise quantification of improvement for a particular workload and a specific implementation.
Tremendous efforts have also gone into improving the underlying technology of disks. The improvement in disk technology is usually quantified by using physical metrics, such as the tracks or bits per inch, the average seek time, and the rotational speed. Relating physical metrics like these to the performance delivered to real workloads is, however, difficult. Thus, it is not apparent how an improvement in one metric compares with an improvement in another in terms of their real-world impact. Furthermore, some of the metrics are not focused on performance but have a significant effect on it. For example, increasing the recording density could improve performance because, if the bits are packed more closely together, they can be accessed with a smaller physical movement. In this paper, we break down the steady improvements in disk technology into four major basic effects: seek time reduction resulting from actuator improvement, increase in rotational speed, linear density improvement, and increase in track density. Then we separately analyze each one to determine its effect on real workloads. We also examine the historical rates of improvement and use the trends to project the actual performance improvement that can be expected from disk technology scaling.
In a companion paper [3], we analyze in detail the characteristics of the various workloads we use, specifically 1) the I/O intensity of the workloads and the overall significance of I/O in the workloads; 2) how the I/O load varies over time and how it behaves when aggregated; and 3) the interaction of reads and writes and how it affects performance. Although the current paper is self-contained, readers are encouraged to also read the companion paper to better understand the workloads on which this analysis is based. The insights gained from the current study motivated the idea of automatic locality-improving storage (ALIS) [4], which is a storage system that continually monitors the way it is accessed and then automatically reorganizes selected disk blocks so that accesses become effectively more sequential. In fact, the results we derive here serve as the baseline for the analysis of ALIS in [4]. Therefore, this paper has an emphasis on the optimizations that directly affect ALIS—in particular, the prefetching.
The remainder of this paper is organized as follows. Section 2 contains a brief overview of previous work in evaluating I/O optimization techniques. Section 3 discusses our methodology and describes the traces that we use. In Section 4, we analyze the effect of the various optimization techniques. In Section 5, we consider the real impact of disk technology improvement over time. Section 6 concludes and summarizes this paper. Because of the huge amount of data that is involved in this study, we can present only a characteristic cross section here. More detailed graphs and data are available in [5].
2. Related work
Various I/O optimization techniques have been individually evaluated by different researchers using dissimilar methodologies including discrete event simulation and analytical modeling. In some cases, the simulations are based on traces of real workloads and in others, randomly generated synthetic workloads. For instance, disk caching is extensively analyzed in [6, 7], prefetching in [8, 9], write buffering in [10, 11], request scheduling in [12–14], and striping in [15, 16]. At the logical level, caching, prefetching, and write buffering are well covered in [17, 18]. Several researchers have also explored ways to improve the various techniques in special situations where the reference pattern is known ahead of time (e.g., [19]). Because improving I/O performance is important, there has been a lot of research on I/O optimization techniques. We mention only some of the more recent work. The reader is referred to [20] for a comprehensive survey of early work on I/O optimization.
3. Methodology
The methodology used in this paper is trace-driven simulation [21, 22]. In trace-driven simulation, relevant information about a system is collected while the system is handling the workload of interest. This is referred to as tracing the system and is usually achieved by using hardware probes or by adding instrumentation to the software. In the second phase, the resulting trace of the system is played back to drive a model of the system under study. Trace-driven simulation is thus a form of event-driven simulation in which the events are taken from a real system that is operating under conditions similar to the ones being simulated. A common difficulty in using trace-driven simulations to study I/O systems is to realistically model timing effects, specifically to account for events that occur faster or slower in the simulated system than in the original system. This difficulty arises because information about how the arrival of subsequent I/Os depends upon the completion of previous requests cannot be easily extracted from a system and recorded in the traces. As described below, we create and use a new method for replaying I/O traces that more accurately models the timing of I/O arrivals and that allows the I/O rate to be more realistically scaled (e.g., when processor power is increased) than previous practice.
Modeling timing effects
In general, simulation models used to evaluate storage system performance can be broadly classified into open and closed models, depending on how request arrivals are choreographed. The closed model traditionally maintains a constant population of outstanding requests. Whenever a request is completed, a new request is issued in its place, sometimes after a simulated “think” time. These models essentially assume that all of the I/Os are time-critical [23], so that a new I/O is issued only after a previous request is completed. By maintaining a constant population of outstanding requests, these models effectively smooth out any burstiness in the I/O traffic. Such an approach is clearly not representative of real workloads, which have been shown in several studies (e.g., [3]) to have bursty I/O traffic patterns.
In the open model, requests arrive at predetermined times (e.g., traced time in [24] and traced interarrival time scaled by a constant factor in [14]), independently of the performance of the storage system. Such models assume that the workload consists exclusively of time-noncritical requests [23], so that whether a preceding request is completed has no bearing on when the system is able to issue subsequent I/Os. Again, this is clearly not true in real systems, in which an overloaded storage system, by being slow, automatically exerts back pressure on the processes generating the I/Os. For example, 66–91% of the I/Os are flagged as synchronous in PC workloads [3] and 52–74% in UNIX** workloads [25]. In other words, the system generally has to wait for I/Os to be completed before it can continue with subsequent processing. Such data highlights the importance of accounting for the feedback effect between request completion and subsequent request issuance. From a practical perspective, having a feedback mechanism also ensures that the number of outstanding requests will not grow without bound whenever the storage system is unable to handle the incoming workload.
Modeling the feedback effect and thereby limiting the number of outstanding requests is especially helpful in this study because we have a diverse set of workloads collected over the span of several years and a wide range of experiments in which the performance of the storage system is significantly varied. Some of our experiments evaluate techniques that are opportunistic; i.e., they take advantage of idle time. Therefore, we have to account for the burstiness seen in real I/O traffic. With these requirements in mind, we came up with a methodology designed to incorporate feedback between request completion and subsequent I/O arrivals, and model burstiness.
Results in [3] show that there is effectively little multiprocessing in PC workloads and that most of the I/Os are synchronous. Such predominantly single-process workloads can be modeled by assuming that after completing an I/O, the system has to do some processing and the user some “thinking” before the next set of I/Os can be issued. For instance, in the timeline in Figure 1, after request R0 is completed, there are delays during which the system is processing and the user is thinking before requests R1, R2, and R3 are issued. Because R1, R2, and R3 are issued after R0 has been completed, we consider them to be dependent on R0. Similarly, R4 and R5 are deemed to be dependent on R1. Presumably, if R0 is completed earlier, R1, R2, and R3 will be dragged forward and issued earlier. If this in turn causes R1 to be finished earlier, R4 and R5 will be similarly moved forward in time. The “think” time between the completion of a request and the issuance of its dependent requests can be adjusted to speed up or slow down the workload. In short, we consider a request to be dependent on the last completed request, and we issue a request only after its parent request has completed. For multiprocessing workloads, this dependence relationship should be maintained on a per-process basis, but unfortunately process information is not always available in I/O traces. To account for multiprocessing workloads, we merge multiple traces to form a workload with several independent streams of I/O, each obeying the dependence relationship described above.
Figure 1
In essence, we have built an out-of-order multiple-issue machine that tries to preserve the dependency structure between I/O requests. We maintain an issue window of 64 requests. A request within this window is issued when the request on which it is dependent completes and the think time has elapsed. Inferring the dependencies based on the last completed request is the best we can do given the block-level traces we have. If the workloads were completely described using logical and higher-level system events (e.g., system calls and interrupts), we might be able to more accurately model feedback effects using a system-level model (e.g., [23]). In the limit, we can run the workloads on a system simulator where we have control over the timing of events [26] or on a virtual machine [27] or on a real system with a timing-accurate storage emulator [28]. However, getting real users to release traces of reference address is difficult enough. Asking them for logical data about their computer operations is next to impossible. Moreover, “capturing” a workload so that it can be realistically replayed may be relatively easy for batch jobs, but it is very difficult for interactive workloads. We essentially end up with the same problem of having to decide what happens when the system reacts faster. For instance, will the user click the mouse earlier?
Workloads and traces
The traces analyzed in this study were collected from both server and PC systems running real user workloads on three different platforms—Microsoft Windows NT**, IBM AIX*, and HP-UX**. All of them were collected downstream of the database buffer pool and the file system cache. Thus, these are real I/O traces, not logical ones. The PC traces were collected by using VTrace [29], a software tracing tool for Intel** x86 PCs running Windows NT/2000**. In this study, we are primarily interested in the disk activities, which are collected by VTrace through the use of device filters. We have verified the disk activity collected by VTrace with the raw traffic observed by a bus (SCSI) analyzer. Both the IBM AIX and HP-UX traces were collected using kernel-level trace facilities built into the respective operating systems. Most of the traces were gathered over periods of several months, but to keep the simulation time manageable, we use only the first 45 days of the traces, of which the first 20 days are used to warm up the simulator.
The PC traces were collected from the primary systems of a wide variety of users, including engineers, graduate students, a secretary, and several people in senior managerial positions. Because we have a wide variety of users in our sample, we believe that our traces are illustrative of the PC workloads in many offices, especially those involved in research and development. Note, however, that the traces should not be taken as typical or representative of any other system or environment. Despite this disclaimer, the fact that many of their characteristics correspond to those obtained previously (see [3]), albeit in somewhat different environments, suggest that our findings can, to a large extent, be generalized. Table 1(a) summarizes the characteristics of these traces. We denote the PC traces as P1, P2, …, P14 and the arithmetic mean of their results as P-Avg. As detailed in [3], the PC traces contain only I/Os that occur when the user is actively interacting with the system. Specifically, we consider the system to be idle from ten minutes after the last user keyboard or mouse activity until the next such user action, and we assume that there is no I/O activity during the idle periods. We believe that this is a reasonable approximation in the PC environment, although it is possible that we are ignoring some level of activity resulting from periodic system tasks such as daemons. This latter type of activity should have a negligible effect on the I/O load, and is not likely to be noticed by the user.
| (a) Personal systems |
|
Desig- nation | User type | System configuration | Trace characteristics |
|
|
| System (MHz) | Memory (MB) | File systems (GB) | Storage used i (GB) | No. disks | Duration | Footprint iv (GB) | Traffic (GB) | Requests (×106) | R/W ratio |
|
| P1 | Engineer | 333 P6 | 64 | 1 GB FATii 5 GB NTFSiii | 6 | 1 | 45 days (7/26/99–9/8/99) | 0.945 | 17.1 | 1.88 | 2.51 |
| P2 | Engineer | 200 P6 | 64 | 1.2, 2.4, 1.2 GB FAT | 4.8 | 2 | 39 days (7/26/99–9/2/99) | 0.509 | 9.45 | 1.15 | 1.37 |
| P3 | Engineer | 450 P6 | 128 | 4, 2 GB NTFS | 6 | 1 | 45 days (7/26/99–9/8/99) | 0.708 | 5.01 | 0.679 | 0.429 |
| P4 | Engineer | 450 P6 | 128 | 3, 3 GB NTFS | 6 | 1 | 29 days (7/27/99–8/24/99) | 4.72 | 26.6 | 2.56 | 0.606 |
| P5 | Engineer | 450 P6 | 128 | 3.9, 2.1 GB NTFS | 6 | 1 | 45 days (7/26/99–9/8/99) | 2.66 | 31.5 | 4.04 | 0.338 |
| P6 | Manager | 166 P6 | 128 | 3, 2 GB NTFS | 5 | 2 | 45 days (7/23/99–9/5/99) | 0.513 | 2.43 | 0.324 | 0.147 |
| P7 | Engineer | 266 P6 | 192 | 4 GB NTFS | 4 | 1 | 45 days (7/26/99–9/8/99) | 1.84 | 20.1 | 2.27 | 0.288 |
| P8 | Secretary | 300 P5 | 64 | 1, 3 GB NTFS | 4 | 1 | 45 days (7/27/99–9/9/99) | 0.519 | 9.52 | 1.15 | 1.23 |
| P9 | Engineer | 166 P5 | 80 | 1.5, 1.5 GB NTFS | 3 | 2 | 32 days (7/23/99–8/23/99) | 0.848 | 9.93 | 1.42 | 0.925 |
| P10 | CTO | 266 P6 | 96 | 4.2 GB NTFS | 4.2 | 1 | 45 days (1/20/00–3/4/00) | 2.58 | 16.3 | 1.75 | 0.937 |
| P11 | Director | 350 P6 | 64 | 2, 2 GB NTFS | 4 | 1 | 45 days (8/25/99–10/8/99) | 0.73 | 11.4 | 1.58 | 0.831 |
| P12 | Director | 400 P6 | 128 | 2, 4 GB NTFS | 6 | 1 | 45 days (9/10/99–10/24/99) | 1.36 | 6.2 | 0.514 | 0.758 |
| P13 | Grad. student | 200 P6 | 128 | 1, 1, 2 GB NTFS | 4 | 2 | 45 days (10/22/99–12/5/99) | 0.442 | 6.62 | 1.13 | 0.566 |
| P14 | Grad. student | 450 P6 | 128 | 2, 2, 2, 2 GB NTFS | 8 | 3 | 45 days (8/30/99–10/13/99) | 3.92 | 22.3 | 2.9 | 0.481 |
| P-Avg. | — | 318 | 109 | — | 5.07 | 1.43 | 41.2 days | 1.59 | 13.9 | 1.67 | 0.816 |
|
| (b) Servers |
|
Desig- nation | Primary function | System configuration | Trace characteristics |
|
|
| System | Memory (MB) | File systems | Storage used (GB) | No. disks | Duration | Footprint iv (GB) | Traffic (GB) | Requests (× 106) | R/W ratio |
|
| FS1 | File server (NFSv) | HP 9000/720** (50 MHz) | 32 | 3 BSD** FFSviii (3 GB) | 3 | 3 | 45 days (4/25/92–6/8/92) | 1.39 | 63 | 9.78 | 0.718 |
| TS1 | Time-sharing system | HP 9000/877** (64 MHz) | 96 | 12 BSD FFS (10.4 GB) | 10.4 | 8 | 45 days (4/18/92–6/1/92) | 4.75 | 123 | 20 | 0.794 |
| DS1 | Database server (ERPvi) | IBM RS/6000* R30 SMPvii (4× 75 MHz) | 768 | 8 AIX* JFSix (9 GB), 3 paging (1.4 GB), 30 raw database partitions (42 GB) | 52.4 | 13 | 7 days (8/13/96–8/19/96) | 6.52 | 37.7 | 6.64 | 0.607 |
| S-Avg. | — | — | 299 | — | 18.5 | 8 | 32.3 days | 4.22 | 74.6 | 12.1 | 0.706 |
|
i Sum of all file systems and allocated volumes
ii File allocation table
iii Microsoft NTFS
iv Amount of data referenced at least once
v Sun Microsystems Network File System** |
vi Enterprise resource planning
vii Symmetric multiprocessor
viii Wind River Systems BSD fast file system
ix IBM AIX journal file system
|
|
The servers traced include a file server, a time- sharing system, and a database server. The characteristics of these traces are summarized in Table 1(b). Throughout this paper, we use the term S-Avg. to denote the arithmetic mean of the results for these server workloads. The first file server trace (FS1) was taken off a file server for nine clients at the University of California, Berkeley. This system was primarily used for compilation and editing. It is referred to as Snake in [25]. The trace denoted TS1 was gathered on a time-sharing system at an industrial research laboratory. It was mainly used for news, mail, text editing, simulation, and compilation. It is referred to as Cello in [25]. The database server trace (DS1) was collected at one of the largest health insurers in the nation. The system traced was running an enterprise resource planning (ERP) application on top of a commercial database system. This trace is only seven days long, and the first three days are used to warm up the simulator. More details about the traces and how they were collected can be found in [3].
In addition to these base workloads, we scale up the traces to obtain workloads that are more intense. Results reported in [3] show that for the PC workloads, the processor utilization during the intervals between the issuance of an I/O and the last I/O completion is related to the length of the interval by a function of the form f(x) = 1/(ax + b), where a = 0.0857 and b = 0.0105. To model a processor that is n times faster than the one in the traced system, we would scale only the system processing time by n, leaving the user portion of the think time unchanged. Specifically, we would replace an interval of length x with one of length x[1 - f(x) + f(x)/n]. In this paper, we run each workload preserving the original think time. For the PC workloads, we also evaluate what happens in the limit when systems are infinitely fast; i.e., we replace an interval of length x with one of x[1 - f(x)]. We denote these speeded-up PC workloads as P1s, P2s, …, P14s and the arithmetic mean of their results as Ps-Avg.
We also merge ten of the longest PC traces to obtain a workload with ten independent streams of I/O, each of which obeys the dependence relationship discussed above. We refer to this merged trace as Pm. The volume of I/O traffic in this merged PC workload is similar to that of a server supporting multiple PCs. However, its locality characteristics are different because there is no sharing of data among the different users, so that if two users are both using the same application, they end up using different copies of the application. Pm might be construed as the workload of a system on which multiple independent PC workloads are consolidated. For the server workloads, we merge the FS1 and TS1 traces to obtain another trace which we denote as Sm. Note that neither method for scaling up the workloads is perfect, but we believe that they are more realistic than simply scaling the interarrival time, as is commonly done. In this paper, we often use the term PC workloads to refer collectively to the base PC workloads, the speeded-up PC workloads, and the merged PC workload. The term server workloads likewise refers to the base server workloads and the merged server workload.
Simulation model
The major components of our simulation model are presented in Figure 2. In practice, optimizations such as caching, prefetching, write buffering, request scheduling, and striping may be performed at multiple levels in the storage system. For instance, there may be several outboard storage controllers, storage adapters (host bus adapters), and disk drives, and they may all perform some of the optimizations to some extent. The number of combinations of which does what and to what extent is large, and the interaction among the optimizations performed at the various levels is complicated and obscure. To gain fundamental insights into the effectiveness of each of the optimizations, we collapse the different levels and model each of the optimizations once.
Figure 2
For example, we model only a single level of cache instead of a disk-drive cache, an adapter cache, a controller cache, and so on. This approach does not expose the interference that occurs when the different levels in the storage stack are all attempting to do some of the same optimizations, but cutting down on the interference is the only way we can look at the real effect of each of the optimizations. The interference is interesting but beyond the scope of the current paper. Furthermore, a well-designed system will have a level at which a particular technique dominates. For instance, for caching, the adapter cache should be bigger than the disk-drive cache so that its effect dominates. For other techniques, such as request scheduling, there is a level at which it can best be implemented. At appropriate points in the paper, we discuss such issues and how we handle them in our simulator.
Even though we simulate only a single instance of each of the optimization techniques, there are many parameters for each technique, and their combination makes for a huge design space. In order to systematically examine the effect of each technique, we pick two reasonable base configurations and perturb them in one dimension at a time. The default parameters used in these base configurations are summarized in Figure 2. As we study each technique individually, the relevant parameters are analyzed and described in detail. As its name suggests, the resource-rich configuration is meant to represent an environment in which resources in the storage system are plentiful, as may be the case when there is a large outboard controller. The resource-poor environment is supposed to mimic a situation in which the storage system consists of only disks and low-end disk adapters.
Our simulator is written in C++ using the CSIM simulation library [30]. It is based upon a detailed model of the mechanical components of the IBM Ultrastar* 73LZX [31] family of disks that is used in disk development and that has been validated against test measurements obtained on several batches of the disk. We model the disk to a similar level of detail as in the publicly available DiskSim package [32]. However, instead of using the same seek profile for reads and writes and accounting for the difference by a constant write-settling delay, we use separate read and write seek curves to more accurately model the disk. As shown in Figure 3, the seek curves for this disk can be approximated by a power function for seeks of less than 5000 tracks and a linear function for longer seeks.
Figure 3
The IBM Ultrastar 73LZX family of 10K-rpm disks was introduced in early 2001 and consists of four members with storage capacities of 9.1, 18.3, 36.7, and 73.4 GB. The performance characteristics of each are almost identical, with the difference in capacity coming from the number of platters. The higher-capacity disk should have a longer seek time because of the increased inertia of the disk arm, but the effect is small. The average seek time is specified to be 4.9 ms, and the data rate varies from 29 MB/s at the inner edge to 57 MB/s at the outer edge. The track density for this series of disks is 27,000 tracks per inch, while the linear density is as high as 480,000 bits per inch. The tracks range in size from 160 KB to 340 KB. More details about the specifications of this family of disks can be found in [31]. To understand the effects of the evolution of disk technology, in Section 5 we scale these disk characteristics according to technology trends that we derive by analyzing the specifications of disks introduced in the last ten years.
For workloads with multiple disk volumes, we concatenate the volumes to create a single address space. In the base configurations, each workload is fitted to the smallest disk from the IBM Ultrastar 73LZX family that is bigger than the total volume size, resulting in an average storage-used-to-disk-capacity ratio of about 55%. We leave a headroom of 20% because the results presented here are part of a larger study that examines the results when up to 20% of the disk blocks are replicated and laid out in an area of the disk that is specially set aside [4]. When we study parallel I/O, we look at the effect of striping the data across multiple disks. Note that we have a separate read cache and write buffer so that we can adjust the size of each independently. Results in [3] show that there is not a lot of interaction between the reads and the writes.
Performance metrics
I/O performance can generally be measured at different levels in the storage hierarchy. In order to quantify the effect of a wide variety of storage optimization techniques, we measure performance from the time at which requests are issued to the storage system before they are potentially broken up by the volume manager for requests that span multiple disks. The two important metrics in I/O performance are response time and throughput. Response time includes both the time needed to service the request and the time spent waiting or queuing for service. Throughput is the maximum number of I/Os that can be handled per second by the system. Quantifying the throughput is generally difficult with trace-driven simulation because the workload, as recorded in the trace, is constant. We can try to scale or speed up the workload to determine the maximum workload the system can sustain, but this is difficult to achieve in a realistic manner.
In this paper, we estimate the throughput by considering the amount of critical resource each I/O consumes. Specifically, we consider the average amount of time the disk arm is busy per request, deeming the disk arm to be busy both when it is being moved into position to service a request and when it has to be kept in position to transfer data. We refer to this metric as the service time. Throughput can be approximated by taking the reciprocal of the average service time. It should be noted that there are opportunistic techniques, especially for reads (e.g., preemptible read ahead), that can be used to improve performance. The service time does not include the otherwise idle time that the opportunistic techniques exploit. This means that the reciprocal of the service time tends to be an optimistic estimate of the maximum throughput, especially in the case of a lightly loaded disk where opportunistic techniques are likely to have a bigger effect.
To gain insight into the workings of the different optimization techniques, we also examine the effective miss ratio of the read cache and the write buffer. The miss ratio is generally defined as the fraction of I/O requests that are not satisfied by the cache or buffer or, in other words, the fraction of requests that require physical I/O. To make our results more useful for subsequent mathematical analyses and modeling by others, we fitted our data to various functional forms through nonlinear regression, which we solved by using the Levenberg–Marquardt method [33].
4. Effect of I/O optimizations
Read caching
Caching is a general technique for improving performance by temporarily holding in a faster memory those data items that are (believed to be) likely to be used. The faster memory is called the cache. In the context of this paper, the data items are disk blocks requested from the storage system, and the faster memory refers to dynamic random access memory (DRAM). The fraction of requests satisfied by the cache is commonly called the hit ratio. The fraction of requests that have to be handled by the underlying storage system is referred to as the miss ratio. The data items can be entered into the cache when they are demand-fetched or when it is anticipated that they are likely to be referenced soon. Caching usually refers only to the former. The latter is generally called prefetching and is discussed in detail in the next section. Note that to focus on the effect of caching, we disable prefetching. This is an exception to our general approach of perturbing, at any one time, only the parameters for one technique from their default values listed in Figure 2.
Figure 4 shows the effectiveness of read caching at reducing physical reads. Unless otherwise noted, the cache block size is 4 KB. We use the least-recently-used (LRU) replacement policy, since variations of it are commonly used throughout computer systems. Notice from Figure 4(a) that the cache is not very useful for sizes up to 32 MB. This is expected because we are looking at the physical reference stream, which has been filtered by the caching going on upstream in the host system. Today, it is common even for PC systems to have more than 100 MB of main memory, much of which can be used for file caching. Yet most disks have only 2–4 MB of cache, with some offering an 8-MB option. Our results suggest that the disk drive cache is not very effective at such sizes. It serves primarily as a buffer for prefetching.
Figure 4
Note that if the cache is large enough to hold all of the blocks that will be referenced again, the performance will obviously be very good. However, we will need a huge cache, because [Figure 4(b)] the miss ratio continues to improve at cache sizes that are beyond 4% of the storage used (allocated). In practice, there is a limit to the size of the cache because of addressing and packaging limitations and cost. Today, most enterprise-class outboard storage controllers, when fully loaded, have cache sizes that are in the range of 0.05–0.2% of the storage space [34–36]. In this study, in the resource-rich environment we set the cache size aggressively to 1% of the storage used and 8 MB per disk in the resource-poor environment. The cost per GB for DRAM is currently about 50 times higher than for disk storage. This means that a cache that is 1% of the storage space and does nothing but helps mask the poor performance of the disk will cost as much as half the disk storage. Though high, this level of cost is likely to be acceptable because it is about half that incurred by sites that mirror instead of parity-protect their disks. Note also that as disks become a lot bigger and PCs have at least one disk, the amount of cache needed in the PC environment to hold 1% of the data stored may be much less than the amount of cache needed to store 1% of the disk capacity.
To establish a rule of thumb relating the read miss ratio to the size of the cache, we took the average of the five plots in Figure 4(b) and fitted various functional forms to it. As shown in the figure, a good fit is obtained with a power function of the form f(x) = a(x - b)c, where a, b, and c are constants. This relationship, based on the physical I/O stream, turns out to be functionally similar to that found at the logical level for large database systems [17]. However, at the logical level, the c value is about half of the -1 in our case. This means that the physical read miss ratio for our workloads improves faster with increase in the cache size than is the case at the logical level for large database systems. Such results suggest that caching can be effective at the physical level provided that the cache is large enough.
In Table 2, we summarize the effectiveness of read caching at improving performance. Throughout this paper, we define improvement as (original value - new value)/(original value) if a smaller value is better, and (new value - original value)/(original value) otherwise. Note that some amount of cache memory is needed as a speed-matching buffer between the disk media and the disk interface with the host. In other words, we need to configure our simulator with some small but non-zero amount of cache memory. Therefore, the improvement reported in Table 2 is relative to the performance with a small (512-KB) cache. As discussed earlier, in the resource-poor environment, caching is relatively ineffective, achieving only about 6% improvement in average read response time and about 4% in average read service time. In the resource-rich environment, the improvement ranges from about 20% for the base PC workloads to more than 50% for the merged workloads.
|
| Table 2
Performance with read caching (LRU), showing percentage of improvement over a system with almost no (512 KB) cache. |
|
|
|
|
|
| | Resource-poor environment | Resource-rich environment |
|
|
| Average read response time | Average read service time | Read miss ratio | Average read response time | Average read service time | Read miss ratio |
|
|
|
|
|
|
| (ms) | (%)i | (ms) | (%)i | | (%)i | (ms) | (%)i | (ms) | (%)i | | (%)i |
|
| P-Avg. | 6.27 | 2.46 | 4.31 | 2.11 | 0.934 | 2.12 | 5.00 | 22.9 | 3.42 | 22.3 | 0.746 | 22.0 |
| S-Avg. | 5.34 | 9.01 | 3.88 | 8.34 | 0.864 | 8.93 | 3.54 | 38.8 | 2.72 | 35.7 | 0.623 | 34.2 |
| Ps-Avg. | 6.96 | 2.33 | 4.34 | 2.09 | 0.934 | 2.13 | 5.73 | 20.6 | 3.49 | 21.5 | 0.746 | 22.0 |
| Pm | 6.04 | 2.26 | 4.18 | 1.83 | 0.935 | 1.81 | 3.15 | 49.0 | 2.27 | 46.6 | 0.521 | 45.2 |
| Sm | 5.69 | 6.36 | 4.10 | 6.34 | 0.891 | 7.27 | 2.69 | 55.7 | 1.99 | 54.5 | 0.463 | 51.8 |
|
i
Improvement over 512-KB cache (buffer) [(original value - new value)/(original value)].
Resource-poor environment: 8 MB per disk, least-recently-used (LRU) replacement.
Resource-rich environment: 1% of storage used, LRU replacement.
|
|
Note that these numbers are for a cache block size of 4 KB. For historical reasons, the sector, or smallest addressable unit in most storage systems today, is 512 B. Managing the cache at such a small granularity is very inefficient because of the large data structures needed to manage them and because most I/O transfers are much larger than 512 B. To reduce the management overhead, a larger cache block can be used together with valid bits to indicate whether each sector within the block is present in the cache. This is similar to the sector cache approach in processor cache. In [5], we evaluate the impact of using a large cache block on the effectiveness of the cache and find that a cache block size of 4 KB is reasonable for our workloads. We use this block size for the rest of this paper. Note that the cache block size is the unit of cache management. It is independent of the fetch or transfer size, which we analyze in the following section.
Prefetching
Prefetching is the technique of predicting blocks that are likely to be used in the future and fetching them before they are actually needed. The overall effectiveness of prefetching at improving performance hinges on 1) the accuracy of the prediction, 2) the amount of extra resource (memory use, disk and data-path busy time, etc.) that is consumed by the prefetch, and 3) the timeliness of the prefetch, i.e., whether the prefetch is completed before the blocks are actually needed.
The prediction is usually based on past access patterns [9, 17], although in certain situations, system-generated plans [37, 38], user-disclosed hints [19], and guidance from speculative execution [39] may be available to help with the prediction. In general, the prediction is not perfect, so that prefetching consumes more resource than demand fetching. Specifically, it congests the I/O system and may pollute memory with unused pages. Memory pollution is the loading of pages that are not referenced and the displacement of pages that will be referenced. For many storage devices, particularly disk drives, however, a large sequential access is much more efficient than multiple small random accesses. For such devices, prefetching of sequential pages has the potential to increase I/O efficiency by transforming several small block I/Os into one large block I/O that can be more efficiently handled by the I/O device. Moreover, most workloads exhibit sequentiality in their I/O access patterns, so that sequential prefetch, especially if performed on a cache miss, scores well on all three criteria (prediction accuracy, cost, and timeliness) listed above. Therefore, practically all storage systems today implement some form of sequential prefetch on cache miss. We focus on such a prefetch in this paper. By default, we assume that data is prefetched into the cache and managed as if it were demand-fetched. The prefetched data could instead be placed in a separate buffer or be handled in the cache differently than demand-fetched data (e.g., be evicted earlier). The interested reader is referred to [17] for an evaluation of such alternatives.
Several researchers have also proposed schemes for automatically matching up access patterns with previously observed contexts, and then prefetching according to the previously recorded reference patterns (e.g., [8]). Such prefetching schemes should score well in the accuracy criteria, but because they incur additional random I/Os (which are slow and inefficient) to perform the prefetch, they may not do as well in the cost and timeliness criteria. We look at an alternative to context-based prefetch in [4].
Large fetch unit
Sequential prefetch can be achieved relatively easily by using a large fetch unit, or transfer size. For example, if the fetch unit is 64 sectors or blocks, a read request for blocks 60–68 causes blocks 0–127 to be fetched. Thus, a large fetch unit, effectively a large block size, generally prefetches blocks both preceding and following the target blocks. Because the preceding blocks are fetched before the target blocks to avoid an extra disk revolution, there is a response-time penalty for having a large fetch unit. Furthermore, the entire transfer must be complete before an I/O interrupt is received, although in an alternate design the fetch could be broken into one that terminated at the target blocks while a second one obtained the remaining blocks.
In Figure 5(a), we plot the effect of having a large fetch unit on the read miss ratio and the average read response time for the resource-rich environment. The corresponding plots for the resource-poor environment are similar and can be found in [5]. Observe that a large fetch unit significantly reduces the read miss ratio, with most of the effect occurring at fetch units that are smaller than about 64 KB. As the fetch unit is increased beyond 64 KB, the average read response time starts to rise because the penalty of having to fetch the blocks preceding and following the target blocks inline begins to outweigh the benefit of the relatively small marginal improvement in read miss ratio. Previously, a one-track fetch unit was recommended [6], but since then physical track sizes have grown from the 10-KB range to about 512 KB today. However, the ability of workloads to use larger fetch units effectively has not kept pace. For all of our workloads, a relatively small fetch unit of 64 KB, or 1/8 of a track, works well.
Figure 5
Read ahead
In read ahead, after the system has fetched the blocks needed to satisfy a read request, it continues to read the blocks following; i.e., it reads ahead of the current request, hence its name. We consider the read request to be completed once all of the requested blocks have been fetched. This typically means that two start I/Os are issued—one for the requested blocks and another to read ahead and prefetch data. In Figure 5(b), we explore the performance effect of reading ahead by various amounts. Observe from the figure that a read ahead of 32 KB performs well for all of our workloads. Beyond 32 KB, the read response time begins to rise slightly for some of the workloads because the read ahead is holding up subsequent demand requests, and the marginal improvement in read-miss ratio at such large read-ahead amounts is not enough to overcome the effect of this delay. Later in this section, we look at preempting the read ahead whenever a demand request arrives.
In Table 3, we summarize the effectiveness of the different prefetching schemes at improving performance over that of a non-prefetching system. Observe that a large fetch unit tends to reduce the read miss ratio more than read ahead does. It also has a slight advantage in read service time for the PC workloads, because the PC workloads tend to exhibit spatial locality and not just sequentiality. In other words, not just the blocks that are following, but those that are near blocks that have been recently referenced, are likely to be accessed in the near future. Thus, a large fetch unit, by causing the blocks around the requested data to be prefetched, can achieve a higher hit ratio. However, because the use of a large fetch unit results in fetching the surrounding blocks before returning from servicing a request, it performs worse than read ahead in terms of response time, especially for the server workloads.
|
| Table 3
Performance improvement with prefetching, showing percentage of improvement over a system that does not prefetch. |
|
|
|
|
|
| | Average read response time | Average read service time | Read miss ratio |
| LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| | (%)ii
| | (%)ii
| | (%)ii
|
| Resource-poor environment |
| P-Avg. | 4.29 | 32.1 | 4.25 | 32.4 | 4.14 | 34.3 | 2.74 | 36.2 | 2.99 | 30.4 | 2.93 | 31.9 | 0.484 | 48.2 | 0.587 | 37.2 | 0.518 | 44.6 |
| S-Avg. | 3.29 | 39.3 | 3.03 | 44.5 | 2.75 | 49.6 | 2.27 | 41.3 | 2.26 | 41.9 | 2.08 | 46.5 | 0.393 | 54.4 | 0.427 | 50.6 | 0.383 | 55.6 |
| Ps-Avg. | 4.92 | 29.8 | 4.64 | 33.4 | 4.65 | 33.4 | 2.78 | 35.8 | 2.84 | 34.4 | 2.81 | 35.0 | 0.484 | 48.2 | 0.587 | 37.2 | 0.518 | 44.6 |
| Pm | 4.10 | 32.1 | 3.98 | 34.1 | 3.89 | 35.6 | 2.74 | 34.4 | 2.94 | 29.5 | 2.91 | 30.3 | 0.495 | 47.1 | 0.596 | 36.2 | 0.533 | 43.0 |
| Sm | 4.23 | 25.7 | 3.79 | 33.4 | 3.61 | 36.6 | 3.13 | 23.6 | 3.07 | 25.3 | 2.93 | 28.6 | 0.505 | 43.3 | 0.551 | 38.1 | 0.523 | 41.3 |
| Resource-rich environment |
| P-Avg. | 3.33 | 33.8 | 3.43 | 31.5 | 3.33 | 33.5 | 2.12 | 37.8 | 2.40 | 29.5 | 2.35 | 30.7 | 0.375 | 49.4 | 0.473 | 36.4 | 0.415 | 44.0 |
| S-Avg. | 1.96 | 39.0 | 1.75 | 47.7 | 1.52 | 53.7 | 1.41 | 37.9 | 1.34 | 43.7 | 1.18 | 49.6 | 0.253 | 53.5 | 0.265 | 52.9 | 0.223 | 59.3 |
| Ps-Avg. | 3.96 | 31.2 | 3.81 | 33.4 | 3.82 | 33.3 | 2.16 | 37.5 | 2.28 | 34.1 | 2.26 | 34.4 | 0.375 | 49.4 | 0.472 | 36.5 | 0.414 | 44.1 |
| Pm | 1.91 | 39.3 | 2.01 | 36.3 | 1.94 | 38.3 | 1.33 | 41.2 | 1.56 | 31.3 | 1.54 | 32.3 | 0.247 | 52.6 | 0.321 | 38.5 | 0.280 | 46.3 |
| Sm | 2.27 | 15.7 | 1.87 | 30.7 | 1.72 | 36.1 | 1.72 | 13.7 | 1.53 | 23.2 | 1.41 | 29.1 | 0.285 | 38.6 | 0.284 | 38.6 | 0.260 | 44.0 |
i
LFU: Large fetch unit (64 KB); RA: Read ahead (32 KB); CSP: conditional sequential prefetch (16-KB segments for PC workloads, 8-KB segments for server workloads, prefetch trigger of 1, prefetch factor of 2). ii
Improvement over no prefetch [(original value - new value)/(original value)].
|
|
Conditional sequential prefetch
To reduce resource wastage from unnecessary prefetch, sequential prefetch can be initiated only when the access pattern is likely to be sequential. Generally, the amount of resources committed to prefetching should increase with the likelihood that the prediction is correct. For instance, previous studies [9, 17] have shown the benefit of determining the prefetch amount by conditioning on the length of the run or sequential pattern observed thus far. We refer to such schemes as conditional sequential prefetch. To condition on the run length, we need to be able to discover the sequential runs in the reference stream. This is generally difficult because of the complex interleaving of references from different processes. In this paper, we use a general sequential detection scheme patterned after that proposed in [17].
The sequential detector keeps track of references at the granularity of multiple sectors or blocks, a unit we refer to as the segment. A segment is considered to be referenced if any page within that segment is referenced. By detecting sequentiality in segment references, we can very effectively capture pseudosequential reference patterns. The sequential detector maintains an LRU-organized list of segments. Each entry in the segment directory has a sequential run counter that tracks the length of the run ending at that segment. On a read, if the corresponding segment is not already in the segment directory, we insert it. The run counter value of the new segment entry is set to 1 if the preceding segment is not in the directory and to 1 + run counter value of the preceding segment otherwise. In the latter case, we remove the entry corresponding to the preceding segment. Note that the segment directory tracks sequential patterns in the actual reference stream. It is therefore updated only when read requests are encountered, not when blocks are prefetched. On a read miss, if the run counter for the segment exceeds a threshold known as the prefetch trigger, we initiate sequential prefetch. In this paper, the prefetch amount is set to 2 × run counter value × segment size, subject to a maximum of 256 KB. The size of the segment directory governs the number of potential sequential or pseudo-sequential streams that can be tracked by the sequential detector. We use a generous 64 entries for all of our simulations.
In Figure 6, we explore the performance sensitivity to the segment size and the prefetch trigger. As we would expect, lower settings for the prefetch trigger perform better because the cost of fetching additional blocks once the disk head is properly positioned is small compared to the cost of a random I/O that might have to be performed later if the blocks are not prefetched. For all of the workloads, the best performance is obtained with a prefetch trigger of 1, meaning that prefetch is triggered on every cache miss. A segment size of 16 KB works well for the PC workloads. For the server workloads, the optimal segment size is 8 KB.
Figure 6
In a similar fashion, we can additionally prefetch preceding blocks when a backward sequential pattern is detected. To avoid having to wait a disk revolution for the preceding blocks to appear under the disk head, the preceding blocks are fetched before the requested blocks. Except for a slight performance improvement in some of the PC workloads, such backward conditional sequential prefetch turns out not to be very useful [5].
In Table 3, we compare the performance of conditional sequential prefetch with that of large fetch unit and read ahead. The three schemes achieve roughly the same average read response time for the PC workloads, reducing it by more than 30%. For the server workloads, conditional sequential prefetch is clearly superior, improving the average read response time by 36–54%. As for read service time, the PC workloads are improved by 30–40%, with large fetch unit having an edge. For the server workloads, conditional sequential prefetch again reigns supreme, with improvement of 29–50%. In the resource-poor environment, about 40–60% of the reads remain after caching and prefetching. In the resource-rich environment, about 25–45% remain.
Opportunistic prefetch
Another way to reduce the potential negative impact of prefetch is to perform the prefetch using only resources that would otherwise be idle or wasted. We refer to such an approach as opportunistic prefetch. In general, opportunistic prefetch can best be performed close to the physical device, where detailed information is available about the critical physical resources. Because a disk access costs much more than a semiconductor memory access, the cost of accessing prefetched data should be largely independent of the layer in the storage stack into which the data is prefetched. However, data prefetched into the disk drive cache tends to be evicted sooner, sometimes even before it is used, because the disk drive cache is typically smaller than the adapter and outboard controller cache. To model this effect, we enter opportunistically prefetched data into an 8-MB (LRU) prefetch buffer instead of the large cache in the resource-rich environment. The prefetch buffer turns out to significantly reduce pollution of the large cache.
The simplest form of opportunistic prefetch is to read ahead up to a maximum amount or until a demand request arrives, at which point the read ahead is terminated. This is known as preemptible read ahead. Read ahead by the disk is usually preemptible. At the adapter/controller level, it is generally difficult to preempt a request that has been issued to the disk, but the request can be broken up into smaller subrequests to enable preemption between them (e.g., [40]). By terminating the read ahead as soon as another demand request arrives, preemptible read ahead avoids holding up subsequent requests. Thus, its performance does not degrade as the maximum read-ahead amount is increased (Figure 7). However, preemptible read ahead tends not to perform as well as non-preemptible read ahead, especially for the speeded-up workloads, because it may be preempted before it can perform any effective prefetch. Such results suggest a hybrid approach of performing preemptible read ahead in addition to the non-opportunistic prefetching schemes discussed above. In such an approach, we would always perform some amount of prefetch (non-opportunistic) and, if idle resources were available, we would prefetch more (opportunistic). We find that with the hybrid approach, an opportunistic prefetch limit of 128 KB works well in almost all of the cases [5]. This is the value that we assume for the rest of the paper. An opportunistic prefetch limit of 128 KB means that blocks are opportunistically prefetched only until a total of 128 KB of data has been prefetched.
Figure 7
Table 4 summarizes the impact of performing preemptible read ahead in addition to the various non-opportunistic prefetching schemes in the resource-rich environment. The corresponding table for the resource-poor environment can be found in [5]. We find that in the resource-poor environment, preemptible read ahead improves average read response time by about 5% for large fetch unit and read ahead. The improvement is smaller for conditional sequential prefetch because conditional sequential prefetch already uses resources carefully by determining the amount to prefetch on the basis of the potential usefulness of the prefetch. In the resource-rich environment, preemptible read ahead has a bigger effect, especially for the server workloads, which are improved by about 15–20%.
|
| Table 4
Additional effect of opportunistic prefetch in a resource-rich environment, showing percentage of improvement over a system that performs only non-opportunistic prefetch. |
|
|
|
|
|
| | Average read response time | Average read service time | Read miss ratio |
| LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| (ms) | (%)ii
| | (%)ii
| | (%)ii
| | (%)ii
|
| Preemptible read ahead |
| P-Avg. | 2.97 | 11.9 | 3.13 | 10.2 | 3.15 | 6.84 | 1.81 | 14.9 | 2.10 | 13.1 | 2.15 | 9.25 | 0.336 | 10.8 | 0.411 | 13.7 | 0.381 | 8.95 |
| S-Avg. | 1.71 | 15.2 | 1.45 | 18.8 | 1.32 | 13.5 | 1.20 | 17.7 | 1.04 | 22.8 | 0.98 | 16.4 | 0.212 | 19.1 | 0.204 | 23.7 | 0.186 | 16.1 |
| Ps-Avg. | 3.76 | 5.40 | 3.65 | 4.99 | 3.69 | 4.49 | 1.99 | 7.79 | 2.13 | 7.24 | 2.13 | 6.43 | 0.371 | 1.15 | 0.437 | 8.07 | 0.391 | 6.28 |
| Pm | 1.67 | 12.3 | 1.96 | 2.46 | 1.98 | -2.15 | 1.14 | 14.6 | 1.45 | 6.99 | 1.50 | 2.25 | 0.223 | 9.90 | 0.296 | 7.75 | 0.275 | 1.60 |
| Sm | 2.05 | 9.8 | 1.44 | 22.8 | 1.38 | 19.6 | 1.56 | 9.47 | 1.14 | 25.4 | 1.10 | 22.1 | 0.257 | 9.79 | 0.212 | 25.3 | 0.205 | 21.2 |
| +Read-any-free-blocksiii |
| P-Avg. | 2.76 | 18.3 | 2.57 | 26.7 | 2.66 | 22.2 | 1.68 | 20.8 | 1.71 | 29.6 | 1.79 | 25.0 | 0.310 | 17.8 | 0.332 | 30.5 | 0.313 | 25.6 |
| S-Avg. | 1.65 | 18.7 | 1.32 | 27.4 | 1.20 | 22.6 | 1.16 | 20.8 | 0.947 | 31.3 | 0.886 | 25.7 | 0.204 | 22.6 | 0.182 | 32.8 | 0.167 | 26.0 |
| Ps-Avg. | 3.47 | 13.1 | 3.03 | 22.1 | 3.15 | 19.3 | 1.81 | 16.3 | 1.69 | 26.7 | 1.75 | 23.8 | 0.336 | 10.7 | 0.348 | 27.0 | 0.319 | 24.0 |
| Pm | 1.57 | 17.8 | 1.59 | 20.9 | 1.65 | 15.0 | 1.07 | 19.9 | 1.17 | 25.3 | 1.24 | 19.4 | 0.207 | 16.0 | 0.238 | 25.8 | 0.226 | 19.4 |
| Sm | 1.99 | 12.4 | 1.44 | 22.8 | 1.38 | 19.9 | 1.52 | 11.8 | 1.15 | 24.7 | 1.10 | 22.2 | 0.249 | 12.6 | 0.212 | 25.3 | 0.204 | 21.3 |
| +Just-in-time seekiv |
| P-Avg. | 2.71 | 19.9 | 2.66 | 24.3 | 2.80 | 18.0 | 1.50 | 29.7 | 1.52 | 37.4 | 1.65 | 30.7 | 0.287 | 24.1 | 0.318 | 33.6 | 0.308 | 26.8 |
| S-Avg. | 1.63 | 20.1 | 1.29 | 29.5 | 1.21 | 22.1 | 1.04 | 29.7 | 0.800 | 42.6 | 0.769 | 36.0 | 0.198 | 24.7 | 0.171 | 37.5 | 0.162 | 28.5 |
| Ps-Avg. | 3.38 | 15.5 | 3.20 | 17.8 | 3.37 | 13.5 | 1.46 | 32.9 | 1.49 | 35.4 | 1.64 | 28.5 | 0.303 | 19.9 | 0.335 | 30.0 | 0.317 | 24.5 |
| Pm | 1.54 | 19.3 | 1.65 | 17.8 | 1.76 | 9.31 | 0.927 | 30.5 | 1.01 | 35.1 | 1.13 | 26.4 | 0.190 | 22.9 | 0.229 | 28.7 | 0.223 | 20.1 |
| Sm | 1.97 | 13.0 | 1.42 | 24.1 | 1.33 | 22.4 | 1.38 | 20.0 | 1.01 | 34.4 | 0.94 | 33.3 | 0.242 | 14.9 | 0.204 | 28.1 | 0.194 | 25.1 |
i
LFU: Large fetch unit (64 KB); RA: Read ahead (32 KB); CSP: Conditional sequential prefetch (16-KB segments for PC workloads, 8-KB segments for server workloads, prefetch trigger of 1, prefetch factor of 2). ii
Improvement over non-opportunistic prefetch [(original value - new value)/(original value)]. iii
Preemptible read ahead + read-any-free-blocks. iv
Preemptible read ahead + read-any-free-blocks + just-in-time seek.
|
|
Another opportunistic prefetching technique is to start reading once the disk head is positioned over the correct track. Such a scheme is known as read-any-free-blocks, or zero latency read. Basically, it uses the rotational delay to perform some prefetching free of cost. Such a scheme may prefetch some blocks that precede the requested data and/or some blocks that come after, depending on the time at which the head is properly positioned. For example, if the head is positioned to read just after the requested data has rotated under, read-any- free-blocks fetches the succeeding blocks until the end of the track and then continues reading the blocks at the beginning of the track. As shown in Table 4, read-any- free-blocks is quite effective at improving performance. Our results indicate that in the resource-poor environment, read-any-free-blocks with preemptible read ahead is able to reduce the average read response time with read ahead by about 20% for the PC workloads and by more than 10% for the server workloads. In the resource-rich environment, the additional improvement is more than 20% for all of the workloads. Again, conditional sequential prefetch is improved less because it performs large prefetches only when they are warranted. As for large fetch unit, it is improved the least by read-any-free-blocks because it already prefetches some of the preceding blocks.
The dual technique of read-any-free-blocks is just-in-time seek, or delayed preemption [41]. The idea here is that when a request arrives while the disk is performing preemptible read ahead, the disk should continue with the read ahead and move the head to service the incoming request only in time for the head to be positioned over the correct track before the requested data rotates under. Basically, this allows the disk to prefetch more of the succeeding blocks. As shown in Table 4, the additional use of just-in-time seek improves performance for large fetch unit slightly over performing only read-any-free- blocks and preemptible read ahead. For read ahead and conditional sequential prefetch, just-in-time seek offers a marginal performance improvement in addition to that from read-any-free-blocks and preemptible read ahead for the server workloads, but loses out for the PC workloads.
During the rotational delay, the disk can also be used to perform I/Os that are tagged as lower-priority. This technique is called freeblock scheduling [42] and is meant to allow tasks such as disk scrubbing and data mining to be performed in the background with no impact on the foreground work. For instance, if the next block to be read is halfway around the track, the disk head could be positioned to service background requests “free of cost” as long as it could be moved back in time to read the block as it rotates under the head. However, given that read- any-free-blocks and just-in-time seek are effective at improving performance, such background I/Os may not be totally free for our workloads.
In general, in both the resource-poor and resource-rich environments, the best performance is obtained for the PC workloads when preemptible read ahead and read-any- free-blocks are performed in addition to simple read ahead. Specifically, this means starting to read once the disk head is positioned over the correct track, and reading beyond the requested data by 32 KB and, if there are no incoming requests, by up to 128 KB. The average read response time in this case is improved by almost 50% over that of a system that does not prefetch (Table 5). For the server workloads, performance improvement of 42–54% in the resource-poor environment and up to 65% in the resource-rich environment is achieved when conditional sequential prefetch is supplemented by preemptible read ahead and read-any-free-blocks.
|
| Table 5
Overall effect of performing preemptible read ahead and read-any-free-blocks in addition to non-opportunistic prefetch, showing percentage of improvement over a system that does not prefetch. |
|
|
|
|
|
| | Resource-poor environment | Resource-rich environment |
| Average read response time | Average read service time | Read miss ratio | Average read response time | Average read service time | Read miss ratio |
| LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
LFU i | RA i | CSP i |
| P-Avg. | 39.5 | 47.3 | 46.1 | 44.9 | 48.3 | 46.3 | 53.7 | 53.9 | 56.8 | 45.7 | 49.7 | 48.2 | 50.6 | 50.4 | 48.1 | 58.3 | 55.8 | 58.4 |
| S-Avg. | 44.3 | 51.7 | 54.2 | 46.8 | 51.1 | 52.9 | 59.4 | 59.8 | 61.7 | 48.7 | 61.8 | 64.6 | 47.9 | 61.0 | 63.2 | 61.7 | 68.0 | 70.2 |
| Ps-Avg. | 35.3 | 45.3 | 43.5 | 42.6 | 49.5 | 48.0 | 50.8 | 51.8 | 55.9 | 40.0 | 48.0 | 46.1 | 47.5 | 51.7 | 50.0 | 54.8 | 53.6 | 57.5 |
| Pm | 38.8 | 48.0 | 46.8 | 42.1 | 46.7 | 44.5 | 51.7 | 52.1 | 54.8 | 50.1 | 49.6 | 47.6 | 52.9 | 48.6 | 45.5 | 60.2 | 54.4 | 56.7 |
| Sm | 30.6 | 40.8 | 42.1 | 28.7 | 34.8 | 36.2 | 47.5 | 47.5 | 48.4 | 26.2 | 46.5 | 48.8 | 23.9 | 42.2 | 44.8 | 46.3 | 54.2 | 55.9 |
i
LFU: Large fetch unit (64 KB); RA: Read ahead (32 KB); CSP: Conditional sequential prefetch (16-KB segments for PC workloads, 8-KB segments for server workloads, prefetch trigger of 1, prefetch factor of 2).
|
|
Sensitivity to cache size
In Figure 8, we analyze performance sensitivity to cache size when data is prefetched into the cache and managed as if it were demand-fetched. We use the default parameters shown in Figure 2, meaning that in the resource-poor environment [Figure 8(a)], we read ahead by at least 32 KB on every cache miss and up to 128 KB if there are no incoming requests. We also perform read-any- free-blocks. In the resource-rich environment [Figure 8(b)], we perform conditional sequential prefetch, together with preemptible read ahead and read-any-free-blocks.
Figure 8
Observe that with prefetching, more than 50% of the reads can be satisfied by a 4-MB cache. Increasing the cache size beyond 4 MB to 32 MB achieves only diminishing returns. Such results suggest that disk drive caches in the megabyte range are sufficient. On the other hand, for very large caches, the miss ratio continues to improve as the cache size is increased beyond 4% of the storage used. As discussed earlier, most enterprise-class outboard storage controllers today, when fully loaded, have a front-end to back-end (cache size to storage space) ratio between 0.05% and 0.2% [34–36]. Our results suggest that increasing the cache size for these systems is likely to continue to improve performance, although whether such large cache sizes are cost-effective is another issue. Note also that the desirable amount of cache may not scale linearly with the size of the system.
Write buffering
The term write buffering refers to the technique of temporarily holding written data in fast memory (typically semiconductor) before destaging the data to permanent storage. A write operation can be reported as completed once its data has been accepted into the buffer. Because writes tend to come in bursts [3], the write buffer helps to better regulate the flow of data to permanent storage. To prevent any loss of data if the system fails before the buffered data is written to permanent storage, the write buffer is typically implemented with some form of non-volatile storage (NVS). In some environments, (e.g., UNIX file system, PC disks), a less expensive approach of periodically (usually every 30 seconds) flushing the buffer contents to disk is considered sufficient. By delaying the time at which the written data is destaged to permanent storage, write buffering makes it possible to combine multiple writes to the same location into a single physical write, thereby reducing the number of physical writes that have to be performed by the system. It may also increase the efficiency of writes by allowing multiple consecutive writes to be merged into a single big-block I/O. In addition, more sophisticated techniques can be used to schedule the writes to take advantage of the characteristics and the state of the storage devices.
In short, the write buffer achieves three main effects. First, it hides the latency of writes by deferring them to some later time. Second, it reduces the number of physical writes, and third, it enables the remaining physical writes to be performed efficiently. In this paper, we evaluate write buffering using a general framework that is flexible enough for us to examine the three effects of write buffering separately. In this framework, a background destage process is initiated whenever the fraction of “dirty” (or modified) blocks in the write buffer exceeds a high-limit threshold, highMark, and is suspended once the fraction of dirty blocks in the buffer drops below a low-limit threshold, lowMark. By appropriately setting highMark, we can ensure that buffer space is available to absorb the incoming writes. To avoid impact on the read response time, destage requests are not serviced unless there are no pending read requests or the write buffer is full. In the latter case, destage requests are serviced at the same priority as the reads. Analysis in [3] shows that the I/O workload is bursty, which implies that the storage system has idle periods during which the destage requests can be handled.
To reduce the number of physical writes, we use the least-recently-written (LRW) policy to decide which blocks to destage [17]. The LRW policy is similar to the LRU policy for read caching and is so named because it selects the block that was least recently written for destage. To examine the effect of limiting the age of dirty data in the buffer, we also destage a block when its age exceeds the maximum allowed. Destage policies have been studied in some detail recently, but the focus has been on selecting blocks to destage on the basis of the efficiency with which buffer space can be reclaimed. For instance, in [10] the track with the most dirty blocks is selected for destage. In [11] the blocks that can be written most quickly are selected. But a destage policy that strives to quickly reclaim buffer space may not be effective if the blocks that are destaged will be dirtied again in the near future. Moreover, with the layered approach of building systems, estimates of the cost of destaging operations may not be available to the destage process. For example, the adapter or outboard controller housing the write buffer typically has no accurate knowledge of the state and geometry of the underlying disks.
The approach we take is to first focus on reducing the number of physical writes by destaging blocks that are less likely to be rewritten and to then perform the remaining writes efficiently. To achieve the latter, whenever a destage request is issued, we include in the same request contiguous blocks that are also dirty. The resulting disk write may span tracks, but it is a large sequential write which can be efficiently handled by the disk. Also, we allow as many outstanding destage requests (contiguous blocks) as the maximum queue depth seen by the host, and once the destage process is initiated, |