|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.502.0199 | Copyright info |  |
 |
 |
Reliability of modular mesh-connected intelligent storage brick systems
|  |  |
by C. Fleiner, R. B. Garner, J. L. Hafner, KK Rao, D. R. Kenchammana-Hosekote, W. W. Wilcke, and J. S. Glider |
|
|  |
 |  |  |
|
| |
|
The Intelligent Bricks project investigates storage systems based on a modular brick architecture with the objectives of simplifying system management, providing a large scaling range, and creating a reliable system from commodity components. Storage servers built with a single type of module, or brick, are attractive in terms of simplicity, scalability, and cost. Bricks include processing, memory, networking, and storage sufficient to run a distributed software system that delivers higher data reliability than that offered by the underlying hardware.
A key property of an intelligent storage brick (ISB) system is its fail-in-place or deferred-maintenance architecture: By over-provisioning or adding additional bricks while operating, hardware maintenance can be delayed for several years—possibly for the entire lifetime of the system. The distributed system software is responsible for automatically invoking spare disks or bricks as components fail. The only maintenance task users are expected to perform is to physically add bricks to meet growing capacity requirements.
This paper presents quantitative insights into the operating characteristics of mesh-connected ISB systems, in which bricks communicate only with physically adjacent bricks. We characterize such systems by the fraction of unusable bricks due to degrading internal bandwidth and external host connectivity, and a reliability expression that approximates the length of time that ISB systems can operate without replacement of failed bricks. Our goal is that ISB systems provide nearly 100% data availability, no ongoing hardware maintenance actions for several years, and a very low probability of data loss due to multiple failures. This paper is a companion to [1], which presents the overall ISB system and an operational 3 × 3 × 3-brick prototype.
| |
|
The approach of distributing data across independent machines to build scalable storage systems has been explored in DataMesh [2], FAB [3], Self-* [4], Petal [5], and OceanStore [6]. Several companies are shipping products based on distributed data redundancy, including Panasas [7], Pivot3, LeftHand Networks, and Isilon. However, none of these approaches focus on fail-in-place or deferred maintenance. The Panasas system, while implementing a distributed RAID5 scheme, is oriented more toward delivering high performance. DataMesh [2] was a two-dimensional mesh-connected storage server that most closely resembled our ISB system and introduced concepts of distributed redundancy, fault isolation, and recovery. The more recent FAB project [3] proposed to build a brick storage system from commodity parts. The Self-* project [4] has a focus on simplifying administration by using a brick storage system, including mechanisms to schedule resources, classify files, and manage replicas.
An initial analysis of overprovisioning for capacity and bandwidth in an ISB system was first described by Kirkpatrick et al. [8]. They conservatively defined usable bricks as those that were connected to at least two or three other bricks. Our approach places data on bricks that may have only a single remaining connection to other bricks while avoiding possible set partitioning due to brick failures.
| |
|
A pristine cube (i.e., an initial cube with no failed bricks) contains N bricks arranged as a two-dimensional (2D) (h × h) or three-dimensional (3D) (h × h × h) nearest-neighbor network mesh. Each brick contributes storage, network bandwidth, memory, and processing resources. The bricks run system software that manages the storage data and implements a distributed RAID (dRAID) scheme, in which storage data is copied or encoded in multiple chunks and each placed on a distinct brick. As bricks progressively fail, a pristine cube slowly declines in performance and capacity. In this section we establish operating ranges of usable bricks in 2D and 3D mesh-connected ISB systems.
For purposes of our analysis, we make the following assumptions:
-
All bricks in the system are identical and contain sufficient processing, memory, networking, and data storage. Bricks communicate with adjacent (neighbor) bricks in a 2D or 3D network mesh topology.
-
A given brick is either completely functional (live) or completely inoperative (failed). When a brick fails, it reduces system network bandwidth as links between it and neighbors are lost, creating “holes” in the mesh.
-
Storage data is redundantly distributed across multiple bricks to ensure a high probability of restoring data after a failure of disks or bricks. When there is a failure, redundant data is rebuilt by the operating software over all surviving storage bricks. We assume that data is randomly distributed across all storage bricks, parameterized by the number of bricks k over which the redundant data chunks are distributed (its set-k, ranging from set-2 for simple mirroring to set-14 for space-efficient or higher-fault-tolerant codes). For example, a traditional 6-data and 1-parity RAID5 scheme would be set-7. Although details of distributed redundancy schemes are not discussed here, see [9] for implementation approaches.
-
The system is overprovisioned with bricks when it is assembled. In realistic deployments, a user could add bricks over time to compensate for brick losses, undoubtedly with improved cost and capacity attributes. Nevertheless, to simplify the analysis, we assume that bricks are not added over time.
Although we analyze symmetrical 3D systems in this paper, our approach can also be applied to systems with nonsquare, rectangular cross sections. This is relevant for actual implementations, because the system height may be limited by floor loading or other structural considerations.
| |
|
For our analysis, we performed Monte Carlo simulations of mesh-connected cubes with randomly selected failing bricks. To find the number of usable bricks in a degrading cube, we assumed that redundant data is placed via a distributed-redundancy, set-k scheme in equal-sized chunks across k bricks. A brick may be live but unusable because of the set-k placement constraints described next.
For each run, the simulator started with an initial, pristine cube and progressively failed one brick per iteration step. For each step, the placement algorithm first found the largest network of connected bricks with at least two surface bricks and then constructed as many set-ks as possible. An additional placement constraint was that the potential failure of any brick would leave at least k − 1 usable bricks for all of the constructed set-ks (i.e., a set-k cannot be partitioned if any single brick fails). This estimation process yielded a lower bound on the number of usable bricks for a given set-k placement [10]. We also assumed that chunks in 3D cubes are not placed together in vertical columns of bricks that can fail due to shared power and cooling in the column. (We found that this had little effect on the results.) We ran at least 300 different simulation runs for each pristine cube size.
Figures 1(a) and 1(b) respectively show the average and standard deviation results for degrading 6 × 6 × 6 = 216 and 15 × 15 = 225 brick systems. We also ran simulations for larger and smaller cubes and found that the results presented here generally apply to 3D cubes down to 64 bricks and 2D systems down to 16 bricks (with set-5 placement).
Figure 1
| |
|
Figure 1(a) plots the number of usable bricks in a degrading 3D system of 6 × 6 × 6 = 216 bricks with distributed redundant data. It plots the percentage of live bricks along the x-axis and the percentage of usable bricks along the y-axis. A pristine cube corresponds to the point in the upper right-hand corner. As a cube degrades, the number of live bricks decreases, as illustrated from right to left in the figure. Note that one should track the lower end of the standard deviation error bars, not the statistical mean.
The data shows that as live bricks decline to 60%, even the large set-14 distributed data placement is able to use nearly all live bricks. However, on losing more than about 50% of the bricks, there exists a noticeable fraction (>10%) of live but unusable bricks. As can be seen in the “bad” operating range, because of the increasing variability in the standard deviation of usable bricks, we suggest that a 3D cube not operate below approximately 60% live bricks.
| |
|
Figure 1(b) plots the number of usable bricks in a degrading 2D system of 15 × 15 = 225 bricks with distributed redundant data. These 2D plots show that once more than 15% of the bricks fail, the number of live but unusable bricks rises rapidly. Note that these 2D results are not applicable to smaller 2D systems (for <16 bricks with smaller set-k placements). We do not examine 2D systems further in this paper.
| |
|
In the previous section we presented simulation results for the fraction of usable bricks in a degrading cube. In this section, we look at the performance of the interconnection network as bricks fail in terms of total and average random bandwidth (for 3D systems only). These results were derived from the Monte Carlo simulations described above and are combined in Figure 2, where the x-axis shows the percentage of usable bricks—not the percentage of live bricks, as in Figures 1(a) and 1(b)—and the y-axis shows the percentage of change in the metric as bricks fail progressively from right to left in the figure.
Figure 2
| |
|
Network distance, the number of hops required to route a packet between two bricks via the shortest path, is a key metric, reflecting brick-to-brick latency and internal bandwidth. For a pristine, mesh-connected cube of edge length h, the average network distance (i.e., the average of the distance between two bricks taken over all source and destination bricks in a cube) is given by dpristine = h − (1/h), derived under the assumption that all source bricks communicate with all destination bricks via the shortest paths. As bricks fail and the cube degrades, the simulated brick configurations are evaluated at each step to determine the actual average network distance of the cube.
The average distance plot (Figure 2) increases slowly at the beginning as bricks fail (right side) and then levels off at about 40% usable bricks (on the left side), with an average network distance about 40% greater than the pristine cube.
From this plot, we conclude that the average network distance does not appreciably increase in 3D cubes as long as about 70% of the bricks remain usable. At 60% usable bricks, the average network distance has increased by about 10%.
| |
|
The total peak internal bandwidth of a cube is given by the number of connections between all usable bricks multiplied by the bandwidth provided by an internal brick face. The internal bandwidth plot in Figure 2 decreases faster than the loss of usable bricks: At 80% of usable bricks, only about 65% of the original peak bandwidth remains, while at 60% of usable bricks, only about 40% remains.
Although total internal bandwidth degrades faster than failing bricks, the number of usable bricks also declines, so the average bandwidth degradation per brick does not decline as quickly. We plot the average internal bandwidth per brick by dividing the total peak internal bandwidth by the average network distance and by the number of usable bricks. The average bandwidth per brick plot in Figure 2 is nearly proportional to the number of usable bricks: At 80% usable bricks, about 80% of the original average bandwidth per brick remains, and at 60% of usable bricks, about 50% remains.
| |
|
The external bandwidth of a cube is given by the number of usable surface bricks multiplied by the bandwidth provided by a surface brick face. The external bandwidth plot in Figure 2 indicates that the external bandwidth of the hosts decreases proportionally as the number of usable bricks drops.
Figure 2 also shows that the number of usable surface bricks in a cube is proportional to the total number of usable bricks (labeled surface bricks/usable bricks). This ratio remains nearly constant as bricks fail.
| |
|
We now examine the reliability of host connectivity to a degrading cube. For simulations, we assumed that external hosts can access the set of remaining usable bricks in the cube via at least one surface brick. However, this implies that every host is connected to every brick on the surface of the cube, which is not practical. Instead, we need to assume that each host has multiple cube connections, each to different surface bricks. We then calculate the probability that the usable surface connections to a host will not be members of the usable bricks in the cube.
Given a cube with the number of usable surface bricks U out of the total number of surface bricks S = 6 − 12 + 8, C connections per host to the cube, and assuming usable surface bricks proportional to total usable bricks (as in the previous section), the probability that a host has no connections to the usable bricks is given by
Using the above equation, Figure 3 plots the probability of a host being unconnected to the usable bricks against the number of per-host connections with either 70% or 80% usable bricks. The graph illustrates that, for every additional two per-host connections, the risk of being unconnected to the usable bricks is reduced by a factor of approximately 10. To achieve Phost_has_connection > 0.99999 in a cube with 70% usable bricks, the number of per-host connections should be 9 or greater. Plotting the same graph for different cube sizes shows that the probability depends primarily on the number of host connections, not size.
Figure 3
From these results, we conclude that either an external network switch is needed between the surface bricks and the hosts or, alternatively, the hosts are in the cube itself (which then moves the surface connectivity problem to the clients of the hosts). Although it is possible for a user to remove a host connection from a failed brick and reconnect it to a remaining usable surface brick, we assume that such maintenance activity is undesirable.
| |
|
A key objective for ISB systems is deferred maintenance, i.e., that they can operate without replacement of failed hardware for long periods of time. Failed bricks remain in place at least until service or maintenance is performed, and possibly until end of life of the system. Our goal is a system that provides nearly 100% data availability with no ongoing maintenance actions (except possibly to add capacity). In this section, we calculate the deferred-maintenance time period for a system as a function of the fraction of brick failures and the reliability of disk and brick controller electronics. We assume that a system is fully functional and optionally overprovisioned with bricks at its start of operation.
The reliability of a system is defined as the probability that it operates properly over a time t. As illustrated in Figure 4, a storage brick contains controller electronics in series with d parallel disks that can fail independently while leaving the brick operational. The reliability of the independent, parallel-connected disks of a brick is the complement of their unreliability; i.e., it is one minus the probability that all of its disks fail over time t.
Figure 4
Furthermore, the reliability of the series-connected controller electronics of a brick and its collection of parallel disks is the product of their reliabilities [11]. Thus, the hardware reliability of a storage brick over a time period t is given by
From this equation, we observe that the parallel disks can achieve a very high hardware reliability, even with disks that have high rates of failure (e.g., Rdisks = 0.99999 for six parallel disks at 3% annual constant failure rate each over five years). Thus, to simplify the analysis below, we assume that the reliability of the storage brick is approximately equal to the reliability of the controller electronics, or Rbrick(t) ≈ Rcontroller(t). Note that Rcontroller(t) includes everything in the brick exclusive of disks.
The system hardware reliability Rsystem(t) is defined as the probability that a system will meet its mission to provide a target number of live bricks and storage capacity through the deferred-maintenance time period t. Our approach to approximate Rsystem(t) is to use the frequency interpretation of probability [12] and assume ideal, identically deployed systems with no systematic environmental influences, each with identical bricks and disks that fail independently. The hardware reliability of our uniform brick system can then be approximated by the cumulative binomial distribution [11]—the probability that M out of N independent and identical bricks survive over time period t:
To determine the maximum deferred-maintenance time duration t of our system, we first decide on a target value for Rsystem(t) and then solve the binomial probability distribution for t as a function of that target reliability, M/N fraction of live bricks, and the predicted brick reliability.
For setting a system reliability target, if our acceptable goal is that one out of D deployed systems fails to achieve a deferred-maintenance mission, we can set target Rsystem(t) = 1 − 1/D. For example, if Rsystem(t) = 0.99999, approximately one system out of 100,000 will fail its mission to remain above the target number of live bricks or storage capacity over deferred-maintenance time t.
Substituting an expected, constant, and memoryless failure rate λbrick with brick reliability Rbrick(t) = e−λbrickt into the above binomial distribution yields
Using target constant values for Rsystem(t), M, N, and a range of brick failure rates λbrick, we iteratively solve this equation for maximum values of t.
As an example, we select a system hardware reliability target of Rsystem(t) = 0.99999 for N = 6 × 6 × 6 = 216 brick systems and vary the fraction M/N of live bricks from 90% down to 60% in 10% steps, corresponding to 10% to 40% of failed bricks and optional initial overprovisioning levels of 1/(1 − M/N), or 11%, 24%, 43%, and 67%, respectively.
In Figure 5, we plot the brick failure rate λbrick on the x-axis against the maximum deferred-maintenance period t on the y-axis for these four brick failure cases. The labeled compute brick point shows that a brick failure rate of 2% in a system that expects 20% failed bricks (optionally 25% overprovisioned), would allow for deferred-maintenance system operation for nearly six years. The labeled storage brick point in Figure 5 illustrates a 2.5-year deferred-maintenance period, achievable with a combined controller, and a disk failure rate of 4.5% per year, as described next.
Figure 5
Next we look at the hardware reliability of storage capacity; that is, the probability that the system will have sufficient storage disk capacity at the end of the deferred-maintenance time period. Although all of the disks operate in parallel and can fail independently, from the perspective of the data stored on a disk, the brick controller electronics are in series with each disk. Because the reliability of series-dependent components is the product of the component reliabilities [11], the reliability of the storage capacity itself is Rstorage(t) = Rcontroller(t) × Rdisk(t). With our assumption of constant failure rate and exponential reliability for brick electronics and disks, the overall resulting storage capacity failure rate is λstorage = λcontroller + λdisk, per these equivalent equations:
| Rstorage(t) = Rcontroller(t) × Rdisk(t), |
| |
| e−λstoraget = e−λcontrollert × e−λdiskt, |
and
| λstorage = λcontroller + λdisk. |
Substituting λstorage in place of λbrick as the failure rate in the cumulative binomial equation, Figure 5 then shows the expected deferred-maintenance intervals for storage capacity. For example, if one assumes a brick controller annual failure rate of 1.5% [mean time between failures (MTBF) ≈ 580,000 hours] and a disk failure rate of 3% (MTBF ≈ 300,000 hours), the effective hardware storage failure rate is 1.5% + 3% = 4.5% per year, yielding a system deferred-maintenance duration of 2.5 years for 100,000 systems that expect up to 20% failed bricks (optionally 25% overprovisioned).
| |
|
The previous section examined the relationship between brick hardware failure rates and the deferred-maintenance duration of ISB systems before accumulated hardware failures warrant maintenance. To implement a high level of data reliability and to guard against irrecoverable data loss, a distributed data redundancy scheme is implemented by the system software across multiple bricks. In this section we present results of Markov models for deriving the probability that a system will not lose data after multiple brick or disk failures, assuming different distributed data redundancy schemes and system performance characteristics.
Standard storage redundancy schemes store either one or more copies of the original data or precomputed redundancy information, such as parity. After brick or disk failure and data erasure are detected, the system software uses a copy to rebuild the original and redundant data in spare, unused storage. Whether a particular hardware or software fault results in the loss of storage data is a function of three elements: the fault tolerance of the redundancy coding scheme; the system network and brick-internal bus bandwidths available to rebuild redundant data (which determines rebuild time); and the probability of additional hardware or software faults during the rebuild process. Note that for performance considerations, redundant parity information is seldom verified on storage read operations.
We assume that if a brick or disk fails, the distributed software will detect that erasure event, for example, when disks return error codes or are unresponsive to multiple retry commands, or when bricks do not respond after multiple out-of-network reboot attempts. Our goal is a very small probability of data loss after multiple brick or disk failures: no more than one data-loss event in five years for 100 one-petabyte systems, or, equivalently, only two data-loss events per exabyte-year.
The redundancy scheme that is the easiest to implement is to mirror or duplicate data. Although the storage write performance is highest, the data storage efficiency is low. There exist several redundancy schemes that guard against multiple failures with high storage efficiency—with the tradeoff of lower write performance. These schemes include RAID5, RAID6, and others that tolerate three or more failures [13, 14]. Redundancy schemes for higher fault tolerance can be realized through several erasure codes such as EVENODD [15], Reed–Solomon [16], low-density parity check (LDPC) [17], and WEAVER, a new constant-efficiency, high-fault-tolerant erasure code [18].
Redundancy schemes can be characterized by several parameters: number of faults tolerated, storage efficiency, and storage write performance. These factors are interrelated; optimizing for two usually results in tradeoffs against the third. From the user's perspective, these parameters translate into the reliability of the storage data, its cost, and overall application performance. Thus, the choice of a particular data redundancy scheme depends on the business cost of a data-loss event, how much the user is willing to pay for higher levels of fault tolerance, and the level of application performance required. Ideally, the operating software will allow the system administrator to make reliability tradeoffs for the user's data.
The fault tolerance and storage efficiency of a particular redundancy scheme defines the minimum number of bricks that must be usable and corresponds to the set-k metric defined earlier. For example, an 80% efficient, single-fault-tolerant RAID5 scheme requires a minimum of five usable bricks for placing data and parity chunks (set-5). A 50% efficient scheme with three data and three parity chunks requires a set-6 placement.
When a brick or disk failure is detected, the distributed operating software rebuilds the erased data across all of the surviving usable bricks. Because the large aggregate bandwidth of the entire 3D network is available for this task, the rebuild times scale well with the size of the cube, minimizing the exposure time to subsequent failures. Note that data redundancy may be implemented not only across bricks, but within bricks as well. Schemes with in-brick redundancy assume that failing disks are not replaced, so erased data in the brick is rebuilt across the remaining operational disks in a brick (assuming sufficient remaining capacity).
By constructing and analyzing Markov chain failure models of representative systems, we compared the efficacy of several dRAID schemes assuming typical brick and disk hardware failure rates. As an example, a simple loss-of-data scenario for a redundancy scheme supporting 1-fault-tolerance1 among bricks with no in-brick redundancy is illustrated in Figure 6.2 Loss of data due to multiple failures occurs when either a brick controller or a disk fails (at rates λb and λd), followed by a rebuild and repair process (at rates μb and μd), during which time a second brick controller or disk fails—or the more probable case of a hard and unrecoverable error occurring during a disk read (with probability er). The 3-fault-tolerant Markov graph is more complex, requiring 16 states, as described in [19].
Figure 6
The Markov model reliability calculations for 1-, 2-, and 3-fault-tolerant dRAID schemes for a pristine 4 × 4 × 4 cube with 12-disk bricks are presented in Figure 7, which shows that mirrored, single-fault-tolerant schemes such as RAID5 fall far short of our target, coming in at about 30,000 data-loss events per exabyte-year. Mirroring between bricks, together with single-fault-tolerant RAID5 in the bricks, is 1,000 times better, at about 30 data losses per exabyte-year.
Figure 7
At least 2-fault tolerance is required between bricks to achieve our target of two data-loss events per exabyte-year, using either of two schemes:
-
A 2-fault-tolerant dRAID scheme across bricks, combined with single-fault-tolerant RAID5 in the bricks for 0.009 data-loss events per exabyte-year (or 9 data losses per zettabyte-year); or, alternatively,
-
A 3-fault-tolerant dRAID scheme across bricks with no redundancy in the bricks for 0.001 data-loss events per exabyte-year (or one data loss per zettabyte-year).
The parameters used in these Markov model calculations include disk MTBF = 300,000 hours, brick electronics MTBF = 400,000 hours, disk read hard error rate of 10−14, 4 × 4 × 4 = 64 bricks, 12 disks per brick, 40-MB/s disk bandwidth, six 800-MB/s 3D mesh network links per brick, rebuild block size = 128 KB, and 10% bandwidth utilization for data redundancy rebuilding. These calculations, including assumptions and sensitivity analyses, are discussed in more detail elsewhere [19]. In summary, we found that the probability of a data-loss event was generally insensitive to the cube size, but sensitive to three variables: the disk failure rate in bricks without internal RAID, the brick controller electronics failure rate, and the rebuild block size.
| |
|
We quantified the effects of brick failures in modular, mesh-connected ISB systems as gauged by capacity, network performance, system reliability, and the probability that a system does not lose data after multiple disk or brick failures. Using Monte Carlo simulations, we examined the effects of loss of usable bricks for distributed data placement on network bandwidth in order to recommend a fraction of usable bricks above which systems should operate.
Our studies show that for 3D, mesh-connected cubes with failing bricks, essentially all bricks are usable for data placement as long as 60% or more of the bricks are operational. Below this level, we found an increase in the number of isolated and unusable bricks and a progressive decrease in internal bandwidth.
Although our results show that 3D systems scale well down to about 60% live bricks, to minimize optional overprovisioning and allow margin for dependent or nonconstant failure rates, we expect that ISB systems can operate down to an 80% live-brick level (corresponding to optional 25% overprovisioning). While operating in this region for the deferred-maintenance duration, we found the following:
-
Essentially all live bricks are usable by distributed data redundancy schemes up to set-14.
-
Because the internal average bandwidth per brick is approximately proportional to the number of usable bricks, the average bandwidth per brick is about 80% of the initial, pristine cube value.
-
Either multiple connections per host or an external switch is necessary between hosts and the cube surface bricks.
We also showed the following for 6 × 6 × 6 cubes down to the 80% live brick level:
-
Deployment of 100,000 systems could achieve a deferred-maintenance duration of approximately 2.5 years assuming typical hardware failure rates for disks (3% per year) and brick controller electronics (1.5% per year).
-
A 3-fault-tolerant dRAID scheme across bricks can achieve a storage reliability target of less than two data-loss events per exabyte-year caused by multiple disk or brick failures.
Assuming modest overprovisioning, we demonstrated that maintenance in 3D mesh-connected storage systems can be deferred for several years while maintaining adequate performance and achieving high levels of data reliability and availability.
*Trademark, service mark, or registered trademark of International Business Machines Corporation.
**Trademark, service mark, or registered trademark of SPARC International, Inc. in the United States, other countries, or both.
| |
| |
1The number of disks or bricks that can fail, f, is specified by the fault tolerance of the redundancy scheme. The Hamming distance of the redundancy scheme is f + 1.
2Note that in the figure a transition from state 2 to 1 (a disk failure followed by its brick controller failure before rebuild completes) is not shown, but has little effect on the results.
Received July 5, 2005; accepted for publication August 8, 2005; Published online February 22, 2006.
|
|