Estimation of Data Reduction
To date, most attention in the storage community was given to devise reduction methods and optimize their performance. However, since these methods do not come for free, knowing how much data reduction can be expected from a given method is imperative.
While many empirical studies have shown what can be expected from different techniques on various workloads (for example, deduplication on virtual machines), in practice, little is known on how to accurately estimate the data reduction level for a specific, given workload.
This question is becoming more and more relevant. In the paper, Estimation of Deduplication Ratios in Large Data Sets, published at MSST 2012, we address the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. While this is doable for compression, it turns out to be a difficult problem for deduplication.
Based on this study, we developed Comprestimator, an IBM tool for estimating the compression ratio of a block device when employing RtC (IBM's Real time Compression) on it. The tool samples the volume's data and gives estimations on two central values:
- Fraction of non-zero blocks: How many of the volume's blocks contain actual data. A block is considered a zero block only if all of its content is zeros.
- Compression ratio estimation of non-zero blocks: RTC will be run on all non-zero blocks; the goal is to estimate how much these blocks would compress.
While the tool gives estimates rather than exact values, it does give explicit guarantees on the accuracy of these estimations. The level of accuracy is derived based on a mathematical analysis of the sampling method. For example, the accompanied figure shows how 3 executions of the Comprestimator converge towards the actual value, and the runs are well within the guaranteed bounds.
From IDC's Real-time Compression Advances Storage Optimization: "Organizations can avail themselves of IBM's Comprestimator tool to model expected compression benefits for specific environments. Using the results and expected growth rates, organizations could derive potential savings by deploying Storwize V7000 with compression or enabling compression on existing volumes. IBM could proactively leverage this tool to provide both customers and prospects with a view of potential savings."
In contrast to compression, estimating the deduplication ratio turns out to be a challenging task. It has been shown both empirically and analytically that essentially all of the data at hand must be inspected to come up with an accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, challenges exist in devising an efficient yet accurate method. In our MSST 2012 paper, we presented a method that inspects all the data but requires only a minimal memory footprint compared with the naïve approach. Empirically, in some special cases, only a small portion of the data is inspected. Furthermore, we set the grounds on the limitations and inherent difficulties of the problem.