With the explosion of data stored by individuals and organizations, deduplication and compression techniques are becoming even more essential. Commonly referred to as ‘data reduction’, these techniques are studied in the context of storage systems. We focus on making IBM storage solutions efficient and easy-to-use by contributing to the design and implementation of compression and deduplication in IBM’s storage products.

Deduplication and Compression Estimation

Our expertise lies in managing data reduction, specifically to help estimate the potential of data reduction. This helps businesses decide where and when to invest in data reduction to maximize the benefit from these services. Our cutting-edge techniques for data reduction estimation are used by IBM storage systems and were published in leading academic storage conferences.

We developed compression and deduplication estimation tools that estimate the deduplication ratio of very large datasets in a practical way, from memory and run-time perspectives. The IBM Data Reduction Estimator Tool is based on our deduplication research and prior research on compression estimation (see Comprestimator below). Our paper on “Estimating Unseen Deduplication – from Theory to Practice”, provides additional details.



Comprestimator

We developed Comprestimator, an IBM tool for estimating the compression ratio of a block device when employing IBM's Real-Time Compression. The tool samples the volume's data and gives estimations on two central values:

  1. Fraction of non-zero blocks: How many of the volume's blocks contain actual data. A block is considered a zero block only if all of its content is zeros.
  2. Compression ratio estimation of non-zero blocks: RTC will be run on all non-zero blocks; the goal is to estimate how much these blocks will compress.

While the tool gives estimates rather than exact values, it does give explicit guarantees on the accuracy of these estimations. The level of accuracy is derived based on a mathematical analysis of the sampling method. For example, the accompanying figure shows how three executions of the Comprestimator converge towards the actual value, and the runs are well within the guaranteed bounds. The tool is based on technology described in the paper “To Zip or not to Zip - Effective Resource Usage for Real-Time Compression”.

Data Compression

Our research also touches on other aspects of compression. For example, we explore improving the speed of software-based compression. We implemented and presented a fast version of the Deflate protocol that is substantially faster than the fastest version of the Zlib software package, yet maintains full compatibility to the Deflate standard (The details are described in “A Fast Implementation of Deflate”). In addition, in “SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks” we explored how to ease the evaluation compression techniques by showing how to create representative datasets for compression comparisons.

Security and Deduplication

Security is a major concern when designing client-side deduplication. In a sequence of papers, we point out threats of over-the-wire deduplication, when performed across various tenants/users. One point is the leakage of small amounts of information in the form of: "did other users already upload this object in the past". We described this threat in the paper "Side Channels in Cloud Services". A more direct threat is one in which a client can pretend to upload data that it does not own, by obtaining only a simple (and very small) hash value of this data. In "Proofs of ownership in remote storage systems," we presented solutions to this threat in the form of proofs of ownership (PoWs), by which a client proves to the cloud that it indeed owns the data at hand. We formally defined security of this notion, presented several schemes, and implemented and tested the performance of our most efficient scheme. This work is an enabling technology for performing over-the-wire deduplication in open environments.