DRAT Skip to main content
IBM Research Homepage 
 Research Home  >> Data Redundancy Avoidance Toolkit



Data Redundancy Avoidance Toolkit (DRAT)

Despite the ever-increasing capacities of storage systems and network links, there are often benefits to storing or transmitting objects efficiently: using as few bytes as possible. Examples of environments that would see such benefits include mobile devices with limited storage, communication over telephone links, storage of reference data (data that are written, saved permanently, and often never again accessed), electronic mail, wide-area transfers of large objects (such as scientific datasets) or over saturated links, and so on.

There are numerous techniques for reducing object sizes, including data compression (eliminating redundancy internal to an object), duplicate suppression (eliminating redundancy caused by identical objects), and delta-encoding (eliminating redundancy of an object relative to another object, often an earlier version of the object by the same name). Another technique involves a method for dividing larger objects into smaller variable-sized “chunks” and eliminating duplicate chunks; the boundaries of the chunks can be determined using any of several functions, including Rabin Fingerprints, over a sliding window of the content. Such “content-defined blocks” isolate changes within an object, so that changes in one part of it do not affect other parts and duplication of blocks across objects can be detected. This technique was first proposed for the Low-Bandwidth File System (LBFS) and has since been applied in other systems.

Still another technique is known as Delta Encoding via Resemblance Detection (DERD). DERD attempts to extend delta-encoding by identifying similar objects, which may otherwise have no spatial or temporal association with the object being encoded, and then performing delta-encoding of the object against a chosen similar object. The resemblance detection step uses Rabin Fingerprints to select a small number of substrings (called fingerprints) from each object; two objects with many of these fingerprints in common are likely to have much of their content in common overall.

The purpose of the Data Redundancy Avoidance Toolkit (DRAT) is to provide support to applications, distributed file systems, and other environments to pick and choose automatically from a smorgasbord of techniques like these. We are currently integrating DRAT with the Distributed Storage Tank project at IBM Almaden, and evaluating new data redundancy avoidance techniques. We call one such technique Redundancy Elimination at the Block Level (REBL), as it combines the content-defined chunking of systems like LBFS with the delta-encoding via resemblance detection of systems like DERD. A paper on REBL and how it compares to several other techniques has been submitted for publication.

In the next stage of the project, we will develop autonomic support for applying these techniques and explore new interesting metrics for evaluating the tradeoffs among them.


For more information, contact Fred Douglis or John M. Tracey.

 Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact