Skip to main content

Deduplication for the Cloud

We focus on problems that arise in deduplication of data in cloud environments.

WAN-based deduplication for the Cloud

VISION CloudOver-the-wire deduplication is a process that identifies and eliminates redundant data before they are sent over the wire or saved on a persistent storage device to optimize network or storage load. In the VISION Cloud project, we designed and implemented a WAN-based deduplication for objects in the cloud, as part of the infrastructure optimizations of the underlying object service. In the VISION Cloud scenario, this means that instead of transferring the data directly, a hash value of the data is created when the data object is first added to the object service. This hash is then sent to a possible transfer destination location inside the VISION Cloud storage system. Now the actual data is only transferred if this specific data does not already exist in the destination.

In VISION Cloud, we support deduplication of full objects (as opposed to chunks of objects), within each cluster. This means that only a single copy of an object is stored per cluster, but replicas will exist if stored at different clusters. Our implementation utilizes the catalog components (implemented as Cassandra) and links in the underlying file system.

Security and Privacy in Cloud WAN-based deduplication

When designing client-side-deduplication, care needs to be taken with respect to security. In a sequence of papers, we point out threats of over-the-wire deduplication, when performed across various tenants/users.

One point is the leakage of small amounts of information in the form of: "did other users already upload this object in the past". This threat is described in the paper "Side Channels in Cloud Services", by Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg, published in IEEE Security and Privacy Magazine (special issue of Cloud Security).

In the CCS 2011 paper "Proofs of ownership in remote storage systems", by Shai Halevi, Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg, we focus on a more direct threat by which a client can pretend to upload data that it does not own, by obtaining only a simple (and very small) hash value of this data. We presented solutions to this threat in the form of proofs of ownership (PoWs), by which a client proves to the cloud that it indeed owns the data at hand. We formally defined security of this notion, presented several schemes, and implemented and tested the performance of our most efficient scheme. This work is an enabling technology for performing over the wire deduplication in open environments.