Storage techniques for Big Data

Introduction

The rapid growth of data with annual growth rates of 50% and more, together with the requirement to contain the cost of data storage at a manageable level, requires efficient data storage, including compression and data deduplication, but also the careful placement of data on the best suited storage medium.

In order to find the optimal data placement on different storage tiers (e.g. SSD, disk, tape) as shown in Figure 1 in terms of cost and performance, it is crucial to understand the access patterns to data. This knowledge can be utilized not only for data placement, but also for intelligent caching and pre-fetching of data, which is one of the research goals of the DOME project.

Magnetic tape is the storage medium that offers by far the lowest cost of ownership for long term storage. An analysis of the tape market showed that most of today’s tape drive and media characteristics (capacity, throughput, longevity, cost, etc.) satisfy customer requirements for use cases such as backup and archival storage. However, there is a strong demand for improvements in tape manageability and usability, for example, a non-proprietary, simple and cost-effective integration of tape storage in the tiered storage hierarchy or even into cloud storage systems. This seamless tape integrating can be done by utilizing the open Linear Tape File System (LTFS) format developed by IBM together with IBM’s General Parallel File System (GPFS).

Such integration enables tape systems to play an important role in active archives, in which data can be seamlessly migrated to the most appropriate storage tier (e.g. SSD, HDD, tape) and where the data is always online and accessible to the users from all storage tiers through a common file system that represents all the tiers in a single namespace.

Big Data analytics has become a significant driver for large storage capacity requirements and the demand for highly optimized and responsive storage systems. At the same time, data analytics can also be an enabler for an optimized data management and storage system.

Smart tiering Figure 1. Smart tiering.