Overview
As the volume of digital information continues to grow, we are faced with a paradox. We can read and interpret the Dead Sea scrolls written almost 2000 years ago, but we cannot do the same with data generated 20 years ago on a 5.25 inch floppy disk. Ironically, as the world becomes digital, we may be entering a digital "Dark Ages" in which business, public, and personal assets are in ever greater danger of being lost. But, on the other hand, there is an increased need for long-lived digital information. Recent compliance legislation, such as HIPAA and the Sarbanes-Oxley Act, which require long-term data viability, have increased the need to study how to preserve myriad types of information, such as scientific, financial, healthcare, artistic, and cultural data for tens and even hundreds of years.
Preserving information is more than just storing bits of data. It involves preserving the understandability and usability of complex interrelated objects even when technologies for computer hardware, operating systems, data management products, and applications are replaced with newer ones -- and as data consumers (designated communities) change frequently. This poses new requirements to ensure long-term access and understandability, while enabling new interpretations of the same data.
At the heart of any solution to the preservation problem resides a storage component, which is the permanent location of the information. Traditional archival storage considers only bit preservation, if it considers preservation issues at all. We argue that in order to better preserve data and understandability for long periods, a new type of storage must emerge to take preservation considerations into account.
Preservation DataStores (PDS) is such a novel storage component that supports digital preservation environments ensuring data usability and integrity over long periods of time. PDS supports new functionalities and extensions that are specific for logical preservation. It encapsulates the raw data with its complex interrelated metadata objects, so they are inseparable during the migration processes and when accessing the data in the future. PDS decreases the data transfer between the applications and the storage by offloading data intensive functions such as fixity computations and transformations to the storage. The PDS storage system simplifies the applications by transferring the responsibility for managing the storage-related events, such as provenance events, to the storage itself.
PDS can be used in a preservation solution as the storage infrastructure that manages preservation objects over time. It can be integrated in a cloud-based service for providing archive and long-term retention and preservation service. PDS can also be integrated with an enterprise content management (ECM) system to provide OAIS-based logical preservation support to the ECM. PDS maps the preservation objects to the ECM object model, utilizing its advantages as well as ensuring the preservation object encapsulation, co-location, and usability over time.
PDS serves as infrastructure storage of CASPAR, and was installed and experimented with at the Europe Space Agency (ESA) where it was tested with scientific data. A PDS prototype is available for public download and free evaluation at IBM AlphaWorks. Some PDS concepts are leveraged in standardization efforts via the SNIA LTR technical working group.