Scientific data models for large-scale applications


Lloyd A. Treinish 
lloydt@us.ibm.com

IBM Thomas J. Watson Research Center 
Yorktown Heights, NY

Abstract

The instrumentation technology behind data generators in a myriad of disciplines is rapidly improving, typically much faster than the techniques available to manage and use the resultant data.  Hence, greater cognizance of the impact of these data volumes is required.  If adequate mechanisms for use of the actual data that meet the requirements and expectations of the scientific users of such data are not available, then the efforts to generate and archive the data will be for naught.  Within the context of scientific visualization, for example, the role of data management can be expressed by the need for a class of data models that is matched to the structure of the data as well how such data may be used.  Visualization is a good metaphor for general scientific applications because of its diverse computation and data access requirements.  Traditional methods of handling scientific data such as flat sequential files are generally inefficient in storage, access or ease-of-use for large complex data sets, i.e., O(1 GB) in size, particularly for such applications.  Relational data management systems are also deficient because the relational model does not accommodate multidimensional, irregular or hierarchical structures often found in scientific data sets nor the type of access that associated computations require.  Attempts to extend the relational model for such incur overhead that limits the effective performance for large data sets.  Therefore, there is a need for some type of data (base) model that possesses elements of a data base management system but is oriented towards scientific data sets and applications.  These requirements demand the utilization of a generalized data model as a mechanism to classify and access data as well as efficiently map data to operations, which provide a self-describing representation of the physical storage structure (e.g., format), structural representation of the data (e.g., data base schema), and a higher-level logical structure (e.g., operations).  The implementation of such a model effectively decouples the management of and access to the data from the actual application and is as important a component of a data visualization system, for example, as underlying graphics and imaging technology.  The justification and implementation of data models for large-scale scientific applications are discussed.

Example Implementation:

Data Explorer Data Model

The IBM Visualization Data Explorer is a general-purpose tool kit for data analysis and visualization.  It it built upon an implementation of an unified data model, which supports various types of simulation and observational data, whether in memory or on disk. Data structures that can be represented include: The data samples can be defined over spaces of any dimensionality, and, independently, can also be connected by primitives of various dimensionalities (allowing, for example, triangular and quadrilateral meshes defined over 2- or 3-dimensional points). The data values can be associated either with the sample points or with the connections between the sample points. Available data types include: Data are stored in the form of Objects for use by Data Explorer.  An Object is a data structure stored in memory that contains an indication of the Object's type, along with additional type-dependent information. The bulk of the data is encapsulated in Array Objects.

The data model centers on the notion of a sampled field. The next section describes the Field, Array, and Group Objects that implement sampled fields in Data Explorer. In addition to these basic Object types, other types are used to construct models for rendering (e.g., Transforms, Clipped Objects, Lights, and Cameras). These are described in "Data Explorer Native Files" and in IBM Visualization Data Explorer Programmer's Reference.

Data are also stored in permanent file storage in the form of the same Objects. Although Data Explorer supports the creation of Objects from data stored in other file formats (such as netCDF), the Data Explorer file format offers significant additional functionality and flexibility.

Note that the Data Explorer file format is versatile, allowing for future expansion of the capabilities of the system without requiring changes to the file format. It is possible to represent data types in a Data Explorer file that cannot be processed by the current version of Data Explorer. For example, in the current release of Data Explorer, most functions support only 1-, 2-, or 3-dimensional coordinates.


A Function-Based Data Model
 
The definition and implementation of coherent methods of representing and using data as a formal data model have been used to effectively to create general-purpose tools for visualization, computation and data management.  Despite the relative success and capabilities of these data models, there can still be gaps in their capabilities to support either specific classes of data or to map well to a user's problem domains.  Therefore, a higher-level data model based upon a simple functional definition is proposed and tested, which leverages some of the capabilities of extant implementations.


lloydt@us.ibm.com

[ DX Home Page | Contact DX