Scientific data models for large-scale applications
Lloyd A. Treinish
lloydt@us.ibm.com
IBM Thomas J. Watson Research Center
Yorktown Heights, NY
The instrumentation technology behind data generators in a myriad of disciplines
is rapidly improving, typically much faster than the techniques available
to manage and use the resultant data. Hence, greater cognizance of
the impact of these data volumes is required. If adequate mechanisms
for use of the actual data that meet the requirements and expectations
of the scientific users of such data are not available, then the efforts
to generate and archive the data will be for naught. Within the context
of scientific visualization, for example, the role of data management can
be expressed by the need for a class of data models that is matched to
the structure of the data as well how such data may be used. Visualization
is a good metaphor for general scientific applications because of its diverse
computation and data access requirements. Traditional methods of
handling scientific data such as flat sequential files are generally inefficient
in storage, access or ease-of-use for large complex data sets, i.e., O(1
GB) in size, particularly for such applications. Relational data
management systems are also deficient because the relational model does
not accommodate multidimensional, irregular or hierarchical structures
often found in scientific data sets nor the type of access that associated
computations require. Attempts to extend the relational model for
such incur overhead that limits the effective performance for large data
sets. Therefore, there is a need for some type of data (base) model
that possesses elements of a data base management system but is oriented
towards scientific data sets and applications. These requirements
demand the utilization of a generalized data model as a mechanism to classify
and access data as well as efficiently map data to operations, which provide
a self-describing representation of the physical storage structure (e.g.,
format), structural representation of the data (e.g., data base schema),
and a higher-level logical structure (e.g., operations). The implementation
of such a model effectively decouples the management of and access to the
data from the actual application and is as important a component of a data
visualization system, for example, as underlying graphics and imaging technology.
The justification and implementation
of data models for large-scale scientific applications are discussed.
The IBM Visualization Data Explorer
is a general-purpose tool kit for data analysis and visualization.
It it built upon an implementation of an unified data model, which supports
various types of simulation and observational data, whether in memory or
on disk. Data structures that can be represented include:
-
Data defined on a regular orthogonal grid
-
Data defined on a deformed regular or curvilinear grid
-
Data defined on various irregular grids, such as triangular, quadrilateral,
and tetrahedral meshes
-
Unstructured data with no connections between the data samples.
The data samples can be defined over spaces of any dimensionality, and,
independently, can also be connected by primitives of various dimensionalities
(allowing, for example, triangular and quadrilateral meshes defined over
2- or 3-dimensional points). The data values can be associated either with
the sample points or with the connections between the sample points. Available
data types include:
-
Real and complex data
-
Scalar, vector and tensor data
-
Byte, short, integer (signed and unsigned), and floating-point data
Data are stored in the form of Objects
for use by Data Explorer. An Object is a data structure stored in
memory that contains an indication of the Object's type, along with additional
type-dependent information. The bulk of the data is encapsulated in Array
Objects.
The data model centers on the notion of a sampled field. The
next section describes the Field, Array, and Group Objects
that implement sampled fields in Data Explorer. In addition to these
basic Object types, other types are used to construct models for rendering
(e.g., Transforms, Clipped Objects, Lights, and Cameras). These are described
in "Data
Explorer Native Files" and in IBM
Visualization Data Explorer Programmer's Reference.
Data are also stored in permanent file storage in the form of the same
Objects. Although Data Explorer supports the creation of Objects from data
stored in other file formats (such as netCDF), the Data Explorer file format
offers significant additional functionality and flexibility.
Note that the Data Explorer file format is versatile, allowing for future
expansion of the capabilities of the system without requiring changes to
the file format. It is possible to represent data types in a Data Explorer
file that cannot be processed by the current version of Data Explorer.
For example, in the current release of Data Explorer, most functions support
only 1-, 2-, or 3-dimensional coordinates.
A Function-Based Data
Model
The definition and implementation of coherent methods of representing
and using data as a formal data model have been used to effectively to
create general-purpose tools for visualization, computation and data management.
Despite the relative success and capabilities of these data models, there
can still be gaps in their capabilities to support either specific classes
of data or to map well to a user's problem domains. Therefore, a
higher-level data model based upon a simple functional definition is proposed
and tested, which leverages some of the capabilities of extant implementations.
lloydt@us.ibm.com
[ DX Home Page | Contact
DX ]