1. Introduction
The instrumentation technology behind data generators in a myriad of disciplines is rapidly improving, typically much faster than the techniques available to manage and use the resultant data. In fact, an onslaught of orders of magnitude more data is expected over the next several years. Hence, greater cognizance of the impact of these data volumes is required. For example, NASA's Earth Observing System (Eos), which is planned for initial deployment later this year, will have to receive, process and store up to ten TB of complex, interdisciplinary, multidimensional earth sciences data per day, for over a decade from a number of instruments. These data will be compared and utilized with models and other data. Another example is the work under the US Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI) to shift from physical testing to computational-based methods for ensuring the safety, reliability and performance of the nuclear weapons stockpile. This requires the most advanced, state-of-the-art scientific simulations derived from a number of distinct codes operating on the world's largest supercomputers (i.e., TB of RAM, TFLOPs of performance and PB of disk). The results of such codes must be coupled and compared to the archived data from the physical testing.
If adequate mechanisms for use of
the actual data that meet the requirements and expectations of the scientific
users of such data are not available, then the efforts to generate and
archive the data will be for naught. Within the context of scientific
visualization it has long been recognized that the role of data management
can be expressed by the need for a class of data models that is matched
to the structure of the data as well how such data may be used. Traditional
methods of handling scientific data such as flat sequential files are generally
inefficient in storage, access or ease-of-use for large complex data sets.
Relational data management systems are also deficient because the relational
model does not accommodate multidimensional, irregular or hierarchical
structures often found in scientific data sets nor the type of access that
associated computations require. Attempts to extend the relational
model for such incur overhead that limits the effective performance for
large data sets. Therefore, there is a need for some type of data
(base) model that may possess elements of a data base management system
but is oriented towards the data sets and applications typical of visualization.
These requirements demand the utilization of a generalized data model as
a mechanism to classify and access data as well as efficiently map data
to operations, which provide a self-describing representation of the physical
storage structure (e.g., format), structural representation of the data
(e.g., data base schema), and a higher-level logical structure (e.g., operations).
The implementation of such a model effectively decouples the management
of and access to the data from the actual application and is as important
a component of a data visualization system, for example, as underlying
graphics and imaging technology.
2. Conceptual Models in Visualization
Although computer-based visualization has changed rapidly over the last few decades, the tools and systems that support it have typically evolved rather than being formally designed. This common situation is due to the fact the "science" of visualization is just beginning, which involves the integration of graphics, imaging, data management and human perception. In addition, two divergent trends in visualization have become common recently. The first is that due to proliferation of data coupled with the low-cost of supporting (hardware) technology, visualization is often available, practical or recognized as more important than ever before, particularly in operational or non-research environments. However, the supporting technology is often not approachable for these applications. The second trend emerges from the aforementioned data glut problem. That is, typical ad hoc approaches do not scale to large, complex problems. Despite competing requirements, access to data is the common barrier. A first step is to decompose visualization into a set of transformations that can highlight these limitations by defining a conceptual model and develop taxonomy.
Figure 1 shows one general decomposition
of visualization into a set of underlying models:
Although this idea has been suggested before (e.g., Hibbard, 1994), there is a limitation when they are considered as a set of layers. On the left side, interaction between layers for typical visualization operations are shown as a set of colored arrows, under the view that the data model is most fundamental. This then suggests that the organization on the right is a better illustration of the role of a data model.
Figure 1. Conceptual Models of
Visualization
2.1 Data Models
A data model is a representation of data, that is how data are described (e.g., abstract data type) and how data are used (e.g., applications programming interface. Borrowing from the data base community, there exists a physical representation (format -- media), a logical representation (data structures, schema) and a visual media (object and conceptual representation, user view), more simply illustrated in Figure 2.
To serve as a lexicon of data (i.e., a lingua franca for software and users), a data model must include a formal definitions and algebra to express the organization and manipulation of data. In this context, visualization itself is not treated as special -- just another consumer and generator of data. Another way to view this idea is a layer that provides a logical link between the concepts that scientists use to think about their field (e.g., particle trajectory, cerebral cortex shape, plasma temperature profile, or gene) and the underlying data from simulation, experiment or storage system.
Figure 2. Role of Data Models in Visualization Software.
In particular, this layer provides
tools that are common to all applications for data definition, metadata
support, and query formulation and execution. This layer supports
computational, analysis and visualization tools, and provides the infrastructure
for data representation and access (e.g., data model definition, metadata
support, query formulation, etc.). Like a data base management system,
it would reside above the operating system and enable the building of applications.
Hence, data models hide the complexity of underlying computational systems
for simulation, analysis and visualization, freeing scientists to focus
on data comprehension by providing a common mechanism for access, utilization
and interchange
3. Related Efforts for Data Model Implementations
These classic limitations have been recognized by a few groups, primarily in the support of a number of scientific applications. While they all have shown to be effective and are widely used, they do have some notable limitations in either the aforementioned representations or their ability to support access to complex data. Although they will not be discussed directly herein, there are other classes of data models. These include geographic information systems (a set of static, two-dimensional spatial layers), mechanical computer-aided design (static, 3d hierarchies and non-mesh representations), and of course, the relational data model (tables of discrete non-spatially-oriented values).
3.1 CDF/netCDF
Common Data Format (CDF), developed at NASA/Goddard Space Flight Center initially in the mid-1980s, was one of the first implementations of a scientific data model [NSSDC, 1996]. It is based upon the concept of providing abstract support for a class of scientific data that can be described by a multidimensional block structure. From the CDF effort spawned the Unidata Program Center's netCDF, which is more focused on data transport [Unidata, 1998]. Both CDF and netCDF are supported by a large collection of utilities and applications, ports to a wide variety of platforms and usage in a diversity of disciplines.
Although netCDF and CDF support the same basic data model, the interfaces they present and their physical storage are quite different. NetCDF has only one physical form -- a single file written in the IEEE format. The multi-dimensional arrays are written by C convention (last dimension varies fastest). In addition, the current software supports only limited direct editing or other transactions on the files in place (i.e., without copying). In contrast, CDF supports multiple physical forms: IEEE or native, single or multiple file (one header file and one file for each variable), row (i.e., C) or column (i.e., FORTRAN) major organization and the ability to interoperate among them. CDF software supports caching and direct utilization of the file system to provide rapid access and in-place updates.
Another important example is the Hierarchical Data Format (HDF) developed by the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign [NCSA, 1998a]. HDF uses an extensible tagged file organization to provide access to basic data types like a raster image (and an associated palette), a multidimensional block, etc. In this sense, HDF provides access (i.e., via its C and FORTRAN bindings) to a number (six) of different flat file organizations. The HDF software has been ported to a wide variety of platforms and supported by a number of applications. Currently, all of HDF's data structures are memory resident. This limits the ability of an application using HDF software to effectively utilize or even randomly access large disk-resident data sets. One of the newer storage schemes, Vset, attempts to supersede the simple multi-dimensional block support inherent in the CDF/netCDF and the conventional HDF data models. It supports regular and irregular data and the ability to form hierarchical groupings. The lack of scalability in HDF is being addressed with a new implementation (HDF5) that offers improved performance by enabling users to have more control over how data are stored (e.g., compression, chunking and parallel I/O on supercomputers) and the ability to do aggregation for large data sets [NCSA, 1998b]. This approach can enable HDF to provide an underlying array access layer that could be used by other software for higher-order data types.
3.3. VisAD
VisAD (Visualization for Algorithm Development) was developed by the University of Wisconsin to provide interactive computation and visualization facilities derived from a set of abstract models. The VisAD data model assumes that data objects are approximations to mathematical objects [Hibbard et al, 1994]. It supports a set of data types ranging from primitive (e.g., approximations to real numbers, integers, and text strings) to complex (finite tuples of other types, finite representations of sets over domains defined by tuples of real types, and finite samplings of functions from domain types to range types). These are associated with rich metadata (precision, existence and sampling information as well as conventional characteristics). In the current implementation, these are embodied as abstract data classes in Java. These facilities are coupled with a display model implemented in Java3D, which is defined by a set of mappings from primitive data types to primitive display types [Hibbard, 1998].
3.4 DX
IBM Visualization Data Explorer is an extended, client-server data-flow system for visualization that is built upon a data model, which supports general field representation with an API, high-level-language and visual program access. It includes support for curvilinear and irregular meshes and hierarchies (e.g., trees, series, composites), vector and tensor data, etc. in addition to the class of data supported by the aforementioned implementations [Abram and Treinish, 1995]. Currently, the physical disk-based format, dx, provides only simple sequential access. The data model utilizes the fact that researchers at Sandia National Laboratories observed that the mathematical notion of fiber bundles provides a useful abstraction for scientific data management and applications [Butler and Pendley, 1989].
The fiber bundle model is illustrated in Figure 3. It is a methodology for decomposing complex topologies (differentiable manifolds) into simpler components. Essentially, a topological space that is a product of two spaces locally but not necessarily globally, is a fiber bundle. It is a description of how discrete samples of continuous variables are organized and their topological relationship. In the figure, an abstract representation of a space is shown as the Cartesian (topological) product of a base space and a fiber space, S = W X Y (fiber bundle).
Figure 3. The Fiber Bundle Field
Data Model.
A copy of fiber space associated with base space element by element is a fiber. One element of a fiber is a fiber bundle section. The base space is analogous to the independent variables of a function while fiber space is analogous to the dependent variables of a function, and fiber bundle section is analogous to a function (field). Therefore, fiber bundle space implies where a function is graphed [Butler and Bryson, 1992].
In Data Explorer, this idea is specialized and extended to incorporate localized, piecewise field descriptions, support compact representations that exploit regularity, and data-parallel execution. This permits the same consistent access to data independent of its underlying grid, type or hierarchical structure(s) via a uniform abstraction to provide polymorphic functions. Data communication among subsequent operations is accomplished by passing pointers. In addition, sharing of these structures among such operations is supported [Haber et al, 1991].
3.5 ASCI Common Data Model (CDM)
From the aforementioned efforts, ASCI,
as part of its overall data management effort, is developing a very comprehensive
data model to ensure coverage of the many different representations of
simulation data that are generated at the DOE laboratories. The main entities
in the model are computational meshes and the simulation variables related
to them. This data model also uses the notion of fiber bundle sections
for the mapping between topological spaces. It then introduces the idea
of cell complexes, which are structures that tile a physical space with
geometric cells that share common faces that serve as a good metaphor for
computational meshes. Under the ASCI program, there is on-going development
effort for this model and as well as tools that utilize it. The model is
based upon a three-layer approach of providing data structures for array
and table access, then fiber bundles to provide basic mappings, and finally
a mesh/field level to provide field as well as cell complex access. The
aforementioned HDF 5 is being used to provide the underlying access to
storage [Ambrosiano, 1998].
4. A Function-Based Unified Data Model
Despite the significant capabilities of these available data models, there remain a few areas which are not adequately addressed. This ranges from the representation of other data types such as ordered structures (e.g., molecular models) or tables and relations (e.g., spreadsheets, RDBMS, categorical data), highly irregular, inconsistent or non-spatial sampling (i.e., observations), and aggregation of disparate types. As is often the case, highly generic systems and approaches are often difficult for many scientists to adapt them to their own problem domains. Therefore, another aspect that needs addressing is more direct mapping at a user level. An approach to address these problems is complementary to and not competitive with the ASCI CDM, DX, Vis-AD data model efforts. In fact these implementations can be used to to help create a new unified data model.
To begin it is necessary to take a step back and look at the fundamental organization and definition of data. Any data set may be considered as a single or multi-valued function of one or more independent variable(s) called dimensions, enumerated from 1 to j. Such dimensions may be space (length, width, height), time, energy, etc. For example, zero-dimensional data are just numbers such as sales, while two-dimensional data could depend on an area such as barometric pressure over a state.
The function(s) composing a data set really are dependent variable(s) -- the data themselves, which may be called parameters. They are dependent on the dimensions, such as sales or temperature. Thus, data or D implies a parameter or field of one or more (dependent) values that is a function of one or more (independent) variables,
D = [y_{1}, y_{2}, ... , y_{i}] = [ f_{1}(x_{1}, x_{2}, ... , x_{j}) (1)
.
A parameter may have more than one value, which is characterized by tensor rank, i, the number of values per dependent variable. Rank 0 is a scalar (one value), such as temperature (a magnitude -- a single-valued function). Rank 1 is a vector such as wind velocity (a magnitude and a direction: two values in two dimensions, three values in three dimensions). Vectors of size, n, are n-valued functions. Rank 2 is a tensor such as stress on an airframe (four values in two dimensions, nine values in three dimensions). A rank 2 tensor in n-dimensional space is a n x n matrix of functions (e.g., stress). Dimensionality and rank are thus, related. The number of elements in a particular parameter is j^{i}, which can be generalized as a set ot tuples.
Independent of this functional relationship is the data type, which includes the physical primitive, which describes how data values are stored on some medium (e.g., byte, int, float, etc.). It can include machine representations (e.g., little endian vs. big endian, IEEE vs. VAX, etc.). In addition, there can be a category of such types, i.e., real, complex or quaternion.
4.1 Functional Mapping
Consider F(D), where D are data and F is some computation, which may include a visualization operation such as realization, transformation, etc. D can be extended beyond mesh sampling or aggregation (i.e., the fiber bundle description) by examining more fundamental topology
a : T_{1} --> T_{2} (2)where a is a mapping between two topological spaces (e.g., a visualization operation). If both a and a^{-1} are continuous then a is considered homeomorphic. Commonly, there may be more than one mapping such that
a_{1} : X --> Y and a_{2} : X --> Y (3)If a_{1} can be deformed to a_{2} then a_{1} is considered homotopic to a_{2}. Such deformable mappings occur often in visualization, which will be illustrated later.
Thus,
F: X X [0, 1] --> Y (4)As a result, homeomorphism generates equivalence classes whose members are topological spaces while homotopy generates equivalence classes whose members are continuous maps. Hence, homotopy equivalent classes are topological invariants of X and Y which enables one to vary X or Y through a family of spaces, C(X, Y). This is exactly the process that can be used for cartographic warping [Treinish, 1994].ifF is continuous such that F (x, 0) = a_{1} , F (x, 1) = a_{2} and
as the real variable, t, in F (x, t) varies continously from [0, 1], a_{1} is deformed continously into a_{2}
Now reconsider F(D) such that F: X X [0, 1] --> Y, then the follow is true:
This effort so far has focused mostly on the upper two layers as shown in Figure 4 to provide a conceptual specification. It certainly leverages the experiences gained in the earlier implementations in that multiple interfaces are needed but can be extended (i.e., at each level). This enables an abstraction at each level to presented in a simple fashion, but efficiencies in implementation can be addressed at the lower levels. To illustrate this, consider Figures 5, 6 and 7 which show the underlying components and their relationships.
Figure 5 illustrates one taxonomy of the field layer or abstraction, which can be decomposed into meshes on which the data are sampled that are either implicitly positioned (i.e., regular) or explicitly positioned (i.e., irregular). In the latter case, the mesh may be topologically regular, in which only the locations of the sample points need be specified, or topologically irregular, in which connectivity information is required. In all cases, the actual data values on the sample points is defined. All of this information is stored as a collection of multidimensional arrays (i.e., the array layer). The metadata associated with the arrays provides the semantics to define a field.
Figure 5. Taxonomy of the Field
Layer.
Figure 6 shows sample taxonomies for the aggregation layer, which describes classes of data that can be broken down into simple fields. They may range from a (time) series, which is merely a sequence of instances of a field to cases which can be spatially decomposed into sets of subfields such as a multizone grid or a collection of spatial partitions for parallel processing. Adding hierarchy enables a description of an adaptive mesh. Some data may be sampled over a very topologically complex mesh. In this case, it may be easier to decompose it into a set of simpler meshes, each of which is of the same class.
Figure 6. Taxonomy of Aggregation
Layer.
Figure 7 shows a taxonomy of the function layer with the five cases defined as necessary to do practical operations. In each, they can be decomposed into a set of operations on fields, as illustrated here, or aggregates.
Figure 7. Taxonomy of Function
Layer.
5. Examples
To demonstrate the flexibility of this approach and to test the applicability of this model, a number of applications have been identified for each of the five cases. Only a handful are shown herein. Visualization and interaction are used to verify the decomposition and mapping. The data model and functionality of Data Explorer is extended to provide the function-based interface for each of the examples.
Figure 8 shows different coordinate systems and sampling for a single scalar data set, which is total column ozone in the earth's atmosphere that is irregularly sampled on a two-dimensional manifold over time. The image only shows a single time step out of a long series of daily observations. Essentially, two equivalent class mappings are presented via generalized cartographic projections to three dimensions, which is a Case 2 example. The window in the upper part of the center of the figure, shows the full data set, mapped to color, opacity and radial deformation. The window at the lower right shows only half (the southern hemisphere). In both cases, other data are registered in the final rendering and the same mapping applied for annotation purposes (topography and coastline data, respectively). The field is discretely sampled in one dimension (requested via widget) and shown as a plot. The remapping is packaged as a new function to create this application. The user only needs to specify the field representing the original data and new space for the mapping.
Figure 8. Different Spaces and
Sampling.
Figure 9 shows different coordinate systems and sampling for a single static scalar data set (topography and bathymetry) regularly sampled on a two-dimensional manifold at much higher than screen resolution (~9.3 million quads). Two equivalent class mappings are presented (cartesian and spherical), which are Case 1 and 2 examples, respectively. The latter as with one of the previous examples is a generalized cartographic projections to three dimensions. Three different levels of sampling have been generated to create a simple multi-resolution hierarchy. The figure only shows two. The Left-hand windows show low-resolution and enable selection of region of interest in either of the mappings. The right-hand windows show full-resolution data for the selected region. In addition to the spherical mapping presented at a functional level, the multi-resolution "query" is packaged at the same level as a single-resolution data set.
Figure 9. Multi-Resolution Representation.
Figure 10 shows several data sets, both scalar and vector, that are sampled irregularly and differently on distinct three-dimensional and two-dimensional manifolds with different irregular sampling in time. The image only shows a single time step out of a series for each of the data sets, ultraviolet intensity, proton density, temperature and speed, topography and bathymetry and three different magnetic field data sets. Essentially, one equivalent class mapping is presented via a generalized cartographic projection to three dimensions. Unlike the previous examples, these may be either Case 2 or Case 4 because some of the data sets are discretely sampled with no information on the relationship between the samples. Each variable is processed separately by using the same remapping function, with which the user specifies the original data and the new (spherical) space. In the application that utilizes this function, several choices of realization mappings are offered for the different data sets that are registered in the final rendering in the remapped (earth-centered, spherical coordinate system). In addition, two of the data sets are plotted at the lower right as a function of their original time sampling.
Figure 10. Data Fusion and Irregular
Sampling in Space and Time.
Figure 11 illustrates what appears to be conventional representations of two data sets sampled irregularly on a three-dimensional manifold over time. The image only shows a single time step out of a series of computed results. But there are some important distinctions. The first is that the domain on which the data are defined is symmetric, with only one-fourth being specified by the manifold. The second is that while one of the fields is a traditional interval data set (density), the other (material) is not. Material is categorical, specifically nominal, that is there is a "name" associated with each sample space that may or may not be related to other sample points. Therefore, for density this is Case 1 while for material is Case 3. Hence, access to the categorical data is packaged as a new function to provide the equivalent of traditional field as well as calculation of a derived one for sample points where the material is not homogenous. In both cases, the relationship between the sampling manifold and the domain is hidden within the function. The application shown in the figure allows the user to specify various realizations and interact with them. In addition, the fields may be queried by invoking the new functions directly. The results are shown both as values in the three-dimensional scene as well as plotted in the lower right. In addition, the material function enables remapping between those data and density, so that isosurfaces of density can be colormapped by material or the density of specific materials can be illustrated.
Figure 11. Sampling of Interval
and Nominal Data.
Figure 12 shows three-dimensional representations of several variables. They are in fact from a set of some 60 parameters provided as a table from a relational data base management system handling credit card transactions. Effectively these data are "sampled" by record or instance. In this case there are some 48,000 records. These data are all categorical, either ordinal or nominal. Ordinarily, this would be Case 5. However, some of the ordinal parameters may be interval-like. Hence, access is packaged as a new function, which provides the equivalent of traditional field (Case 4) via Delauney tetrahedralization. In this case, the user interactively chooses the mapping of parameters out of the set of 60 to three spatial dimensions to form the basis of the independent variables. In this application, up to three additional variables may be selected as being dependent on the first three for creating planar mappings, which are then pseudo-colored and contoured. In addition, the original data may be queried.
Figure 12. "Re-Sampling" of Tabular
Data.
6. Conclusions
A study of current and on-going data
model efforts and limitations in how tools for visualization, data management
and computation map to end user requirements has led to a taxonomy of conceptual
models for visualization. In turn, this taxonomy has enabled a simple formalism
to define a set of higher-level functions that map directly to how data
may be used in visualization. To test these ideas, a collection of higher-order
functions have been implemented by leveraging the capabilities of a lower-level
data model. This has demonstrated not only the feasibility of this idea,
but of its potential applicability to large number of problems. Current
plans include the further extension and develop of these higher-order functionas,
continuing to refine the formalism, and apply them to a larger class of
data.
7. Acknowledgements
The data used to create the various
figures used in this paper have been provided courtesy of NASA/Goddard
Space Flight Center, Lawrence Livermore National Laboratory and Citicorp.
8. References