Interactive archives for scientific data

Lloyd A. Treinish
IBM Thomas J. Watson Research Center
Yorktown Heights, NY 10598
914-784-5038 (voice)
914-784-7667 (fax)
lloydt@us.ibm.com
 

Preamble

The original version of this paper was written in 1993 as an outgrowth of ideas the author first developed at NASA/GSFC in the late 1980s, and efforts to prototype these notions in the early 1990s.  It was derived from experience in implementing interactive and integrated data and information systems for earth, space and environmental sciences at NASA/GSFC, and the underlying data management, analysis and visualization tools.  The motivation was to develop a scalable approach to utilizing the large data sets that would be acquired by NASA and other agencies in the 1990s and beyond, since the extant methods would be insufficient.

The work was first presented at the 1994 Goddard Conference on Space Applications of Artificial Intelligence, where it received best paper award.  Later that year, the paper was published in Informatics and Telematics, 11, n. 4, November 1994.  Afterwards, the effort remained dormant due primarily to lack of interest on the part of the potential benficiaries of this approach until fairly recently.

The advent of the internet being effective as a means of remotely accessing data, coupled with the emergence of the world-wide-web and the plunging in cost of high-performance (by early 1990s' standards) desktops and servers capable of interactive graphics and imaging, has renewed interest in some of these ideas.  The growth in the capabilities of data generators exceeding predictions, particularly in simulation, has results in recognition that alternatives for data access and utilization are necessary.  Advances in both geometric- and image-based graphics storage and rendering methods has expanded the potential for the realization of interactive data archives.  In addition, this idea maps well to the advantages and limitations of the world-wide-web and web browsers.

The bulk of the paper that follows is from the original version, which explains some of the utilization of obsolete hardware by current standards.  However, the author believes that the ideas cast in terms of modern equipment are even more feasible.  The text is supplemented with some references to newer technologies, and the results of more recent efforts to protoype these ideas in the context of current methods.  Many of these are shown in italics for the convenience of the reader.

Introduction

The instrumentation technology behind data generators in a myriad of disciplines, whether observational or computational, is rapidly improving, especially in the earth and space sciences, computational physics, etc. The capability to produce data is typically growing much faster than the techniques available to manage and use them. Traditional bulk access to data archives do not scale well to the volume and complexity of current or planned data streams nor to demanding applications of such data, such as analysis and modelling. A fundamental change in the modes of accessing archives, from static, batch systems to dynamic, interactive systems, is required. As a first step in enabling this paradigm shift, consider appropriate data management techniques coupled with the process or methods of scientific visualization. The technologies that support visualization show promise in helping to address some of these issues.

Data management

Data management issues relevant to the access and utilization of large scientific data sets (e.g., locating, retrieving, and using data of interest) may be classified into four levels:

I.  At the back-end are warehousing issues related to problems of media, bandwidth, protocols, formats.

II.  Above that are data models, access techniques, data base systems, query languages, and programming interfaces.

III.  As a medium between the access of data and the user, consider browsing as an aid for feature identification, which serves as a guide in the data selection process.

IV.  At the front-end are human factors issues (e.g., system ergonomics, human perception) and the incorporation of domain-specific knowledge to permit a system to be usable and useful (e.g., domain-driven task-based interfaces and tools).

At level I, high-speed and high-capacity storage and networking are either available now or will be common in the near future. A chief concern is not the volume of data nor the signaling speeds of these devices, but the effective capacity and bandwidth that can be utilized by applications that require warehousing (i.e., above level I). Typical storage and communications protocols are not a match for GB archives, yet current archives are in the TB (240 bytes) range, while those being planned are measured in PB (250 bytes). In addition, accepted data distribution mechanisms such as CD-ROMs hardly scale to data rates planned for projects like NASA's Earth Observing System, which are measured in TB/day (1 TB/day is about 2000 CD-ROMs/day).  Newer technology like DVD-ROMs only reduces this by about an order of magnitude at best.  Efforts in other domains expected to generate similiar data volumes, whether from large-scale acquisitions (e.g., medical imagery, seismic surveys, high-energy physics) or simulations on computers that can exceed a TFLOP (1012 floating-point operations/second), where the latter often imply fewer data sets, but of larger size (e.g., computational fluid dynamics).

At level II and to a limited extent level I, consider what is stored and how it is stored. There are many interfaces and structures developed by or used in the scientific community (Treinish, 1993a; Brown et al, 1993), which are driven by application, politics, tradition and requirements. For the kind of data rates that need to be supported for specific computational codes, applications, etc., it is NOT sufficient to provide only bulk access. The software that accesses the data must be able to do so in a meaningful way. Self-describing physical storage formats and structures should be used. The self-describing formats must have associated disk-based structures and software interfaces (e.g., at least abstract data types for programmers and applications, and simple utilities for those do not wish to program). They must be transparently accessible at the desk of an end-user. These structures and interfaces must be consistent with the high-speed access to be provided. This must also include task-to-task communication (e.g., transparent workstation access to high-speed storage).

Such requirements demand the utilization of a generalized data model as a mechanism to classify and access data as well as to map data to operations. Further, it defines and organizes a set of data objects (Treinish, 1993a). The model (e.g., Haber et al, (1991)) must provide a self-describing

Since access to data organized via such a model is described in terms of a logical structure, a generalized data model provides the foundation for implementing a data access server, as shown in Figure 1. Without such a data model, performance of high-speed devices will not be realized. High-data-rate computations like signal processing, visualization, and some classes of modelling also require such support.

Level II requires enabling tools for data/information systems. Having efficient storage and access are critical for driving applications, but will not be directly useful for helping find data of interest. At the very least, there is a need for low-level metadata management that enables both content and context for the warehousing information. Traditionally, a relational data base management system (RDBMS) could be used for its implementation. Although this concept was first prototyped over 10 years ago, a RDBMS cannot handle semantics, spatial information, or the data volume (Treinish and Ray, 1985). The RDBMS would not have any data per se, but would have pointers to the bulk storage enabling simple querying of what is there and where it is stored by characteristics such as spacecraft, instrument, mission, code, date/time, investigator, and owner. It would have to supplemented with a non-relational system to adequately support spatial metadata (e.g., Fekete (1990)), as shown in figure 1.

Figure 1. Data management layers for interactive archives.

The advent in recent years of extended relational data base systems or so-called object-relational systems (ORDBMS) such as Informix Universal Server or IBM Universal Data Base show some potential to go beyond this classical limitation.  The traditional method of extending an RDBMS is simply to provide access to non-relational data via a pointer to a large array or BLOB (Binary Large OBject).  Any operations on the array (e.g., query) or semantics associated with it are left as an exercise to the user.  These ORDBMS appear to have the ability to store and query efficiently, non-relational data such as spatial scientific data or images is encouraging.  While bulk data may still remain outside of the DBMS for the most appropriate access, abstractions as non-relational data (compressed meshes, images, animations, geometries, etc.) or metadata could reside in the DBMS.  Some implementations of ORDBMS enable the methods that are used to generate such "data" could also be stored in the data base.  This could lead to convenient, integrated access/query for data and metadata, even if they are stored independently.  Scalability for scientific data of interesting size and complexity remains an open question.

Figure 2. User-centric control of interactive archives.
 
There are many challenges in the implementation of an effective data system at level II providing complete access to data of interest simply and easily. A key role for data systems is as a means of efficient and intelligent querying and searching for relevant data based upon an assumption that the data volume and complexity is sufficiently large that practical examination of more than a tiny fraction of archive contents is prohibitively expensive. Therefore, the implementation of searches as meta-operations on abstractions of the archive (e.g., contextual, domain-driven, spatial, user-driven, and visual), though difficult, would be highly beneficial. Visual searches or browsing imply the perusal of pictorial representations of data or their abstractions with sufficient content for a user to determine the utility for further, more detailed examination. Practical methods of scientific visualization developed recently show promise in the implementation of visual searches. Hence, consider a very simple notion: looking at data of interest in an "appropriate" fashion to determine the merit of accessing such data for further study. Thus, visualization can be an adjunct to archive management. This idea is illustrated in figure 2, where the user is in control of an interactive selection, querying and browsing session/environment.  The support of iteration and feedback is essential for the exploratory access of data.

History

The idea of visually browsing data is hardly new. Scientists have always used visual abstractions (on paper) of their experimental results for communication. Often these same visualizations en masse were filed and later perused to help recall findings from earlier work. The mechanism for storing, scanning, and distribution of such visualizations was the same as for text -- initially shelves and files of paper followed by boxes of microfiche. With the advent of digital data collection instrumentation (e.g., seismic soundings for oil prospecting, study of the earth's atmosphere from spacecraft), this same paradigm was adopted with computer-generated paper- and later microfiche-based visualizations. These visualizations were based upon the graphics technology of the era and the traditions in these fields (e.g., line drawings of iso-contours). Monochromatic microfiche became an effective means of storing visual information compactly and were relatively inexpensive to generate after the cost of the production equipment was amoritized. Further, they were easily distributed, were cheap to examine remotely, and were archiveable. These are important virtues to consider in a modern visualization approach to browsing that goes beyond the relatively limited medium of microfiche.  (Obviously, use of microfiche is not being advocated herein, merely how they were used when that technology was appropriate.)

For example, figure 3 is a photographic reproduction of a microfiche (approximately three inches by five inches in size) showing contour maps of daily total column ozone observed by a NASA spacecraft over Antarctica during two months of 1986. Although the system that generated the microfiche is primarily intended for interactive use in the analysis of many kinds of earth and space science data (Treinish, 1989), the needs of a low-cost browsing medium could only be met until more recently by employing its tools in the batch production of microfiche.
 
 
Figure 3. Azimuthal equidistant contours maps of Antarctic daily column ozone for September 24, 1986 through November 23, 1986 (from microfiche).

The advent of digital instrumentation, especially in remote sensing, created another role for browsing that is based upon the assumption that there is insufficient computing resources to support the analysis of all acquired data from all experiments of a specific project or mission. Hence, visual abstractions were created to help identify what subset of data should actually be fully processed to support scientific investigations. These graphical representations, known as summary plots, were generated on microfiche and distributed to all participants in a project. They usually contained a time sequence of simple visualizations of instrument outputs in specific modes at sufficient resolution to suitably identify a set of "events" as interesting, and thus warrant further processing. The particular presentations were chosen to highlight the signatures of such events within the limited graphical techniques that could be supported on monochromatic microfiche (line drawing, simple gray-scale polygon fill). Users of such a mechanism would receive a stack of microfiche each week, for example, and visually browse them, looking for patterns in the plots that would be indicative of something interesting (Treinish, 1982). As such, this approach represented an early form of content-based directed search.

With improvements in technology to support interactive computing, most efforts associated with data browsing at level III in the aforementioned hierarchy of archive data management and access, have focused on simple images with image data. Furthermore, implementations are generally confined to one data set or a small number of similar data sets (cf, Simpson and Harkins, 1993; Oleson, 1992). This has been a conscious choice based upon the need for supporting only a limited domain and/or for driving interactivity. Unfortunately, these techniques are not applicable to many classes of data or when multiple disparate data sets are to be considered simultaneously, which are the characteristics of most current or planned archives.  Further, these issues also arise when one considers large-scale simulations rather than collections of archived data.

Approach

There are four issues associated with interactive data browsing using visualization from archives of scientific data:

Figure 4 is a schematic of a simple interactive archive system. Each of the aforementioned levels of data management are shown, such that level I services are provided via archive, warehouse, and data distribution servers. Level II services are within the gray box: data and information server, and metadata server. Figure 1 illustrates the relationship between these components and specific data management technologies.


Figure 4. An architecture for an interactive archive system utilizing visualization browsing

Browsing (level III) is provided via the browse server and local and remote clients. At this level of detail, any level IV services would be embedded within the local and remote clients. The schematic only indicates the major interaction paths as arrows between each component. The arrow thickness corresponds to relative bandwidth of communications, where thickest arrows would imply the greatest bandwidth (e.g., ANSI High-Performance Parallel Interface or HiPPI), and the thinnest arrow would imply the least (e.g., Ethernet). A jagged arrow refers to traditional remote network communications, which today would be embodied as the world-wide-web and access via a web browser.  In addition, remote communications could be disconnected by distribution of products to a remote site via an alternate medium (e.g., CD-ROM) to be utilized with a local client at the remote site.

Abstraction

Efforts in scientific visualization typically focus on the analysis of data in a post-processing mode. This is still the emphasis even with the potential advent of computational steering or telescience, the remote sensing equivalent. Browsing is a subjective process involving the human visual system, which is consistent with one of the origins of the notion of scientific visualization as a method of computing. Consider what happens when an individual as an observer walks into a room. That person scans the room and identifies and categorizes the contents (objects) in the room. From the classification or browsing process, further interaction with those contents may take place. Hence, the requirements for qualitative browsing presentation are not the same as the requirements for analysis.

Choosing effective visualization strategies for browsing implies navigation through a complex design space.  Thus, as with primitive summary plots, browsing visualization methods should intentionally transform the structure of the data to highlight features to attract attention, scale dimensions to exaggerate details, or segment the data into regions (Rogowitz and Treinish, 1993a). Traditional image browsing is very limited in this regard. Highlighting can only be achieved by modifying a pseudo-color or false-color presentation, which may not be sufficient even for image data or data than can be represented by an image. Of course, this method cannot even be considered for data that cannot be visualized as a simple image (e.g., any non-scalar data, any data of more than two dimensions).

Visual highlighting is often ill-advised for analysis because of the intentional distortion, which can be misleading (Rogowitz and Treinish, 1996).  In general, the browsing visualization should show data in the proper context (e.g., spatial or temporal). This is especially true in the earth and space sciences, where acquired data need to be registered in some (geographic) coordinate system for viewing. If the browsing visualization is to show multiple parameters, then the highlighting techniques must preserve the fidelity of the data (e.g., regions of missing data, original mesh structure) well enough, so that the artifacts in the presentation are not erroneously interpreted as potential (correlative) features in the data. Although the focus of such techniques for browsing is not for quantitative study, they may have a role in such tasks (Treinish, 1994).
 
Distribution

One potential problem with the visual browsing of large data archives is the need to be in relatively close proximity to either the data archives or a facility that can generate the browse products (i.e., high-bandwidth access to an interactive system). In general, high-bandwidth communications between an archive and its users is not always practical, given typical geographic dispersal of scientists and their need for utilizing more than one archive.

Since interactive browsing has usually considered only image data, the distribution of browse products has focused on the distribution of these same images in some form, usually lossy compressed, because of the typical size of image archives (Simpson and Harkins, 1993). In this sense, the browsing and visualization media and the data are all the same. Hence, the compression is of the original data. The compressed data will be easier to distribute than the original data, but compression is applicable only to a limited class of data (i.e., images). Furthermore, the resultant compressed form may not be suitable for browsing, and the quality may be poor.

Alternatively, consider the distribution of compressed visualizations instead of compressed data. These visualizations would be represented as images or geometries, so that available compression technology could be utilized. As with image data, uncompressed visualizations would be of high quality but of sufficient volume to be impractical to distribute to remote sites. The compressed visualizations would be easy to distribute, as with compressed image data. The key difference is that the compressed visualization could apply to any collection of data and would be of high quality.  Since the representation of the data are compressed, the original data remain preserved for potential detailed and accurate analysis.  Hence, one could consider the compressed visualizations as the summary plots of the 1990s, because the lack of resources to access or distribute all archived data or their abstractions is similar to the lack of resources to process or analyze all acquired bits, which motivated the generation of summary plots in the past.

Such an approach could be extended for predefined access and browsing scenarios, where the compressed visualizations are available for on-line remote access, dynamic request on the world-wide-web or via CD-ROM. In this case, the viewing of such compressed visualizations would require relatively simple, low-cost display software and hardware, which is readily available in desktop environments, including within web browsers. In general, the access and distribution costs associated with compressed visualizations will be similar to those associated with image data, but will, of course, be significantly less than those associated with the original data. Generating the visual browse products obviates the need to distribute the actual data, but entails an additional cost, which would be justified given the added value.

Interactivity

For visual browsing to be effective, it must be interactive. Otherwise, it is little different than watching television or traditional image browsing. One aspect of the interaction is in terms of data management: selection of and access to data of potential interest and metadata to help in guiding that process. In terms of visualization, the ability to interact with the browse "objects" (e.g., spatially, temporally) in near-real-time is critical. This requires rapid access to the data and rapid generation of the browse products.

Implementation

Abstraction

The requirements to create qualitative visualizations that are effective as browse products do have implications on the software used to create them. Registration of data into an appropriate coordinate system for viewing requires the support of user-defined coordinate systems. To be able to properly show more than one data set simultaneously requires the ability to operate on different grid structures simultaneously and to transform grid geometry independent of data. Depending on the visualization strategies being used, rendered images may need to contain different geometries (e.g., points, lines, surfaces, and volumes independent of color or opacity or of original grid structure). (See Treinish (1994) for a discussion of these ideas with respect to data analysis.)

Figure 5.  Architecture of IBM Visualization Data Explorer

A commercial visualization environment (IBM Visualization Data Explorer or DX) has been used to experimentally implement the aforementioned browsing techniques. DX is a general-purpose software package for data visualization and analysis (Abram and Treinish, 1995).  It employs a client-server architecture with an extended data-flow execution model and is available on Unix workstations (e.g., Sun, Silicon Graphics, Hewlett-Packard, IBM, DEC and Data General) and Intel-based personal computers running Windows NT and Windows 95.  It was originally developed for a parallel supercomputer, IBM POWER Visualization System (PVS).  The architecture of DX is summarized in figure 5.  Its modular construction and client-server implementation are well suited to the implementation of an interactive archive as in figure 4 with the client providing the mechanism for interaction and the server providing data and representations thereof.

Distribution

Near-real-time browsing of visualizations at sufficient resolution to enable the user to see relevant features requires the distribution of a large number of images or geometries. Clearly, lossy compression is necessary to drive viewing with update rates near the refresh rate of display controllers. For utilization remote from the archive or browse server, low-cost, simple display software and hardware are needed as well. There is much literature on data compression strategies and algorithms, but these will not be discussed herein. (A few hundred citations are reported over the last decade by the National Technical Information Service alone, (NTIS, 1992).) This notion of visualization distribution is built upon the extant and growing body of implementations of compression algorithms, which are being utilized in scientific, multimedia and entertainment applications.

The idea of distributing imagery for visualization is not new. For example, Johnston et al (1989) experimented with both block-truncation and Lempel-Ziv compression for the distribution of visualization animation. Rombach et al (1991) discussed the Joint Photographic Experts Group (JPEG) compression scheme for the distribution of cardiographic imagery from different sources like ultrasound, magnetic resonance imagery, and angiography. In these and other cases, the authors considered a low-cost viewing environment on the desktop as being critical, especially if the expense of generating the images to be distributed is high. Therefore, in this initial implementation, block-truncation (lossy, i.e., reduces the number of colors to represent a full-color pixel), modified Lempel-Ziv (lossless, e.g., like Unix compress), and temporal coherence between animation frames (lossy, e.g., the Moving Picture Experts Group, MPEG (LeGall, 1991)) will be considered.

To illustrate the viability of this approach, consider a modest data set, composed of a rectilinear scalar field of 32-bit floating-point numbers (e.g., atmospheric temperature) at one-degree of geographic resolution at seven levels in the earth's atmosphere. Each time step would require about 1.7 MB (1 MB = 220 bytes). If these are daily data, then less than a year would fit on a single CD-ROM, uncompressed. This does not include ancillary data required for annotation, such as coastline maps, topography, other reference material. Lossy compression would not be relevant since the data are not imagery. Lossy compression could be applied to each layer of the atmosphere individually. However, the results would be rather poor (i.e., the two-dimensional spatial resolution is already low, 180 x 360), and spatial coherence for the entire volume could not be maintained. If lossless compressed, decompression could be expensive (e.g., using Lempel-Ziv) or inconvenient (e.g., using scaled/encoded 12- or 16-bit integers). Either compression approach is highly sensitive to the contents of the data set.

Alternatively, visualization compression is independent of data characteristics, and only the resolution of the visualization(s) drives the compression/distribution/decompression cost. Of course, the distribution of uncompressed browse visualizations is expensive, potentially more than that of the uncompressed data. Lossless compression, although cheaper to distribute, would still require the decompression process. The decompression of losslessly compressed images could also be more expensive than the decompression of losslessly compressed data. Hence, the lossy compression of the visualization imagery is the best approach both from a cost perspective as well as from the perspective of image quality. Hence, for sequence of 640x480 24-bit image representations of the simple volumetric data set, over 14 years' worth of frames for such daily data could be stored using a simple 8:1 block truncation compression (i.e., each 24-bit pixel is represented by three bits) on a single CD-ROM. Using 32:1 JPEG compression, a sequence of over two years of these images at hourly resolution could be stored on a single CD-ROM.  Obviously, such compression makes utilization via the world-wide-web quite feasible.

Most of the work in compression has focused on two-dimensional forms like images and video.  Recently, however, efforts to develop algorithms to compress three-dimensional, geometric shapes have been initiated because of the growth of polyhedral models in domains as diverse as scientific visualization to computer-aided manufacturing and design to video games.  The application of such models have been limited because of their size and subsequent demands on storage, transmission and rendering performance in remote utilization over the internet (e.g., the Virtual Reality Modeling Language, VRML -- http://www.vrml.org), immersive environments, etc. (Taubin and Rossigniac, 1996).  In particular, compression ratios of up to 50:1 can be obtained for large models by compressing connectivity data without loss of information (i.e., to approximately two bits per triangle. This is a process known as topological surgery.  Variable length encoding with optionally lossy quantization is used for spatial coordinates, normals, colors, and texture coordinates.
 
Interactivity

For browsing to be effective it must be interactive with near-real-time system response. With data sets of interesting size, e.g., >= O(1 GB), immediate interaction cannot take place on current conventional systems (i.e., graphics workstations). Even though a 1 GB data set is admittedly modest by today's standards for data generation, the access and visualization of an entire data set or even a large fraction of it for browsing, may place significant burdens on the floating point and bandwidth capacities of the computer system being employed. The bandwidth requirements are derived from the bulk access speeds of large data sets and the transmission of images sufficiently fast to be interactive. The floating point requirements stem from three classes of computation for visualization:

Although the visualization requirements are different, the computational needs of interactive browsing are very similar to those of visualization in a virtual world environment (Bryson and Levit, 1991).

Early experimentation with a commercial parallel supercomputer (PVS) and the aforementioned DX environment has shown the viability of such interactive visual browsing, even with multiple data sets. In this effort, the PVS and DX combination has been used as the browse server with local and remote clients on workstations as indicated in figure 4. The PVS functions as an archive server in this context. Figure 6 shows the relationship between the browse server and the data and information server and its components in the interactive archive system shown in figure 4.

Figure 6. Architecture for a data and information system incorporating visualization browsing.

The IBM POWER Visualization System (PVS), introduced in 1991, was a medium-grain, coherent shared-memory parallel supercomputer with the interactivity of a workstation. This was achieved not via special-purpose hardware but instead via a programmable (general-purpose) approach that maintained balance among floating point performance via moderate parallelism, large physical memory, and high-speed external and internal bandwidth. The PVS consisted of three major hardware components: server, disk array, and video controller. The server was a coherent, shared-memory, symmetric multi-processor with up to 32 processors (40 MHz Intel i860XR or 44 MHz Intel i860XP), a 1.28 GB/sec (at 40 MHz) internal backplane supporting a hierarchical bus structure, hierarchical memory (16 MB local memory per processor and up to 2 GB global/shared memory), ANSI HiPPI communications, fast and wide SCSI-2, and an IBM RISC System/6000 support processor (for local area network and storage access). The server supported parallelized computations for visualization via DX. The disk array was a HiPPI-attached RAID-3 device with either 50 MB/sec or 95 MB/sec sustained access speeds, or a fast and wide SCSI-2 four-bank RAID-3 device with 76 MB/sec sustained access speeds. It provided access to archived data to be browsed. The video controller was a programmable 24-bit double-buffered frame buffer with 8-bit alpha overlay (for custom XWindow server) attached to an IBM RISC System/6000 workstation. It received images from the PVS server via HiPPI (either compressed or uncompressed) at resolutions up to 1920x1536, including HDTV, for real-time image updates. The video controller provided an interface for interaction with and viewing of the browsing visualization at speeds up to 95 MB/sec.

Results

Abstraction

Figure 7 shows a traditional two-dimensional visualization of ozone data similar to those shown in figure 3. The data are realized with a pseudo-color map and iso-contour lines for September 30, 1992. The rectangular presentation of the data is consistent with the provided mesh in that it is torn at the poles and at a nominal International Date Line. The ozone data are overlaid with a map of world coastlines and national boundaries in magenta as well as fiducial lines (of latitude and longitude) in white.

Figure 7.  Pseudo-color-mapped global column ozone on September 30, 1992.

To provide a qualitative impression for browsing, the data are transformed to a three-dimensional continuous spherical surface in figure 8. The ozone is triply redundantly mapped to radial deformation, color, and opacity so that high ozone values are thick, far from the earth, and reddish while low ozone values are thin, close to the earth, and bluish. Replacing the map for annotation is a globe in the center of this ozone surface. The use of three redundant realization techniques results in textures for qualitatively identifying regions of spatial or temporal interest. The gauges on the left illustrate the daily total ozone statistics. The pseudo-hour hand position ranges from 100 to 650 Dobson Units, while the color corresponds to that of the ozone surface. From the top they show the mean, minimum, and maximum for each day. The value corresponding to geographic view for each frame is shown next. At the bottom is a bar chart indicating the standard deviation of the daily measurements. This approach to qualitative visualization is potentially applicable to a large variety of simulated or observed earth and planetary data on a large spatial scale, especially for two and three-dimensional scalar fields.

Figure 8. Radially deformed pseudo-color and opacity-mapped spherically warped surface of global column ozone on September 30, 1992 with annotation.

A browsing animation of the ozone would be one that illustrates the data on a daily basis as in figure 8 for the entire archive of available data (i.e., from late 1978 through early 1993) or at least a useful subset. The geographic view of these data would change with each day to provide reasonable coverage of the entire globe over a complete year. The view would be chosen to concentrate on interesting regions such as the poles during appropriate seasons like Spring. Treinish (1993b) presents examples of such browsing animations. Since these animations were produced, they have been useful for identifying periods of time or geographic regions that warrant further study.

These browsing animation sequences are derived from about 1 GB of data, a two-dimensional scalar field over a torn geographic mesh. For each of the more than 5800 frames of the sequence, several calculations are required to create a visualization image. Each image is actually composed of two images that have been blended. There is a background image, which is composed of frame-variant contents under a constant view: primarily opaque polygonal text, dials, and bars. This annotation changes to summarize daily statistics. There is a foreground image, which is also composed of frame-variant contents, but with a frame-variant view. Each foreground image contains a static globe with the surrounding translucent ozone surface. For each day, the ozone data are transformed (irregularized to remove regions of missing data and warped onto a sphere), realized (color and opacity mapped and surface deformed), and full-color rendered (about 45,000 to 50,000 translucent quads for the ozone; 259,200 full-color-mapped opaque quads on a sphere with normals for the globe). The foreground and background images are brightened and blended to compose each final frame. Each frame at workstation resolution (about 1.2 million pixels) using DX on a 32-way (40 MHz) PVS required about 12 seconds of computing time. Hence, the entire animation took about 15 hours of compute time for transformation, realization and software rendering to generate at that resolution.  It should be noted that modern, high-end graphics workstations with similar SMP configurations as the PVS for the non-rendering tasks are capable of generating animations of this class at interactive speeds.  Hence, the notion of a local interactive browse server is feasible today.  For remote utilization, parallelized visualization operations on a multi-processor server is still the appropriate mechanism.

Distribution

Earlier efforts on data distribution have focused on the application of compression techniques to three sample animation sequences on a 32-way (40 MHz) PVS equipped with a RAID-3 HiPPI disk array capable of 50 MB/sec sustained access speeds. The first example is a high-resolution sequence of 2040x1536 32-bit (8-bits of red, green, blue, alpha) images, 88 frames in length totalling about 1052 MB. An 8:1 block-truncation lossy compression (i.e., each 32-bit pixel is represented by 4 bits) required about 41 seconds, resulting in a rate of approximately 26 MB/second or 2.15 Hz disk to disk. Lempel-Ziv lossless compression was applied to the entire sequence as a whole, not on a frame-by-frame basis. As expected the results took considerably longer, requiring about 2 minutes, 2 seconds, yielding a rate of approximately 8.7 MB/second disk-to-disk to achieve 45.4% compression. In both cases, the compression algorithms were parallelized on the PVS.

The second example is a 151-frame sequence of 640x480 32-bit images of about six months worth of animation similar to that illustrated in figure 7. Results with this considerably smaller collection (about 177 MB) are quite similar, pointing to the potential scalability of shared-memory, symmetric multiprocessor systems like a PVS to this problem. For the 8:1 lossy compression, about eight seconds were required, yielding a rate of approximately 23 MB/second or 18.9 Hz disk to disk. The lossless compression of the entire sequence required about 22 seconds to achieve 62.6% compression at 8.0 MB/second disk-to-disk.

The third example is 5853 frames of a digital video (D1) sequence (Treinish, 1993a). Most of the sequence consists of frames similar to Figure 7 -- one for each day from January 1, 1979 through December 31, 1991. Each D1 frame is composed of 10-bits each of YUV (a chrominance and intensity-based specification of color) at 720 x 487 resolution for playback at 30 Hz. Hence, this 3 minute, 15 second sequence is 10.6 GB in size, which is maintained as a single file on a PVS. A PVS-based (parallelized software) MPEG compression facility was used to create an approximately 225:1, lossy-compressed MPEG-1 sequence. About 12 minutes, 5 seconds were required for this operation, yielding a rate of 14.7 MB/second or 8 Hz, disk-to-disk.

The results from the early compression experiments are summarized in the following table.
 
Frames Resolution Colors Size (GB) Compression Speed (MB/sec) Rate (Hz)
88 2040x1536 32-bit RGBA 1.03 8:1 (lossy, block-trunctation) 26 2.15
88 2040x1536 32-bit RGBA 1.03 2.2:1 (lossless, Lempel-Ziv) 8.7 --
151 640x480 32-bit RGBA 0.17 8:1 (lossy, block-trunctation) 23 18.9
5853 720x487 30-bit YUV 10.6 225:1 (MPEG-1) 14.7 8
 
More recent efforts at developing mechanisms for browsing have focused on providing mechanisms for disseminating results from high-resolution, operational numerical weather prediction models.  In particular, such browse products can be used for overall assessment of the data and source material for distribution of forecast information potentially suitable for the non-meteorologist (e.g., media, World-Wide-Web, etc.) (Treinish and Rothfusz, 1997).  An example of such a visualization is shown in figure 9.  It shows a three-dimensional illustration of clouds as a translucent, white isosurface of cloud water density at 10-5 kg/kg registered with a terrain map.  This familiar-looking representation can effectively show gross atmospheric motion, convective activity, and potential distribution of moisture.  The terrain is pseudo-colored by total precipitation to further enhance the information portrayed by indicating where and how much rainfall is predicted by the model.  This type of visualization, particular in animation, represents a compact four-dimensional presentation of the results of large simulation.

Figure 9.  A numerical weather prediction of thunderstorm activity in Georgia on August 4, 1996 at 9:00 pm EDT.

Translucent, cyan isosurfaces are forecast radar reflectivities at a threshold of 25 dBz, approximating rain shafts.  These results are registered in a terrain-following, stereographic grid with a topographic surface, color-coded by total precipitation, where heavy rainfall is shown as blue puddles.  The surface is overlaid with state (white) and coastline (black) maps and vector arrows of surface wind velocity, color-coded by speed.  The locations of Atlanta and Savannah are indicated on the map.  The results were computed at 8 km horizontal resolution and are shown for the time during the Closing Ceremonies of the Atlanta Summer Olympic Games. The model, as illustrated through three-dimensional visualization, correctly predicted thunderstorm activity in the vicinity of Atlanta and the Closing Ceremonies, but NOT over the city itself. Thus, forecasters were able to give Olympic officials an "all clear" for the Closing Ceremonies despite thunderstorms in the area. Click on the image to see a full-size 24-bit image of the same data. A 144-frame MPEG animation will illustrate these techniques and data. You can watch the formation of a cluster of thunderstorms and its dissipation during a 24-hour prediction.

Realizations such as these are composed of several polygonal surfaces, which are then rendered.  At the resolution of the model, the image in figure 9 would represent over 343 thousand triangles -- a considerable volume to download via the world-wide-web.  To address this problem, each individual component is separately simplified using a scheme based upon volume tolerance (Gueziec, 1996).  The algorithm attempts to construct a simplified surface such that the distance from each point (that is, vertices and points contained inside triangles) of the simplified surface to the original surface will be less than the maximum distance bound and also such that the distance from each point of the original surface to the simplified surface will be less than that bound.  The volume of a solid bounded by the surface is also preserved.  Figure 10 shows the results of applying this algorithm to the visualization in figure 9, where the simplified representation is shown on the right.  It is composed of only 40 thousand triangles (less than 12% of the original volume).


Figure 10.  An example geometric simplification of multiple surfaces composing a browsing visualization.
 
While the differences between the geometries are apparent in the side-by-side images, the simplified form does preserve the salient features.  A version of the geometry on the right is available to examine as a VRML model.

Although these results are encouraging, many limitations in this technology persists.  The utilization of a geometric representation on the world-wide-web conceptually is similar to the vector information used to create contour line plots for remote perusal in systems from the 1980s (e.g., Treinish, 1989).  Therefore, even though there is interactivity with the geometry on an appropriate workstation, once the model is downloaded, it is insufficent for a browsing application.  The ability to interrogate the geometry and connect it back to the data is essentially non-existant.  The lack of support for time sequences is a further handicap.  Efforts to enhance VRML and VRML browsers to support these capabilities are under discussion.

Interactivity and systems

Figure 11 is a snapshot of a DX Motif-based interface for a prototype interactive browsing system. It provides very simple modes of interaction: selection of space and time (i.e., geographic regions or seasons of interest) for browsing via the spherically warped presentation shown in figure 7. There are dial widgets for the specification of the geographic viewing centroid of the global "object", and slider widgets for selection of the year to examine. Other Motif widgets, most of which are not visible in figure 8, provide choices of other visualization techniques, use of cartographic presentations, selection of analysis techniques, etc. A VCR-like widget provides control over the choice of the portion of the year to browse by Julian day. Optionally, a zonal slice of the data being browsed at the specified longitude may be shown as a pseudo-colored line plot of the latitudinal distribution. The data illustrated in figure 8 are the aforementioned global column ozone data derived from the same 14-year archive as shown in figures 3, 7 and 8.


Figure 11. Example user interface for browsing showing global column ozone on September 30, 1992.
 
The prototype browsing system is built as a client-server, consistent with the architecture of DX, as shown in figure 5. A PVS functions as the browse server in this implementation. High-speed display of browsing visualizations is local to the PVS. Remote display is via standard XWindow services with update rates limited to what the network infrastructure can provide. This prototype also operates at lower performance (i.e., less interactive) with a Unix workstation as a browse server.
 
This visualization shown in figure 9 is from an operational application used to assess and interact with the results from a mesoscale weather model.  A screen dump of the DX-based application is shown in figure 12.  It has the ability to view and interact with the data in a latitude-longitude (from the model's stereographic grid)-altitude (from the model's height calculations) coordinate system.  The view is annotated with base maps.  The focus of this application is qualitative assessment (browsing) of the model output.  It has the ability to create both time-based and key-frame (flyover) animation.  There is a control panel of indirect interactors, which gives you the ability to select specific surface and upper air variables to visualize with pre-defined methods.  It also gives you control over the creation of key-frame animation.  Another panel allows you to select the particular model run of interest, including one that may still be executing.  It also allows you to create products for web pages.  Since the data produced by the simulation need to be analyzed in a timely fashion, they are not archived.  Thus, the aforementioned warehouse server is replaced by a computational server (an IBM RS/6000 SP) that executes the weather forecasting model.  The browse server ideally is an SMP workstation with graphics hardware support.


Figure 12.  Example user interface for numerical weather prediction model browsing and tracking.
 
Another approach to remote, interactive access on the world-wide-web is through image-based rendering.  In this case, a three-dimensional scene is defined by a small set of images (e.g., interior faces of a bounding box), which are rendered from the geometric realization of the data of interest.  A browser uses these images to stitch together a derived image for a current view.  Such a panoramic viewer locates the viewpoint in one spot in relation to the image.  Users see the scene from only that vantage point, but the movement is very smooth and the view is highly realistic.  In a geometric browser, like a VRML viewer, a user can freely move their viewpoint anywhere in a 3D scene, but the motion is jerky especially if the scene has realistic-looking texture maps, unless high-end graphics hardware is available.  A panoramic viewer is strictly software-based and can operate on commidity personal computers.  The experiments have been done with IBM PanoramIX.  It provides a non-immersive virtual viewer for panoramic scenes as a web browser plug-in.  With PanoramIX, users can interactively explore panoramic images compiled from photographs, rendered images-even scanned hand-drawn sketches.  It features zoom and pan on complex scenes, and provides straight up and down views on virtual scenes.  Figure 13 is a screen dump of the PanoramIX browser applied to the weather forecasting example shown in Figure 9.  It shows much promise of highly interactive utilization for lightweight clients.


Figure 13.  Image-based interaction of weather forecasting scene in a web browser.

Current personal computer technology provides computational and graphics capabilities that exceed that of today's low-end Unix workstations or mid-range systems of the recent past.  Since the former are available at a very modest cost, they are becoming commonplace on desktops at work and home.  However the "traditional" visualization systems, whether turnkey packages or development toolkits will be unapproachable for the typical user of such personal computers even though they have more than adequate facilities for supporting such software.  We are already seeing artifacts today that justify this assertion -- availability of such conventional visualization tools on low-cost Unix and especially Windows 95/NT desktops have given access to powerful facilities in non-research environments.  In these operational, mission-critical, and often commercial settings today, there is often a mismatch in interface.  On the typical personal computer desktop this mismatch will be exaggerated by the expectation of the usual shrinkwrap packaging of software.  Although traditional users of visualization tools on Unix workstations will employ them on personal computers, they represent only a small fraction of personal computer users.

Figure 14.  Fluid client-server architecture for visualization.

An initial response to this dilemma is to consider the ubiquitous web browser as the vehicle to deliver visualization services to the cheap desktop via a client-server relationship.  This is hardly a new idea. However, the implementations to date have generally been based upon the notion that the desktop is a very lightweight client with fixed functions networked to a server, which is doing the bulk of the visualization work.  Such an approach does not take advantage of the growing power of these desktop systems. Instead consider an enhancement of the client-server, where the division of labor is not fixed.  This notion is illustrated in figure 14, which shows a fluid division between the client and server.  The boundary between the client and server roles is dictated by two factors, which are functions of the visualization task and requirements as well as the capability of the desktop.

To enable such an approach for ubiquitous visualization, the traditional visualization functions whether operating on the client or on the server (i.e., the latter will still be dominated by Unix systems, especially on the high-end) must live within the web browser.  Since browsers are evolving rapidly from competitive pressures not influenced by visualization, visualization tools must learn to adapt the visual media of web browsers.  Although not ideal, there does exist infrastructure for delivery of images, animations, geometry and interactions.

A recent effort by the Image Information Systems group at IBM Thomas J. Watson Research Center has been to create an architecture for the interactive and remote querying and retrieval of archived, large multi-spectral images (Li et al, 1997).  Given the volume of typical, multi-spectral images, access is expensive.  Therefore, operations such as feature-specific classifiers, template-matching, etc. are done in the (wavelet) compressed domain, to achieve one to two orders of magnitude speed-up.  The data are organized into a multi-resolution pyramid, with either lossless or lossy compression.  The system is explicitly organized as a client-server for interactive utilization, where the client is assumed to be lightweight running a web browser, within which the user can specify a query, and display and interact with the results.  The server parses and executes the query, performs image processing operations, compression and decompression, generates and maintains a metadatabase, etc.

Figure 15.  System architecture for remote, content-based search through image archives.

The architecture of the initial implementation of this framework is shown in Figure 15.  It consists of Java clients, an HTTP server, the search engine, database management system (IBM DB2), an index and image archive, and a library contains various feature extraction, template matching, clustering and classification modules.

Conclusions

A study of qualitative methods of presenting data shows that visualization provides a mechanism for browsing independent of the source of data and is an effective alternative to traditional image-based browsing of image data. To be generally applicable, however, such visualization methods must be based upon an underlying data model with support for a broad class of data types and structures.

Interactive, near-real-time browsing for data sets of interesting size today requires a browse server of considerable power. A symmetric multi-processor with very high internal and external bandwidth demonstrates the feasibility of this concept. Although this technology is likely to be available on the desktop within a few years, the increase in the size and complexity of archived data will continue to exceed the capacity of "workstation" systems. Hence, a higher class of performance, especially in bandwidth, will generally be required for on-demand browsing.

A few experiments with various digital compression techniques indicates that a MPEG-1 implementation within the context of a high-performance (i.e., parallelized) browse server is a practical method of converting a browse product to a form suitable for network or CD-ROM distribution.

Future work

From this initial prototype implementation of an interactive data browser, there are several areas for future work. Since practical low-cost decompression of JPEG-compressed images is becoming available on the desktop, experimentation with JPEG is warranted (Pennebaker and Mitchell, 1993). As with JPEG, the MPEG-1 motion video compression technique is becoming available for multimedia applications of video sequences on the desktop, whether the animation is distributed via network or via CD-ROM. Additional testing with MPEG-1 and the higher quality, MPEG-2 as it becomes available is required within the browse server as well as on various desktop playback systems.

The second area of research would focus on fleshing out more of the interactive archive architecture schematically illustrated in Figures 1, 2, 4 and 6. Specifically, the prototype interface and visualization could migrate to a data-driven one conceptually similar to the primitive implementation discussed by Treinish (1989), which could be integrated with metadata and data servers to achieve a browsing archive system (i.e., with data management services at the aforementioned levels I, II and III). This would also imply the availability of integrated data and information services similar to those in the rudimentary system described by Treinish and Ray (1985). Such an approach would be further enhanced by the integration of the prototype browsing system with tools for data analysis, which are already available.

The strategies for qualitative visualization have focused on only a few methods for spherically-oriented data with large spatial extent. Clearly, investigation of alternative approaches of highlighting features in such data are required, for which there are a number of potential issues (Rogowitz and Treinish, 1993). In addition, the extension of this browsing architecture to other classes of data is also warranted.

References

G. Abram and L. Treinish.  An Extended Data Flow Architecture for Data Analysis and VisualizationProceedings IEEE Visualization '95, pp. 263-269, October 1995.

Brown, S. A., M. Folk, G. Goucher and R. Rew. Software for Portable Scientific Data Management. Computers in Physics, 7, n. 3, pp. 304-308, May/June 1993.

Bryson, S. and C. Levit. The Virtual Wind Tunnel: An Environment for the Exploration of Three-Dimensional Unsteady Flows. Proceedings IEEE Visualization '91, pp. 17-24, October 1991.

Fekete, G. Rendering and Managing Spherical Data with Sphere Quadtrees. Proceedings IEEE Visualization '90, pp. 176-186, October 1990.

Gueziec, A.  Surface Simplification Inside a Tolerance VolumeIBM Technical Report, 1996.

Haber, R., B. Lucas and N. Collins. A Data Model for Scientific Visualization with Provisions for Regular and Irregular Grids. Proceedings IEEE Visualization '91 Conference, pp. 298-305, October 1991.

Johnston, W. E., D. W. Robertson, D. E. Hall, J. Huang, F. Renema, M. Rible, J. A. Sethian. Video-based scientific visualization. Proceedings of a Workshop on Geometric Analysis and Computer Graphics, University of California at Berkeley, Berkeley, CA, pp. 89-102, May 1989.

LeGall, D. J. The MPEG Video Compression Standard. Proceedings IEEE COMPCON Spring '91, pp. 334-335, 1991.

Li, C.-S., J. J. Turek, and E. Feig.  Progressive Template Matching for Content-based Retrieval in Earth Observing Satellite Image Databases, accepted to be published in IEEE Transactions on Signal Processing, 1997.

National Technical Information Service. Data Compression (Citations from the NTIS Database). NTIS Document PB92-802750, U. S. Department of Commerce, Springfield, VA, 1992.

Oleson, L. The Global Land Information System. Eros Data Center, U. S. Geological Survey, Sioux Falls, SD, 1992.

Pennebaker, W. B. and J. L. Mitchell. JPEG: Still Image Data Compression Standard. Van Nostrand Reinhold, 1993.

Rogowitz, B. E. and L. A. Treinish. Data Structures and Perceptual Structures. Proceedings of the SPIE/SPSE Symposium on Electronic Imaging, 1913, pp. 600-612, February 1993.

Rogowitz, B. E. and L. A. Treinish. An Architecture for Perceptual Rule-Based Visualization. Proceedings IEEE Visualization '93, pp. 236-243, October 1993.

Rogowitz, B. E. and L. A. Treinish.  How Not to Lie with Visualization. Computers in Physics, 10, n.3, pp. 268-274, May/June 1996.
 
Rombach, M. R., U. Solzbach, U. Tites, A. M. Zeiher, H. Wollschlager, H. Just. PACS for cardiology - Perspective or Fiction? Proceedings IEEE Computers in Cardiology, Venice, Italy, pp. 77-79, September 1991.

Simpson, J. J. and D. N. Harkins. The SSable System: Automated Archive, Catalog, Browse and Distribution of Satellite Data in Near-Real Time. IEEE Transactions on Geoscience and Remote Sensing, 31, n.2, pp. 515-525, March 1993.
 
Taubin, G. and J. Rossignac.  Geometric Compression Through Topological SurgeryIBM Research technical report number RC-20340 (January 16, 1996).

Treinish, L. A. The Dynamics Explorer Summary Plot Software System. NASA/Goddard Space Flight Center, Internal Report, March 1982.

Treinish, L. A. An Interactive, Discipline-Independent Data Visualization System. Computers in Physics, 3, n. 4, July 1989.
 
Treinish, L. A. Unifying Principles of Data Management for Scientific Visualization. Animation and Scientific Visualization Tools and Applications (R. Earnshaw and D. Watson, editors), Academic Press, pp. 141-169, 1993.
 
Treinish, L. A. Climatology of Global Stratospheric Ozone (1979 through 1991). ACM SIGGRAPH Video Review, 93, August 1993.

Treinish, L. A.  Visualization of Disparate Data in the Earth Sciences, Computers in Physics, 8, November/December 1994.

Treinish, L. A. and S. N. Ray. An Interactive Information System to Support Climate Research. Proceedings of the First International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography and Hydrology, American Meteorology Society, pp 72-79, January 1985.

Treinish, L. A. and L. Rothfusz.  Three-Dimensional Visualization for Support of Operational Forecasting at the 1996 Centennial Olympic GamesProceedings of the Thirteenth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography and Hydrology, American Meteorology Society, February 1997.


lloydt@watson.ibm.com



[DX Home Page | Contact DX]