|
Executive Summary A primary feature of
public health analyses is that they typically affect large numbers of people over extended geographic regions. While detailed methodologies have been developed within epidemiology to characterize non-spatial aspects of
risk assessment, the use of spatial data for risk evaluation and intervention have lagged. Preliminary studies indicate that spatial information, such as remotely sensed data adds significantly to our ability to
understand these public health problems. However, these problems usually reflect temporally and spatially dynamic processes that are not sufficiently captured by a single, static data set and simple queries.
Comprehensive investigation thus far has often been conducted in focused areas due to severe resource constraint. The models established through this approach are usually difficult to generalize to larger scales.
Furthermore, the lack of full coverage of some of the data prevents accurate prediction of future threats from the model. The recent rapid growth of large scale remotely sensed images and data from missions such as
Mission to the Planet Earth has presented an exciting and unprecedented array of opportunities for environmental epidemiology. The availability of these new data enables the comprehensive coverage of the entire world
spatially, spectrally, and temporally. Furthermore, the rapid advances in content-based retrieval, data mining, and knowledge discovery have made it possible to discover both simple and complex trends, association
rules, and complex patterns from a large amount of data efficiently. The goal of this work is to perform the basic research required for applying content-based retrieval techniques on a set of federated image and
data archives in order to generate and validate environmental epidimiology models. The key problems to be explored in this research project are:
- Validate a hypothetical environmental epidimiology model through the use of content-based retrieval techniques on satellite images, geospatial data, and other data,
- Determine the relative importance of various environmental factors to a specific epidemic disease through data mining and knowledge discovery techniques as well as interactive user refinement,
- Locate available images and data from a set of federated archives that can be used to build the model,
- Scale the modeling and content-based retrieval algorithms to a large amount of data (in excess of 280 GB/day),
- Develop a flexible system that can be easily adapted to a wide variety models by different users.
To illustrate the goal of this project, consider the following scenarios of Hantavirus. This public health problem is not easily resolved at present, however, with the success of the proposed research, a powerful set
of tools and methods is at the disposal of researchers to develop new solutions.
- Consider that the user suspects that the location of a house (proximity to a wet grassland makes it more prone to large populations of mice) and recent temperature and moisture patterns (such as a
wet season followed by dry season) are the most important factors for predicting the outbreak of Hantavirus. The user formulates a content-based query by composing a search consisting of
- the coocurrence of a specific texture and spectral pattern (to locate wetlands) and houses,
- a weather pattern specified by a time series, and
- ground moisture and temperature levels. The result are ranked based on the similarity to each of the three criteria using the user-assigned weighting of the three criteria. The user compares the
results with the historical database which contains the geographical coordinates of each disease outbreak.
- The user would like to test additional factors for contribution to the outbreak of Hantavirus. The user formulates a new query by adding vegetation index, ground moisture level, and ozone level to
the model. Through a process of iterative refinement which compares the query results with the historical database, the model is revised.
- Although houses are located with high resolution images (5 meters or below), these images may be prohibitively expensive or simply not available. Furthermore, the spatial coverage of precipitation
and temperature data is not complete. Consequently, alternative methods need to be developed to substitue the missing data. For example, lower resolution data such as TM data (with 30 meter
resolution) from LANDSAT can be used to infer the location of houses by extrapolating from the neighboring areas. The rainfall and moisture data can be extracted from the vegetation index and water
vapor channel provided by many satellites. The system will thus determine which set of data to use for a given location based on a set of substituion rules and budget constraints. In general,
the data will reside at multiple sites which are managed by different archive centers.
- After the model is developed from the initial data sets, the user can extend the model spatially and temporally to consider a wider area of the world over a longer period of time. This large scale
application of the model presents a great challenge for the required scalability of the algorithms used to build and validate the model.
- After the model for Hantavirus is established, the user can pursue new types of diseases, such as malaria or lyme disease. The user expects minimal modification in the approaches listed above in
order to establish the new models.
The research proposed in this project solves two key problems:
- the query tools and methods for generating and validating the models are made routine and accessible and
- the system generalizes to a wide variety of disciplines (e.g. agriculture, forestry, fishing, and transportation).
In the first two of the following public health scenarios, the role of content-based retrieval is critical. In the third scenario, content-based interaction within a federated system is essential. Concurrent access
to large databases by large numbers of users is required by scenarios 4 and 5. Recently, techniques for implementing content-based querying of images have been explored. In particular, the IBM QBIC project and the
Virage system (one of the Informix datablades) allow the retrieval of images based on the texture, color histogram, and shape. The Alexandria project from UCSB allows the retrieval of images based on local texture
features. The SaFe/VisualSeek project from the Columbia University and the Blobworld/Bodyplan project from UC Berkeley allow the retrieval of image objects based on their spatial configurations. However, these
systems are insuffient for developing and validating the proposed sophisticated models. For example, we need effective methods for defining intricate composite objects (as illustrated in scenario 1 and 2) and for
efficiently querying the composite objects. We flexibility in specifying the simple object constructs of composite objects in terms of user defined features, pixels, semantics, and user annotations. We need
extensibility of the rule set which defines relationships between the objects along spatial, temporal, and spectral dimensions. The approach we propose is based on structural decomposition of the search target. Each
search target (potential locations for disease outbreak) is decomposed into a list of entities with possibly spatial, temporal and spectral constraints. Each entity can be described by specific pixel patterns, features
(texture, spectral histogram, NDVI, or shape), semantics (such as urban, grassland), and time series patterns. The structnural relationships among entities and the weighting of each entity is related to the statistical
model that can be used for predicting future disease outbreak. We intend to build a prototype of the results of the research that is based on digital images from EOSDIS and other missions to the planet earth. We
porpose to make the modeling system, search engine, raw data, and modeling output available via the Internet to other ESIP partners and research communities. The testbed will examine a data collection that is of
significant size, which will be used to estimate the quality of the techniques when applied to databases much larger than the testbed database. Original Proposal
System Design. |