Skip to main content
ESIP Home
Project
ESIP Product
Cluster
White Paper
Contact
iis header page

Executive Summary

A primary feature of public health analyses is that they typically affect large numbers of people over extended geographic regions. While detailed methodologies have been developed within epidemiology to characterize non-spatial aspects of risk assessment, the use of spatial data for risk evaluation and intervention have lagged. Preliminary studies indicate that spatial information, such as remotely sensed data adds significantly to our ability to understand these public health problems. However, these problems usually reflect temporally and spatially dynamic processes that are not sufficiently captured by a single, static data set and simple queries. Comprehensive investigation thus far has often been conducted in focused areas due to severe resource constraint. The models established through this approach are usually difficult to generalize to larger scales. Furthermore, the lack of full coverage of some of the data prevents accurate prediction of future threats from the model.

The recent rapid growth of large scale remotely sensed images and data from missions such as Mission to the Planet Earth has presented an exciting and unprecedented array of opportunities for environmental epidemiology. The availability of these new data enables the comprehensive coverage of the entire world spatially, spectrally, and temporally. Furthermore, the rapid advances in content-based retrieval, data mining, and knowledge discovery have made it possible to discover both simple and complex trends, association rules, and complex patterns from a large amount of data efficiently.

The goal of this work is to perform the basic research required for applying content-based retrieval techniques on a set of federated image and data archives in order to generate and validate environmental epidimiology models. The key problems to be explored in this research project are:

  • Validate a hypothetical environmental epidimiology model through the use  of content-based retrieval techniques on satellite images, geospatial data,  and other data,
  • Determine the relative importance of various environmental factors to a  specific epidemic disease through data mining and knowledge discovery  techniques as well as interactive user refinement,
  • Locate available images and data from a set of federated archives that can  be used to build the model,
  • Scale the modeling and content-based retrieval algorithms to a large  amount of data (in excess of 280 GB/day),
  • Develop a flexible system that can be easily adapted to a wide variety  models by different users.

To illustrate the goal of this project, consider the following scenarios of Hantavirus. This public health problem is not easily resolved at present, however, with the success of the proposed research, a powerful set of tools and methods is at the disposal of researchers to develop new solutions.

  • Consider that the user suspects that the location of a house (proximity to  a wet grassland makes it more prone to large populations of mice) and recent  temperature and moisture patterns (such as a wet season followed by dry  season) are the most important factors for predicting the outbreak of  Hantavirus. The user formulates a content-based query by composing a search  consisting of
    1. the coocurrence of a specific texture and spectral pattern (to locate  wetlands) and houses,
    2. a weather pattern specified by a time series, and
    3. ground moisture and temperature levels. The result are ranked  based on the similarity to each of the three criteria using the user-assigned  weighting of the three criteria. The user compares the results with the  historical database which contains the geographical coordinates of each  disease outbreak.
  • The user would like to test additional factors for contribution to the  outbreak of Hantavirus. The user formulates a new query by adding vegetation  index, ground moisture level, and ozone level to the model. Through a process  of iterative refinement which compares the query results with the historical  database, the model is revised.
  • Although houses are located with high resolution images (5 meters or  below), these images may be prohibitively expensive or simply not available.  Furthermore, the spatial coverage of precipitation and temperature data is not  complete. Consequently, alternative methods need to be developed to substitue  the missing data. For example, lower resolution data such as TM data (with 30  meter resolution) from LANDSAT can be used to infer the location of houses by  extrapolating from the neighboring areas. The rainfall and moisture data can  be extracted from the vegetation index and water vapor channel provided by  many satellites. The system will thus determine which set of data to use for a  given location based on a set of substituion rules and budget constraints. In  general, the data will reside at multiple sites which are managed by different  archive centers.
  • After the model is developed from the initial data sets, the user can  extend the model spatially and temporally to consider a wider area of the  world over a longer period of time. This large scale application of the model  presents a great challenge for the required scalability of the algorithms used  to build and validate the model.
  • After the model for Hantavirus is established, the user can pursue new  types of diseases, such as malaria or lyme disease. The user expects minimal  modification in the approaches listed above in order to establish the new  models.

The research proposed in this project solves two key problems:

  • the query tools and methods for generating and validating the models are  made routine and accessible and
  • the system generalizes to a wide variety of disciplines (e.g. agriculture,  forestry, fishing, and transportation).

In the first two of the following public health scenarios, the role of content-based retrieval is critical. In the third scenario, content-based interaction within a federated system is essential. Concurrent access to large databases by large numbers of users is required by scenarios 4 and 5.

Recently, techniques for implementing content-based querying of images have been explored. In particular, the IBM QBIC project and the Virage system (one of the Informix datablades) allow the retrieval of images based on the texture, color histogram, and shape. The Alexandria project from UCSB allows the retrieval of images based on local texture features. The SaFe/VisualSeek project from the Columbia University and the Blobworld/Bodyplan project from UC Berkeley allow the retrieval of image objects based on their spatial configurations.

However, these systems are insuffient for developing and validating the proposed sophisticated models. For example, we need effective methods for defining intricate composite objects (as illustrated in scenario 1 and 2) and for efficiently querying the composite objects. We flexibility in specifying the simple object constructs of composite objects in terms of user defined features, pixels, semantics, and user annotations. We need extensibility of the rule set which defines relationships between the objects along spatial, temporal, and spectral dimensions.

The approach we propose is based on structural decomposition of the search target. Each search target (potential locations for disease outbreak) is decomposed into a list of entities with possibly spatial, temporal and spectral constraints. Each entity can be described by specific pixel patterns, features (texture, spectral histogram, NDVI, or shape), semantics (such as urban, grassland), and time series patterns. The structnural relationships among entities and the weighting of each entity is related to the statistical model that can be used for predicting future disease outbreak.

We intend to build a prototype of the results of the research that is based on digital images from EOSDIS and other missions to the planet earth. We porpose to make the modeling system, search engine, raw data, and modeling output available via the Internet to other ESIP partners and research communities. The testbed will examine a data collection that is of significant size, which will be used to estimate the quality of the techniques when applied to databases much larger than the testbed database.

Original Proposal

System Design.

ESIP Home Project ESIP Product Cluster White Paper Contact

| Project home| Technical agenda| Publications| Contact|

[ Research home page | IBM home page | Order | Search | Contact IBM | Legal ]