IBM Skip to main contentUnited States  
      Home  |  Products & services  |  Support & downloads  |  My account
Home
Research Directions:
Information Integration and Business Intelligence
Semantic Information Integration
XML Data Federation
Parametric Search of E-Commerce Data
Semi-structured and Unstructured Data Access
Autonomous Sensor Data Management
XML View Materialization
XML Encodings
Adaptive XML Indexes
XQuery Optimization
Dynamic Web Indexes
Efficient Data Access Methods for RDBMS
Multi Dimensional Clustering
Self-Adaptive Histograms
Continuous Query Optimization
Previous Projects:
IBM DB2 Parallel Edition
IBM DB2 Tertiary Storage Integration
Members:
Yuan-Chi Chang
Bishwaranian Bhattacharjee
Christian Lang
Timothy Malkemus
George Mihaila
Ioana Stanoi
Min Wang
Publications
Watson Research Center
 
Database Research at Watson
Semi-structured and Unstructured Data Storage and Retrieval
With an increasing amount of data being captured with partial or no structure, efficient and meaningful access to this data becomes paramount. We are investigating methods to store, index, and retrieve semi-structured XML and WWW documents in large document repositories.
Autonomous Data Management in Fast Changing Environments Autonomous Data Management in Fast Changing Environments
The highly dynamic and fast paced business environment demands business support software to adapt at an equal or faster pace. Business applications in turn translates the adaptation requirement to data management middleware. It is increasingly acknowledged that the current practice of manually defining schemas is not a sustainable model for two reasons. First, the pace of schema change has increased dramatically in areas such as business process refinement and integration. Second, real-time integration of data cannot be done because of the length of time that it takes to manually evolve the schema. The project was motivated from two separate scenarios, e-commerce catalog and business performance management, both of which demand dynamic schema declaration, data population and efficient retrieval. In the former scenario, catalog content unforeseen at tool design time must be added and managed by the catalog tool seamlessly. In the latter scenario, new business performance monitors are plugged in and out across the enterprise with no centralized control but the monitored performance data needs to be archived for search and retrieval. Both call for a set of data management tools with self-configuration and self-optimization features. We are developing a set of data management tools that evolve along the business requirements by embedding the logic of configuration and performance tuning. We have demonstrated a data management layer in the support of e-commerce catalog that can accommodate a wide variety of heterogenous catalog content. We have also been developing sensor data management for smart oil fields early warning system. While most autonomous features are embedded in the database tuning today, this project is targeted at the higher layer of autonomous support for applications.

Contributors: Yuan-Chi Chang, Ioana Stanoi
 
Query-Aware XML View Materialization Query-Aware XML View Materialization
The current version of the Store2XML component of the XDM system provides the basic framework for generating XML documents from relational databases but if an application wants to extract some data from these documents using a query language, the whole document needs to be generated first before it can be queried. In this project we consider the problem of rewriting an XML view, into several custom views in order to support clients that require various portions of the mapping-defined data. Each custom view extracts only the relevant data from the underlying relational databases. View rewriting has the effect of reducing the amount of shipped data and, potentially, query processing time at the client.

Contributors: George Mihaila, Oded Shmueli (Technion University)
 
Encoding XML for Faster Processing Encoding XML for Faster Processing
The most important obstacle for efficient XML processing today is the compute-intensive parsing of XML documents in their textual representation. This problem is further exacerbated in an e-business environment where XML messages need to be parsed multiple times as they are processed by several subsystems (for example payment processing, inventory, warehouse management, and shipping). To alleviate the XML parsing problem, one solution is to convert the XML messages from their textual form to a binary encoded form. We are designing a binary format for XML targeted at efficient parsing and query processing. The objective is to allow fast, streaming query processing over encoded XML messages.

Contributors: George Mihaila, Yi Chen (University of Pennsylvania)
 
Memory and Bandwidth Adaptive Access Methods for Tagged Documents Memory and Bandwidth Adaptive Access Methods for Tagged Documents
An emerging trend in data exchange and management is to annotate data with metadata in order to allow for more meaningful semantic interpretation. One example is the increasing popularity of XML for data representation. For some applications, it is crucial to efficiently filter through a collection of documents. To answer the need for efficient and portable filters, our project focuses on developing effective structural filters with memory constraints. In other applications, it is more important to quickly process queries over large documents. For existing structural indexes to be practical, there is the need for efficient incremental maintenance algorithms that ensure the consistency between the data access structures and the underlying data. We have developed the first such algorithms with provable guarantees on the quality of the resulting structural index. Furthermore, we also considered the processing of XML documents in environments where the portability of the index is a requirement, and developed a family of data access methods based on navigational aids that conform to an a-priori assigned memory size.

Contributors: Christian Lang, Ioana Stanoi
 
XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation
The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an off-line scan of the XML repository to collect the statistics. In this project, we propose XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repository is either inaccessible or too large to be scanned in its entirety. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload-aware in collecting the statistics and thus can be more accurate than the more costly off-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change.

Contributors: Min Wang, Lipyeow Lim (Duke University), Sriram Padmanabhan (IBM Silicon Valley Labs), Jeffrey Scott Vitter (Purdue University), and Ronald Parr (Duke University)

 Publications
 
Dynamic Maintenance of Web Indexes Using Landmarks Dynamic Maintenance of Web Indexes Using Landmarks
Recent work on incremental crawling has enabled a search engine to keep its indexed document collection more synchronized with the changing World Wide Web. However, the information in this synchronized collection is not immediately searchable, since the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed. In this project, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.

Contributors: Min Wang, Lipyeow Lim (Duke University), Sriram Padmanabhan (IBM Silicon Valley Labs), Jeffrey Scott Vitter (Purdue University), and Ramesh Agarwal

 Publications
  About IBM  |  Privacy  |  Legal  |  Contact