IBM Skip to main contentUnited States  
      Home  |  Products & services  |  Support & downloads  |  My account
Home
Research Directions:
Information Integration and Business Intelligence
Semantic Information Integration
XML Data Federation
Parametric Search of E-Commerce Data
Semi-structured and Unstructured Data Access
Autonomous Sensor Data Management
XML View Materialization
XML Encodings
Adaptive XML Indexes
XQuery Optimization
Dynamic Web Indexes
Efficient Data Access Methods for RDBMS
Multi Dimensional Clustering
Self-Adaptive Histograms
Continuous Query Optimization
Previous Projects:
IBM DB2 Parallel Edition
IBM DB2 Tertiary Storage Integration
Members:
Yuan-Chi Chang
Bishwaranian Bhattacharjee
Christian Lang
Timothy Malkemus
George Mihaila
Ioana Stanoi
Min Wang
Publications
Watson Research Center
 
Database Research at Watson
Efficient Data Access Methods for Relational Databases
The development of efficient RDBMS algorithms and data structures is essential for the success of higher level functionalities. We are researching novel data storage and retrieval as well as statistics collection techniques for disk-based and potentially highly parallel RDBM systems.
Multi Dimensional Clustering for DB2 Multi Dimensional Clustering for DB2
This project deals with the design and implementation of a new data layout scheme, called Multi Dimensional Clustering, in DB2 Universal Database Version 8. Many applications, e.g., OLAP and data warehousing, process a table or tables in a database using a multi-dimensonal access paradigm. Currently, most database systems can only support organization of a table using a primary clustering index. Secondary indexes are created to access the tables when the primary key index is not applicable. Unfortunately, secondary indexes perform many random i/o accesses against the table for a simple operation such as a range query. This our in multi-dimensional clustering addresses this important deficiency in database systems. Multi-Dimensional Clustering is based on the definition of one or more orthogonal clustering attributes (or expressions) on a table. The table is organized physically by associating records with similar values for the dimension attributes in a cluster. We have implemented novel techniques for maintaining this physical layout efficiently and methods of processing database operations that provide significant performance improvements.

Contributors: Ramesh Agarwal, Bishwaranjan Bhattacharjee, Timothy Malkemus, Sriram Padmanabhan (IBM Silicon Valley Labs), Leslie Cranston (IBM Toronto Labs), Matthew Huras (IBM Toronto Labs), Tony Lai (IBM Toronto Labs)

 Publications
 
SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads
Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries. These selectivities are typically used for cost-based query optimization. While the problem of building an accurate histogram for a given attribute or attribute set has been well-studied, little attention has been given to the problem of building and tuning a set of histograms collectively for multidimensional queries in a self-managed manner based only on query feedback. In this project, we propose SASH, a Self-Adaptive Set of Histograms that addresses the problem of building and maintaining a set of histograms. SASH uses a novel two-phase method to automatically build and maintain itself using query feedback information only. In the online tuning phase, the current set of histograms is tuned in response to the estimation error of each query in an online manner. In the restructuring phase, a new and more accurate set of histograms replaces the current set of histograms. The new set of histograms (attribute sets and memory distribution) is found using information from a batch of query feedback.

Contributors: Min Wang, Lipyeow Lim (Duke University), and Jeffrey Scott Vitter (Purdue University)

 Publications
 
Estimating Statistics  of Operators in Continuous Queries Estimating Statistics of Operators in Continuous Queries
Statistic estimation such as output size estimation of operators is a well-studied subject in the database research community, mainly for the purpose of query optimization. The assumption, however, is that queries are ad-hoc and therefore the emphasis has been on capturing the data distribution. When long standing continuous queries on a changing database are concerned, a more direct approach, namely building an estimation model for each operator, is possible. In this project, we propose a novel learning-based method. Our method consists of two steps. The first step is to design a dedicated feature extraction algorithm that can be used incrementally to obtain feature values from the underlying data. The second step is to use a data mining algorithm to generate an estimation model based on the feature values extracted from the historical data. Experimental results show this approach provides accurate statistic estimates with a low overhead.

Contributors: Min Wang, Like Gao (University of Vermont), Xiaoyang Sean Wang (University of Vermont), and Sriram Padmanabhan (IBM Silicon Valley Labs)

 Publications
  About IBM  |  Privacy  |  Legal  |  Contact