|
|
|
|
Database Research at Watson
|
| Efficient Data Access Methods for Relational Databases |
| The development of efficient RDBMS algorithms and data structures is essential for the success of
higher level functionalities. We are researching novel data storage and retrieval as well as
statistics collection techniques for disk-based and potentially highly parallel RDBM systems. |
 |
Multi Dimensional Clustering for DB2
This project deals with the design and implementation of a new data layout scheme, called Multi Dimensional Clustering, in DB2
Universal Database Version 8. Many applications, e.g., OLAP and data warehousing, process a table or tables in a database
using a multi-dimensonal access paradigm. Currently, most database systems can only support organization of a table using a primary clustering index.
Secondary indexes are created to access the tables when the primary key index is not applicable. Unfortunately, secondary indexes perform many
random i/o accesses against the table for a simple operation such as a range query. This our in multi-dimensional clustering addresses this important
deficiency in database systems. Multi-Dimensional Clustering is based on the definition of one or more orthogonal clustering attributes (or expressions)
on a table. The table is organized physically by associating records with similar values for the dimension attributes in a cluster.
We have implemented novel techniques for maintaining this physical layout efficiently and methods of processing database operations that provide
significant performance improvements.
Contributors: Ramesh Agarwal, Bishwaranjan Bhattacharjee, Timothy Malkemus, Sriram Padmanabhan (IBM Silicon Valley Labs), Leslie Cranston (IBM Toronto Labs),
Matthew Huras (IBM Toronto Labs), Tony Lai (IBM Toronto Labs)
Publications
|
| |
 |
SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads
Most RDBMSs maintain a set of histograms for estimating the selectivities of given
queries. These selectivities are typically used for cost-based query optimization. While the
problem of building an accurate histogram for a given attribute or attribute set has been
well-studied, little attention has been given to the problem of building and tuning a set of histograms
collectively for multidimensional queries in a self-managed manner based only on query feedback.
In this project, we propose SASH, a Self-Adaptive Set of Histograms that addresses the problem of
building and maintaining a set of histograms. SASH uses a novel two-phase method to automatically
build and maintain itself using query feedback information only. In the online tuning phase, the
current set of histograms is tuned in response to the estimation error of each query in an online
manner. In the restructuring phase, a new and more accurate set of histograms replaces the current set
of histograms. The new set of histograms (attribute sets and memory distribution) is found
using information from a batch of query feedback.
Contributors: Min Wang, Lipyeow Lim (Duke University), and Jeffrey Scott Vitter (Purdue University)
Publications
|
| |
 |
Estimating Statistics of Operators in Continuous Queries
Statistic estimation such as output size estimation of operators is a
well-studied subject in the database research community, mainly for
the purpose of query optimization. The assumption, however, is that
queries are ad-hoc and therefore the emphasis has been on capturing
the data distribution. When long standing continuous queries on a
changing database are concerned, a more direct approach, namely
building an estimation model for each operator, is possible. In this
project, we propose a novel learning-based method. Our method consists
of two steps. The first step is to design a dedicated feature
extraction algorithm that can be used incrementally to obtain feature
values from the underlying data. The second step is to use a data
mining algorithm to generate an estimation model based on the feature
values extracted from the historical data. Experimental results show this approach
provides accurate statistic estimates with a low overhead.
Contributors: Min Wang, Like Gao (University of Vermont), Xiaoyang Sean Wang (University of Vermont),
and Sriram Padmanabhan (IBM Silicon Valley Labs)
Publications
|
|