|
Multi Dimensional Clustering (MDC)
in DB2
Providing
an elegant method for flexible, continuous, and automatic clustering
of data along multiple dimensions with applications in business
intelligence and transaction processing environments.
Multi
Dimensional Clustering (MDC) is a new
data layout scheme incorporated in DB2 Universal Database (UDB)
Version 8, which allows a relational table to be clustered on
one or more orthogonal clustering attributes (or expressions)
of a table. Many applications, e.g., Online Analytic Processing
(OLAP) and data warehousing, process a table or tables in a database
using a multi-dimensional access paradigm. Currently, most database
systems can only support organization of a table using a primary
clustering index on one dimension. Secondary indexes are created
to access the tables when the primary key index is not applicable.
Unfortunately, secondary indexes perform many random I/O access
against the table for a simple operation such as a range query.

A three-dimensional MDC
layout on nation, itemid
and year (orderDate)
Our work in MDC addresses this important deficiency
in database systems. It allows the robustness and scalability
provided by relational database systems, but is able to provide
efficiency for multidimensional data access. The above figure
indicates a simple MDC configuration on three dimensions based
on the following table columns: nation, itemId and a third generated
from the year of the orderDate column. The intersection of the
dimension values form logical entities called cells, which are,
in turn made up of physical entities called blocks. These are
accessed using block indexes. A configuration like this would
allow efficient access for user queries on the dimensions.
MDC also provides for a continuous and automatic
maintenance of the clustering over time, which reduces the need
to reorganize the table to regain clustering -- as happens with
current technology. MDC also allows the ability to co-exist with
existing database features such as row based indexes, table constraints,
materialized views, high speed load and mass delete.
These features make MDC useful for a business
intelligence environment as well as a transaction processing environment.
Many major DB2 clients are realizing significant query performance
gains by using MDC in their production systems.
As reported in "MDC Performance
Customer Examples & Experience" at the DB2
Information Management Technical Conference 2004 in Madrid,
Spain by Leslie
Cranston, IBM clients like the Canadian Astronomy Data Center
(CADC), BrasilTelecom, BankOne and Thomas West have reported significant
performance gains in their workloads and benchmarks using MDC.
In particular, the talk mentions Thomas West as reporting "The
new multidimensional data clustering capability has improved performance
of our most complex queries by up to 30 times while removing the
need for additional reorganization". MDC has been used in
systems as large as 32 terabytes and in major industrial benchmarks
like the DB2
10TB TPCH benchmark.
Members of the MDC team have been awarded the
IBM Outstanding Innovation Award for their contributions to DB2
V8 for MDC.
B. Bhattacharjee, S. Padmanabhan, T. Malkemus,
T. Lai, L. Cranston, M. Huras, "Efficient
Query Processing for Multi-Dimensionally Clustered Tables in DB2",
Proceedings of the 29th VLDB Conference 2003.
B. Bhattacharjee, L. Cranston, T. Malkemus, S.
Padmanabhan, "Boosting
Query Performance: Multidimensional Clustering", DB2
Magazine, Quarter 2, 2003.
B. Bhattacharjee, L. Cranston, T. Malkemus, S.
Padmanabhan, "The
Luster of Data Clustering", eServer Magazine,
April 2003.
S. Padmanabhan, B. Bhattacharjee, T. Malkemus,
L. Cranston, M. Huras, "Multi-
Dimensional Clustering: A New Data Layout Scheme in DB2",
Proceedings of the ACM SIGMOD Conference 2003.
Copyright © (2003) by Association for Computing
Machinery, Inc. Permission to make digital or hard copies of part
of all of this work for personal or classroom use is granted without
fee provided that copies are not made or distributed for profit
or commercial advantage. To copy otherwise, to republish, to post
on servers, or to redistribute to lists, requires prior specific
permission and/or a fee.
M. Gonzales, G. Robinson, The
OLAP-Aware Database, DB2 Magazine, Quarter 2, 2003.
Meet
The Experts: Matt Huras and Version 8 Enhancements For DB2 on Linux,
Unix and Windows, DB2 Developers Domain, 23 July 2002.
Multidimensional
Clustering, DB2 Information Center, 2003.
P Gunning, Database
Technology Leaps Ahead, DB2 Magazine, Quarter 4, 2002.
R. Welgan, Comparing
Query Performance, MDC vs. Non-MDC Tables, DB2 Magazine,
Quarter 2, 2003.
S. Gausden, T. Mason, Getting
Started With MDC Space Management, The IDUG Solutions Journal,
October 2002. |
 |
 |
|
Bishwaranjan
Bhattacharjee
Researcher
|
What is the most exciting potential
future use for the work you're doing?
The potential usage of MDC
in Online Transaction Processing (OLTP) for reorganization
avoidance. Reorganizing a table generally means the table
cannot be used during that time and any technology which
helps avoid it can make a major difference in OLTP environments.
What is the most interesting part
of your research?
The Block Index Lookahead (BILA) prefetching technique
used for inter block prefetching for block index scans in
MDC was the most interesting part of my research. This allowed
us to prefetch blocks of data by looking into the block
index, identifying the exact blocks which are need by the
query and prefetching them. It has many advantages over
the existing sequential detection technique for prefetching
and has proved to be useful for other scenarios like block
index mostly scans.
What inspired you to go into this
field?
The impact MDC could have on query processing performance
and its ability to prove to be a key differentiator between
our products and the competition in the Online Analytic
Processing (OLAP) and data warehousing environment was the
key motivator for this work.
What is your favorite invention
of all time?
Powered Flight
Learn
more
|
|