![]() |
![]() |
![]() |
![]() |
|
| Multi Dimensional Clustering (MDC) | |||
|
|
|||
| |
|
|
The Multi Dimensional Clustering (MDC) project deals with the the design and implementation of a new physical data layout scheme in DB2 Universal Database Version 8 ment for a multi dimensional access paradigm . Logically, a database schema defines a large multidimensional cube containing the active domains of its attributes. Many applications, OLAP and data warehousing in particular, usually consider several of these attributes as dimensions for processing and maintenance. For example, a retail warehouse contains dimension attributes based on time, region, customer, product, forecast, and others. Applications and users access the data from this database using subsets of dimensions. Logically, these access patterns can be considered as selecting regions from the large multidimensional cube attributes.
MDC / OLAP Star Schema Example with 5 Logical Dimensions
Most commercial database systems have slowly integrated primary clustering indexes in their repertoire. In this scheme, an index composed of one or more keyparts is identified as the basis for data clustering. All records are organized based on their attribute values for these index keyparts. Two records are physically close to each other if the attributes defining the clustering index keyparts have similar values. Note that the clustering index typically contains one entry for each record. In our terminology, these clustering indexes provide unidimensional physical clustering of data. Records are clustered in the order of the index keyparts. When the query contains predicates on attributes in the index which do not conform to primary order, it generates skip sequential accesses of the data. However, either a secondary index or a table scan method is needed to perform this access. These skip sequential accesses can generate significant I/O seek times as it jumps between pages on disk. Several relational database systems also support a coarser grain range-partitioning technology. Most of these mechanisms are also fairly unidimensional and provide rigid syntax that introduces many manageablility issues if they are to be extended to multiple orthogonal dimensions. OLAP systems recognized this problem of relational databases and provide a solution that physically organizes the data in some multidimensional cube form. A request to find regions in this cube are easily satisfied by OLAP systems. However, these systems usually suffer from sparsity and scalability problems. On the other hand, scalability and ability to address sparsity are strengths of the relational databases. It is to be noted that there are many reasearch publications dealing with spatial clustering. We do not claim to invent a radically different idea for clustering. Instead, we extend basic research ideas to practically surmount the problem of providing efficient clustering using multiple attributes. The basic aim of the MDC project was to provide a scheme that maintains the efficiency and scalability provided by relational database systems but is able to provide efficiency for multidimensional data access. A MDC table is defined to include one or more clustering dimensions. Unlike the unidimensional primary clustering index, we support a physical layout that mimics a multi-dimensional cube by associating a physical region for each unique combination of dimension attribute values. Records containing the same unique values for dimension attributes are clustered in one or more regions called blocks. These blocks are the units of addressability for our clusters. We provide a higher granularity indexing scheme, called block indexes, to address these blocks. Block indexes are quite compact since they identify several records using one entry. In all other aspects, block indexes resemble regular B-tree indexes. We also associate a data structure called the Block Map to maintain information about the state of blocks in a table. How MDC works: When a table is created one can specify one or more keys as dimensions along which to cluster the table's data. In DB2, a new clause called ORGANIZE BY DIMENSIONS has been added for this purpose. For example, the following DDL describes a Sales table organized by storeId and salesDate dimensions. CREATE TABLE Sales( Each of these dimensions can consist of one or more columns, as index keys do. In fact, a 'dimension block index' will be automatically created for each of the dimensions specified and will be used to quickly and efficiently access data. A composite block index will also be created automatically if necessary, containing all dimension key columns, and will be used to maintain the clustering of data over insert and update activity. Every unique combination of dimension values forms a logical 'cell', which is physically organized as blocks of pages, where a block is a set of consecutive pages on disk. The set of blocks that contain pages with data having a certain key value of one of the dimension block indexes is called a 'slice'. Every page of the table is part of exactly one block, and all blocks of the table consist of the same number of pages, viz., the blocksize. In DB2, the the block size is the extent size of the tablespace so that block boundaries line up with extent boundaries.
A 3 Dimension MDC layout on nation, itemid and
year (orderDate) The above figure illustrates these concepts. This MDC table is clustered along the dimensions nation, itemId and year(orderDate). Records in the table are stored in blocks, which contain an extent's worth of consecutive pages on disk. In the diagram, a block is represented by an square, and is numbered according to the logical order of allocated extents in the table. The grid in the diagram represents the logical partitioning of these blocks and each square represents a logical cell. A column or row in the grid represents a slice for a particular dimension. For example, all records containing the value 'Canada' in the nation column are found in the blocks contained in the slice defined by the 'Canada' column in the grid. In fact, each block in this slice only contains records having 'Canada' in the nation field. To facilitate the determination of blocks belonging to a slice or which blocks contain all records having a particular dimension key value, a dimension block index is automatically created for each dimension when the table is created. In our example, a dimension block index is created on nation, one on itemId and one on year(orderDate). Each dimension block index is structured in the same manner as a traditional RID index except that at the leaf level the keys point to a block ID (BID) instead of a record ID (RID). Since each block contains potentially many pages of records, these block indexes are much smaller than RID indexes and need only be updated whenever a new block is added to a cell, or existing blocks are emptied and removed from a cell. A slice, or the set of blocks containing pages with all records having a particular key value in a dimension, will be represented in the associated dimension block index by a BID list for that key value. In the example above, to find the slice containing all records with 'Canada' for the nation dimension, we would look up this key value in the nation dimension block index and find a key. This key points to the exact set of BIDs for the particular value. Similarly, to find the list of blocks containing all records having '1999' for the year(orderDate) dimension, we would look up this value in the year(orderDate) dimension block index, and find the key. Thus, block indexes provide an efficient access mechanism to the data which are clustered in blocks.
New Innovations Incorporated With MDC: The following are some of the new innovations incorporated with MDC 1. Block Predicates: A predicate is a condition in the query. Predicates are of 3 types – index, data and deferred. With current technology they are all applied on every record. With MDC we have introduced a new type of predicate called a block predicate which is applied only 1 time per block. Since predicate processing can be quite costly in CPU usage, the ability to run a predicate one time per block instead of every record where applicable results in query processing advantages. Block predicates can also result in I/O savings since when the predicate fails the rest of the block (which need not be processed) need not be read from disk. 2. Block Locking: MDC tables supports 3 levels of locking – record, table and a MDC specific block lock. The block lock controls access to a block . Having a third lock type which is between a record and table lock allows more concurrency for applications. 3. Block Index Look Ahead Prefetching: An important aspect of query processing is prefetching. DB2 supports sequential detection for record index scans. Sequential detection is mostly useful if the index is well clustered. Also it can become difficult if used on the inner of a nested loop join or if the index scan has a selective index predicate on it. Block index scans use Block Index Look Ahead (BILA) Prefetching instead of sequential detection. Here we look into the index and prefetch only those blocks which qualify for the query. Prefetching here is independent on the inter block clustering and works irrespective of the selectivity of the index predicate or its usage in the inner of the nested loop join.
MDC is being used by many major DB2 customers in their production systems with significant query performance gains. As reported in the talk "MDC Performance Customer Examples & Experience" at the DB2 Information Management Technical Conference 2004 at Madrid, Spain by Leslie Cranston, customers like the Canadian Astronomy Data Center (CADC), BrasilTelecom, BankOne and Thomas West have reported significant performance gains in their workloads and benchmarks using MDC. In particular the talk mentions Thomas West as reporting "The new multidimensional data clustering capability has improved performance of our most complex queries by up to 30 times while removing the need for additional reorganization" . MDC has been used in systems as large as 32 TB and in major industrial benchmarks like the DB2 10TB TPCH benchmark. |
| About IBM | Privacy | Terms of use | Contact |