The following slides were presented at EDBT '98 in March 1998.



At first, let's introduce the problem. Let's consider the problem that there are many aggregate queries to be computed, and those queries are computed from only one source relation. We would like to compute multiple aggregate queries from one source relation efficiently.

This figure shows that our methods work 16 times faster than conventional methods for wide range of grouping selectivity, when there are 100 aggregate queries to be computed form same relation. Next, I would like to explain the problem in more detail.

Let's think about a situation in which multiple aggregates are required. First of all, I would like to talk about a motivating example in OLAP. Let's consider a relation of 10 million bank customers. This has 30 numeric attributes such as Age, Balance, Income, and so on, and 100 categorical attributes such as Occupation, Sex, Credit class, Overdue experience, and so on. Both the number of records and the number of attributes are taken from a real-life data set. Since, both numeric and categorical attributes can be used for making groups, we can make 130-dimensional data-cube. And, this data-cube has 2 to the 130th power vertices. Since response times are crucial in OLAP, and computation of the complete data-cube is difficult, OLAP systems pre-compute frequently used queries to reduce response times. Materialized views are used to help answer other aggregate queries. And, pre-compute times should be small, because the source relation is updated frequently.

Let's think about how many materialized views should be pre-computed. Harinarayan et al. presented a greedy algorithm for deciding what subset of data cubes should be pre-computed. We suppose that each dimension has 100 distinct values and distributes uniformly, and OLAP system makes all 1D aggregates computable from a materialized view. By the use of Harinarayan's cost model, we can know that 44 3-D views should be pre computed to make all 1D aggregates computable from a materialized view. Similarly, at least 2,795 3-D views should be pre-computed for all 2D aggregates. Many aggregate queries should be pre-computed. And those queries are from the same relation.

Let's consider another example using the same bank data. We have fast algorithms for computing association rules from 1D or 2D data distributions along numeric attributes. This rule is one example of the interesting 2D rules. This figure shows that the person whose age and balance falls into this region tends to delay his/her card loan payments with high probability. X-axis indicates saving balance, Y-axis indicates age, and distributions of all customers and ratio of unreliable customers are shown by brightness and redness of each pixel. To compute a single rule, data distribution of all customers and unreliable customers are necessary.

For data-mining, it is common to compute all rules and sort them in order of given criteria. Conditional part of 2-D rule is the region in the Euclidean plane indexed by 2 numeric attributes. And, the target property has the form that a categorical attribute is equal to a value, such as "overdue experience equals YES". Then all combinations of 2 numeric attributes may become a conditional part of the rule. And all combinations of a categorical attribute and its value may become a target property. Since, we have 30 numeric attributes, the number of all combinations of 2 numeric attributes are 435. And, there are 100 categorical attributes, and the number of values of each attribute is between 2 and several dozens. Therefore the possible number of target properties will range from several hundred to several thousand. The number of required data distributions are 435 times this large number. Since a numeric attribute is divided into 20 to 400 buckets, and each distribution makes 20 square to 400 square groups, a large volume of memory is required. Large numbers of aggregate queries and groups are required for our data-mining system.

The algorithms for a single aggregate query are well-studied. For example, the optimization for a single aggregation, and the parallel algorithms for a single query. I will talk about these algorithms later. And, optimization for inter-related multiple aggregate queries on serial machine is known. But, we have many aggregate queries.We would like to compute multiple aggregate queries on shared-nothing multiprocessors.

Next, I will talk about serial algorithm for a single aggregate query, which is called hash-based algorithm. This algorithm works as follows. In step 1, all tuples of the source relation are read, and a hash table is constructed by hashing on the group-by attributes of the tuple. In step 2, when the group values do not all fit into the memory allocated for the hash table, the tuples that do not belong to groups in the memory are hash-partitioned into multiple buckets files on disk. The number of partition files is as many as necessary to ensure no future memory overflows. In step 3, the overflow bucket files are processed one by one in the same way in step 1. When memory overflow occurs, this algorithm needs extra IO cost.

Next, I will discuss about parallel algorithms on shard-nothing multiprocessors for a single aggregate query. We suppose that the source relation is partitioned into local disks. Three parallel algorithms based on hash-based algorithm are known. First one is the Two-phase, which works well when the number of groups is small. I will call it 2P. Second one is the Repartition algorithm, which works well when the number of groups are large. I will call it Rep. The last one is the Broadcasting algorithm, which is considered impractical for shared-nothing multiprocessors. I will call it BC. However, recently high-bandwidth multiprocessors interconnect, such as high performance switch for IBM SP2, have become available. We should evaluate the feasibility of the broadcasting approach with high-performance networks using such equipment. Next, I will discuss these algorithms respectively.

This figure shows the relationship between the number of groups and the response times of three algorithms for a configuration of 16 processors with high speed interconnects. As expected, the 2P works well for a small number of groups, because of fewer communications. The Rep works well for a large number of groups, because of efficient use of its memory. The BC cannot win at all for a single query, because broadcasting costs are high. But, we have multiple queries. Applying the algorithms for a single query sequentially is available, but it's not best solution.

We extended the algorithms for a single aggregate query.

The main idea is that multiple queries to be computed are from the same relation, and we can share the common process, such as scanning data. By processing multiple queries simultaneously, we extend the algorithms for a single query to the algorithms for multiple queries. We will introduce the m2P, mRep, mBC algorithm, which are based on the 2P, Rep, BC algorithm respectively. By processing multiple aggregate queries simultaneously, the required memory size is Q times as large as the algorithms for a single query. But, communication costs of mBC algorithm are the same as the BC algorithm for a single query. The communication costs of mBC are one Qth of that of BC Q times. The m2P has the disadvantages of memory overflow, which could occur easily, because of the inefficient use of memory. The mBC has high communication cost, but it is shared by Q queries.

The two-phase algorithm simply partitions the source relation. In this first phase, each node in the system computes aggregates on its local partition of the relation. In the second phase, these partial results are collected to one of the nodes, which merges them to produce the final results. The advantage of the 2P algorithm is less communication. And, the disadvantages are that required memory size for each node is the same size of total result size, and that duplication of aggregation work in the first phase and second phase becomes significant, when the number of groups are large.

m2P algorithm processes multiple aggregate queries simultaneously. m2P reads local relation once, and aggregates for multiple queries. And, 2nd phase also can be parallelized.

The Repartitioning algorithm partitions groups, and can work well for a large number of groups. It redistributes the data on the group-by attributes, and performs aggregation on each partition in parallel. This algorithm is efficient when the grouping selectivity is high, because it eliminates duplication of work by processing each value for aggregation just once. It also reduces the memory requirement, since each group value is stored in only one place. However, when the grouping selectivity is so low that the number of groups is smaller than the number of processors, this algorithm cannot use all the processors, which severely affects the performance.

mRep algorithm processes multiple aggregate queries simultaneously.

The Broadcasting algorithm broadcasts all local disk pages to all nodes. Receiving the entire source relation, each node computes aggregations for groups assigned for it. Since each group value is stored in only one place, the memory requirement of this algorithm is the same as that of the Repartitioning algorithm. The BC algorithm computes both groups and aggregations on receiving nodes. When, the number of groups are smaller than the number of nodes, the BC algorithm can not use all the processors, for the same reason as in the Rep.

mBC algorithm processes multiple aggregate queries simultaneously. Each node reads local relation, and broadcast to the all nodes. Receiving broadcasted data, each nodes compute aggregation for multiple queries.

These tables compare the cost of scanning, communication, and extra IO between algorithm for single query times Q and alogrithm for multiple queries.

This figure shows the relationship between the number of groups per query and the response times of these algorithms. The number of queries is 100, in this case. As you can see, the m2P works well for a small number of groups. And the mRep works well for a very large number of groups. And, surprisingly, the mBC algorithm wins for a middle range, which is between 5 thousand and one hundred and fifty thousand. The mBC also works well for the very large number of groups, because of the trade-off between extra IO and the number of scans, both mBC and mRep scan the source relation only once, and use the necessary extra IOs, and the extra IOs of mRep are larger than mBC.

These figures show the best algorithm for a 16-node system depending on the problem size, which are the number of queries and the number of groups per query. The switching points of m2P and mBC lie on a line parallel to the dotted line, on which the entire result size is equal to the main memory size of a single node. From these results, we can conclude that none of the algorithms gives a satisfactory performance for the entire range of problem sizes, and that it is necessary to choose best algorithm according to the overall result size and the available memory size. The mBC algorithm for multiple queries may be practical when high-speed networks are available, and if the result size is larger than the main memory size of a single node.

The reasons why the mBC wins for middle range are as follows. The mRep repeats repartitioning 100 times, which is costly. Because of inefficient use of memory, the m2P has to pay extra IO costs, when total result size is over the local memory size of each node. Therefore, broadcasting algorithm is effective when the total results do not fit into the main memory of a single node.

These figures compare the best performance of the proposed algorithms, m2P, mRep, mBC with that of conventional algorithms, which is repeating Q times of 2P, Rep, BC, when there are 100 queries. From empirical result, we can see that the best of the proposed algorithms is about 16 times faster than the best of the conventional algorithms for a wide range of grouping selectivities.

I will conclude my presentation. We compared extend algorithms Two-phase, Repartitioning, Broadcasting both on the analytical models and the empirical result. Although the Broadcasting algorithm has been considered impractical for today's shard-nothing parallel architectures, we proved that Broadcasting algorithm wins when the total result size does not fit into the memory of a single node, when the system has high-performance networks. None of the existing algorithms gives a satisfactory performance for the entire range of problem size; it is necessary to choose best algorithm according to the problem size and available memory size. Analytical cost models will help to predict the best algorithm for given problems. We have ignored the relations among multiple queries and treated them as independent. But, actual queries are often interrelated. We need optimization of algorithms using interrelation among multiple queries.