The following slides were presented at EDBT '98 in March 1998.
At first, let's introduce the problem.
Let's consider the problem that there are many aggregate queries to
be computed, and those queries are computed from only one source
relation. We would like to compute multiple aggregate queries from
one source relation efficiently.
This figure shows that our methods work 16 times faster than
conventional methods for wide range of grouping selectivity, when
there are 100 aggregate queries to be computed
form same relation. Next, I would like to explain the problem in
more detail.
Let's think about a situation in which multiple aggregates are required.
First of all, I would like to talk about a motivating example in OLAP.
Let's consider a relation of 10 million bank customers.
This has 30 numeric attributes such as Age, Balance, Income, and so
on, and 100 categorical attributes such as Occupation, Sex, Credit
class, Overdue experience, and so on. Both the number of records and
the number of attributes are taken from a real-life data set.
Since, both numeric and categorical attributes can be used for
making groups, we can make 130-dimensional data-cube.
And, this data-cube has 2 to the 130th power vertices.
Since response times are crucial in OLAP, and computation of the
complete data-cube is difficult, OLAP systems pre-compute
frequently used queries to reduce response times.
Materialized views are used to help answer other aggregate queries.
And, pre-compute times should be small, because the source relation
is updated frequently.
Let's think about how many materialized views should be pre-computed.
Harinarayan et al. presented a greedy algorithm for deciding what
subset of data cubes should be pre-computed.
We suppose that each dimension has 100 distinct values and
distributes uniformly, and OLAP system makes all 1D aggregates
computable from a materialized view.
By the use of Harinarayan's cost model, we can know that 44 3-D
views should be pre computed to make all 1D aggregates computable
from a materialized view. Similarly, at least 2,795 3-D views should
be pre-computed for all 2D aggregates.
Many aggregate queries should be pre-computed.
And those queries are from the same relation.
Let's consider another example using the same bank data.
We have fast algorithms for computing association rules from 1D or
2D data distributions along numeric attributes.
This rule is one example of the interesting 2D rules.
This figure shows that the person whose age and balance falls into
this region tends to delay his/her card loan payments with high
probability. X-axis indicates saving balance, Y-axis indicates age,
and distributions of all customers and ratio of unreliable customers
are shown by brightness and redness of each pixel. To compute a
single rule, data distribution of all customers and unreliable
customers are necessary.
For data-mining, it is common to compute all rules and sort them in
order of given criteria.
Conditional part of 2-D rule is the region in the Euclidean plane
indexed by 2 numeric attributes. And, the target property has the
form that a categorical attribute is equal to a value, such as
"overdue experience equals YES". Then all combinations of 2 numeric
attributes may become a conditional part of the rule.
And all combinations of a categorical attribute and its value may
become a target property.
Since, we have 30 numeric attributes, the number of all combinations
of 2 numeric attributes are 435.
And, there are 100 categorical attributes, and the number of values
of each attribute is between 2 and several dozens.
Therefore the possible number of target properties will range from
several hundred to several thousand.
The number of required data distributions are 435 times this large
number. Since a numeric attribute is divided into 20 to 400 buckets,
and each distribution makes 20 square to 400 square groups, a large
volume of memory is required. Large numbers of aggregate queries and
groups are required for our data-mining
system.
The algorithms for a single aggregate query are well-studied. For
example, the optimization for a single aggregation, and the parallel
algorithms for a single query. I will talk about these algorithms
later. And, optimization for inter-related multiple aggregate queries
on serial machine is known. But, we have many aggregate queries.We
would like to compute multiple aggregate queries on shared-nothing
multiprocessors.
Next, I will talk about serial algorithm for a single aggregate
query, which is called hash-based algorithm.
This algorithm works as follows.
In step 1, all tuples of the source relation are read, and a hash
table is constructed by hashing on the group-by attributes of the tuple.
In step 2, when the group values do not all fit into the memory
allocated for the hash table, the tuples that do not belong to
groups in the memory are hash-partitioned into multiple buckets
files on disk. The number of partition files is as many as necessary
to ensure no future memory overflows.
In step 3, the overflow bucket files are processed one by one in the
same way in step 1. When memory overflow occurs, this algorithm
needs extra IO cost.
Next, I will discuss about parallel algorithms on shard-nothing
multiprocessors for a single aggregate query.
We suppose that the source relation is partitioned into local disks.
Three parallel algorithms based on hash-based algorithm are known.
First one is the Two-phase, which works well when the number of
groups is small. I will call it 2P.
Second one is the Repartition algorithm, which works well when the
number of groups are large.
I will call it Rep.
The last one is the Broadcasting algorithm, which is considered
impractical for shared-nothing multiprocessors.
I will call it BC.
However, recently high-bandwidth multiprocessors interconnect, such
as high performance switch for IBM SP2, have become available.
We should evaluate the feasibility of the broadcasting approach with
high-performance networks using such equipment.
Next, I will discuss these algorithms respectively.
This figure shows the relationship between the number of groups and
the response times of three algorithms for a configuration of 16
processors with high speed interconnects.
As expected, the 2P works well for a small number of groups, because
of fewer communications.
The Rep works well for a large number of groups, because of
efficient use of its memory.
The BC cannot win at all for a single query, because broadcasting
costs are high. But, we have multiple queries.
Applying the algorithms for a single query sequentially is
available, but it's not best solution.
We extended the algorithms for a single aggregate query.
The main idea is that multiple queries to be computed are from the
same relation, and we can share the common process, such as scanning data.
By processing multiple queries simultaneously, we extend the
algorithms for a single query to the algorithms for multiple
queries. We will introduce the m2P, mRep, mBC algorithm, which are
based on the 2P, Rep, BC algorithm respectively.
By processing multiple aggregate queries simultaneously, the
required memory size is Q times as large as the algorithms for a
single query. But, communication costs of mBC algorithm are the same
as the BC algorithm for a single query.
The communication costs of mBC are one Qth of that of BC Q times.
The m2P has the disadvantages of memory overflow, which could occur
easily, because of the inefficient use of memory.
The mBC has high communication cost, but it is shared by Q queries.
The two-phase algorithm simply partitions the source relation.
In this first phase, each node in the system computes aggregates on
its local partition of the relation.
In the second phase, these partial results are collected to one of
the nodes, which merges them to produce the final results.
The advantage of the 2P algorithm is less communication.
And, the disadvantages are that required memory size for each node
is the same size of total result size, and that duplication of
aggregation work in the first phase and second phase becomes
significant, when the number of groups are large.
m2P algorithm processes multiple aggregate queries simultaneously.
m2P reads local relation once, and aggregates for multiple queries.
And, 2nd phase also can be parallelized.
The Repartitioning algorithm partitions groups, and can work well
for a large number of groups.
It redistributes the data on the group-by attributes, and performs
aggregation on each partition in parallel.
This algorithm is efficient when the grouping selectivity is high,
because it eliminates duplication of work by processing each value
for aggregation just once.
It also reduces the memory requirement, since each group value is
stored in only one place.
However, when the grouping selectivity is so low that the number of
groups is smaller than the number of processors, this algorithm
cannot use all the processors, which severely affects the performance.
mRep algorithm processes multiple aggregate queries simultaneously.
The Broadcasting algorithm broadcasts all local disk pages to all nodes.
Receiving the entire source relation, each node computes
aggregations for groups assigned for it.
Since each group value is stored in only one place, the memory
requirement of this algorithm is the same as that of the
Repartitioning algorithm.
The BC algorithm computes both groups and aggregations on receiving nodes.
When, the number of groups are smaller than the number of nodes, the
BC algorithm can not use all the processors, for the same reason as
in the Rep.
mBC algorithm processes multiple aggregate queries simultaneously.
Each node reads local relation, and broadcast to the all nodes.
Receiving broadcasted data, each nodes compute aggregation for
multiple queries.
These tables compare the cost of scanning, communication, and extra
IO between algorithm for single query times Q and alogrithm for
multiple queries.
This figure shows the relationship between the number of groups per query and the
response times of these algorithms.
The number of queries is 100, in this case.
As you can see, the m2P works well for a small number of groups.
And the mRep works well for a very large number of groups.
And, surprisingly, the mBC algorithm wins for a middle range, which is between 5
thousand and one hundred and fifty thousand.
The mBC also works well for the very large number of groups, because of the trade-off
between extra IO and the number of scans, both mBC and mRep scan the source
relation only once, and use the necessary extra IOs, and the extra IOs of mRep are
larger than mBC.
These figures show the best algorithm for a 16-node system depending on the problem
size, which are the number of queries and the number of groups per query.
The switching points of m2P and mBC lie on a line parallel to the dotted line, on which
the entire result size is equal to the main memory size of a single node.
From these results, we can conclude that none of the algorithms gives a satisfactory
performance for the entire range of problem sizes, and that it is necessary to choose best
algorithm according to the overall result size and the available memory size.
The mBC algorithm for multiple queries may be practical when high-speed networks
are available, and if the result size is larger than the main memory size of a single node.
The reasons why the mBC wins for middle range are as follows.
The mRep repeats repartitioning 100 times, which is costly.
Because of inefficient use of memory, the m2P has to pay extra IO
costs, when total
result size is over the local memory size of each node.
Therefore, broadcasting algorithm is effective when the total results do not fit into the
main memory of a single node.
These figures compare the best performance of the proposed algorithms, m2P, mRep,
mBC with that of conventional algorithms, which is repeating Q times of 2P, Rep, BC,
when there are 100 queries.
From empirical result, we can see that the best of the proposed algorithms is about 16
times faster than the best of the conventional algorithms for a wide range of grouping
selectivities.
I will conclude my presentation.
We compared extend algorithms Two-phase, Repartitioning, Broadcasting both on the
analytical models and the empirical result.
Although the Broadcasting algorithm has been considered impractical for today's
shard-nothing parallel architectures, we proved that Broadcasting algorithm wins when
the total result size does not fit into the memory of a single node, when the system has
high-performance networks.
None of the existing algorithms gives a satisfactory performance for the entire range of
problem size; it is necessary to choose best algorithm according to the problem size and
available memory size.
Analytical cost models will help to predict the best algorithm for given problems.
We have ignored the relations among multiple queries and treated them as
independent.
But, actual queries are often interrelated.
We need optimization of algorithms using interrelation among multiple queries.