Here, let me explain the motivation behind the reason why we
create ``equal-sized'' buckets. We call a bucket finest when it contains a single value.
Ideally, all buckets should be finest, because combination of finest buckets will give the range of optimized rule.
Let's look at a relation of millions of bank customers, which has millions of tuples.
The relation has attributes ``Age'' and ``Balance''.
The domain of age is non-negative integers bounded by 120 or 30. So its number of finest buckets will be less than 130.
But the domain of balance of millions of bank customers may range from 0 dollar to the amount that the richest people have. In this case, the number of finest buckets may amount to millions.
By sorting all data over balance, we can have the finest buckets, but we don't want to sort, if we handle a huge database that cannot fit into the main memory.
We instead generate a smaller number of almost equal-sized buckets by using lightweight randomized bucketing algorithm that does not require sorting nor large main memory.
Firstly, from original data, make an S-sized random sample that
is small enough to fit in the main memory. For instance ten
thousand of tuples.
Secondly, sort the sample in order S log(S) time.
And lastly, scan the sorted sample data and generate M equal sized buckets.
A bucket is expected to contain N / M original data, and let's suppose that a bucket actually contains X original data.
Suppose that a bucket is not acceptable if the difference ratio exceeds some threshold delta, say 50%.
And the probability that a bucket is not acceptable depends on (Sample Size/Number of Buckets), not on (Original Data Size).
This figure shows that the relationship between S over M and the probability.
The probability goes down sharply as S over M increases.
It becomes small enough when S over M is around 40. And it decrease slowly when S over M greater than 40.
Thus in our implementation, we use 40 times M as the sample size S.
By using an ordered binary tree, we can compute the number of buckets in O(N log(M)) time.
We generated random test data with 8 numeric attributes and 8 Boolean
attributes. As a test case, we divided the data into 1,000 buckets with respect to each numeric attribute, and counted the number of tuples in every bucket for each Boolean attribute.
We compared our algorithm with two conventional methods.
One is ``Naive Sort'', which sorts all data by using Quick Sort for each numeric attribute.
The other is ``Vertical Split Sort'', which firstly splits data vertically to generate a temporary table with tuple ID and each numeric attribute, so that the temporary table fits into the main memory, and then sort the temporary table.
This graph shows the results.
This line shows the result of ``Naive Sort'', this line is ``Vertical Split Sort', and this is our method.
``Naive Sort'' is useless when the size of data is over one million.
The execution time of ``Vertical Split Sort'' and our method grow almost linearly in the number of buckets.
If we split the table vertically, for 5 millions of tuples still fit into main memory of today's computer, ``Vertical Split Sort'' worked fine. But when the number of data goes to 100 millions, then this method will also be useless.
I would like to briefly touch on a way to determine the number of
buckets. The conclusion is the number of bucket should be much greater than inverse of the support.
Let's consider the possible error range caused by granularity of buckets.
The optimal range can be computed by using the finest buckets. So let's assume that we know the optimal range, [x, y].
Assume that we use these buckets to compute an optimal range.
Because x, and y may be in the middle of some buckets. So, our algorithm will outputs one of those ranges.
Here are the bounds of approximated support and confidence
For example, when the support of the optimal range is 30%, and confidence is 70%, this table shows bounds of the errors.
If you want to compute a rule with small support, you should make many buckets so that 2 times B become much less than the support.