Some of the following foils were presented at ACM SIGMOD'96 in June 1996.

Suppose that we are given a relation with many numeric attributes and we want to understand va particular property of our interest in the database.

To this end, it would be helpful if we could find a pair of a numeric attribute and its range, or a pair of two numeric attributes and their region that best describe the target condition.

For instance, consider a relation of millions of customers in a bank. that have many numeric attributes such as balance of checking accounts, balance of saving accounts, and so on.

Suppose that we want to find conditions that well characterize unreliable customers who have ever delayed the payment of their credit card charges.

We want to find a rule of the form

(Balance in [X1,X2]) => (CardLoanDelay = yes),
which describes that customers whose balances fall in the range [X1,X2] have ever delayed the payment of their credit card charges with some probability. We call such a rule 1 dimensional because it refer to only 1 numeric attribute. We may also need 2 dimensional one.

An example of a 2 dimensional rule is
(Age,Balance) in R => (CardLoanDelay = yes),
which describes that customers whose ages and balances fall in the region R have ever delayed the payment of their credit card charges with some probability.

Now let us consider the size of the region R. Suppose that the green circle in the foil illustrates R that contains a pretty small number of customers in it, say 0.05% of all customers. Then, the 2D rule can be applicable to a very small set of customers, and therefore is not so meaningful.

On the other hand, consider the case when R, which is shown by the red circle, contains 99% of all customers. The 2D rule can be applied to almost all customers, and hence it cannot well characterize unreliable customers.

Thus we should avoid generating very small or very large regions. To this end, it is effective to provide a minimum threshold, say 10%, for the percentage of customers in the region R. And then we look for optimal region that maximize the probability of unreliable customers.


Before getting into the details of the algorithms for computing the optimal regions, we will show the output image of our system.

Suppose that we are given a database that contains 3.54% of unreliable customers, and we look for the rule of the form

(Age,Balance) in R => (CardLoanDelay = yes).

If we give 10% minimum threshold for the ratio of customers in R, the system extracts the region enclosed in the blue lines that contains 14.39% of unreliable customers, which is much higher than the average of 3.54%.

In the figure, the redness of each pixel shows the probability of unreliable customers in the pixel.


We now introduce some technical terms. Consider the rule of the form:
(Age,Balance) in R => C

support(R) means the percentage of tuples in the database that meet the presumptive condition "(Age, Balance) in R."

hit(R) shows the ratio of tuples that satisfy both the presumptive condition and the target condition in the conclusion.

confidence(R) is hit(R) over support(R), and it expresses the probability of customers whose ages and balances are in region R meet the target condition C.

For instance, the red circle in the foil shows a region with 10% support and 30% confidence. The product of those figures gives 3% hit. The green circle illustrates another region with 30% support and 20% confidence.

We may find a region with higher confidence for a smaller support. Thus there is a tradeoff between smaller supports and higher confidence. This tradeoff naturally raises the following types of optimization problems.

Given a minimum threshold for support(R) (or, hit(R)), we search for optimized confidence regions that maximizes confidence(R).

Given a minimum threshold for confidence(R), we look for optimized support (or, hit) regions that maximizes support(R) (or, hit(R), respectively).


In order to generate images and extract regions, we first discretize the relation into 2D matrixes. For instance, for each pair of two numeric attributes, say Age and Balance, we divide the Euclidean plane into pixels.

This can be done by as follows:

  1. First we sort the tuples in the relation according to the values of Age and Balance, respectively. For other pairs of attributes, we can perform the same procedure to extract regions.
  2. Second, we generate equi-depth buckets for each attribute so that the tuples are distributed to each bucket almost evenly.
  3. Finally, for each pixel P, we count support(P) and hit(P). In order to minimize the data transfer between the secondary disk and the main memory, we scan the relation in the secondary disk once, and for each tuple we find the pixel for the tuple. To effectively compute the pixels, we can use some hash functions or an balanced ordered binary tree of all buckets for each attribute.
In the above steps, the 1st step could be the bottleneck when we handle a huge relation that occupy much more space than the main memory. In such cases, we instead use randomized bucketing that takes a small sample, sorts the sample and computes substitutes for equi-depth buckets for each attribute.

The next step is to calculate optimized regions. In our SIGMOD'96 paper, we consider two types of regions: rectangles, and x-monotone regions. Rectangles would be the most standard, but they are not always helpful to characterize target customers of interest. We may need much more various shaped regions.

We considered x-monotone regions such that the intersection of an x-monotone region with any vertical line is undivided.


Now let us give some results for computing optimized confidence/support/hit regions.

In the case of ranges, at ACM PODS'96 we have presented an algorithm for optimized support rules and another algorithm for optimized confidence rules. Let M denote the number of tuples in the input database. Both algorithms run in O(M), if the data is sorted.

Let n denote the number of pixels. We have O(n1.5) time algorithms for computing optimized rectangles by reducing the problem to the case of ranges.

Computing optimized x-monotone regions is a hard problem. It can be shown that no algorithm running in polynomial time with respect to n and log M exists unless P=NP.

In what follows, we present am O(n log M) time approximation algorithm.


For a region R, we associate a stamp point (support(R), hit(R)). Since there could be number number of regions, we cannot afford generating those stamp points. We just image those points on an Euclidean plan.

Let us focus on a stamp point (support(R),hit(R)). Then, the slope of the line between the origin and the stamp point gives the confidence "confidence(R)."


Suppose that the optimized region of interest is the red point that is hidden inside of the upper convex hull (light blue lines) of all stamp points. As we pointed before computing the red point itself is a hard problem. We instead look for a point on the upper convex hull, say the blue point, that is close to the red point, and we substitute the blue point to the red one.

We will explain how to find such an approximation, but before it we show how to find optimized confidence/support/hit regions.


Consider the case of computing the optimized confidence region for a given minimum support threshold that is illustrated by the green line in the picture.

We need to consider all stamp points that are in the right side of the green line, and among them we see that the red point represents the optimized confidence region, which however is hidden inside of the upper convex hull.

So we substitute the blue point to the red one.


Next let us consider the case of computing the optimized confidence region for a given minimum hit threshold that is illustrated by the green line in the picture.

We need to consider all stamp points that are in the upper side of the green line, and among them we see that the red point represents the optimized confidence region.

So we substitute the blue point to the red one.


The next problem is to compute the optimized support region for a given minimum confidence threshold that is illustrated by the green line in the picture.

We need to consider all stamp points that are in the upper side of the green line, and among them we see that the red point represents the optimized support region.

So we substitute the blue point to the red one.


The problem of computing the optimized hit region for a given minimum confidence threshold can be solved in a way similar to the previous solutions.

Now we are in a position to show how to scan a point on the upper convex of all stamp points.

At ACM SODA'96, Asano, Chen, Katoh and Tokuyama proposed a technique called "hand-probing." Observe that for any point on the upper convex hull, there exits a line that is tangent at this point. Also note that for any non-negative slope (theta, in the picture), there is a line with the slope that touches at least one point on the upper hull. Suppose that the touching point represents region R. Then, R maximizes

hit(R) - theta support(R).
Using the technique introduced in the Asano et al.'s ACM SODA'96 paper, we obtained an algorithm for computing the region R that maximizes the above value in time propositional to the number of pixels. Details of the algorithm can be found in our paper presented at ACM SIGMOD96.

Using the technique of probing a point on the upper convex hull, we can search for a region that approximate the optimized region.

For instance, the green point in the picture show the target optimized confidence region for a given minimum support threshold. We start with using a horizontal line (named "1", in the picture) to find a point, say A.

Next, we compute the slope of the line that goes between the origin and the point that has just been touched, and we use a line with the slope (named "2") to touch another point, say B, on the upper hull.

Now we see that the given minimum support threshold lies between A and B, we proceed to touch another point between A and B by using the slope of the line between A and B.

Repeating this binary search until we get to the target point.


The figure shows the performance result. As you can see, the computation time grows almost linearly in the number of pixels.