Some of the following foils were presented at
ACM SIGMOD'96 in June 1996.
Suppose that we are given a relation with
many numeric attributes and we want to understand
va particular property of our interest in the database.
To this end, it would be helpful if we could find a pair of a numeric attribute and its range, or a pair of two numeric attributes and their region that best describe the target condition.
For instance, consider a relation of millions of customers in a bank. that have many numeric attributes such as balance of checking accounts, balance of saving accounts, and so on.
Suppose that we want to find conditions that well characterize unreliable customers who have ever delayed the payment of their credit card charges.
We want to find a rule of the form
An example of a 2 dimensional rule is
Now let us consider the size of the region R. Suppose that the green circle in the foil illustrates R that contains a pretty small number of customers in it, say 0.05% of all customers. Then, the 2D rule can be applicable to a very small set of customers, and therefore is not so meaningful.
On the other hand, consider the case when R, which is shown by the red circle, contains 99% of all customers. The 2D rule can be applied to almost all customers, and hence it cannot well characterize unreliable customers.
Thus we should avoid generating very small or very large regions.
To this end, it is effective to provide a minimum threshold, say 10%,
for the percentage of customers in the region R.
And then we look for optimal region that maximize the probability of
unreliable customers.
Before getting into the details of the algorithms for computing
the optimal regions,
we will show the output image of our system.
Suppose that we are given a database that contains 3.54% of unreliable customers, and we look for the rule of the form
If we give 10% minimum threshold for the ratio of customers in R, the system extracts the region enclosed in the blue lines that contains 14.39% of unreliable customers, which is much higher than the average of 3.54%.
In the figure, the redness of each pixel shows the probability of
unreliable customers in the pixel.
We now introduce some technical terms.
Consider the rule of the form:
support(R) means the percentage of tuples in the database that meet the presumptive condition "(Age, Balance) in R."
hit(R) shows the ratio of tuples that satisfy both the presumptive condition and the target condition in the conclusion.
confidence(R) is hit(R) over support(R), and it expresses the probability of customers whose ages and balances are in region R meet the target condition C.
For instance, the red circle in the foil shows a region with 10% support and 30% confidence. The product of those figures gives 3% hit. The green circle illustrates another region with 30% support and 20% confidence.
We may find a region with higher confidence for a smaller support. Thus there is a tradeoff between smaller supports and higher confidence. This tradeoff naturally raises the following types of optimization problems.
Given a minimum threshold for support(R) (or, hit(R)), we search for optimized confidence regions that maximizes confidence(R).
Given a minimum threshold for confidence(R),
we look for optimized support (or, hit) regions
that maximizes support(R) (or, hit(R), respectively).
In order to generate images and extract regions,
we first discretize the relation into 2D matrixes.
For instance, for each pair of two numeric attributes,
say Age and Balance,
we divide the Euclidean plane into pixels.
This can be done by as follows:
The next step is to calculate optimized regions. In our SIGMOD'96 paper, we consider two types of regions: rectangles, and x-monotone regions. Rectangles would be the most standard, but they are not always helpful to characterize target customers of interest. We may need much more various shaped regions.
We considered x-monotone regions such that
the intersection of an x-monotone region with any vertical line
is undivided.
Now let us give some results for computing optimized
confidence/support/hit regions.
In the case of ranges, at ACM PODS'96 we have presented an algorithm for optimized support rules and another algorithm for optimized confidence rules. Let M denote the number of tuples in the input database. Both algorithms run in O(M), if the data is sorted.
Let n denote the number of pixels. We have O(n1.5) time algorithms for computing optimized rectangles by reducing the problem to the case of ranges.
Computing optimized x-monotone regions is a hard problem. It can be shown that no algorithm running in polynomial time with respect to n and log M exists unless P=NP.
In what follows, we present am O(n log M) time
approximation algorithm.
For a region R, we associate a stamp point (support(R), hit(R)).
Since there could be number number of regions,
we cannot afford generating those stamp points.
We just image those points on an Euclidean plan.
Let us focus on a stamp point (support(R),hit(R)).
Then, the slope of the line between the origin and the stamp point gives
the confidence "confidence(R)."
Suppose that the optimized region of interest is the red point
that is hidden inside of the upper convex hull (light blue lines)
of all stamp points.
As we pointed before computing the red point itself is a hard problem.
We instead look for a point on the upper convex hull,
say the blue point, that is close to the red point, and
we substitute the blue point to the red one.
We will explain how to find such an approximation, but before it
we show how to find optimized confidence/support/hit regions.
Consider the case of computing the optimized confidence region for
a given minimum support threshold that is illustrated
by the green line in the picture.
We need to consider all stamp points that are in the right side of the green line, and among them we see that the red point represents the optimized confidence region, which however is hidden inside of the upper convex hull.
So we substitute the blue point to the red one.
Next let us
consider the case of computing the optimized confidence region for
a given minimum hit threshold
that is illustrated by the green line in the picture.
We need to consider all stamp points that are in the upper side of the green line, and among them we see that the red point represents the optimized confidence region.
So we substitute the blue point to the red one.
The next problem is to compute the optimized support region for
a given minimum confidence threshold
that is illustrated by the green line in the picture.
We need to consider all stamp points that are in the upper side of the green line, and among them we see that the red point represents the optimized support region.
So we substitute the blue point to the red one.
The problem of computing the optimized hit region for
a given minimum confidence threshold can be solved in a way similar to
the previous solutions.
Now we are in a position to show how to scan a point
on the upper convex of all stamp points.
At ACM SODA'96, Asano, Chen, Katoh and Tokuyama proposed a technique called "hand-probing." Observe that for any point on the upper convex hull, there exits a line that is tangent at this point. Also note that for any non-negative slope (theta, in the picture), there is a line with the slope that touches at least one point on the upper hull. Suppose that the touching point represents region R. Then, R maximizes
Using the technique of probing a point on the upper convex hull,
we can search for a region that approximate the optimized region.
For instance, the green point in the picture show the target optimized confidence region for a given minimum support threshold. We start with using a horizontal line (named "1", in the picture) to find a point, say A.
Next, we compute the slope of the line that goes between the origin and the point that has just been touched, and we use a line with the slope (named "2") to touch another point, say B, on the upper hull.
Now we see that the given minimum support threshold lies between A and B, we proceed to touch another point between A and B by using the slope of the line between A and B.
Repeating this binary search until we get to the target point.
The figure shows the performance result.
As you can see,
the computation time grows almost linearly in the number of pixels.