Some of the following slides were used in KDD '97 in August 1997.

An association rule has the form like this:
A presumptive condition C1 implies an objective condition C2. For example, the presumptive condition `If a person buys pizza' implies the objective condition `that person also buys coke.'

There are two parameters that characterize an association rule: `support' and `confidence'.

The support of a rule is the ratio of number of data that satisfy C1. The confidence of a rule is the conditional probability that a tuple satisfies C2 when the tuple satisfies C1.

An interesting rule is a rule that has both a sufficiently large support and a sufficiently high confidence.

Association rules are typically applied to databases whose attributes are Boolean values of items.


But databases in the real world usually have numeric attributes. So, we consider this kind of database, which has numeric attributes such as ``Balance'' meaning balance of a customer in his or her bank account and Boolean attributes such as ``Card Loan'' meaning whether the customer uses a credit card loan or not.

One-dimensional association rule has the form like this: If the numeric attribute ``Balance'' of a tuple falls in a range [s, t], then the Boolean attribute ``Card Loan'' of that tuple is ``yes'' with a high probability. This probability is the confidence of the rule, and the ratio of the number of tuple inside the range [s,t] is the support of the rule.

We have two kinds of optimized range here: the optimized-support range and the optimized-confidence range. The optimized-support range has the maximum support under some given confidence threshold. The optimized-confidence range has the maximum confidence under some given support threshold.

This is a histogram of this data. A red bar represents the number of data that satisfy Card Loan=yes, and a green bar represents the number of data that satisfy Card Loan=no. The range [s, t] of the orange line is an example of the optimized confidence range with some support threshold.



In the real world, one numeric attribute is not enough to characterize the objective condition, so we consider an association between a pair of numeric attributes and one Boolean attribute, called two-dimensional association rules.

A two-dimensional association rule has the form like this: If the point to which the two numeric attributes is mapped in an Euclidean plane, is inside a region R, the Boolean attribute CardLoan is yes with a high probability.

The optimized confidence and optimized support region is defined similarly as in the one-dimensional case.

This figure shows an example of the distribution of this data. A red point corresponds to a tuple whose coordinates are values of two numeric attributes Age and Balance, and whose Boolean attribute CardLoan is yes. A green point corresponds to a tuple whose CardLoan is no. The region R surrounded by the orange line is an example of the optimized confidence region with some support threshold.

Since it is hard to compute an arbitrary shape of region, we need to compute the region which has some nice geometric property to obtain a good two-dimensional rule.



We previously considered two geometric classes of region to obtain good two-dimensional association rules. One is the rectangular region and the other is the X-monotone region. An X-monotone region is a connected region whose intersections with any vertical line are always undivided.

We presented efficient algorithms for computing the optimized rectangular and X-monotone region for two-dimensional association rules.

But, each of these two geometric classes of region has a problem.

Rectangular regions are not flexible enough to cover various data distribution. For example, if the distribution of data which satisfy the objective condition is scattered along diagonal area, the optimized-confidence rectangular region will not give a high confidence.

X-monotone regions tend to overfit training data too much. This means that even if we obtain an optimized X-monotone region from a dataset, that region will not give a good prediction for an unseen data which will be added in the future. I will talk about overfitting in the later page.

So there are a flexibility problem for rectangular regions and an overfitting problem for X-monotone regions. We therefore present a new geometric class of region, called rectilinear region to overcome both of these problems.

A rectilinear region is a connected region whose intersections with any vertical line as well as its intersections with any horizontal line are always undivided.

This is the computational time to find the optimized rectangular region. Since the optimized X-monotone region and optimized rectilinear region cannot be computed in polynomial time with respect to N and logM, we compute the region pretty close to the optimized region instead.



I'll explain the basic idea of the algorithm for computing the optimized rectilinear region in the following three pages.

This is the two-dimensional association rule, in which we compute the region R. A and B are numeric attributes and C is a Boolean attribute.

Let support of R denote the number of data in R, and let hit of R denote the number of data that meet the objective condition in R. Each region R can be mapped to a stamp point in an Euclidean plane spanned by support and hit. The slope of a point R, or hit R divided by support R is called the confidence of R.

Let's think about how to find the optimized region using this figure. Suppose this line be a minimum support threshold and find the optimized confidence region. The point corresponding to the optimized confidence region is at the right side of this line and whose slope is the steepest. The red point Rc is the optimized confidence region in this case.

And suppose the slope of this line be a minimum confidence threshold and find the optimized support region. The point corresponding to the optimized support region is above this line and whose support is the largest. The red point Rs is the optimized support region in this case.

Finding the optimized region such as those red points is intractable in general. So just imagine the upper convex hull of all of the stamp points in this plane, and since we assume that a huge number of points are densely plotted over the hull, we can find the stamp point on the hull which satisfies the threshold condition and closest to the optimized region. So we compute the point Rc dash instead of the optimized confidence region Rc, and the point Rs dash instead of the optimized support region Rs. By using hand-probing technique, such a region can be computed by searching order log M stamp points on the convex hull.

So the next question is how to compute a region on the convex hull.



For each point R on the upper hull, there must exist a line with a slope tau that is tangential to the hull at this point. This point maximizes hit R minus tau multiplied by support R, which we call gain of R.

So the problem is to compute the rectilinear region R that maximizes its gain with a slope tau.

We define four types of column or vertical range in a rectilinear region: W, U, D, and N. The column that gets wider from the left is named W. The column that slants upward is named U. The column that slants downward is named D. The column that gets narrower is named N. Since a rectilinear region is a connected region, any column in a rectilinear region belongs to at lease one of these types.

Then we define f X m, s, t as the maximum gain of all the rectilinear region whose last column is a range [s, t] of m-th column of type X.

The maximum of fX(m, [s, t]) for all of the parameter X, m, s, t is the maximum gain of all rectilinear regions. The computation of these values are performed by a dynamic programming.



In a rectilinear region, types of two adjacent columns have some restrictions. For example, the left of a U type column is always U type or W type, but never D type nor N type. This figure shows the restrictions which two adjacent columns in a rectilinear region must obey.

Using these restrictions, we have eleven cases of rectilinear region which is also illustrated in figure 3 of our paper.

This is the recurrence used for computing fW(m,[s,t]). fW is computed by using only the fWs at the previous and the current column, which satisfies the these restrictions. fU, fD and fN are computed by the similar recurrences. Please refer to our paper for the detail of this dynamic programming.

Roughly speaking, since there are three parameters: m, s, ant t, each of which has a range 1 to N, the computational time is O(N cubic).



We perform some experiment to compare the characteristics of three classes of region with respect to overfitting.

We generate two synthetic datasets that represent typical cases in practice. A and B are numeric attributes ranging from -1 to 1, which span a two-dimensional plane. C is a Boolean attributes. The number of records for both dataset is 10,000 (ten thousand). The figures show two synthetic datasets.

The random points are distributed uniformly in the plane. The point closer to the diagonal is set C=yes with higher probability for D-linear dataset. The point closer to the origin is setC=yes with higher probability for D-circluar dataset. The deeper red pixel means higher ratio of data that meet the condition C=yes.

Please refer to our paper for the details of the synthetic dataset generation.


To compare the three classes of region with respect to overfitting, we perform N-fold cross validation. First we divide the dataset into N equal-sized subsets. We then take the union of N-1 subsets and use it as a training dataset for extracting the optimized confidence regions with a minimum support threshold of 50%. The training confidence is the confidence of the extracted region. We then use the remaining one subset as the test dataset and compute the ratio of the number of tuples that satisfy the objective condition to the number of tuples, which is called the test confidence. We repeat these three steps N times, and compute the average of the training and test confidences.

The training data is considered the recorded data from which we derive rules, and the test data is considered the unobserved data whose features are predicted by the derived rules.

Generally, a test confidence is lower than its training confidence since the region for counting the test confidence overfits the training dataset. So, we can measure the overfitting by the difference between the training confidence and the test confidence.


This is an example of the extracted regions for training dataset of D-linear. These pictures are taken from our data mining tool. A pixel of deeper red color has a higher confidence and the extracted region is surrounded by the light blue line.

The ideal optimized confidence region for this dataset is the region surrounded by two lines parallel to the diagonal. Here I'll give you some intuitive explanation for what the results might be.

Rectangular regions do not seem to be so suited for this data distribution that the confidence of the optimized rectangular region is presumed to be much lower than the other two.

X-monotone regions as well as rectilinear regions seem to be rather suited for this data distribution. But as we can see, an X-monotone region overfits the dataset, an X-monotone region adapts its shape to even a small deviation from the ideal distribution. As a result, it is presumed that the difference between a training confidence and a test confidence is much larger in X-monotone regions than that in the other two classes of region and that the test confidence of X-monotone regions may not be the highest.

Rectilinear regions do not seem to overfit the training dataset so much as the X-monotone regions do. It is presumed that the difference between training and test confidences of rectilinear regions is smaller compared to that of X-monotone regions and that the test confidence of rectilinear regions is higher than that of rectangular regions.

In the next page, I'll show you the experimental results to confirm these intuitive explanations.



Here is the results for two dataset D-linear and D-circular.

These bar graphs show the confidences of the optimized-confidence rectangular, X-monotone, and rectilinear regions for various numbers of pixels ranging from 8x8 to 64x64. A blue bar shows a training confidence and a yellow bar shows a test confidence.

At first, we can observe a major advantage of rectilinear regions: Let's take a look at the yellow bars. The test confidence of rectilinear regions is the highest among the three classes of regions. This means that a rectilinear region gives the best prediction for an unseen data.

Secondly, the difference of length between a blue bar and a yellow bar shows the extent to which the optimized region overfits the training dataset. The difference increases for the larger number of pixels, but the increase is much larger in X-monotone regions than those in other classes of regions, which implies that the optimized X-monotone region seriously overfits training data.



Finally, let's review and summarize the result.

We considered the problem of finding two-dimensional association rules, in which the region is extracted from a two-dimensional plane. We previously developed algorithms to find the optimized rectangular and X-monotone region for association rules. But each of these two classes of region has its own problem.

A rectangular region is not so flexible to cover various data distribution that the optimized confidence rectangular region may not give a high confidence or optimized support rectangular region may not give a high support. This is a problem for rectangular regions.

An X-monotone region overfits the training dataset too much. Even if the optimized confidence X-monotone region gives a high confidence for training dataset, the confidence of the region against the test dataset tends to be worse. This is a problem for X-monotone regions.

We therefore proposed a new classes of region, called rectilinear regions to overcome both of these problems. We showed the algorithm to compute optimized rectilinear regions and performed an experiment.

The result of the experiment showed that the optimized confidence rectilinear region gives high confidence and does not overfit training dataset so much.