Introduction


Some of the slides here were presented at ACM PODS '96 in June 1996. [Next | Prev | Index ]

An association rule has the form "C1 ==> C2", Where C2 is a target property of interest, and C1 is a condition that characterizes the target property.

A rule has support and confidence. Support of rule is the probability that tuples meet both C1 and C2, and Confidence is the probability that tuples meeting C1 also satisfy C2.

For example, a customer who buys "Coke" and "Pizza" also buys "Potato". It contains only ``yes-no'' attributes. So we call such rules ``Boolean association rules''.

The second example expresses that "a bank customer whose age is between x1 and x2 has ever delayed his or her card loan payment". In this rule, "unreliable customer" is the target property, and attribute "age" explains "unreliable customers". This rule contains a single numeric attribute in condition part, so we call such rules "one-dimensional numeric association rules".

The last example expresses "a bank customer whose age and balance are in some two dimensional region R has ever delayed his or her card loan payment". This rule contains two numeric attributes, so we call such rules "two-dimensional numeric association rules".

We need to use different techniques to derive 1D and 2D rules. In these web pages, we would like to focus on one-dimensional rules, like the second example.

[Next | Prev | Index ]


One-dimensional numeric association rule has the form like "A in [x1, x2] ==> C".

Given a database with many numeric attributes, and a target property of interest, our goal is to find numeric attributes and their ranges that best characterize the target property.

Then one problem emerges. What is a good range? What range characterize the target well?

We have two measures for ranges. Support and confidence.

A rule with large support is a good rule because the rule can be applied for many data. And a rule with high confidence is a good rule because the rule holds with high probability.

But, regarding selection of range, there is a trade-off between support and confidence.

A range with too small support is not useful even if its confidence is very high.

A range with too large support is not interesting since its confidence will be almost the same as the average.

Therefore, we need to define interestingness for ranges.

[Next | Prev | Index ]


In order to define interestingness for ranges, we define the following two terms.

We call a rule ``ample'' if the support of the rule is not less than a given minimum support threshold. An ample rule is important because its support is large.

We call a rule ``confident'' if the confidence of the rule is not less than a given minimum confidence threshold. A confident rule is also important because its confidence is high.

Using these terms, we define two types of optimized rules. One is ``optimized confidence rule'', which is an ample rule that maximizes the confidence.

And the other is ``optimized support rule'', which is a confident rule that maximizes the support.

These are our definitions of good rules.

As the preparations, we divide the domain of a numeric attribute A into almost equal-sized buckets.

And then we link consecutive buckets to produce the range of the optimized rules.

Let support(Bi) denote the number of tuples in the i-th bucket.

And let hit(B1) denote the number of tuples in the i-th bucket that satisfy the target property of the rule, C.

[Next | Prev | Index ]


You might think that you can compute the optimized range very easily.

A naive idea for computing optimized confidence rule in linear time will be the following.

  1. Enumerate all ranges whose support is equal to the threshold.
  2. And, choose a range that maximizes the confidence.
It seems to work fine. But actually, it does not.

For example, let's consider an optimized confidence rule on the condition that the support of the range is not less than 20%.

The domain of the numeric attribute A is divided into 10 of almost equal-sized buckets. Each bucket has about 100 of tuples.

The first bucket has 101 tuples, and 9 of them satisfy the target property ``C''. So, its support is 101, and its hit is 9.

Since there are 10 buckets, connecting two consecutive buckets will make a range with 20% of support.

I computed confidence for all ranges with 20% support, and I found that this range has the maximum confidence.

But, the real optimal range is the concatenation of these three buckets. Its support is 30% and, its confidence is 51%.

Obviously this range is better than these ranges of two buckets, because both support and confidence are greater than those of them.

So, the naive idea is not correct. We must revise the idea as follows.

  1. Enumerate all ranges whose support is not less than threshold in quadratic time,
  2. And, choose the best one.

[Next | Prev | Index ]


This chart summarizes what we want to say.

  1. We give a linear time algorithm for optimized confidence rule.
  2. We also have a linear time algorithm for optimized support rule.

    Each algorithm uses some techniques in computational geometry to achieve linear time complexity.

  3. We give a randomized bucketing algorithm which computes almost equal-sized buckets.
  4. Experiments proves that our algorithms are practically fast.
  5. We developed a prototype system using these algorithms.

[Next | Prev | Index ]