Now I would like to explain the algorithm that computes optimized
confidence rules. Let's consider the sequence of points Qm, whose coordinates are the sum of support and hit for the first bucket through the m-th bucket.
When range R is the union of consecutive buckets from the (m+1)-th bucket to the t-th bucket,
The horizontal distance of Qm and Qt is equal to the support of the range.
The vertical distance of Qm and Qt is equal to the hit of the range.
And the slope of Qm Qt gives the confidence of the rule.
This thick broken line is the upper convex hull of the points that are farther than minimum support threshold from Qm in a horizontal distance.
Let's consider the tangent of Qm and the upper convex hull. This thin line is the tangent.
It's obvious that the tangent has the maximum slope of Qm and this upper convex hull.
Thus, the tangent of Qm and corresponding upper convex hull for some m gives the range of the optimized confidence rule.
So, we focus on computing tangents for all m.
I would like to explain how to compute tangents for all m. We compute all tangents from left to right.
Let's focus on Qk, the dotted line is the corresponding upper convex hull. The horizontal distance between Qk and the hull must be greater than the minimum support threshold.
We find the touching point of the tangent by clockwise search. That is, visit each node on the convex hull from the leftmost node in clockwise order, and find the line with the maximum slope.
When we move to Qm, the horizontal distance between Qm and these nodes gets to be less than the threshold. So these nodes have to be removed from the upper convex hull. And these nodes have been hidden inside of the previous upper hull, and therefore they are to be newly added to the convex hull.
Let's assume that L is the tangent most recently found.
When Qm is above L. The slope of the tangent of Qm and the corresponding upper convex hull can not be greater than the slope of L. So, in this case, we don't compute the
So, we focus on computing tangents for all m. tangent of Qm.
This chart shows how to handle the case when Qm is under
L. When L doesn't touch the upper convex hull because the touching point was removed in order to meet the minimum support threshold.
In this case, we find the touching point of the tangent by clockwise search. That is, visit each node on the convex hull starting from the leftmost node in clockwise order, and find the line with the maximum slope.
In the case when L still touches the convex hull, we find the touching point of the tangent by counter-clockwise search. That is, visit each node on the hull starting from touching point of L, in counter-clockwise order, and find the line with the maximum slope.
Throughout this algorithm, we can prove that both the clockwise search and the counter clockwise search scan each edge of all upper convex hulls at most once.
Thus the algorithm computes the tangents with the maximum slop in time linear in the number of edges, namely in the number of buckets.
In this algorithm, we assumed that it is possible to access the two neighbors of each node on a upper hull in constant time. Actually it is possible in by maintaining the upper convex hull with stacks.
We make stack S store Ui for i equals M through 0.When i equals M, push QM onto S. This trivially makes S store UM.
For each i equals to M-1 through 0, assuming S stores Ui+1, visit each node on Ui+1 from Qi+1 in clockwise order, that is this order on S, find the node that makes the maximum slop with Qi, move these skipped nodes to stack Di, and push Qi onto stack S.
After the termination, the stack stores U0.
Stack Dis are used for recording the nodes deleted at each step. By using Dis, we can conversely make Ui+1 from Ui.
We make stack S store all nodes of an upper convex hull orderly so
that we can access two neighbors of each node on a convex hull in
constant time. This part shows the state transition of a single stack, S, with respect to this example.
Clockwise traversal on convex hull corresponds to top to down search on the stack.
The algorithm to build this data structure is almost the same as the algorithm for computing tangents. So I don't explain it here. Please read the paper for detail.
We implemented the algorithm and evaluated it comparing the results
with a naive algorithm that computes the confidence of all ranges in
quadratic time. Experiments shows that our algorithm is quite fast in practice.
The execution time of our algorithm grows up linearly in the number of bucket in wide range.