
Coppersmith, D., Hong, S. J., and Hosking, J. R. M. (1999).
Data Mining and Knowledge Discovery, 3, 197-217.
Abstract.
To find the optimal branching of a nominal attribute at a node in an
L-ary decision tree, one is often forced to search over all possible
L-ary partitions for the one that yields the minimum impurity measure.
For binary trees (L=2) when there are just two classes a short-cut
search is possible that is linear in n, the number of distinct values
of the attribute.
For the general case in which the number of classes, k, may be
greater than two, Burshtein et al. have shown that the optimal
partition satisfies a condition that involves the existence of
L(L-1)/2 hyperplanes in the class probability space.
We derive a property of the optimal partition for
concave impurity measures (including in particular
the Gini and entropy impurity measures)
in terms of the existence of L vectors in the dual
of the class probability space, which implies the earlier condition.
Unfortunately, these insights still do not offer a practical search
method when n and k are large, even for binary trees.
We therefore present a new heuristic search algorithm to
find a good partition. It is based on ordering the attribute's
values according to their principal component scores in the class
probability space, and is linear in n. We demonstrate the
effectiveness of the new method through Monte Carlo simulation
experiments and compare its performance against other heuristic methods.
[ J. R. M. Hosking's home page |
IBM Research home page ][
IBM home page |
Order |
Search |
Contact IBM |
Help |
(C) |
(TM)
]