Our regression tree algorithm employs the simplest conceivable heuristics: we use axis-parallel splits, select splits that minimize variance, and approximate the function by constant values in the leaf nodes. A brief description appears below; more details can be found in (Breiman et al., 1984).
The trees are constructed in a ``batch'' mode using a fixed set of
training cases. Each training case has d input attribute values, and an
associated function value which may be adjusted during training.
Given a training set N, a regression tree is constructed
recursively as follows:
First, the average a and the
variance v of the function values of the cases are calculated. If |N|
is less than a given threshold or if v
is sufficiently small, the algorithm terminates, returning a
leaf node which approximates the function by the constant a.
Otherwise, the best axis-parallel split of the cases is found by examining all
possible splits on each attribute. The best split is defined as follows:
consider a split that divides N into sets
and
, with
respective variances
and
.
The split which minimizes the quantity
is the best split. A new internal node
defined by this split is then created. The training set is
separated into two subsets, and two subtrees for the new node are created
recursively. Finally, the algorithm returns the node and terminates.
We performed some initial investigations of minimal error-complexity pruning, as described in (Breiman et al., 1984). However, it was found that setting a minimum of five cases in each leaf node and not using any pruning gave better results. Minimal error-complexity pruning seems ill-suited to our problem, perhaps because the policies generated by using the tree are more important than the accuracy of the tree's approximation of the function. Other pruning algorithms have not been investigated.
In our problem, the training cases consist of random price pairs (corresponding to uniform random exploration of the state-action space), plus an additional ``knowledge-engineered'' binary attribute which is set to 1 when the Q-learner's price is less than the opponent's price, and 0 otherwise. This helps the tree represent the critically important ``undercutting'' discontinuity present in all three models.
The function values for
the cases are initialized to the seller's two-step immediate reward for
that state and action (as in Section 3). An initial
tree is built from these cases. Then,
repeated sweeps are performed through the set of training cases, in which
the case values
are adjusted according to:
where the max-Q value for the
successor state is found using the current tree.
A new tree is built after each sweep
through the training set. The algorithm terminates after a fixed number of
sweeps. Training runs typically used a fixed learning rate
,
which seemed to give good results even though convergence theorems
require decreasing
with time.
In the case of multiple Q-learners, the tree updates are performed as follows. First, the sweep through the training set of each Q-learner is done, with the most recent policy of the other Q-learners (determined from their most recently constructed tree) used to calculate the immediate reward r. Then, the trees of all the Q-learners are reconstructed. This method promotes consistency of the Q-functions, as all Q-learners have access to equally recent information about other Q-learners.
One slightly different algorithm was also examined, which involved building a tree as above, and then further refining the leaf node values. A random new point in the space is chosen, and the corresponding leaf node is determined. Let a be the current leaf node estimate of the function, k be the number of cases that fell in that leaf node during previous training, and f be the function value of the new point. The leaf node's estimate is then updated according to: a' = (a * k + f) / (k + 1). This process continues for a number of new points, after which a new tree is built from a newly constructed training set. This algorithm worked well in the case of a single Q-learner, but for multiple Q-learners was not as good at producing stable policies as the batch updating algorithm described above.