next up previous
Next: Results Up: Multi-agent Q-learning and regression Previous: Difficulties of neural network

Q-learning with regression trees

Our regression tree algorithm employs the simplest conceivable heuristics: we use axis-parallel splits, select splits that minimize variance, and approximate the function by constant values in the leaf nodes. A brief description appears below; more details can be found in (Breiman et al., 1984).

The trees are constructed in a ``batch'' mode using a fixed set of training cases. Each training case has d input attribute values, and an associated function value which may be adjusted during training. Given a training set N, a regression tree is constructed recursively as follows: First, the average a and the variance v of the function values of the cases are calculated. If |N| is less than a given threshold or if v is sufficiently small, the algorithm terminates, returning a leaf node which approximates the function by the constant a. Otherwise, the best axis-parallel split of the cases is found by examining all possible splits on each attribute. The best split is defined as follows: consider a split that divides N into sets tex2html_wrap_inline383 and tex2html_wrap_inline385 , with respective variances tex2html_wrap_inline387 and tex2html_wrap_inline389 . The split which minimizes the quantity tex2html_wrap_inline391 is the best split. A new internal node defined by this split is then created. The training set is separated into two subsets, and two subtrees for the new node are created recursively. Finally, the algorithm returns the node and terminates.

We performed some initial investigations of minimal error-complexity pruning, as described in (Breiman et al., 1984). However, it was found that setting a minimum of five cases in each leaf node and not using any pruning gave better results. Minimal error-complexity pruning seems ill-suited to our problem, perhaps because the policies generated by using the tree are more important than the accuracy of the tree's approximation of the function. Other pruning algorithms have not been investigated.

In our problem, the training cases consist of random price pairs (corresponding to uniform random exploration of the state-action space), plus an additional ``knowledge-engineered'' binary attribute which is set to 1 when the Q-learner's price is less than the opponent's price, and 0 otherwise. This helps the tree represent the critically important ``undercutting'' discontinuity present in all three models.

The function values for the cases are initialized to the seller's two-step immediate reward for that state and action (as in Section 3). An initial tree is built from these cases. Then, repeated sweeps are performed through the set of training cases, in which the case values tex2html_wrap_inline393 are adjusted according to:

  equation39

where the max-Q value for the successor state is found using the current tree. A new tree is built after each sweep through the training set. The algorithm terminates after a fixed number of sweeps. Training runs typically used a fixed learning rate tex2html_wrap_inline299 , which seemed to give good results even though convergence theorems require decreasing tex2html_wrap_inline299 with time.

In the case of multiple Q-learners, the tree updates are performed as follows. First, the sweep through the training set of each Q-learner is done, with the most recent policy of the other Q-learners (determined from their most recently constructed tree) used to calculate the immediate reward r. Then, the trees of all the Q-learners are reconstructed. This method promotes consistency of the Q-functions, as all Q-learners have access to equally recent information about other Q-learners.

One slightly different algorithm was also examined, which involved building a tree as above, and then further refining the leaf node values. A random new point in the space is chosen, and the corresponding leaf node is determined. Let a be the current leaf node estimate of the function, k be the number of cases that fell in that leaf node during previous training, and f be the function value of the new point. The leaf node's estimate is then updated according to: a' = (a * k + f) / (k + 1). This process continues for a number of new points, after which a new tree is built from a newly constructed training set. This algorithm worked well in the case of a single Q-learner, but for multiple Q-learners was not as good at producing stable policies as the batch updating algorithm described above.


next up previous
Next: Results Up: Multi-agent Q-learning and regression Previous: Difficulties of neural network

kephart
Tue Mar 21 00:52:15 EST 2000