In the case of single-agent Q-learning, the batch training algorithm
converged rapidly, within a few dozen iterations, taking
1 minute or
less of CPU time on a fast RS/6000 workstation.
This represents a huge improvement over neural network training,
which ranged from a few hours to several days of CPU time.
Furthermore, in all three economic models, regression tree learning
was consistently able to match the exact optimal
policies that were obtained from lookup-table Q-learning.
(Again, this is superior to neural nets, which usually were not
able to find the optimal policies.)
Figure 3: Results of single-agent Q-learning with regression trees
in the Shopbot model,
.
(a) Average Q-learner profit per time step as function of training set
size. Error bars represent min and max profit over
10 runs with different randomly-generated training sets.
Lower dashed line represents average profit in myopic vs. myopic play.
Upper dashed line indicates average profit of exact optimal policy
obtained by lookup table Q-learning.
(b) Average tree size as a function of training set size.
These optimal policies were found with training set sizes that were
relatively small fractions of the total state space size.
Typically we find that data sets
20-25% of the
state space size in the
Price-Quality and Information-Filtering models, and
40-50% in
the Shopbot model, are sufficient to generate exact or near-optimal
policies. As shown in figure 3, the algorithm's behavior
as a function of training set size is quite good. The policy
profitability quickly asymptotes at the level of the lookup table
policy, and for smaller training sets, there is graceful degradation:
in most cases a near-optimal policy is found, however, there is an
increasing chance that a poor solution will be obtained as the
training set size is reduced.
We can also see in figure 3 that the trees grow to reasonable sizes (much smaller than the corresponding lookup tables) as a function of training set size. For example, in the Shopbot model, trees with approximately 300 nodes (or 150 leaf nodes) were able to generate optimal policies. Trees with approximately 600 nodes generated optimal policies in the Price-Quality model, with the higher quality seller as the Q-learner.
Figure 4: Results of varying Minobjs, the minimum no. of
cases per leaf node. Single-agent Q-learning with regression trees
in the Shopbot model,
, 1500 training cases.
(a) Average Q-learner profit per time step as function of
Minobjs. Error bars represent min and max profit over
10 runs with different randomly-generated training sets.
(b) Average tree size as a function of Minobjs.
Most of our results were obtained with the minimum number of training cases per leaf node (``Minobjs'') set to 5. This lenient criterion resulted in a large number of leaf nodes, with relatively good constant-value approximation within each leaf node. We have also investigated more stringent stopping criteria by increasing Minobjs. This results in smaller trees, with potentially cruder function approximation within the leaf nodes. Generally we find robust behavior with increasing Minobjs, as illustrated in figure 4. Increasing Minobjs from 5 to 20 was found to reduce the tree size by nearly a factor of 3, with only a slight decrease in policy profitability.
An illustration of the quality of regression tree function approximation
compared to an exact lookup table solution is
shown in figure 5. The tree correctly fits the diagonal
discontinuity along the undercutting line
(aided greatly by the ``knowledge-engineered'' input attribute), and
unlike neural networks, it also correctly fits the second discontinuity
in the small
regime. This is important in obtaining the optimal policy.
Neural nets fit the flat and smoothly curving portions of the landscape
better, but this is irrelevant to finding the optimal policy.
The results of multi-agent Q-learning using regression trees were also
promising. In the Price-Quality model, results were identical to those
generated with lookup tables: robust convergence was found to a
self-consistent pair
of policies independent of
. In the Information-Filtering model,
convergence was obtained for
up to 0.5, with cumulative profits
for both sellers increasing with higher
, exactly as in
the lookup table case. Results for
the Shopbot model were mixed. For
up to 0.2, convergence to
symmetric policies was obtained. Using the leaf-node updating algorithm
described briefly in Section 4, the same asymmetric solutions were
obtained for
. Using the batch training algorithm with large
, the policies of the sellers remain symmetric for some time, after
which they suddenly become asymmetric and no convergence is obtained. Although
these results are not completely understood, they seem to correspond well
with the lookup table results, where it was also seen that the symmetric
solution for multi-agent Q-learning was highly unstable.
Figure 5: Plot of the Q-function for single-agent Q-learning vs. a
myopic opponent in the Shopbot model at
.
is the Q-learner's price and
is the myopic player's
price. (a) Exact optimal Q-function obtained by lookup table
Q-learning. (b) Regression tree approximation to the Q-function.
The tree was trained on 1500 cases and contained at least 10 cases
per leaf node.
Training times in the multi-agent case increased, but convergence was still
usually obtained in under 5 minutes. Partly this was due to having
to use a smaller value of
, whereas in the single
agent case, it was often possible to go as high as
and still obtain convergence. Tree sizes increased
slightly in multi-agent Q-learning tests, which is reasonable since the
profit landscapes are more complicated. In the Shopbot model, trees that
generated optimal policies ranged from 350-400 nodes, and trees with
approximately 700 nodes generated optimal policies for the higher-quality
seller in the Price-Quality model.