Using lookup tables to represent Q-functions as described in the previous two sections can only be feasible for small-scale problems. It is likely that the situations that will be faced by software agents in the real world will be too large-scale and complex to tackle via lookup tables, and that some sort of function approximation scheme will be necessary. This section examines the use of multi-layer neural networks to represent the Q-functions in the same economic models studied previously. Some initial results are presented for the case of a single adaptive QNN (Q-learning neural network) agent, training vs. a fixed-strategy myopic agent, as was described previously in section 3.
The neural networks studied here are multi-layer perceptrons (MLPs)
as used in back-propagation. The same Q-learning equation
5 is used as previously, however, the quantity
is interpreted as the output error signal used
in a backprop-style gradient calculation of weight changes.
As in the previous sections, the state-action pairs (s,a)
are chosen by uniform random exploration, although there is some
preliminary evidence that somewhat better policies can be
obtained by training on actual trajectories. Also in the
experiments below, a fixed learning-rate constant
was
used, rather than the time-varying schedule
described
previously. This appears to give a significant speed increase
at the cost of only a slight degradation in final network
performance.
One of the most important issues in using neural networks is the
design of the input state representation scheme. Schemes that
incorporate specialized knowledge of the domain can often do
better than naive representation schemes. The only knowledge
included here is that it is important for a seller to know whether
its price is greater than, less than, or equal to the other
seller's price. This suggests a coding scheme using five input
units. The first two units represent the two seller prices
as real numbers, and the remaining three units
are binary units representing the three logical conditions
.
In the experiments below, the networks contained a single linear
output unit, and a single hidden layer of 10 hidden units,
fully connected to the input layer. For the Shopbot model,
it was found that the
network's ability to approximate the correct Q-function
was generally poor, with the worst accuracy obtained at large
values of
. Furthermore, the improvement in approximation
error with training was extremely slow, and continued to decrease
at an extremely slow rate for as long as the training was continued.
Typical training runs lasted for several tens of thousands of
sweeps through all possible price pairs. In the case of
an extremely long training run of several million sweeps was
performed, after which the approximation accuracy was quite good,
but the error was still decreasing at a very slow rate.
Figure 11: Plot of expected utility for a single Q-learning
agent, training against a fixed myopic opponent, in the Shopbot
model, as a function of
. Filled circles represent
a neural network Q-learner, while open circles represent
a lookup table Q-learner. Each data point represents the best
policy obtained during a training run.
By comparison, the baseline myopic vs. myopic expected utility
is indicated by the dashed line.
The difficulty in obtaining accurate function approximation
could have resulted from inaccurate targets
used in training, or it could be due to intrinsic limitations
in the function approximator itself. In a separate series of
experiments, neural nets were trained with exact targets provided
by the lookup tables, and this yielded no measurable advantage
in approximation accuracy, suggesting that the problem is not
due to inaccurate heuristic teacher signals in equation 5.
During the neural net training runs, the expected performance
of the neural net policy was periodically measured vs. the
myopic opponent. While the absolute error in the Q-function
improved monotonically, the policy's expected profit was found
to reach a peak relatively quickly (usually withing 100-200 sweeps,
but longer for large
) and then either level off
or decrease slightly. A plot of the peak expected utility
for each training run as a function of
is shown in figure 11. For comparison, the
expected utility of the exact optimal policy, obtained by
lookup table Q-learning, is also plotted. It is encouraging
to note that, although the absolute accuracy of the neural net
Q-function is poor and improves extremely slowly, the resulting
policies nevertheless give reasonably decent performance and
can be trained relatively quickly. This once again re-emphasizes
a point found in other successful applications of neural nets
and reinforcement learning: the neural net approach can often
give a surprisingly strong policy, even though the absolute
accuracy of the value function is poor.