next up previous
Next: Conclusions Up: Pricing in agent economies Previous: Multi-agent Q-learning

Q-learning with neural networks

Using lookup tables to represent Q-functions as described in the previous two sections can only be feasible for small-scale problems. It is likely that the situations that will be faced by software agents in the real world will be too large-scale and complex to tackle via lookup tables, and that some sort of function approximation scheme will be necessary. This section examines the use of multi-layer neural networks to represent the Q-functions in the same economic models studied previously. Some initial results are presented for the case of a single adaptive QNN (Q-learning neural network) agent, training vs. a fixed-strategy myopic agent, as was described previously in section 3.

The neural networks studied here are multi-layer perceptrons (MLPs) as used in back-propagation. The same Q-learning equation 5 is used as previously, however, the quantity tex2html_wrap_inline711 is interpreted as the output error signal used in a backprop-style gradient calculation of weight changes. As in the previous sections, the state-action pairs (s,a) are chosen by uniform random exploration, although there is some preliminary evidence that somewhat better policies can be obtained by training on actual trajectories. Also in the experiments below, a fixed learning-rate constant tex2html_wrap_inline715 was used, rather than the time-varying schedule tex2html_wrap_inline717 described previously. This appears to give a significant speed increase at the cost of only a slight degradation in final network performance.

One of the most important issues in using neural networks is the design of the input state representation scheme. Schemes that incorporate specialized knowledge of the domain can often do better than naive representation schemes. The only knowledge included here is that it is important for a seller to know whether its price is greater than, less than, or equal to the other seller's price. This suggests a coding scheme using five input units. The first two units represent the two seller prices tex2html_wrap_inline465 as real numbers, and the remaining three units are binary units representing the three logical conditions tex2html_wrap_inline721 .

In the experiments below, the networks contained a single linear output unit, and a single hidden layer of 10 hidden units, fully connected to the input layer. For the Shopbot model, it was found that the network's ability to approximate the correct Q-function was generally poor, with the worst accuracy obtained at large values of tex2html_wrap_inline443 . Furthermore, the improvement in approximation error with training was extremely slow, and continued to decrease at an extremely slow rate for as long as the training was continued. Typical training runs lasted for several tens of thousands of sweeps through all possible price pairs. In the case of tex2html_wrap_inline449 an extremely long training run of several million sweeps was performed, after which the approximation accuracy was quite good, but the error was still decreasing at a very slow rate.

   figure175
Figure 11: Plot of expected utility for a single Q-learning agent, training against a fixed myopic opponent, in the Shopbot model, as a function of tex2html_wrap_inline443 . Filled circles represent a neural network Q-learner, while open circles represent a lookup table Q-learner. Each data point represents the best policy obtained during a training run. By comparison, the baseline myopic vs. myopic expected utility is indicated by the dashed line.

The difficulty in obtaining accurate function approximation could have resulted from inaccurate targets tex2html_wrap_inline711 used in training, or it could be due to intrinsic limitations in the function approximator itself. In a separate series of experiments, neural nets were trained with exact targets provided by the lookup tables, and this yielded no measurable advantage in approximation accuracy, suggesting that the problem is not due to inaccurate heuristic teacher signals in equation 5.

During the neural net training runs, the expected performance of the neural net policy was periodically measured vs. the myopic opponent. While the absolute error in the Q-function improved monotonically, the policy's expected profit was found to reach a peak relatively quickly (usually withing 100-200 sweeps, but longer for large tex2html_wrap_inline443 ) and then either level off or decrease slightly. A plot of the peak expected utility for each training run as a function of tex2html_wrap_inline443 is shown in figure 11. For comparison, the expected utility of the exact optimal policy, obtained by lookup table Q-learning, is also plotted. It is encouraging to note that, although the absolute accuracy of the neural net Q-function is poor and improves extremely slowly, the resulting policies nevertheless give reasonably decent performance and can be trained relatively quickly. This once again re-emphasizes a point found in other successful applications of neural nets and reinforcement learning: the neural net approach can often give a surprisingly strong policy, even though the absolute accuracy of the value function is poor.


next up previous
Next: Conclusions Up: Pricing in agent economies Previous: Multi-agent Q-learning

kephart
Wed Sep 29 11:51:48 EDT 1999