next up previous
Next: References Up: Pseudo-convergent Q-Learning by Competitive Previous: Which solution will occur?

Conclusions

In this paper and in our earlier work  [7], we have observed interesting and unexpected phenomena when using simultaneous Q-learning to attempt to obtain optimal pricing policies for competing pricebots. These findings include:

(1) In the shopbot model studied here, and in the two additional models studied in  [7], there do exist exact optimal Q-learning solutions, in which the two players' policies are simultaneously optimal vs. each other. Furthermore, the Q-learning procedure is able to find these solutions. In general, for arbitrary payoff functions, there is no reason a priori to expect that there will always exist a pair of response curves such that tex2html_wrap_inline664 is optimal vs. tex2html_wrap_inline662 and vice versa. The fact that such solutions exist in three different economically motivated models suggests that the structure of the payoff functions of naturally occuring real-world problems may be sufficiently ``nice'' to lead to such exact solutions, even though they cannot be expected for arbitrary problems.

(2) In addition to the exact solutions, we also find in the shopbot model, and in the Information Filtering model studied in  [7], a non-stationary ``pseudo-convergent'' solution with small non-zero Bellman error. This is a novel phenomenon that does not occur in ordinary single-agent Q-learning. Here we found that the non-stationarity is inherently cyclic in nature; this had previously been masked by the random exploration of the Q-learning algorithm. The pseudo-solution in the shopbot model is of further interest because it is a broken symmetry solution, whereas the payoff functions for the two players are symmetric. In theoretical physics, broken symmetry solutions to symmetric equations often indicate interesting underlying physics. Broken symmetry solutions may be of similar interest in understanding the dynamics of multi-agent Q-learning.

(3) We find that the exact solution and the pseudo-solution can co-exist, and that the choice of which solution is obtained by Q-learning depends on initial conditions, the discount parameter, and (to a lesser extent) on the randomness of exploration.

This paper has presented analytic and numerical techniques for calculating the forms of the solutions, the dynamics of the learning procedure (how the policies and Bellman errors behave with time), and understanding the conditions under which each solution is chosen. A large- tex2html_wrap_inline576 approximation for the various thresholds describing the policies would be desirable, as would a more detailed characterization of the basins of attraction. More broadly, we would like to develop a deeper and more general understanding of pseudo-convergent solutions and the range of scenarios and conditions under which they occur. From our experience to date, we suspect that they will be found generally in automated price-setting applications, where instabilities generated by undercutting may be quite endemic. But only after conducting a broader analytical and experimental study of multi-agent Q-learning in various applications (most likely aided by techniques from nonlinear dynamical systems theory) will we know whether pseudo-convergence is a phenomenon confined narrowly to non-cooperative non-zero-sum games between two players, or a fundamental characteristic of multi-agent Q-learning.


next up previous
Next: References Up: Pseudo-convergent Q-Learning by Competitive Previous: Which solution will occur?

kephart
Tue Mar 21 00:33:02 EST 2000