In this paper and in our earlier work [7], we have observed interesting and unexpected phenomena when using simultaneous Q-learning to attempt to obtain optimal pricing policies for competing pricebots. These findings include:
(1) In the shopbot model studied here, and in the two additional
models studied in [7], there do exist
exact optimal Q-learning solutions, in which the two players'
policies are simultaneously optimal vs. each other.
Furthermore, the Q-learning procedure is able to find these solutions.
In general, for arbitrary payoff functions, there is no reason
a priori to expect that there will always exist a pair of
response curves such that
is optimal vs.
and vice versa.
The fact that such solutions exist in three different economically
motivated models suggests that the structure of the payoff functions
of naturally occuring real-world problems may be sufficiently ``nice''
to lead to such exact solutions, even though they cannot be expected
for arbitrary problems.
(2) In addition to the exact solutions, we also find in the shopbot model, and in the Information Filtering model studied in [7], a non-stationary ``pseudo-convergent'' solution with small non-zero Bellman error. This is a novel phenomenon that does not occur in ordinary single-agent Q-learning. Here we found that the non-stationarity is inherently cyclic in nature; this had previously been masked by the random exploration of the Q-learning algorithm. The pseudo-solution in the shopbot model is of further interest because it is a broken symmetry solution, whereas the payoff functions for the two players are symmetric. In theoretical physics, broken symmetry solutions to symmetric equations often indicate interesting underlying physics. Broken symmetry solutions may be of similar interest in understanding the dynamics of multi-agent Q-learning.
(3) We find that the exact solution and the pseudo-solution can co-exist, and that the choice of which solution is obtained by Q-learning depends on initial conditions, the discount parameter, and (to a lesser extent) on the randomness of exploration.
This paper has presented analytic and numerical techniques for
calculating the forms of the solutions, the dynamics of the
learning procedure (how the policies and Bellman errors behave
with time), and understanding the conditions under which each
solution is chosen.
A large-
approximation for the various thresholds describing
the policies would be desirable, as would a
more detailed characterization of the basins of attraction.
More broadly, we would like to develop a deeper and
more general understanding of pseudo-convergent solutions and the
range of scenarios and conditions under which they occur. From
our experience to date, we suspect that they will be found
generally in automated price-setting applications, where
instabilities generated by undercutting may be quite endemic.
But only after conducting a broader analytical and experimental
study of multi-agent Q-learning in various
applications (most likely aided by techniques from nonlinear
dynamical systems theory) will we know
whether pseudo-convergence is a phenomenon
confined narrowly to non-cooperative non-zero-sum games
between two players, or a fundamental characteristic of multi-agent
Q-learning.