We have examined single-agent and multi-agent Q-learning in three models of a two-seller economy in which the sellers alternately take turns setting prices, and then instantaneous profits are given to both sellers based on the current price pair. Such models fall into the category of two-player, alternating-turn, arbitrary-sum Markov games, in which both the rewards and the state-space transitions are deterministic. The game is Markov because the state space is fully observable and the rewards are not history dependent.
In all three models (Price-Quality, Information-Filtering, and
Shopbot), large-amplitude cyclic price wars are obtained when the
sellers myopically optimize their instantaneous profits
without regard to longer-term impact of their pricing policies.
We find that, in all three models, the use of Q-learning by
one of the sellers against a myopic opponent invariably results
in exact convergence to the optimal Q-function and optimal
policy against that opponent, for all allowed values of the discount
parameter
. The use of the Q-derived policy yields
greater expected profit for the Q-learner, with monotonically
increasing profit as
increases. In many cases,
it has a side benefit of also
enhancing the welfare of the myopic opponent.
This comes about by reducing the amplitude of the undercutting
price-war regime, or in some cases, eliminating it completely.
We have also studied the more interesting and challenging
situation of simultaneously training Q-functions for both sellers.
This is more difficult because as each seller's Q-function and
policy change, it provides a non-stationary environment for
adaptation of the other seller. No convergence proofs exist
for such simultaneous Q-learning by multiple agents.
Nevertheless, despite the absence of theoretical guarantees,
we do find generally good behavior of the algorithm in our model economies.
In two of the models (Shopbot and Price-Quality), we find
exact or very good approximate convergence to simultaneously
self-consistent Q-functions and optimal policies for any
value of
, whereas in the Information-Filtering model,
simultaneous convergence was found for
.
In the Information-Filtering and Shopbot models, monotonically
increasing expected profits for both sellers were also found
for small values of
. In the Price-Quality model,
simultaneous Q-learning yields an asymmetric solution, corresponding
to the solution found in (Tesauro and Kephart, 1999), that is
highly advantageous to the lesser-quality seller, but slightly
disadvantageous to the higher-quality seller, when compared
to myopic vs. myopic pricing. A similar asymmetric solution
is also found in the Shopbot model for large
, even
though the profit functions for both players are symmetric.
For each model, there exists a range of discount parameter values where the solutions obtained by simultaneous Q-learning are self-consistently optimal, and outperform the solutions obtained in (Tesauro and Kephart, 1999). This is presumably because the previously published methods were based on limited lookahead, whereas the Q-functions in principle look ahead infinitely far, with appropriate discounting.
It is intruiging that simultaneous Q-learning works well in our models, despite the lack of theoretical convergence proofs. Sandholm and Crites also found that simultaneous Q-learning generally converged in the Iterated Prisoner's Dilemma game. These empirical findings suggest that a deeper theoretical analysis of simultaneous Q-learning may be worth investigating. There may be some underlying theoretical principles that can explain why simultaneous Q-learning works, for at least certain classes of arbitrary-sum profit functions.
Several important challenges will also be faced in extending our approach to larger-scale, more realistic simulations. While there are some economic situations in the real world where there are only two dominant sellers, in general the number of sellers can be much greater. The situation that we foresee in agent economies is that the number of competing sellers will be very large. In this case, the seller profits and pricing functions will have such high input dimensionality that it will be infeasible to use lookup table state-space representations, and most likely some sort of compact representation combined with a function approximation scheme will be necessary. Furthermore, with many sellers, the concept of sellers taking turns adjusting their prices in a well-defined order becomes problematic. This could lead to an additional combinatorial explosion, if the mechanism for calculating expected reward has to anticipate all possible orderings of opponent responses.
Furthermore, while our economic models have a moderate degree of realism in their profit functions, they are unrealistic in the assumptions of knowledge and dynamics. In the work reported here, the state space was fully observable infinitely frequently at zero cost and with zero propagation delays. The expected consumer demand for a given price pair was instantaneous, deterministic and fully known to both players. Indeed, the players' exact profit functions were fully known to both players. It was also assumed that the players would alternately take turns equally often in a well-defined order in adjusting their prices. Under such assumptions of knowledge and dynamics, one could hope to develop an algorithm that could calculate in advance something like a game-theoretic optimal pricing algorithm for each agent.
However, in realistic agent economies, it is likely that agents will have much less than full knowledge of the state of the economy. Agents may not know the details of other agents' profit functions, and indeed an agent may not know its own profit function, to the extent that buyer behavior is unpredictable. The dynamics of buyers and sellers may also be more complex, random and unpredictable than what we have assumed here. There may also be information delays for both buyers and sellers, and part of the economic game may involve paying a cost in order to obtain information about the state of the economy faster and more frequently, and in greater detail. Finally, we expect that buyer behavior will be non-stationary, so that there will be a more complex co-evolution of buyer and seller strategies.
While such real-world complexities are daunting, there are reasons to believe that learning approaches such as Q-learning may play a role in practical solutions. The advantage of Q-learning is that one does not need a model of either the instantaneous payoffs or of the state-space transitions in the environment. One can simply observe actual rewards and transitions and base learning on that. While the theory of Q-learning requires exhaustive exploration of the state space to guarantee convergence, this may not be necessary when function approximators are used. In that case, after training a function approximator on a relatively small number of observed states, it may then generalize well enough on the unobserved states to give decent practical performance. Several recent empirical studies have provided evidence of this (Tesauro, 1995; Crites and Barto, 1996; Zhang and Dietterich, 1996).