next up previous
Next: Acknowledgements Up: Pricing in agent economies Previous: Q-learning with neural networks

Conclusions

This paper has examined single-agent and multi-agent Q-learning in three models of a two-seller economy in which the sellers alternately take turns setting prices, and then instantaneous utilities are given to both sellers based on the current price pair. Such models fall into the category of two-player, alternating-turn, arbitrary-sum Markov games, in which both the rewards and the state-space transitions are deterministic. The game is Markov because the state space is fully observable and the rewards are not history dependent.

In all three models (Price-Quality, Information-Filtering, and Shopbot), large-amplitude cyclic price wars are obtained when the sellers myopically optimize their instantaneous utilities without regard to longer-term impact of their pricing policies. It is found that, in all three models, the use of Q-learning by one of the sellers against a myopic opponent invariably results in exact convergence to the optimal Q-function and optimal policy against that opponent, for all allowed values of the discount parameter tex2html_wrap_inline443 . The use of the Q-derived policy yields greater expected utility for the Q-learner, with monotonically increasing utility as tex2html_wrap_inline443 increases. In many cases, it has a side benefit of enhancing social welfare by also giving greater expected utility for the myopic opponent. This comes about by reducing the amplitude of the undercutting price-war regime, or in some cases, eliminating it completely.

The more interesting and challenging situation of simultaneously training Q-functions for both sellers has also been studied. This is more difficult because as each seller's Q-function and policy change, it provides a non-stationary environment for adaptation of the other seller. No convergence proofs exist for such simultaneous Q-learning by multiple agents. Nevertheless, despite the absence of theoretical guarantees, generally good behavior of the algorithm was found. In two of the models (Shopbot and Price-Quality), exact or very good approximate convergence was obtained to simultaneously self-consistent Q-functions and optimal policies for any value of tex2html_wrap_inline443 , whereas in the Information-Filtering model, simultaneous convergence was found for tex2html_wrap_inline741 . In the Information-Filtering and Shopbot models, monotonically increasing expected utilities for both sellers were also found for small values of tex2html_wrap_inline443 . In the Price-Quality model, simultaneous Q-learning yields an asymmetric solution, corresponding to the solution found in (Tesauro and Kephart, 1999), that is highly advantageous to the lesser-quality seller, but slightly disadvantageous to the higher-quality seller, when compared to myopic vs. myopic pricing. A similar asymmetric solution is also found in the Shopbot model for large tex2html_wrap_inline443 , even though the utility functions for both players are symmetric.

For each model, there exists a range of discount parameter values where the solutions obtained by simultaneous Q-learning are self-consistently optimal, and outperform the solutions obtained in (Tesauro and Kephart, 1999). This is presumably because the previously published methods were based on limited lookahead, whereas the Q-functions in principle look ahead infinitely far, with appropriate discounting.

It is intruiging that simultaneous Q-learning works well in our models, despite the lack of theoretical convergence proofs. Sandholm and Crites also found that simultaneous Q-learning generally converged in the Iterated Prisoner's Dilemma game. These empirical findings suggest that a deeper theoretical analysis of simultaneous Q-learning may be worth investigating. There may be some underlying theoretical principles that can explain why simultaneous Q-learning works, for at least certain classes of arbitrary-sum utility functions.

Some initial steps have also been taken in combining nonlinear function approximation, using neural nets, with the Q-learning approach. It was found that a single neural net Q-learner facing a myopic opponent can exhibit reasonably good pricing policies, despite difficulties in obtaining an accurate approximation to the Q-function.

In addition to replacing lookup tables with function approximators, several other important challenges will also be faced in extending our approach to larger-scale, more realistic simulations. First, the three economic models used here quite deliberately ignored frictional effects such as agent search costs. Such effects can damp out price wars, and can lead to different system behaviors such as partial equilibria that support stable price differentiation. Eventually such frictional effects will have to be considered, although it has been argued in prior studies of these models (Sairamesh and Kephart, 1998; Kephart, Hanson and Sairamesh, 1998; Greenwald and Kephart, 1999) that frictional effects in Web-based agent economies will be considerably smaller than in traditional human economies. Also, with many sellers, the concept of sellers taking turns adjusting their prices in a well-defined order becomes problematic. This could lead to an additional combinatorial explosion, if the mechanism for calculating expected reward has to anticipate all possible orderings of opponent responses.

Furthermore, while these economic models have a moderate degree of realism in their utility functions, they are unrealistic in the assumptions of knowledge and dynamics. In the work reported here, the state space was fully observable infinitely frequently at zero cost and with zero propagation delays. The expected consumer response to a given price pair was instantaneous, deterministic and fully known to both players. Indeed, the players' exact utility functions were fully known to both players. It was also assumed that the players would alternately take turns equally often in a well-defined order in adjusting their prices. Under such assumptions of knowledge and dynamics, one could hope to develop an algorithm that could calculate in advance something like a game-theoretic optimal pricing algorithm for each agent.

However, in realistic agent economies, it is likely that agents will have much less than full knowledge of the state of the economy. Agents may not know the details of other agents' utility functions, and indeed an agent may not know its own utility function, to the extent that buyer behavior is unpredictable. The dynamics of buyers and sellers may also be more complex, random and unpredictable than what we have assumed here. There may also be information delays for both buyers and sellers, and part of the economic game may involve paying a cost in order to obtain information about the state of the economy faster and more frequently, and in greater detail. Finally, one may expect that buyer behavior will be non-stationary, so that there will be a more complex co-evolution of buyer and seller strategies.

While such real-world complexities are daunting, there are reasons to believe that learning approaches such as Q-learning may play a role in practical solutions. The advantage of Q-learning is that one does not need a model of either the instantaneous payoffs or of the state-space transitions in the environment. One can simply observe actual rewards and transitions and base learning on that. While the theory of Q-learning requires exhaustive exploration of the state space to guarantee convergence, this may not be necessary when function approximators are used. In that case, after training a function approximator on a relatively small number of observed states, it may then generalize well enough on the unobserved states to give decent practical performance. Several recent empirical studies have provided evidence of this (Tesauro, 1995; Crites and Barto, 1996; Zhang and Dietterich, 1996).


next up previous
Next: Acknowledgements Up: Pricing in agent economies Previous: Q-learning with neural networks

kephart
Wed Sep 29 11:51:48 EDT 1999