Reinforcement learning in multi-agent environments is a challenging forefront of machine learning research that could have immense practical benefit in many real-world problems. Many application domains are envisioned in which teams of software agents cooperate to achieve a global objective. We also foresee signifcant applications for self-interested agents, for example, in electronic marketplaces, where economic software agents can interact with humans and/or other agents to maximize their own individual profits.
This paper investigates the use of Reinforcement Learning in multi-agent economies. Specifically, we study the use of Q-learning (Watkins, 1989) to learn pricing strategies in a competitive marketplace. Q-learning is an algorithm for learning to estimate the long-term expected reward for a given state-action pair. It has the nice properties that it does not need a model of the environment, and it can be used for on-line learning. With a lookup table representing the Q-function, the learning procedure is guaranteed to converge to the optimal value function and optimal policy.
The are two main challenges in using Q-learning in multi-agent systems. First, most of the existing convergence proofs only apply to single-agent, stationary Markov Decision Problems. (Important first steps in analyzing the two-agent case have been reported in (Littman, 1994) and (Hu and Wellman, 1998).) However, when multiple agents simultaneously adapt, each agent provides an effectively non-stationary environment for the other agents. Hence it is unknown in general whether any global convergence will be obtained in this case, and if so, whether such solutions are optimal. Second, we expect that large multi-agent systems will have states spaces that are too large for lookup tables to be feasible, and hence some sort of function approximation scheme seems necessary to represent the Q-function.
The present work combines single-agent and multi-agent Q-learning with tree-based function approximation in a model two-seller economy. The sellers alternately take turns setting prices, taking into account the other seller's current price. After the price has been set, the consumers choose either seller 1's product or seller 2's product, based on the current price pair. This leads to an instantaneous reward or profit given to the sellers. We assume that both sellers have full knowledge of the expected consumer response for any price pair, and moreover have full knowledge of both reward functions.
Q-learning is one of a variety of ways of endowing agents with ``foresight,'' i.e. an ability to anticipate long-term consequences of actions. Foresight was found in previous work (Tesauro and Kephart, 1999a,b) to improve profitability, and to damp out or eliminate the pathological behavior of unending cyclic ``price wars,'' in which long episodes of repeated undercutting amongst the sellers alternate with large jumps in price. Such price wars were found to be rampant in prior studies of agent economy models (Kephart, Hanson and Sairamesh, 1998; Sairamesh and Kephart, 1998) when agents use ``myopically optimal'' or ``myoptimal'' pricing algorithms that optimize immediate reward, but do not anticipate any longer-term consequences. Q-learning in particular is a principled way to obtain deep lookahead, since the Q-function represents the cumulative discounted reward looking infinitely far ahead in time. In contrast, the prior work of (Tesauro and Kephart, 1999a) was based on shallow finite lookahead.
Q-learning with lookup tables was previously studied in (Tesauro and Kephart, 1999b). A single Q-learner playing against a myoptimal opponent always converged to an optimal policy. In the more interesting case of two agents simultaneously Q-learning against each other, convergence was again obtained in all three models, at least for small-to-moderate values of the discount parameter. In (Tesauro, 1999), preliminary studies of Q-learning with neural networks found reasonably good policies but excessively long training times. This is a potentially major drawback in an on-line scenario where agents must learn quickly to maintain profitability. The current study of regression trees as an alternate function approximator was motivated by the goal of improving on the training time of neural networks while generating policies of at least equal quality to those given by a neural network.
The remainder of this paper is organized as follows. Section 2 describes the structure and dynamics of the model two-seller economy, and presents three economically-based seller profit functions which are prone to price wars when agents myopically optimize their short-term payoffs. Section 3 describes the implementation of Q-learning in these model economies, and summarizes previous results of lookup table and neural network Q-learning. In Section 4, the algorithms used for constructing regression trees and adapting them to Q-learning in our study are presented. Section 5 presents the results of using Q-learning with regression trees and compares these results to those of section 3. Finally, section 6 summarizes the main conclusions and discusses promising directions and challenges for future work.