Reinforcement Learning (RL) procedures have been established as powerful and practical methods for solving Markov Decision Problems. One of the most significant and actively investigated RL algorithms is Q-learning (Watkins, 1989). Q-learning is an algorithm for learning to estimate the long-term expected reward for a given state-action pair. It has the nice property that it does not need a model of the environment, and it can be used for on-line learning. A number of powerful convergence proofs have been given showing that Q-learning is guaranteed to converge with probability 1, in cases where the state space is small enough so that lookup table representations can be used (Watkins and Dayan, 1992). Furthermore, in large state spaces where lookup table representations are infeasible, RL methods can be combined with function approximators to give good practical performance despite the lack of theoretical guarantees of convergence to optimal policies.
Most real-world problems are not fully Markov in nature - they are often non-stationary, history-dependent and/or not fully observable. In order for RL methods to be more generally useful in solving such problems, they need to be extended to handle these non-Markovian properties. One important application domain where the non-Markovian aspects are paramount is the area of multi-agent systems. This area is expected to be increasingly important in the future, due to the potential rapid emergence of ``agent economies'' consisting of large populations of interacting software agents engaged in various forms of economic activity. The problem of multiple agents simultaneously adapting is in general non-Markov, because each agent provides an effectively non-stationary environment for the other agents. Hence the existing convergence guarantees do not hold, and in general, it is not known whether any global convergence will be obtained, and if so, whether such solutions are optimal.
Some progress has been made in analyzing certain special case
multi-agent problems. For example, the problem of ``teams,'' where
all agents share a common utility function, has been studied, for
example, in (Crites and Barto, 1996).
Likewise, the purely competitive case of zero-sum utility functions
has been studied in (Littman, 1994), where an algorithm called
``minimax-Q'' was proposed for two-player zero-sum games, and shown
to converge to the optimal value function and policies for both players.
Sandholm and Crites studied simultaneous Q-learning by two
players in the Iterated Prisoner's Dilemma game
(Sandholm and Crites, 1995), and found that
the learning procedure generally converged to stationary solutions.
However, the extent to which those solutions were ``optimal'' was unclear.
Recently, Hu and Wellman proposed an algorithm for calculating
optimal Q-functions in two-player arbitrary-sum games
(Hu and Wellman, 1998). This algorithm is an important first step.
However, it does not yet appear to be useable for practical problems,
because it assumes that policies followed by both players will be
Nash equilibrium policies, and it does not address the ``equilibrium
coordination'' problem, i.e. if there are multiple Nash equilibria,
how do the agents decide which equilibrium to choose?
This may be a serious problem, since according to the
``folk theorem of iterated games'' (Kreps, 1990), there can be
a proliferation of Nash equilibria when there is sufficiently
high emphasis on future rewards, i.e., a large value of the
discount parameter
. Furthermore,
there may be inconsistencies between the assumed Nash policies,
and the policies implied by the Q-functions calculated by the
algorithm.
The present work examines simultaneous Q-learning in an economically
motivated two-player game. The players are assumed to be two
sellers of similar or identical products, who compete against
each other on the basis of price. At each time step, the sellers
alternately take turns setting prices, taking into account the
other seller's current price. After the price has been set,
the consumers then respond instantaneously and deterministically,
choosing either seller 1's product or seller 2's product (or no
product) based on the current price pair
, leading to an
instantaneous reward or utility
given to
sellers 1 and 2 respectively. It is assumed that both
sellers have full knowledge of the expected consumer response
for any given price pair, and in fact have full knowledge of
both utility functions.
This work builds on prior research reported in (Tesauro and Kephart, 1998; Tesauro and Kephart, 1999). Those papers examined the effect of including foresight, i.e. an ability to anticipate longer-term consequences of an agent's current action. Two different algorithms for agent foresight were presented: (i) a generalization of the minimax search procedure in two-player zero-sum games; (ii) a generalization of the Policy Iteration method from dynamic programming, in which both players' policies are simultaneously improved, until self-consistent policy pairs are obtained that optimize expected reward over two time steps. It was found that including foresight in the agents' pricing algorithms generally improved overall agent profitability, and usually damped out or eliminated the pathological behavior of unending cyclic ``price wars,'' in which long episodes of repeated undercutting amongst the sellers alternate with large jumps in price. Such price wars were found to be rampant in prior studies of agent economy models (Kephart, Hanson and Sairamesh, 1998; Sairamesh and Kephart, 1998) when the agents use ``myopically optimal'' or ``myoptimal'' pricing algorithms that optimize immediate reward, but do not anticipate the longer-term consequences of an agent's current price setting.
There are three primary motivations for studying simultaneous Q-learning in this paper. First, if Q-functions can be learned simultaneously and self-consistently for both players, the policies implied by those Q-functions should be self-consistently optimal. In other words, an agent will be able to correctly anticipate the longer-term consequences of its own actions, the other agents' actions, and will correctly model the other agents as having an equivalent capability. Hence the classic problem of infinite recursion of opponent models will be avoided. In contrast, in other approaches to adaptive multi-agent systems, these issues are more problematic. For example, (Hu and Wellman, 1996) study the situation of a single ``strategic'' agent, which is able to anticipate the market impact of its pricing actions, in a population of ``reactive'' agents, which have no such anticipatory capability. Likewise, (Vidal and Durfee, 1998) propose a recursive opponent modeling scheme, in which level-0 agents do no opponent modeling, level-1 agents model the opponents as being level-0, level-2 agents model the opponents as being level-1, etc.. In both of these approaches, there is no effective way for an agent to model other agents as being at an equivalent level of depth or complexity.
The second advantage of Q-learning is that the solutions should
correspond to deep lookahead: in principle,
the Q-function represents the expected reward looking infinitely
far ahead in time, exponentially weighted by a discount parameter
. In contrast, the prior work of
(Tesauro and Kephart, 1999) was based on shallow finite
lookahead. Finally, in comparison to directly modeling agent
policies, the Q-function approach seems more extensible
to the situation of very large economies with many competing
sellers. Approximating Q-functions with nonlinear function
approximators such as neural networks seems intuitively
more feasible than approximating the corresponding policies.
Furthermore, in the Q-function approach, each agent only needs
to maintain a single Q-function for itself, whereas in the
policy modeling approach, each agent needs to maintain a
policy model for every other agent; the latter seems infeasible
when the number of sellers is large.
The remainder of this paper is organized as follows. Section 2 describes the structure and dynamics of the model two-seller economy, and presents three economically-based models of seller utility (Price-Quality, Information-Filtering, and Shopbot) which are known to be prone to price wars when agents myopically optimize their short-term payoffs. System parameters are chosen to place each of these systems in a price-war regime. Section 3 describes implementation details of Q-learning in these model economies. As a first step, the simple case of ordinary Q-learning is considered, where one of the two sellers uses Q-learning and the other seller uses a fixed pricing policy (the myopically optimal, or ``myoptimal'' policy). Section 4 examines the more interesting and novel situation of simultaneous Q-learning by both sellers. Section 5 studies single-agent Q-learning in these models using neural networks, and compares the results to those of section 3 using lookup tables. Finally, section 6 summarizes the main conclusions and discusses promising directions and challenges for future work.