Within the next few years, we expect electronic commerce to be an important multi-agent domain in which reinforcement learning will find numerous applications. One such application is automated dynamic pricing by software agents [1]. Suppose that each seller agent individually attempts to maximize profits through judicious setting of prices and other product parameters. Even if the seller agents do not communicate with one another directly, market forces may strongly couple their actions, resulting in a highly dynamic multi-agent system. Since decision making in markets and economies benefits greatly from one's ability to forecast economic trends and opponents' strategies, reinforcement learning is likely to be an essential component of decision making by economically-motivated software agents.
Unfortunately, from a theoretical perspective, the issue of what happens when multiple interacting agents simultaneously adapt, using RL or other approaches, is largely an open question. This stands in contrast to the case of single-agent RL: in stationary Markov Decision Problems, a solid theoretical understanding has been provided by research on algorithms such as Dynamic Programming and Q-learning. Various theorems establish that global convergence to a unique optimal value function and optimal policy will always be obtained. However, these theorems do not apply in the multi-agent case, as adapting agents provide effectively non-stationary environments for other agents.
Some progress has been made in analyzing certain special-case multi-agent problems. For example, cooperative teams of agents sharing a common goal or utility function have been studied in [6], among others. The purely competitive case of zero-sum utility functions has been studied in [4], where an algorithm called ``minimax-Q'' was proposed for two-player zero-sum games, and shown to converge to the optimal value function and policies for both players. Simultaneous Q-learning by two players in the Iterated Prisoner's Dilemma game was studied empirically in [5], who found that the learning procedure frequently converged to stationary solutions. An important first step in analyzing Q-learning for arbitrary-sum two-player games was recently taken in [3]. This algorithm assumes that the players follow Nash equilibrium policies. Issues remaining to be addressed include the ``equilibrium coordination'' problem (i.e. how the agents choose from amongst multiple Nash equilibria) and verification that the policies implied by the learned Q-functions are consistent with the initially assumed Nash policies.
In our previous work [7], we studied simultaneous
Q-learning by two price-setting agents in three simple models, showing
that in all cases Q-learning by one or both of the agents raised both
of their profits substantially from what they would obtain using
myopic best-response pricing. Simultaneous convergence of both
sellers to stationary policies was found in some but not all cases,
depending on the model's payoff functions and on the discount
parameter
. In other cases, we observed
``pseudo-convergence,'' i.e. near-convergence to a
slightly unstable solution with non-zero Bellman error--an
intriguing phenomenon
that has no analog in ordinary single-agent Q-learning. Moreover,
in the shopbot model, the pseudo-convergent policies of the two
agents were asymmetric even though the two agents were identical
in every respect, including their payoffs.
In this paper, we delve into the nature and behavior
of simultaneous Q-learning by two interacting agents. As a
vehicle for our study, we focus on the ``shopbot'' model originally
introduced in [1]. In any given training run, we had
observed [7] that either of two solutions could be
obtained: symmetric and stable, or asymmetric and pseudo-convergent.
We also found that the asymmetric solution was increasingly favored as
the discount parameter
was increased. Having observed
pseudo-convergence under a wide range of experimental
conditions--even in entirely different agent models--we conjecture
that it is somehow endemic to the problem of pricing in multi-agent
domains. By studying these phenomena in the context of the shopbot
model, we hope to develop more broadly useful analytic and
experimental tools and insights that will bring us to a broader
understanding of the nature of multi-agent Q-learning, particularly in
competitive, non-zero-sum settings like markets and economies.
The rest of this paper is organized as follows. Section 2 provides some background on the shopbot model and our previous work on simultaneous Q-learning in that context. Section 3 presents an analysis of both the symmetric and asymmetric solutions, and the dynamic behavior of Bellman error vs. time in the latter case. Section 4 analyzes the conditions under which each solution is chosen, and the final section gives concluding remarks and suggested directions for future work.