next up previous
Next: Model agent economies Up: Pricing in agent economies Previous: Pricing in agent economies

Introduction

Reinforcement Learning (RL) procedures have been established as powerful and practical methods for solving Markov Decision Problems. One of the most significant and actively investigated RL algorithms is Q-learning (Watkins, 1989). Q-learning is an algorithm for learning to estimate the long-term expected reward for a given state-action pair. It has the nice property that it does not need a model of the environment, and it can be used for on-line learning. A number of powerful convergence proofs have been given showing that Q-learning is guaranteed to converge with probability 1, in cases where the state space is small enough so that lookup table representations can be used (Watkins and Dayan, 1992). Furthermore, in large state spaces where lookup table representations are infeasible, RL methods can be combined with function approximators to give good practical performance despite the lack of theoretical guarantees of convergence to optimal policies.

Most real-world problems are not fully Markov in nature - they are often non-stationary, history-dependent and/or not fully observable. In order for RL methods to be more generally useful in solving such problems, they need to be extended to handle these non-Markovian properties. One important application domain where the non-Markovian aspects are paramount is the area of multi-agent systems. This area is expected to be increasingly important in the future, due to the potential rapid emergence of ``agent economies'' consisting of large populations of interacting software agents engaged in various forms of economic activity. The problem of multiple agents simultaneously adapting is in general non-Markov, because each agent provides an effectively non-stationary environment for the other agents. Hence the existing convergence guarantees do not hold, and in general, it is not known whether any global convergence will be obtained, and if so, whether such solutions are optimal.

Some progress has been made in analyzing certain special case multi-agent problems. For example, the problem of ``teams,'' where all agents share a common utility function, has been studied, for example, in (Crites and Barto, 1996). Likewise, the purely competitive case of zero-sum utility functions has been studied in (Littman, 1994), where an algorithm called ``minimax-Q'' was proposed for two-player zero-sum games, and shown to converge to the optimal value function and policies for both players. Sandholm and Crites studied simultaneous Q-learning by two players in the Iterated Prisoner's Dilemma game (Sandholm and Crites, 1995), and found that the learning procedure generally converged to stationary solutions. However, the extent to which those solutions were ``optimal'' was unclear. Recently, Hu and Wellman proposed an algorithm for calculating optimal Q-functions in two-player arbitrary-sum games (Hu and Wellman, 1998). This algorithm is an important first step. However, it does not yet appear to be useable for practical problems, because it assumes that policies followed by both players will be Nash equilibrium policies, and it does not address the ``equilibrium coordination'' problem, i.e. if there are multiple Nash equilibria, how do the agents decide which equilibrium to choose? This may be a serious problem, since according to the ``folk theorem of iterated games'' (Kreps, 1990), there can be a proliferation of Nash equilibria when there is sufficiently high emphasis on future rewards, i.e., a large value of the discount parameter tex2html_wrap_inline443 . Furthermore, there may be inconsistencies between the assumed Nash policies, and the policies implied by the Q-functions calculated by the algorithm.

The present work examines simultaneous Q-learning in an economically motivated two-player game. The players are assumed to be two sellers of similar or identical products, who compete against each other on the basis of price. At each time step, the sellers alternately take turns setting prices, taking into account the other seller's current price. After the price has been set, the consumers then respond instantaneously and deterministically, choosing either seller 1's product or seller 2's product (or no product) based on the current price pair tex2html_wrap_inline465 , leading to an instantaneous reward or utility tex2html_wrap_inline467 given to sellers 1 and 2 respectively. It is assumed that both sellers have full knowledge of the expected consumer response for any given price pair, and in fact have full knowledge of both utility functions.

This work builds on prior research reported in (Tesauro and Kephart, 1998; Tesauro and Kephart, 1999). Those papers examined the effect of including foresight, i.e. an ability to anticipate longer-term consequences of an agent's current action. Two different algorithms for agent foresight were presented: (i) a generalization of the minimax search procedure in two-player zero-sum games; (ii) a generalization of the Policy Iteration method from dynamic programming, in which both players' policies are simultaneously improved, until self-consistent policy pairs are obtained that optimize expected reward over two time steps. It was found that including foresight in the agents' pricing algorithms generally improved overall agent profitability, and usually damped out or eliminated the pathological behavior of unending cyclic ``price wars,'' in which long episodes of repeated undercutting amongst the sellers alternate with large jumps in price. Such price wars were found to be rampant in prior studies of agent economy models (Kephart, Hanson and Sairamesh, 1998; Sairamesh and Kephart, 1998) when the agents use ``myopically optimal'' or ``myoptimal'' pricing algorithms that optimize immediate reward, but do not anticipate the longer-term consequences of an agent's current price setting.

There are three primary motivations for studying simultaneous Q-learning in this paper. First, if Q-functions can be learned simultaneously and self-consistently for both players, the policies implied by those Q-functions should be self-consistently optimal. In other words, an agent will be able to correctly anticipate the longer-term consequences of its own actions, the other agents' actions, and will correctly model the other agents as having an equivalent capability. Hence the classic problem of infinite recursion of opponent models will be avoided. In contrast, in other approaches to adaptive multi-agent systems, these issues are more problematic. For example, (Hu and Wellman, 1996) study the situation of a single ``strategic'' agent, which is able to anticipate the market impact of its pricing actions, in a population of ``reactive'' agents, which have no such anticipatory capability. Likewise, (Vidal and Durfee, 1998) propose a recursive opponent modeling scheme, in which level-0 agents do no opponent modeling, level-1 agents model the opponents as being level-0, level-2 agents model the opponents as being level-1, etc.. In both of these approaches, there is no effective way for an agent to model other agents as being at an equivalent level of depth or complexity.

The second advantage of Q-learning is that the solutions should correspond to deep lookahead: in principle, the Q-function represents the expected reward looking infinitely far ahead in time, exponentially weighted by a discount parameter tex2html_wrap_inline469 . In contrast, the prior work of (Tesauro and Kephart, 1999) was based on shallow finite lookahead. Finally, in comparison to directly modeling agent policies, the Q-function approach seems more extensible to the situation of very large economies with many competing sellers. Approximating Q-functions with nonlinear function approximators such as neural networks seems intuitively more feasible than approximating the corresponding policies. Furthermore, in the Q-function approach, each agent only needs to maintain a single Q-function for itself, whereas in the policy modeling approach, each agent needs to maintain a policy model for every other agent; the latter seems infeasible when the number of sellers is large.

The remainder of this paper is organized as follows. Section 2 describes the structure and dynamics of the model two-seller economy, and presents three economically-based models of seller utility (Price-Quality, Information-Filtering, and Shopbot) which are known to be prone to price wars when agents myopically optimize their short-term payoffs. System parameters are chosen to place each of these systems in a price-war regime. Section 3 describes implementation details of Q-learning in these model economies. As a first step, the simple case of ordinary Q-learning is considered, where one of the two sellers uses Q-learning and the other seller uses a fixed pricing policy (the myopically optimal, or ``myoptimal'' policy). Section 4 examines the more interesting and novel situation of simultaneous Q-learning by both sellers. Section 5 studies single-agent Q-learning in these models using neural networks, and compares the results to those of section 3 using lookup tables. Finally, section 6 summarizes the main conclusions and discusses promising directions and challenges for future work.


next up previous
Next: Model agent economies Up: Pricing in agent economies Previous: Pricing in agent economies

kephart
Wed Sep 29 11:51:48 EDT 1999