Real agent economies are likely to contain large numbers of agents, with complex details of how the agents behave and interact with each other on multiple time scales. Our approach toward modeling and understanding such complexity is to begin by making a number of simplifying assumptions. We first consider the simplest possible case of two competing seller agents offering similar or identical products to a large population of consumer agents. The sellers compete on the basis of price, and we assume that prices are discretized and can lie between a minimum and maximum price, such that the number of possible prices is at most a few hundred. This renders the state space small enough that it is feasible to use lookup tables to represent the agents' pricing policies and expected profits. Time in the simulation is also discretized; at each time step, we assume that the consumers compare the current prices of the two sellers, and instantaneously and deterministically choose to purchase from at most one seller. Hence at each time step, for each possible pair of seller prices, there is a deterministic reward or profit given to each seller. The simulation can iterate forever, and there may or may not be a discounting factor for the present value of future rewards.
It is worth noting that the consumers are not regarded as ``players'' in the model. The consumers have no strategic role: they behave according to an extremely simple, fixed, short-term greedy rule (buy the lowest priced product at each time step), and are regarded as merely providing a stationary environment in which the two sellers can compete in a two-player game. This is clearly a simplifying first step in the study of multi-agent phenomena, and in future work, the models will be extended to include strategic and adaptive behavior on the part of the consumers as well. This will change the notion of ``desirable'' system behavior. In the present model, desirable behavior would resemble ``collusion'' between the two sellers in charging very high prices, so that both could obtain high profits. Obviously this is not desirable from the consumers' viewpoint.
Regarding the dynamics of seller price adjustments, we assume that the sellers alternately take turns adjusting their prices, rather than simultaneously setting prices (i.e. the game is extensive-form rather than normal-form). Our choice of alternating-turn dynamics is motivated by two considerations: (a) As the number of sellers becomes large and the model becomes more realistic, it seems more reasonable to assume that the sellers will adjust their prices at different times rather than at the same time, although they probably will not take turns in a well-defined order. (b) With alternating-turn dynamics, we can stay within the normal Q-learning framework where the Q-function implies a deterministic optimal policy: it is known that in two-player alternating turn games, there always exists a deterministic policy that is as good as any non-deterministic policy (Littman, 1994). In contrast, in games with simultaneous moves (for example, rock-paper-scissors), it is possible that no deterministic policy is optimal, and that the existing Q-learning formalism for MDPs would have to be modified and extended so that it could yield non-deterministic optimal policies.
We study Q-learning in three different economic models that have been described in detail elsewhere (Sairamesh and Kephart, 1998; Kephart, Hanson and Sairamesh, 1998; Greenwald and Kephart, 1999). The first model, called the ``Price-Quality'' model (Sairamesh and Kephart, 1998), models the sellers' products as being distinguished by different values of a scalar ``quality'' parameter, with higher-quality products being perceived as more valuable by the consumers. The consumers are modeled as trying to obtain the lowest-priced product at each time step, subject to threshold-type constraints on both quality and price, i.e., each consumer has a maximum allowable price and a minimum allowable quality. The similarity and substitutability of seller products leads to a potential for direct price competition; however, the ``vertical'' differentiation due to differening quality values leads to an asymmetry in the sellers' profit functions. It is believed that this asymmetry is responsible for the unending cyclic price wars that emerge when the sellers employ myoptimal pricing strategies.
The second model is an ``Information-Filtering'' model described in detail in (Kephart, Hanson and Sairamesh, 1998). In this model there are two competing sellers of news articles in somewhat overlapping categories. In contrast to the vertical differentiation of the Price-Quality model, this model contains a horizontal differentiation in the differing article categories. To the extent that the categories overlap, there can be direct price competition, and to the extent that they differ, there are asymmetries introduced that again lead to the potential for cyclic price wars.
The third model is the so-called ``Shopbot'' model described in (Greenwald and Kephart, 1999), which is intended to model the situation on the Internet in which some consumers may use a Shopbot to compare prices of all sellers offering a given product, and select the seller with the lowest price. In this model, the sellers' products are exactly identical and the profit functions are symmetric. Myoptimal pricing leads the sellers to undercut each other until the minimum price point is reached. At that point, a new price war cycle can be launched, due to buyer asymmetries rather than seller asymmetries. The fact that not all buyers use the Shopbot, and some buyers instead choose a seller at random, means that it can be profitable for a seller to abandon the low-price competition for the bargain hunters, and instead maximally exploit the random buyers by charging the maximum possible price.
An example profit function that we study, taken from the
Price-Quality model, is as follows: Let
and
represent
the prices charged by seller 1 and seller 2 respectively. Let
and
represent their respective quality parameters,
with
. Let c(q) represent the cost to a seller
of producing an item of quality q. Then assuming the
particular model of consumer behavior described in
(Sairamesh and Kephart, 1998) one can show analytically that
in the limit of infinitely many consumers, the instantaneous
profits per consumer
and
obtained by
seller 1 and seller 2 respectively are given by:
A plot of the profit landscape for seller 1 as a function of prices
and
is given in
figure 1, for the following parameter settings:
,
, and c(q) = 0.1 (1+q).
(These specific parameter settings were chosen because they are
known to generate harmful price wars when the agents use myopic
optimal pricing.) We can see in this
figure that the myopic optimal price for seller 1 as a function of
seller 2's price,
, is obtained for each value of
by sweeping across all values of
and choosing the value
that gives the highest profit. We can see that for small values
of
, the peak profit is obtained at
, whereas for
larger values of
, there is eventually a discontinuous shift
to the other peak, which follows along the parabolic-shaped ridge
in the landscape. An analytic expression for the myopic optimal price for
seller 1 as a function of
is as follows (defining
and
):
Similarly, the myopic optimal price for
seller 2 as a function of the price set by
seller 1,
, is given by the following formula
(assuming that prices are discrete and that
is
the price discretization interval):
Figure 1: Sample profit landscape for seller 1 in
Price-Quality model, as a function of seller 1 price
and seller 2 price
.
We also note in passing that there are similar profit landscapes for each of the sellers in the Information-Filtering model and in the Shopbot model. In all three models, it is the existence of multiple, disconnected peaks in the landscapes, with relative heights that can change depending on the other seller's price, that leads to price wars when the sellers behave myopically.
Regarding the information set that is made available to the sellers, we have made a simplifying assumption as a first step that the players have essentially perfect information. They can model the consumer behavior perfectly, and they also have perfect knowledge of each other's costs and profit functions. Hence our model is thus a two-player perfect-information deterministic game that is very similar to games like chess. The main differences are that the profits in our model are not strictly zero-sum, and that there are no terminating or absorbing nodes in our model's state space. Also in our model, payoffs are given to the players at every time step, whereas in games such as chess, payoffs are only given at the terminating nodes.
As mentioned previously, we constrain the prices set by
the two sellers to
lie in a range from some minimum to maximum allowable price.
The prices are discretized, so that one can create
lookup tables for the seller profit functions
. Furthermore, the optimal pricing policies
for each seller as a function of the other seller's price,
and
, can also be
represented in the form of table lookups.