According to Table 1, the five myoptimal sellers in Fig. 1 each receive an average profit of 0.0337, or about two-thirds of the theoretical maximum of 0.05 that would be achieved by a cartel. One would expect an improvement if the algorithm were modified to take into account the anticipated future pricing behavior of competitors. Sellers might avoid undercutting if they could foresee that it might invite retaliation. Such thinking is very natural and intuitive for humans. The challenge is to endow software agents with such economic foresight without requiring that they be anywhere near the level of sophistication required to pass the Turing test.
One promising method that we have investigated is a reinforcement
learning technique called Q-learning. A seller employing this
technique learns a Q function that expresses the anticipated
future-discounted
profit for each possible price that it might charge, given the current
prices of all of its
competitors [26, 27, 28].
The anticipated future-discounted profit is simply the expected profit
during the current time interval plus
times the expected profit
one time step in the future plus
times the expected profit
two time steps in the future, etc., where the future discount parameter
lies between 0 and 1.
The seller's pricing policy is to select the
price that offers the highest future-discounted profit.
Figure 2: a) Cross plot of Q pricing policy vs. MY pricing policy.
The future discount parameter is
. The thin line
shows a typical price trajectory from a given initial condition
(represented by dot). b) Associated price-war time series.
Q-learning is known to converge when the function to be learned is
stationary. Thus a single Q-learner is guaranteed to develop an
optimal policy against a single non-adaptive DF, GT, or MY
agent. Against DF and GT, Q behaves essentially like
MY [11]. However, against MY, Q learns a superior
policy. Fig. 2a depicts Q's policy
, i.e. its price as a function of MY's
price. Rotated and superimposed on this is MY's pricing policy
. Successive
iteration of the two policies from a particular initial condition
yields a trajectory in price vector space that is represented in
Fig. 2a by a thin zigzag curve. Assuming that Q moves
first from the initial condition
, the trajectory moves
vertically down to the
curve, then horizontally to the
curve, and then alternates between
vertical and horizontal movements. The price trajectory is represented
more conventionally in Fig. 2b.
Q's policy is identical to that of MY for moderate to high prices, so
the price trajectory begins as a price war similar in form to that of
the 5MY cell in Fig. 1. However, below a critical price
, Q's policy differs from that of
MY. Instead of slightly undercutting MY, Q drops its price
aggressively to
. MY responds
by raising its price to the monopolist price 0.5, whereupon Q
undercuts it, beginning the price war cycle anew. By eliminating
the portion of the price-war cycle below price 0.196, Q improves
its average profit from 0.0892 to 0.1089, and it also improves
MY's average profit to 0.1076.
When two Q-learners are pitted against another, each creates a
non-stationary environment for the other. There are no theoretical
guarantees of convergence in this case, and in fact we have observed
both convergence and non-convergence. Convergence typically occurs
when the future discount parameter
is small (i.e., relatively
little emphasis is placed on future earnings). In this case, two
Q-learners typically converge to a stable, symmetric pair of policies,
each of which is quite similar to Q's optimal policy against a single
MY. When a moderate to strong emphasis is placed on the value of
future rewards, the Q-learners' policies nearly converge to a strongly
asymmetric pair of policies in which one of the agents has a clear
advantage over the other. The convergence is imperfect, however, and
the Q-functions and associated policies continue to make small
oscillations around a well-defined but not quite attainable asymmetric
solution. This asymmetric quasi-solution has no analog in ordinary
single-agent Q-learning. Our early efforts to characterize this
solution and the circumstances under which it arises hint of a rich
field for investigation by theorists in machine learning and dynamical
systems.
Figure 3: a) Cross plot of asymmetric pricing policies derived from
simultaneous learning by two Q-learners. The future discount parameter
. The thin line
shows a typical price trajectory from a given initial condition
(represented by dot). b) Associated price-war time series.
The asymmetric solution is illustrated in Fig. 3a, which
shows the crossed policies for Agents 1 and 2. Both pricing policies
are approximately describable by a small number of simple line
segments, except for slight perturbations that arise from the
instability of the asymmetric solution. The positions and sizes of the
perturbations shift unpredictably as the Q-learners continue to learn.
Superimposed on these two crossed policies is a trajectory in price
vector space that starts from the initial condition
and proceeds vertically and horizontally as in
Fig. 2a. The classic price war pattern is again evident in
Fig. 3a and (more conventionally) in Fig. 3b.
However, in this case Agent 1 ends the price war quite
early by aggressively lowering its price, inducing Agent 2 to set its
price to the monopolistic value of 0.5. Since prices never fall below
0.44, the price war is short-lived, and profits remain very high. On
average, Agents 1 and 2 receive 0.1254 and 0.1171, respectively. This
exceeds what two MY sellers would obtain (0.0892 each), and it even
exceeds the profits that would be obtained by two DFs (0.1127 each).
An important challenge is to extend the Q-learning technique so that
it can feasibly handle more than one opponent. The natural way to
express the Q-function is as a lookup table with one Q value per
possible price vector. For a system with s sellers, the table
is s-dimensional. Although we have found s=2 to be manageable,
the lookup table becomes infeasible in both size and trainability
for
. The obvious solution is to replace the lookup
table with a function approximator. We have explored the use of
neural nets [27] and regression trees as function
approximators. In tests on two-seller systems, neural nets appear
to have difficulty approximating the lookup table accurately
and take a very long time to train, but regression trees appear
to offer good approximation and good training times. We are
beginning to assess their performance on systems with three sellers.