We first consider ordinary single-agent Q-learning in the above
two-seller economic models. The procedure for Q-learning is as
follows. Let Q(s,a) represent the discounted long-term expected
reward to an agent for taking action a in state s. The
discounting of future rewards is accomplished by a discount
parameter
such that the value of a reward expected at
n time steps in the future is discounted by
.
Assume that the Q(s,a) function is represented by a lookup
table containing a value for every possible state-action pair,
and assume that the table entries are initialized to arbitrary
values. Then the procedure for solving for Q(s,a) is to
infinitely repeat the following two-step loop:
1. Select a particular state s and a particular action a, observe the immediate reward r for this state-action pair, and observe the resulting state s'.
2. Adjust Q(s,a) according to the following equation:
where
is the learning rate parameter, and
the
operation represents choosing the optimal action b
among all possible actions that can be taken in the successor
state s' leading to the greatest Q-value. A wide variety
of methods may be used to select state-action pairs in step 1,
provided that every state-action pair is visited infinitely often.
For any stationary Markov Decision Problem, the Q-learning
procedure is guaranteed to converge to the correct values,
provided that
is decreased over
time with an appropriate schedule.
We first consider using Q-learning for one of the two sellers
in our economic models, while the other seller maintains a fixed
pricing policy. In the simulations described below the fixed
policy is in fact the myoptimal policy
represented for
example in the Price-Quality model by equations
3 and 4.
In our pricing application, the distinction between states and actions is somewhat blurred. We will assume that the ``state'' for each seller is sufficiently described by the other seller's last price, and that the ``action'' is the current price decision. This should be a sufficient state description because no other history is needed either for the determination of immediate reward, or for the calculation of the myoptimal price by the fixed-strategy player. We have also modified the concepts of immediate reward r and next-state s' for the two-agent case. We define s' as the state that is obtained, starting from s, of one action by the Q-learner and a response action by the fixed-strategy opponent. Likewise, the immediate reward is defined as the sum of the two rewards obtained after those two actions. These modifications were introduced so that the state s' would have the same player to move as state s. (A possible alternative to this, which we have not investigated, is to include the side-to-move as additional information in the state-space description.)
In the simulations reported below, the sequence of state-action pairs selected for the Q-table updates were generated by uniform random selection from amongst all possible table entries. The initial values of the Q-tables were generally set to the immediate reward values. (Consequently the initial Q-derived policies corresponded to myoptimal policies.) The learning rate was varied with time according to:
where the initial learning rate
was
usually set to 0.1, and the constant
when the
simulation time t was measured in units of
, the size of
the Q-table. (N is the number of possible prices that could be
selected by either player.) A number of different values of
the discount parameter
were studied, ranging from
to
.
Results for single-agent Q-learning in all three models indicated
that Q-learning worked well (as expected) in each case. In each
model, for each value of the discount parameter, exact convergence
of the Q-table to a stationary optimal solution was found. The
convergence times ranged from a few hundred sweeps through each
table element, for smaller values of
, to at most a few
thousand updates for the largest values of
. In addition,
once Q-learning converged, we then measured the expected cumulative
profit of the policy derived from the Q-function. We ran
the Q-policy against the other player's myopic policy from 100 random
starting states, each for 200 time steps, and averaged the resulting
cumulative profit for each player. We found that, in each case,
the seller achieved greater profit against a myopic opponent by using
a Q-derived policy than by using a myopic policy. (This was true
even for
, because, due to the redefinition of Q updates
summing over two time steps, the case
effectively
corresponds to a two-step optimization, rather than the one-step
optimization of the myopic policies.) Furthermore, the cumulative
profit obtained with the Q-derived policy monotonically increased
with the increasing
(as expected).
Figure 2: Results of single-agent Q-learning in the
Shopbot model. (a) Average profit per time step for
Q-learner (seller 1, filled circles) and myopic seller
(seller 2, open circles) vs. discount parameter
. Dashed line
indicates baseline expected profit when both sellers are myopic.
(b) Cross-plot of Q-derived price curve (seller 1) vs.
myopic price curve (seller 2) at
. Dashed line and arrows
indicate a temporal price-pair trajectory using these policies, starting
from filled circle.
It was also interesting
to note that in many cases, the expected profit of the myopic
opponent also increased when playing against the Q-learner, and
also improved monotonically with increasing
. The explanation
is that, rather than better exploiting the myopic opponent, as
would be expected in a zero-sum game, the Q-learner instead reduced
the region over which it would participate in a mutually
undercutting price war. Typically we find in these models that
with myopic vs. myopic play, large-amplitude price wars are generated
that start at very high prices and persist all the way down to
very low prices. When a Q-learner competes against a myopic opponent,
there are still price wars starting at high prices, however, the
Q-learner abandons the price war more quickly as the prices decrease.
The effect is that the price-war regime is smaller and confined to
higher average prices, leading to a closer approximation to cooperative
or collusive behavior, with greater expected utilites for both players.
An illustrative example of the results of single-agent Q-learning
is shown in figure 2.
Figure 2(a) plots the average profit for both
sellers in the Shopbot model, when one of the sellers is myopic
and the other is a Q-learner. (As the model is symmetric, it doesn't
matter which seller is the Q-learner.)
Figure 2(b) plots the myopic price curve of seller 2
against the Q-derived price curve (at
) of seller 1.
We can see that both curves have a maximum price of 1 and a
minimum price of approximately 0.58. The portion of both curves
lying along the diagonal indicates undercutting behavior, in
which case the seller will respond to the opponent's price by
undercutting by
, the price discretization interval.
The system dynamics for the state
in figure 2(b)
can be obtained by alternately applying the two pricing policies.
This can be done by a simple iterative graphical
construction, in which for any given starting point, one first
holds
constant and moves horizontally to the
curve, and then one
holds
constant and moves vertically to the
curve. We see in this figure that
the iterative graphical construction leads to an unending
cyclic price war, whose trajectory is indicated by the dashed line.
Note that the price-war behavior begins at the price pair (1, 1),
and persists until a price of approximately 0.83. At this point,
seller 1 abandons the price war, and resets its price to 1,
leading once again to another round of undercutting.
The amplitude of this price war is diminished compared to the situation in which both players use a myopic policy. In that case, seller 1's curve would be a mirror image of seller 2's curve, and the price war would persist all the way to the minimum price point, leading to a lower expected profit for both sellers.