Let us first consider ordinary single-agent Q-learning in the above two-seller economic models: one seller uses Q-learning to learn a Q-function and corresponding policy, while the other seller maintains a fixed pricing policy.
The procedure for Q-learning is as
follows. Let Q(s,a) represent the discounted long-term expected
reward to an agent for taking action a in state s. The
discounting of future rewards is accomplished by a discount
parameter
such that the value of a reward expected at
n time steps in the future is discounted by
.
Assume that the Q(s,a) function is represented by a lookup
table containing a value for every possible state-action pair,
and assume that the table entries are initialized to arbitrary
values. Then the procedure for solving for Q(s,a) is to
infinitely repeat the following two-step loop:
1. Select a particular state s and a particular action a, observe the immediate reward r for this state-action pair, and observe the resulting state s'.
2. Adjust Q(s,a) according to the following equation:
where
is the learning rate parameter, and
the
operation represents choosing the optimal action b
among all possible actions that can be taken in the successor
state s' leading to the greatest Q-value. A wide variety
of methods may be used to select state-action pairs in step 1,
provided that every state-action pair is visited infinitely often.
For any stationary Markov Decision Problem, the Q-learning
procedure is guaranteed to converge to the correct values,
provided that the learning rate parameter is decreased over
time with an appropriate schedule.
In the simulations described below the fixed-policy seller uses
the myoptimal policy
represented for
example in the Price-Quality model by equations
3 and 4.
In this model, the distinction between states and actions is somewhat blurred. It will be assumed that the ``state'' for each seller is sufficiently described by the other seller's last price, and that the ``action'' is the current price decision. This should be a sufficient state description because no other history is needed either for the determination of immediate reward, or for the calculation of the myoptimal price by the fixed-strategy player. The definitions of immediate reward r and next-state s' have also been modified for the two-agent case as follows: let s' be the state that is obtained, starting from s, of one action by the Q-learner and a response action by the fixed-strategy opponent. Likewise, the immediate reward is defined as the sum of the two rewards obtained after those two actions. These modifications were introduced so that the state s' would have the same player to move as state s. (A possible alternative to this, which has not been investigated, is to include the side-to-move as additional information in the state-space description.)
An important issue in Q-learning is ``exploration'' of the state space. Each state-action pair must be visited sufficiently often for learning to converge. This often necessitates some sort of randomized off-policy trajectory generation, for example, by Boltzmann exploration. In the simulations reported below, the sequence of state-action pairs selected for the Q-table updates were generated by uniform random selection from amongst all possible table entries. This does not correspond to actual on-line training in a real environment, but is appropriate for situations where one has an accurate simulation of the environment, and can envision attempting to solve for optimal strategies for both players via off-line training in the laboratory.
The initial values of the Q-tables were generally set to the immediate reward values. (Consequently the initial Q-derived policies corresponded to myoptimal policies.) The learning rate was varied with time according to:
where the initial learning rate
was
usually set to 0.1, and the constant
when the
simulation time t was measured in units of
, the size of
the Q-table. (N is the number of possible prices that could be
selected by either player.) A number of different values of
the discount parameter
were studied, ranging from
to
.
Results for single-agent Q-learning in all three models indicated
that Q-learning worked well (as expected) in each case. In each
model, for each value of the discount parameter, exact convergence
of the Q-table to a stationary optimal solution was found. The
convergence times ranged from a few hundred sweeps through each
table element, for smaller values of
, to at most a few
thousand updates for the largest values of
. In addition,
once Q-learning converged, the expected cumulative profit or utility of
the policy derived from the Q-function was then measured, by running
the Q-policy against the other player's myopic policy from 100 random
starting states, each for 200 time steps, and averaging the resulting
cumulative utility for each player. In each case, it was found that
the seller achieved greater profit against a myopic opponent by using
a Q-derived policy than by using a myopic policy. (This was true
even for
, because, due to the redefinition of Q updates
summing over two time steps, the case
effectively
corresponds to a two-step optimization, rather than the one-step
optimization of the myopic policies.) Furthermore, the cumulative
utility obtained with the Q-derived policy monotonically increased
with the increasing
(as expected).
It was also interesting
to note that in many cases, the expected utility of the myopic
opponent also increased when playing against the Q-learner, and
also improved monotonically with increasing
. The explanation
is that, rather than better exploiting the myopic opponent, as
would be expected in a zero-sum game, the Q-learner instead reduced
the region over which it would participate in a mutually
undercutting price war. Typically one finds in these models that
with myopic vs. myopic play, large-amplitude price wars are generated
that start at very high prices and persist all the way down to
very low prices. When a Q-learner competes against a myopic opponent,
there are still price wars starting at high prices, however, the
Q-learner abandons the price war more quickly as the prices decrease.
The effect is that the price-war regime is smaller and confined to
higher average prices, leading to a closer approximation to cooperative
or collusive behavior, with greater expected utilites for both players.
It is interesting that this can come about even though both players
are entirely selfish.
An illustrative example of the results of single-agent Q-learning is shown below in figures 2 and 3. Figure 2 plots the average utility for both sellers in the Shopbot model, when one of the sellers is myopic and the other is a Q-learner. (As the model is symmetric, it doesn't matter which seller is the Q-learner.)
Figure 2: Plot of average utility per time step for seller 1 and seller 2,
as a function of discount parameter
, in
the Shopbot model when one seller (seller 2, open circles)
uses a myopic policy, and the other seller (seller 1, filled circles)
uses Q-learning to learn an optimal policy against
the myopic player. The dashed line indicates baseline expected
utility when both sellers are myopic.
Figure 3: Cross-plot of myopic price curve vs. Q-derived price curve
in the Shopbot model at
. Seller 1 is the Q-learner
and seller 2 is myopic. The dashed line and arrows indicate how the prices
will evolve over time using these pricing policies, starting from
a particular price pair, indicated by the filled circle.
Figure 3 plots the myopic price curve of seller 2
against the Q-derived price curve (at
) of seller 1.
We can see that both curves have a maximum price of 1 and a
minimum price of approximately 0.58. The portion of both curves
lying along the diagonal indicates undercutting behavior, in
which case the seller will respond to the opponent's price by
undercutting by
, the price discretization interval.
The system dynamics for the state
in figure 3
can be obtained by alternately applying the two pricing policies.
This can be done by a simple iterative graphical
construction, in which for any given starting point, one first
holds
constant and moves horizontally to the
curve, and then one
holds
constant and moves vertically to the
curve. We see in this figure that
the iterative graphical construction leads to an unending
cyclic price war, whose trajectory is indicated by the dashed line.
Note that the price-war behavior begins at the price pair (1, 1),
and persists until a price of approximately 0.83. At this point,
seller 1 abandons the price war, and resets its price to 1,
leading once again to another round of undercutting.
The amplitude of this price war is diminished compared to the situation in which both players use a myopic policy. In that case, seller 1's curve would be a mirror image of seller 2's curve, and the price war would persist all the way to the minimum price point, leading to a lower expected utility for both sellers.