Figure 2: Cross plot of symmetric response function solutions
for Q-learning with
.
For all values of
in the range
, the
symmetric best-response policy R(p) is observed to have a functional
form that depends on just two parameters
and
:
Figure 2 illustrates the symmetric
R(p) for the case
and
.
This differs from the myoptimal policy of Fig. 1
in that undercutting does not continue all the way down to
.
Instead, when the price gets down to
, the
agent aggressively drops its price all the way down
to a value
. The opponent's best response to
is to set the price back up to v. While this
aggressive price lowering decreases the agent's immediate profit, it
proves advantageous in the long run for at least two reasons. First,
both agents avoid the lower portion of the price-war cycle, and so the
average price over the course of a cycle is increased. Second,
when the competitor responds with price v, the agent can then
undercut at the price v-1, yielding a relatively high profit.
To calculate the parameters
and
,
consider first that
is the price below
which no agent will price because the advantage
of increased market share is outweighed by the very small profit
margin. The price
can be determined by noting that the
discounted reward must be just marginally higher than that of
choosing price v. Taking price quantization into account,
must be the smallest integer such that
.
For low to moderate values of
, very accurate approximations
to both Q values can be computed. First, consider
. As the undercutter, B's expected profit
according to Eq. 1
is
. Next, Agent A will respond
with
. Since B is still the undercutter, its
profit will be another
.
B will then respond by undercuting A with price v-1.
Therefore, using Eq. 3,
Now consider
. Since B is undercut by A, its
expected reward is
. A then responds with v-1;
this too undercuts B, and thus B again receives
. At its next opportunity, B undercuts Agent A with price
v-2. Thus the Q-value in this case is
To compute
and
, note that these price
vectors are at or near the beginning of the price war cycle. Until
the price drops down to
, B will alternately be the
undercutter and the undercuttee. Thus, when
B sets its price to p, the expected profits from its move
and A's countermove will simply be
. At B's next turn,
it will set its price to p-2, and so on. Therefore,
where the approximation comes about because the
finite arithmetico-geometric series is being approximated
by an infinite one; this is valid to the extent that
-- i.e. it assumes that B
places negligible weight on events after the end
of the price-war cycle.
Noting that
must be the integer just greater than the
value obtained by equating
Eqs. 6 and 7, and using
Eq. 8, we obtain
where the approximation is quite accurate provided that the following condition is satisfied:
Similarly,
must be the smallest price for which
,
i.e. the point at which the future discounted reward of
slightly undercutting
is just barely higher than that of
agressive undercutting to
. In the first scenario, the
price sequence leading into the beginning of the price-war
cycle is
, i.e. B's
immediate profit will be higher but it will then be
undercut twice in a row by A. In the second scenario, the
price sequence leading into the price-war cycle
will be
, i.e. B's immediate profit will be lower but it
will undercut A twice in a row. The Q functions for these
two scenarios can be computed:
and
and equated to yield
which is accurate provided that Eq. 10 holds. A simpler but less accurate expression can be derived by substituting Eq. 9, ignoring the integer restrictions, and neglecting terms of O(1):
Figure 3 plots the values of
and
as a function of
for
. The solid circles
represent measurements taken by running the Q algorithm to convergence,
while the curves represent the
theoretical approximations obtained from Eqs. 13 and
9 up to the point where
Eq. 10 becomes seriously violated.
Figure 3: Symmetric solution: theoretical and observed
and
as a function of
. Wiggles in theoretical curve are due to
integer ceiling functions in Eq. 9 and 13.