The asymmetric solution is observed to always have the form
illustrated in Figure 4.
It can be described using one parameter (
)
for one of the agents (say Agent A) and three parameters
(
,
, and
) for the other agent
(say Agent B). The
functional forms of the response curves are described by:
Figure 4: Cross plot of asymmetric response function solutions at
.
(Not all of these parameters are independent: B's price
is just below A's price-war threshold
, i.e.
.)
We can derive
approximations to the values of
,
,
and
that are accurate for sufficiently small
.
First, consider the determination of
, the lowest
price that A is willing to undercut. At this value, A
is just on the verge of preferring to opt out of a price
war and set its price up to v. Therefore, temporarily disregarding the
fact that
must be an integer, we seek
such that
.
These two Q-values can be computed by following the price trajectories
up to the point where they join the price-war trajectory:
Equating the right-hand sides of these equations and rounding up to the nearest integer, we obtain
The parameter
is the value of
at which Agent B decides to
set its price aggressively low. This
is (approximately) the point at which
. Following the price trajectories up to the
point where they join the standard price-war trajectory, we find:
from which we obtain
Similarly, we can compute
by setting
:
from which we obtain
Figure 5: Asymmetric solution: theoretical and observed
,
,
, and
as a function of
.
Figure 5 plots the values of
,
,
, and
as a function of
for
. The
solid circles represent measurements taken by running the Q-learning
algorithm until the Bellman error is minimized, while the solid curves
represent the theoretical approximations given by
Eqs. 18,
20, and 22, which are valid
provided that
, along with the relation
.
Figure 6: Bellman error for simultaneous Q-learning by agents
1 and 2, with
. Each time unit represents a number
of random updates equal to the total number of
price pairs.
Interestingly, this solution just barely fails to be fully self-consistent.
A clear symptom of inconsistency can be seen in
Figure 6, which plots the Bellman error
(the discrepancy between the lefthand and righthand sides of
Eq. 3) as a function of training
time for sellers A and B.
The Bellman error, defined as the average RMS error
weighted equally over all price pairs, comes extremely close to zero,
but suddenly shoots up dramatically.
The error soon decreases, again dropping nearly to zero but
shooting up again,
and so the cycle continues unceasingly. For example, at
time 464, the policies have the canonical pseudo-solution
form, and the Bellman error is just
0.0007 for A and 0.0012 for B. However, at time 465, the
response curve for A suddenly shifts from
to
as one ridge in
at
just rises
above a ridge at
. This is manifested as a long finger extending
across the crossplot illustrated in Figure 7.
Figure 7: Policy crossplot at time t=465 during the Q-learning
run of Fig. 6.
The location of the finger suggests that the problem lies in a part of
that is only relevant during transients before the
price-war cycle has begun -- somewhere around
.
In fact, detailed analysis reveals that, in this region,
very slightly exceeds
if all other Q
values are taken as described in Eqs. 15
and 16. As the Q function gradually
becomes more self-consistent and accurate, it finally reaches the
point where, for some value of
in the critical range, A's best
response shifts from v to
. Analogous fingers may develop
for other
in this range as well.
B soon discovers that, simply by shifting its threshold
from
to
, it can undercut A. Interestingly,
analysis and observation demonstrate that A cannot retaliate
by extending its finger a little further to the left; instead
it retreats back to playing v. After a while, B shifts its
threshold back up to
, and the policy cycle
is ready to begin anew.
The dramatic and cyclical shift in the policies
translates into large cyclical spikes in the Bellman error for both players.
The irregularity in amplitude and frequency is due to
the randomness of the Q algorithm, and the gradual lengthening
of the period is due to the cooling of the
parameter.