Detailed results of lookup table-based Q-learning are presented
in (Tesauro and Kephart, 1999b). In brief,
single-agent Q-learning in all three models was found to always
yield exact convergence to a stationary optimal solution (as expected).
The resulting Q-derived policies always outperformed a myopic
strategy when tested against myopic opposition,
and the expected profit increased montonically with
.
In many cases, Q-learning had the side benefit of also improving
the myopic opponent's expected profit.
This improvement is due to the
Q-learner learning to abandon undercutting behavior more readily
as the price decreases.
The price-war regime is thus smaller and confined to
higher average prices, leading to a closer approximation to
collusive behavior, with greater expected profits for both sellers.
For simultaneous Q-learning by both sellers, the procedure utilized was to alternately adjust a random entry in seller 1's Q-function, followed by a random entry in seller 2's Q-function, using the same formalism presented above. As the Q-functions evolved, the policies were correspondingly updated so that they optimized the agents' current Q-function. In modeling the two-step payoff r to a seller in equation 1, the opponent's current policy was used, as implied by its current Q-function.
Simultaneous Q-learning in the Price-Quality model yielded
robust convergence to a unique pair of policies, independent of
, identical to the solution found by shallow lookahead in
(Tesauro and Kephart, 1999a). In the Shopbot model,
exact convergence of the Q-functions was only found for
. For
, there was
very good approximate convergence, in which the Q-functions
converged to stationary solutions to within
small random fluctuations.
Different solutions were obtained at each value of
.
For small
, a symmetric solution is generally obtained
(in which the shapes of
and
are identical),
whereas a broken symmetry solution,
similar to the Price-Quality solution, is obtained
at large
. There was a range of
values,
between 0.1 and 0.2, where either a symmetric or asymmetric
solution could be obtained, depending on initial conditions.
The asymmetric solution seems counter-intuitive because one
would expect that symmetric profit functions would lead to
symmetric policies.
Finally, in the Information-Filtering model,
simultaneous Q-learning produced exact or good approximate
convergence for
. For larger
,
no convergence was obtained. The Q-derived
policies yielded reduced-amplitude price wars, and
montonically increasing profitability for both sellers
as a function of
, up to
.
Figure 2: Results of simultaneous Q-learning with lookup tables in the
Shopbot model. (a) Average utility per time step for
seller 1 (solid diamonds) and seller 2 (open diamonds)
vs. discount parameter
. Dashed line
indicates baseline myopic vs. myopic expected utility.
(b) Cross-plot of Q-derived price curves
at
. Dashed line and arrows
indicate a sample price dynamics trajectory.
Figure 2 is illustrative of the results of
simultaneous Q-learning.
The left figure plots the average profit for both
sellers in the Shopbot model.
The right figure plots Q-derived price curves
(at
) of seller 1 and seller 2 against each other.
The system dynamics
can be obtained by alternately applying the two pricing policies.
This can be done by an iterative graphical
construction, in which for any given starting point, one first
holds
constant and moves horizontally to the
curve, and then one
holds
constant and moves vertically to the
curve.
For these particular curves, the graphical construction leads
to a very short cyclic price war, indicated by the dashed line.
The price-war behavior begins at the price pair (1, 1),
lasts only a couple of steps, and then drops to
.
At this point seller 1 resets its price to
and the
cycle repeats.
The price war amplitude is diminished compared to myopic vs. myopic play,
where a long price war would persist all the way to a
minimum price point
.