Let us now turn to the more challenging situation of simultaneous training of Q-functions and policies for both sellers. The procedure studied here is to alternately adjust a random entry in seller 1's Q-function, followed by a random entry in seller 2's Q-function, using the same formalism presented in the previous section. As each seller's Q-function evolves, the seller's pricing policy is correspondingly updated so that it optimizes the agent's current Q-function. In modeling the two-step payoff r to a seller in equation 5, the opponent's current policy is used, as implied by its current Q-function. The parameters in the experiments below were generally set to the same values as in the previous section. In most of the experiments, the Q-functions were initialized to the instantaneous payoff values (so that the policies corresponded to myopic policies), although other initial conditions were explored in a few experiments.
For simultaneous Q-learning in the Price-Quality model,
we find robust convergence to a unique pair of pricing policies,
independent of the value of
, as illustrated in
figure 4. This solution also corresponds to
the solution found by generalized minimax and by generalized
DP in (Tesauro and Kephart, 1999). Repeated
application of this pair of price curves leads to a dynamical
trajectory that eventually converges to a fixed-point located
at
. A detailed analysis of these
pricing policies and the fixed-point solution is presented in
(Tesauro and Kephart, 1999). In brief, for sufficiently low
prices of seller 2, it pays seller 1 to abandon the price war
and to charge a very high price,
. The value of
then corresponds to the highest price that seller 2
can charge without provoking an undercut by seller 1, based
on a two-step lookahead calculation (seller 1 undercuts, and then
seller 2 replies with a further undercut). This
fixed point differs from the Nash equilibrium for this game,
which was calculated in (Sairamesh and Kephart, 1998) to be
. The difference is due to two
factors: (i) these models assume alternating-turn dynamics, rather
than the simultaneous-move dynamics assumed by the Nash
calculation; (ii) the game here is an iterated game, whereas
the Nash calculation assumes a one-shot game. It was conjectured
in (Tesauro and Kephart, 1999) that the solution observed
in figure 4 corresponds to a subgame-perfect
equilibrium (Fudenberg and Tirole, 1991) rather than a Nash equilibrium.
Figure 4: Cross-plot of optimal price curves for seller 1 vs. seller 2
in Price-Quality model obtained by simultaneous Q-learning; the
same solution was found for all values of
.
The dashed line indicates how the prices
will evolve over time using these pricing policies, starting from
a particular price pair, indicated by the filled circle. The price
war is eliminated with these pricing curves, and the dynamics instead
evolves to a fixed point indicated by an open circle.
The cumulative utilities obtained by the pair of pricing policies in figure 4 are plotted in figure 5. It is interesting that seller 2, the lower-quality seller, actually obtains a significantly higher profit than seller 1, the higher-quality seller. In contrast, with myopic vs. myopic pricing, seller 2 does worse than seller 1.
Figure 5: Plot of expected utilities for seller 1 (solid diamonds)
and seller 2 (open diamonds)
in Price-Quality model obtained by simultaneous Q-learning.
By comparison, the utilities for seller 1 and seller 2 in
myopic vs. myopic pricing are indicated as dashed lines. Seller 2's
utility is higher than seller 1's in the simultaneous Q-learning
solution, even though seller 2 has a lower quality parameter.
In the Shopbot model, exact convergence of the Q-functions
was not found for all values of
. However, in
those cases where exact convergence was not found, there was
very good approximate convergence, in which the Q-functions
and policies converged to stationary solutions to within
small random fluctuations.
Different solutions were obtained at each value of
.
For small
, a symmetric solution is generally obtained
(in which the shapes of
and
are identical),
whereas a broken symmetry
solution, similar to the Price-Quality solution, is obtained
at large
. There was also a range of
values,
between 0.1 and 0.2, where either a symmetric or asymmetric
solution could be obtained, depending on initial conditions.
The asymmetric solution seems counter-intuitive because
we expected that the symmetry of the two sellers' utility functions
would lead to a symmetric solution. In hindsight, one can apply
the same type of reasoning as in the Price-Quality model to
explain the asymmetric solution. Plots of the symmetric and
asymmetric solution, obtained at
and
respectively, are shown in figures 6 and
7. A plot of the expected utility for both sellers
as a function of
is shown in figure 8.
Figure 6: Cross-plot of optimal price curves for seller 1 vs. seller 2
in the Shopbot model obtained by simultaneous Q-learning at
. The resulting price dynamics is indicated by the
dashed line and arrows.
Figure 7: Cross-plot of optimal price curves for seller 1 vs. seller 2
in the Shopbot model obtained by simultaneous Q-learning at
. The resulting price dynamics is indicated by the
dashed line and arrows.
Figure 8: Plot of expected utilities for seller 1
(solid diamonds) and seller 2 (open diamonds)
in Shopbot model obtained by simultaneous Q-learning, for
values of
ranging from 0.0 to 0.99.
By comparison, the utilities for seller 1 and seller 2 in
myopic vs. myopic pricing are indicated as a dashed line.
Finally, in the Information-Filtering model,
simultaneous Q-learning produced exact or good approximate
convergence for small values of
. For large values of
,
no convergence was obtained. The simultaneous Q-learning
solutions yielded reduced-amplitude price wars, and
montonically increasing profitability for both sellers
as a function of
, at least up to
.
A few data points were examined at
, and even
though there was no convergence, the Q-policies still yielded
greater utility for both sellers than in the myopic vs. myopic
case. A plot of the Q-derived policies and system dynamics
for
is shown in figure 9.
The expected utilities for both players as a function of
is plotted in figure 10.
Figure 9: Cross-plot of optimal price curves for seller 1 vs. seller 2
in the Information-Filtering model obtained by simultaneous Q-learning at
.
Figure 10: Plot of expected utilities for seller 1 (solid diamonds)
and seller 2 (open diamonds) in the
Information-Filtering model obtained by simultaneous Q-learning, for
values of
ranging from 0.0 to 0.7. (The data points at
represent unconverged Q-functions and
policies.)
By comparison, the utilities for seller 1 and seller 2 in
myopic vs. myopic pricing are indicated as dashed lines.