We now examine the more interesting and challenging case of simultaneous training of Q-functions and policies for both sellers. Our approach is to use the same formalism presented in the previous section, and to alternately adjust a random entry in seller 1's Q-function, followed by a random entry in seller 2's Q-function. As each seller's Q-function evolves, the seller's pricing policy is correspondingly updated so that it optimizes the agent's current Q-function. In modeling the two-step payoff r to a seller in equation 5, we use the opponent's current policy as implied by its current Q-function. The parameters in the experiments below were generally set to the same values as in the previous section. In most of the experiments, the Q-functions were initialized to the instantaneous payoff values (so that the policies corresponded to myopic policies), although other initial conditions were explored in a few experiments.
Figure 3: Results of simultaneous Q-learning in the Price-Quality model.
(a) Average profit per time step for
seller 1 (solid diamonds) and seller 2 (open diamonds)
vs. discount parameter
. Dashed line
indicates baseline myopic vs. myopic expected profit.
Note that seller 2's profit is higher than seller 1's,
even though seller 2 has a lower quality parameter.
(b) Cross-plot of Q-derived price curves (at any
).
Dashed line and arrows indicate a sample price dynamics trajectory,
starting from the filled circle. The price war is eliminated and
the dynamics evolves to a fixed point indicated by an open circle.
For simultaneous Q-learning in the Price-Quality model,
we find robust convergence to a unique pair of pricing policies,
independent of the value of
, as illustrated in
figure 3(b). This solution also corresponds to
the solution found by generalized minimax and by generalized
DP in (Tesauro and Kephart, 1999). We note that repeated
application of this pair of price curves leads to a dynamical
trajectory that eventually converges to a fixed-point located
at
. A detailed analysis of these
pricing policies and the fixed-point solution is presented in
(Tesauro and Kephart, 1999). In brief, for sufficiently low
prices of seller 2, it pays seller 1 to abandon the price war
and to charge a very high price,
. The value of
then corresponds to the highest price that seller 2
can charge without provoking an undercut by seller 1, based
on a two-step lookahead calculation (seller 1 undercuts, and then
seller 2 replies with a further undercut). We note that this
fixed point does not correspond to a Nash equilibrium, since
both players have an incentive to deviate, based on a one-step
lookahead calculation.
It was conjectured
in (Tesauro and Kephart, 1999) that the solution observed
in figure 3(b) corresponds to a subgame-perfect
equilibrium (Fudenberg and Tirole, 1991) rather than a Nash equilibrium.
The cumulative profits obtained by the pair of pricing policies are plotted in figure 3(a). It is interesting that seller 2, the lower-quality seller, actually obtains a significantly higher profit than seller 1, the higher-quality seller. In contrast, with myopic vs. myopic pricing, seller 2 does worse than seller 1.
Figure 4: Results of simultaneous Q-learning in the
Shopbot model. (a) Average profit per time step for
seller 1 (solid diamonds) and seller 2 (open diamonds)
vs. discount parameter
. Dashed line
indicates baseline myopic vs. myopic expected profit.
(b) Cross-plot of Q-derived price curves
at
; the solution is symmetric.
Dashed line and arrows
indicate a sample price dynamics trajectory.
(c) Cross-plot of Q-derived price curves
at
; the solution is asymmetric.
In the Shopbot model, we did not find exact convergence
of the Q-functions for each value of
. However, in
those cases where exact convergence was not found, we did
find very good approximate convergence, in which the Q-functions
and policies converged to stationary solutions to within
small random fluctuations.
Different solutions were obtained at each value of
.
We generally find that a symmetric solution, in which the
shapes of
and
are identical, is
obtained at small
, whereas a broken symmetry
solution, similar to the Price-Quality solution, is obtained
at large
. We also found a range of
values,
between 0.1 and 0.2, where either a symmetric or asymmetric
solution could be obtained, depending on initial conditions.
The asymmetric solution was counter-intuitive to us, because
we expected that the symmetry of the two sellers' profit functions
would lead to a symmetric solution. In hindsight, we can apply
the same type of reasoning as in the Price-Quality model to
explain the asymmetric solution.
A plot of the expected profit for both sellers
as a function of
is shown in figure 4(a).
Plots of the symmetric and
asymmetric solution, obtained at
and
respectively, are shown in figures 4(b) and
4(c).
Figure 5: Results of multi-agent Q-learning in the
Information-Filtering model. (a) Average profit per time step for
seller 1 (solid diamonds) and seller 2
(open diamonds) vs. discount parameter
.
(The data points at
represent unconverged Q-functions and policies.)
Dashed lines
indicates baseline expected profit when both sellers are myopic.
(b) Cross-plot of Q-derived price curves at
.
Finally, in the Information-Filtering model, we found that
simultaneous Q-learning produced exact or good approximate
convergence for small values of
. For large values of
,
no convergence was obtained. The simultaneous Q-learning
solutions yielded reduced-amplitude price wars, and
montonically increasing profitability for both sellers
as a function of
, at least up to
.
A few data points were examined at
, and even
though there was no convergence, the Q-policies still yielded
greater profit for both sellers than in the myopic vs. myopic
case. A plot of the Q-derived policies and system dynamics
for
is shown in figure 5(b).
The expected profits for both players as a function of
is plotted in figure 5(a).