next up previous
Next: Q-learning with neural networks Up: Pricing in agent economies Previous: Single-agent Q-learning

Multi-agent Q-learning

Let us now turn to the more challenging situation of simultaneous training of Q-functions and policies for both sellers. The procedure studied here is to alternately adjust a random entry in seller 1's Q-function, followed by a random entry in seller 2's Q-function, using the same formalism presented in the previous section. As each seller's Q-function evolves, the seller's pricing policy is correspondingly updated so that it optimizes the agent's current Q-function. In modeling the two-step payoff r to a seller in equation 5, the opponent's current policy is used, as implied by its current Q-function. The parameters in the experiments below were generally set to the same values as in the previous section. In most of the experiments, the Q-functions were initialized to the instantaneous payoff values (so that the policies corresponded to myopic policies), although other initial conditions were explored in a few experiments.

For simultaneous Q-learning in the Price-Quality model, we find robust convergence to a unique pair of pricing policies, independent of the value of tex2html_wrap_inline443 , as illustrated in figure 4. This solution also corresponds to the solution found by generalized minimax and by generalized DP in (Tesauro and Kephart, 1999). Repeated application of this pair of price curves leads to a dynamical trajectory that eventually converges to a fixed-point located at tex2html_wrap_inline653 . A detailed analysis of these pricing policies and the fixed-point solution is presented in (Tesauro and Kephart, 1999). In brief, for sufficiently low prices of seller 2, it pays seller 1 to abandon the price war and to charge a very high price, tex2html_wrap_inline517 . The value of tex2html_wrap_inline657 then corresponds to the highest price that seller 2 can charge without provoking an undercut by seller 1, based on a two-step lookahead calculation (seller 1 undercuts, and then seller 2 replies with a further undercut). This fixed point differs from the Nash equilibrium for this game, which was calculated in (Sairamesh and Kephart, 1998) to be tex2html_wrap_inline659 . The difference is due to two factors: (i) these models assume alternating-turn dynamics, rather than the simultaneous-move dynamics assumed by the Nash calculation; (ii) the game here is an iterated game, whereas the Nash calculation assumes a one-shot game. It was conjectured in (Tesauro and Kephart, 1999) that the solution observed in figure 4 corresponds to a subgame-perfect equilibrium (Fudenberg and Tirole, 1991) rather than a Nash equilibrium.

   figure103
Figure 4: Cross-plot of optimal price curves for seller 1 vs. seller 2 in Price-Quality model obtained by simultaneous Q-learning; the same solution was found for all values of tex2html_wrap_inline443 . The dashed line indicates how the prices will evolve over time using these pricing policies, starting from a particular price pair, indicated by the filled circle. The price war is eliminated with these pricing curves, and the dynamics instead evolves to a fixed point indicated by an open circle.

The cumulative utilities obtained by the pair of pricing policies in figure 4 are plotted in figure 5. It is interesting that seller 2, the lower-quality seller, actually obtains a significantly higher profit than seller 1, the higher-quality seller. In contrast, with myopic vs. myopic pricing, seller 2 does worse than seller 1.

   figure114
Figure 5: Plot of expected utilities for seller 1 (solid diamonds) and seller 2 (open diamonds) in Price-Quality model obtained by simultaneous Q-learning. By comparison, the utilities for seller 1 and seller 2 in myopic vs. myopic pricing are indicated as dashed lines. Seller 2's utility is higher than seller 1's in the simultaneous Q-learning solution, even though seller 2 has a lower quality parameter.

In the Shopbot model, exact convergence of the Q-functions was not found for all values of tex2html_wrap_inline443 . However, in those cases where exact convergence was not found, there was very good approximate convergence, in which the Q-functions and policies converged to stationary solutions to within small random fluctuations. Different solutions were obtained at each value of tex2html_wrap_inline443 . For small tex2html_wrap_inline443 , a symmetric solution is generally obtained (in which the shapes of tex2html_wrap_inline643 and tex2html_wrap_inline647 are identical), whereas a broken symmetry solution, similar to the Price-Quality solution, is obtained at large tex2html_wrap_inline443 . There was also a range of tex2html_wrap_inline443 values, between 0.1 and 0.2, where either a symmetric or asymmetric solution could be obtained, depending on initial conditions. The asymmetric solution seems counter-intuitive because we expected that the symmetry of the two sellers' utility functions would lead to a symmetric solution. In hindsight, one can apply the same type of reasoning as in the Price-Quality model to explain the asymmetric solution. Plots of the symmetric and asymmetric solution, obtained at tex2html_wrap_inline449 and tex2html_wrap_inline451 respectively, are shown in figures 6 and 7. A plot of the expected utility for both sellers as a function of tex2html_wrap_inline443 is shown in figure 8.

   figure126
Figure 6: Cross-plot of optimal price curves for seller 1 vs. seller 2 in the Shopbot model obtained by simultaneous Q-learning at tex2html_wrap_inline449 . The resulting price dynamics is indicated by the dashed line and arrows.

   figure135
Figure 7: Cross-plot of optimal price curves for seller 1 vs. seller 2 in the Shopbot model obtained by simultaneous Q-learning at tex2html_wrap_inline451 . The resulting price dynamics is indicated by the dashed line and arrows.

   figure144
Figure 8: Plot of expected utilities for seller 1 (solid diamonds) and seller 2 (open diamonds) in Shopbot model obtained by simultaneous Q-learning, for values of tex2html_wrap_inline443 ranging from 0.0 to 0.99. By comparison, the utilities for seller 1 and seller 2 in myopic vs. myopic pricing are indicated as a dashed line.

Finally, in the Information-Filtering model, simultaneous Q-learning produced exact or good approximate convergence for small values of tex2html_wrap_inline443 tex2html_wrap_inline691 . For large values of tex2html_wrap_inline443 , no convergence was obtained. The simultaneous Q-learning solutions yielded reduced-amplitude price wars, and montonically increasing profitability for both sellers as a function of tex2html_wrap_inline443 , at least up to tex2html_wrap_inline445 . A few data points were examined at tex2html_wrap_inline699 , and even though there was no convergence, the Q-policies still yielded greater utility for both sellers than in the myopic vs. myopic case. A plot of the Q-derived policies and system dynamics for tex2html_wrap_inline445 is shown in figure 9. The expected utilities for both players as a function of tex2html_wrap_inline443 is plotted in figure 10.

   figure155
Figure 9: Cross-plot of optimal price curves for seller 1 vs. seller 2 in the Information-Filtering model obtained by simultaneous Q-learning at tex2html_wrap_inline445 .

   figure164
Figure 10: Plot of expected utilities for seller 1 (solid diamonds) and seller 2 (open diamonds) in the Information-Filtering model obtained by simultaneous Q-learning, for values of tex2html_wrap_inline443 ranging from 0.0 to 0.7. (The data points at tex2html_wrap_inline459 represent unconverged Q-functions and policies.) By comparison, the utilities for seller 1 and seller 2 in myopic vs. myopic pricing are indicated as dashed lines.


next up previous
Next: Q-learning with neural networks Up: Pricing in agent economies Previous: Single-agent Q-learning

kephart
Wed Sep 29 11:51:48 EDT 1999