next up previous
Next: Conclusions Up: Pricing in agent economies Previous: Single-agent Q-learning

Multi-agent Q-learning

We now examine the more interesting and challenging case of simultaneous training of Q-functions and policies for both sellers. Our approach is to use the same formalism presented in the previous section, and to alternately adjust a random entry in seller 1's Q-function, followed by a random entry in seller 2's Q-function. As each seller's Q-function evolves, the seller's pricing policy is correspondingly updated so that it optimizes the agent's current Q-function. In modeling the two-step payoff r to a seller in equation 5, we use the opponent's current policy as implied by its current Q-function. The parameters in the experiments below were generally set to the same values as in the previous section. In most of the experiments, the Q-functions were initialized to the instantaneous payoff values (so that the policies corresponded to myopic policies), although other initial conditions were explored in a few experiments.

 

  figure94


Figure 3: Results of simultaneous Q-learning in the Price-Quality model. (a) Average profit per time step for seller 1 (solid diamonds) and seller 2 (open diamonds) vs. discount parameter tex2html_wrap_inline324 . Dashed line indicates baseline myopic vs. myopic expected profit. Note that seller 2's profit is higher than seller 1's, even though seller 2 has a lower quality parameter. (b) Cross-plot of Q-derived price curves (at any tex2html_wrap_inline324 ). Dashed line and arrows indicate a sample price dynamics trajectory, starting from the filled circle. The price war is eliminated and the dynamics evolves to a fixed point indicated by an open circle.

For simultaneous Q-learning in the Price-Quality model, we find robust convergence to a unique pair of pricing policies, independent of the value of tex2html_wrap_inline324 , as illustrated in figure 3(b). This solution also corresponds to the solution found by generalized minimax and by generalized DP in (Tesauro and Kephart, 1999). We note that repeated application of this pair of price curves leads to a dynamical trajectory that eventually converges to a fixed-point located at tex2html_wrap_inline540 . A detailed analysis of these pricing policies and the fixed-point solution is presented in (Tesauro and Kephart, 1999). In brief, for sufficiently low prices of seller 2, it pays seller 1 to abandon the price war and to charge a very high price, tex2html_wrap_inline398 . The value of tex2html_wrap_inline544 then corresponds to the highest price that seller 2 can charge without provoking an undercut by seller 1, based on a two-step lookahead calculation (seller 1 undercuts, and then seller 2 replies with a further undercut). We note that this fixed point does not correspond to a Nash equilibrium, since both players have an incentive to deviate, based on a one-step lookahead calculation. It was conjectured in (Tesauro and Kephart, 1999) that the solution observed in figure 3(b) corresponds to a subgame-perfect equilibrium (Fudenberg and Tirole, 1991) rather than a Nash equilibrium.

The cumulative profits obtained by the pair of pricing policies are plotted in figure 3(a). It is interesting that seller 2, the lower-quality seller, actually obtains a significantly higher profit than seller 1, the higher-quality seller. In contrast, with myopic vs. myopic pricing, seller 2 does worse than seller 1.

 

  figure108


Figure 4: Results of simultaneous Q-learning in the Shopbot model. (a) Average profit per time step for seller 1 (solid diamonds) and seller 2 (open diamonds) vs. discount parameter tex2html_wrap_inline324 . Dashed line indicates baseline myopic vs. myopic expected profit. (b) Cross-plot of Q-derived price curves at tex2html_wrap_inline334 ; the solution is symmetric. Dashed line and arrows indicate a sample price dynamics trajectory. (c) Cross-plot of Q-derived price curves at tex2html_wrap_inline336 ; the solution is asymmetric.

In the Shopbot model, we did not find exact convergence of the Q-functions for each value of tex2html_wrap_inline324 . However, in those cases where exact convergence was not found, we did find very good approximate convergence, in which the Q-functions and policies converged to stationary solutions to within small random fluctuations. Different solutions were obtained at each value of tex2html_wrap_inline324 . We generally find that a symmetric solution, in which the shapes of tex2html_wrap_inline526 and tex2html_wrap_inline530 are identical, is obtained at small tex2html_wrap_inline324 , whereas a broken symmetry solution, similar to the Price-Quality solution, is obtained at large tex2html_wrap_inline324 . We also found a range of tex2html_wrap_inline324 values, between 0.1 and 0.2, where either a symmetric or asymmetric solution could be obtained, depending on initial conditions. The asymmetric solution was counter-intuitive to us, because we expected that the symmetry of the two sellers' profit functions would lead to a symmetric solution. In hindsight, we can apply the same type of reasoning as in the Price-Quality model to explain the asymmetric solution. A plot of the expected profit for both sellers as a function of tex2html_wrap_inline324 is shown in figure 4(a). Plots of the symmetric and asymmetric solution, obtained at tex2html_wrap_inline334 and tex2html_wrap_inline336 respectively, are shown in figures 4(b) and 4(c).

 

  figure124


Figure 5: Results of multi-agent Q-learning in the Information-Filtering model. (a) Average profit per time step for seller 1 (solid diamonds) and seller 2 (open diamonds) vs. discount parameter tex2html_wrap_inline324 . (The data points at tex2html_wrap_inline340 represent unconverged Q-functions and policies.) Dashed lines indicates baseline expected profit when both sellers are myopic. (b) Cross-plot of Q-derived price curves at tex2html_wrap_inline326 .

Finally, in the Information-Filtering model, we found that simultaneous Q-learning produced exact or good approximate convergence for small values of tex2html_wrap_inline324 tex2html_wrap_inline580 . For large values of tex2html_wrap_inline324 , no convergence was obtained. The simultaneous Q-learning solutions yielded reduced-amplitude price wars, and montonically increasing profitability for both sellers as a function of tex2html_wrap_inline324 , at least up to tex2html_wrap_inline326 . A few data points were examined at tex2html_wrap_inline588 , and even though there was no convergence, the Q-policies still yielded greater profit for both sellers than in the myopic vs. myopic case. A plot of the Q-derived policies and system dynamics for tex2html_wrap_inline326 is shown in figure 5(b). The expected profits for both players as a function of tex2html_wrap_inline324 is plotted in figure 5(a).


next up previous
Next: Conclusions Up: Pricing in agent economies Previous: Single-agent Q-learning

kephart
Wed Sep 29 11:53:06 EDT 1999