next up previous
Next: Difficulties of neural network Up: Single and Multi-agent Q-learning Previous: Learning algorithm

Results of lookup table learning

Detailed results of lookup table-based Q-learning are presented in (Tesauro and Kephart, 1999b). In brief, single-agent Q-learning in all three models was found to always yield exact convergence to a stationary optimal solution (as expected). The resulting Q-derived policies always outperformed a myopic strategy when tested against myopic opposition, and the expected profit increased montonically with tex2html_wrap_inline235 . In many cases, Q-learning had the side benefit of also improving the myopic opponent's expected profit. This improvement is due to the Q-learner learning to abandon undercutting behavior more readily as the price decreases. The price-war regime is thus smaller and confined to higher average prices, leading to a closer approximation to collusive behavior, with greater expected profits for both sellers.

For simultaneous Q-learning by both sellers, the procedure utilized was to alternately adjust a random entry in seller 1's Q-function, followed by a random entry in seller 2's Q-function, using the same formalism presented above. As the Q-functions evolved, the policies were correspondingly updated so that they optimized the agents' current Q-function. In modeling the two-step payoff r to a seller in equation 1, the opponent's current policy was used, as implied by its current Q-function.

Simultaneous Q-learning in the Price-Quality model yielded robust convergence to a unique pair of policies, independent of tex2html_wrap_inline235 , identical to the solution found by shallow lookahead in (Tesauro and Kephart, 1999a). In the Shopbot model, exact convergence of the Q-functions was only found for tex2html_wrap_inline323 . For tex2html_wrap_inline325 , there was very good approximate convergence, in which the Q-functions converged to stationary solutions to within small random fluctuations. Different solutions were obtained at each value of tex2html_wrap_inline235 . For small tex2html_wrap_inline235 , a symmetric solution is generally obtained (in which the shapes of tex2html_wrap_inline331 and tex2html_wrap_inline333 are identical), whereas a broken symmetry solution, similar to the Price-Quality solution, is obtained at large tex2html_wrap_inline235 . There was a range of tex2html_wrap_inline235 values, between 0.1 and 0.2, where either a symmetric or asymmetric solution could be obtained, depending on initial conditions. The asymmetric solution seems counter-intuitive because one would expect that symmetric profit functions would lead to symmetric policies. Finally, in the Information-Filtering model, simultaneous Q-learning produced exact or good approximate convergence for tex2html_wrap_inline339 . For larger tex2html_wrap_inline235 , no convergence was obtained. The Q-derived policies yielded reduced-amplitude price wars, and montonically increasing profitability for both sellers as a function of tex2html_wrap_inline235 , up to tex2html_wrap_inline239 .

   figure28
Figure 2: Results of simultaneous Q-learning with lookup tables in the Shopbot model. (a) Average utility per time step for seller 1 (solid diamonds) and seller 2 (open diamonds) vs. discount parameter tex2html_wrap_inline235 . Dashed line indicates baseline myopic vs. myopic expected utility. (b) Cross-plot of Q-derived price curves at tex2html_wrap_inline237 . Dashed line and arrows indicate a sample price dynamics trajectory.

Figure 2 is illustrative of the results of simultaneous Q-learning. The left figure plots the average profit for both sellers in the Shopbot model. The right figure plots Q-derived price curves (at tex2html_wrap_inline237 ) of seller 1 and seller 2 against each other. The system dynamics can be obtained by alternately applying the two pricing policies. This can be done by an iterative graphical construction, in which for any given starting point, one first holds tex2html_wrap_inline233 constant and moves horizontally to the tex2html_wrap_inline331 curve, and then one holds tex2html_wrap_inline231 constant and moves vertically to the tex2html_wrap_inline333 curve. For these particular curves, the graphical construction leads to a very short cyclic price war, indicated by the dashed line. The price-war behavior begins at the price pair (1, 1), lasts only a couple of steps, and then drops to tex2html_wrap_inline361 . At this point seller 1 resets its price to tex2html_wrap_inline363 and the cycle repeats. The price war amplitude is diminished compared to myopic vs. myopic play, where a long price war would persist all the way to a minimum price point tex2html_wrap_inline365 .


next up previous
Next: Difficulties of neural network Up: Single and Multi-agent Q-learning Previous: Learning algorithm

kephart
Tue Mar 21 00:52:15 EST 2000