next up previous
Next: Buyers and their search Up: Sellers and automated pricing Previous: Fast myoptimal

Foresight and machine learning

According to Table 1, the five myoptimal sellers in Fig. 1 each receive an average profit of 0.0337, or about two-thirds of the theoretical maximum of 0.05 that would be achieved by a cartel. One would expect an improvement if the algorithm were modified to take into account the anticipated future pricing behavior of competitors. Sellers might avoid undercutting if they could foresee that it might invite retaliation. Such thinking is very natural and intuitive for humans. The challenge is to endow software agents with such economic foresight without requiring that they be anywhere near the level of sophistication required to pass the Turing test.

One promising method that we have investigated is a reinforcement learning technique called Q-learning. A seller employing this technique learns a Q function that expresses the anticipated future-discounted profit for each possible price that it might charge, given the current prices of all of its competitors [26, 27, 28]. The anticipated future-discounted profit is simply the expected profit during the current time interval plus tex2html_wrap_inline780 times the expected profit one time step in the future plus tex2html_wrap_inline808 times the expected profit two time steps in the future, etc., where the future discount parameter tex2html_wrap_inline780 lies between 0 and 1. The seller's pricing policy is to select the price that offers the highest future-discounted profit.

   figure140
Figure 2: a) Cross plot of Q pricing policy vs. MY pricing policy. The future discount parameter is tex2html_wrap_inline682 . The thin line shows a typical price trajectory from a given initial condition (represented by dot). b) Associated price-war time series.

Q-learning is known to converge when the function to be learned is stationary. Thus a single Q-learner is guaranteed to develop an optimal policy against a single non-adaptive DF, GT, or MY agent. Against DF and GT, Q behaves essentially like MY [11]. However, against MY, Q learns a superior policy. Fig. 2a depicts Q's policy tex2html_wrap_inline814 , i.e. its price as a function of MY's price. Rotated and superimposed on this is MY's pricing policy tex2html_wrap_inline816 . Successive iteration of the two policies from a particular initial condition yields a trajectory in price vector space that is represented in Fig. 2a by a thin zigzag curve. Assuming that Q moves first from the initial condition tex2html_wrap_inline818 , the trajectory moves vertically down to the tex2html_wrap_inline814 curve, then horizontally to the tex2html_wrap_inline816 curve, and then alternates between vertical and horizontal movements. The price trajectory is represented more conventionally in Fig. 2b.

Q's policy is identical to that of MY for moderate to high prices, so the price trajectory begins as a price war similar in form to that of the 5MY cell in Fig. 1. However, below a critical price tex2html_wrap_inline824 , Q's policy differs from that of MY. Instead of slightly undercutting MY, Q drops its price aggressively to tex2html_wrap_inline826 . MY responds by raising its price to the monopolist price 0.5, whereupon Q undercuts it, beginning the price war cycle anew. By eliminating the portion of the price-war cycle below price 0.196, Q improves its average profit from 0.0892 to 0.1089, and it also improves MY's average profit to 0.1076.

When two Q-learners are pitted against another, each creates a non-stationary environment for the other. There are no theoretical guarantees of convergence in this case, and in fact we have observed both convergence and non-convergence. Convergence typically occurs when the future discount parameter tex2html_wrap_inline780 is small (i.e., relatively little emphasis is placed on future earnings). In this case, two Q-learners typically converge to a stable, symmetric pair of policies, each of which is quite similar to Q's optimal policy against a single MY. When a moderate to strong emphasis is placed on the value of future rewards, the Q-learners' policies nearly converge to a strongly asymmetric pair of policies in which one of the agents has a clear advantage over the other. The convergence is imperfect, however, and the Q-functions and associated policies continue to make small oscillations around a well-defined but not quite attainable asymmetric solution. This asymmetric quasi-solution has no analog in ordinary single-agent Q-learning. Our early efforts to characterize this solution and the circumstances under which it arises hint of a rich field for investigation by theorists in machine learning and dynamical systems.

   figure166
Figure 3: a) Cross plot of asymmetric pricing policies derived from simultaneous learning by two Q-learners. The future discount parameter tex2html_wrap_inline682 . The thin line shows a typical price trajectory from a given initial condition (represented by dot). b) Associated price-war time series.

The asymmetric solution is illustrated in Fig. 3a, which shows the crossed policies for Agents 1 and 2. Both pricing policies are approximately describable by a small number of simple line segments, except for slight perturbations that arise from the instability of the asymmetric solution. The positions and sizes of the perturbations shift unpredictably as the Q-learners continue to learn. Superimposed on these two crossed policies is a trajectory in price vector space that starts from the initial condition tex2html_wrap_inline832 and proceeds vertically and horizontally as in Fig. 2a. The classic price war pattern is again evident in Fig. 3a and (more conventionally) in Fig. 3b. However, in this case Agent 1 ends the price war quite early by aggressively lowering its price, inducing Agent 2 to set its price to the monopolistic value of 0.5. Since prices never fall below 0.44, the price war is short-lived, and profits remain very high. On average, Agents 1 and 2 receive 0.1254 and 0.1171, respectively. This exceeds what two MY sellers would obtain (0.0892 each), and it even exceeds the profits that would be obtained by two DFs (0.1127 each).

An important challenge is to extend the Q-learning technique so that it can feasibly handle more than one opponent. The natural way to express the Q-function is as a lookup table with one Q value per possible price vector. For a system with s sellers, the table is s-dimensional. Although we have found s=2 to be manageable, the lookup table becomes infeasible in both size and trainability for tex2html_wrap_inline840 . The obvious solution is to replace the lookup table with a function approximator. We have explored the use of neural nets [27] and regression trees as function approximators. In tests on two-seller systems, neural nets appear to have difficulty approximating the lookup table accurately and take a very long time to train, but regression trees appear to offer good approximation and good training times. We are beginning to assess their performance on systems with three sellers.


next up previous
Next: Buyers and their search Up: Sellers and automated pricing Previous: Fast myoptimal

kephart
Mon Mar 20 11:03:38 EST 2000