next up previous
Next: Multi-agent Q-learning Up: Pricing in agent economies Previous: Model agent economies

Single-agent Q-learning

We first consider ordinary single-agent Q-learning in the above two-seller economic models. The procedure for Q-learning is as follows. Let Q(s,a) represent the discounted long-term expected reward to an agent for taking action a in state s. The discounting of future rewards is accomplished by a discount parameter tex2html_wrap_inline324 such that the value of a reward expected at n time steps in the future is discounted by tex2html_wrap_inline444 . Assume that the Q(s,a) function is represented by a lookup table containing a value for every possible state-action pair, and assume that the table entries are initialized to arbitrary values. Then the procedure for solving for Q(s,a) is to infinitely repeat the following two-step loop:

1. Select a particular state s and a particular action a, observe the immediate reward r for this state-action pair, and observe the resulting state s'.

2. Adjust Q(s,a) according to the following equation:

  equation64

where tex2html_wrap_inline460 is the learning rate parameter, and the tex2html_wrap_inline462 operation represents choosing the optimal action b among all possible actions that can be taken in the successor state s' leading to the greatest Q-value. A wide variety of methods may be used to select state-action pairs in step 1, provided that every state-action pair is visited infinitely often. For any stationary Markov Decision Problem, the Q-learning procedure is guaranteed to converge to the correct values, provided that tex2html_wrap_inline460 is decreased over time with an appropriate schedule.

We first consider using Q-learning for one of the two sellers in our economic models, while the other seller maintains a fixed pricing policy. In the simulations described below the fixed policy is in fact the myoptimal policy tex2html_wrap_inline472 represented for example in the Price-Quality model by equations 3 and 4.

In our pricing application, the distinction between states and actions is somewhat blurred. We will assume that the ``state'' for each seller is sufficiently described by the other seller's last price, and that the ``action'' is the current price decision. This should be a sufficient state description because no other history is needed either for the determination of immediate reward, or for the calculation of the myoptimal price by the fixed-strategy player. We have also modified the concepts of immediate reward r and next-state s' for the two-agent case. We define s' as the state that is obtained, starting from s, of one action by the Q-learner and a response action by the fixed-strategy opponent. Likewise, the immediate reward is defined as the sum of the two rewards obtained after those two actions. These modifications were introduced so that the state s' would have the same player to move as state s. (A possible alternative to this, which we have not investigated, is to include the side-to-move as additional information in the state-space description.)

In the simulations reported below, the sequence of state-action pairs selected for the Q-table updates were generated by uniform random selection from amongst all possible table entries. The initial values of the Q-tables were generally set to the immediate reward values. (Consequently the initial Q-derived policies corresponded to myoptimal policies.) The learning rate was varied with time according to:

  equation69

where the initial learning rate tex2html_wrap_inline486 was usually set to 0.1, and the constant tex2html_wrap_inline488 when the simulation time t was measured in units of tex2html_wrap_inline492 , the size of the Q-table. (N is the number of possible prices that could be selected by either player.) A number of different values of the discount parameter tex2html_wrap_inline324 were studied, ranging from tex2html_wrap_inline334 to tex2html_wrap_inline336 .

Results for single-agent Q-learning in all three models indicated that Q-learning worked well (as expected) in each case. In each model, for each value of the discount parameter, exact convergence of the Q-table to a stationary optimal solution was found. The convergence times ranged from a few hundred sweeps through each table element, for smaller values of tex2html_wrap_inline324 , to at most a few thousand updates for the largest values of tex2html_wrap_inline324 . In addition, once Q-learning converged, we then measured the expected cumulative profit of the policy derived from the Q-function. We ran the Q-policy against the other player's myopic policy from 100 random starting states, each for 200 time steps, and averaged the resulting cumulative profit for each player. We found that, in each case, the seller achieved greater profit against a myopic opponent by using a Q-derived policy than by using a myopic policy. (This was true even for tex2html_wrap_inline334 , because, due to the redefinition of Q updates summing over two time steps, the case tex2html_wrap_inline334 effectively corresponds to a two-step optimization, rather than the one-step optimization of the myopic policies.) Furthermore, the cumulative profit obtained with the Q-derived policy monotonically increased with the increasing tex2html_wrap_inline324 (as expected).

 

  figure77


Figure 2: Results of single-agent Q-learning in the Shopbot model. (a) Average profit per time step for Q-learner (seller 1, filled circles) and myopic seller (seller 2, open circles) vs. discount parameter tex2html_wrap_inline324 . Dashed line indicates baseline expected profit when both sellers are myopic. (b) Cross-plot of Q-derived price curve (seller 1) vs. myopic price curve (seller 2) at tex2html_wrap_inline326 . Dashed line and arrows indicate a temporal price-pair trajectory using these policies, starting from filled circle.

It was also interesting to note that in many cases, the expected profit of the myopic opponent also increased when playing against the Q-learner, and also improved monotonically with increasing tex2html_wrap_inline324 . The explanation is that, rather than better exploiting the myopic opponent, as would be expected in a zero-sum game, the Q-learner instead reduced the region over which it would participate in a mutually undercutting price war. Typically we find in these models that with myopic vs. myopic play, large-amplitude price wars are generated that start at very high prices and persist all the way down to very low prices. When a Q-learner competes against a myopic opponent, there are still price wars starting at high prices, however, the Q-learner abandons the price war more quickly as the prices decrease. The effect is that the price-war regime is smaller and confined to higher average prices, leading to a closer approximation to cooperative or collusive behavior, with greater expected utilites for both players.

An illustrative example of the results of single-agent Q-learning is shown in figure 2. Figure 2(a) plots the average profit for both sellers in the Shopbot model, when one of the sellers is myopic and the other is a Q-learner. (As the model is symmetric, it doesn't matter which seller is the Q-learner.) Figure 2(b) plots the myopic price curve of seller 2 against the Q-derived price curve (at tex2html_wrap_inline326 ) of seller 1. We can see that both curves have a maximum price of 1 and a minimum price of approximately 0.58. The portion of both curves lying along the diagonal indicates undercutting behavior, in which case the seller will respond to the opponent's price by undercutting by tex2html_wrap_inline416 , the price discretization interval.

The system dynamics for the state tex2html_wrap_inline346 in figure 2(b) can be obtained by alternately applying the two pricing policies. This can be done by a simple iterative graphical construction, in which for any given starting point, one first holds tex2html_wrap_inline322 constant and moves horizontally to the tex2html_wrap_inline526 curve, and then one holds tex2html_wrap_inline320 constant and moves vertically to the tex2html_wrap_inline530 curve. We see in this figure that the iterative graphical construction leads to an unending cyclic price war, whose trajectory is indicated by the dashed line. Note that the price-war behavior begins at the price pair (1, 1), and persists until a price of approximately 0.83. At this point, seller 1 abandons the price war, and resets its price to 1, leading once again to another round of undercutting.

The amplitude of this price war is diminished compared to the situation in which both players use a myopic policy. In that case, seller 1's curve would be a mirror image of seller 2's curve, and the price war would persist all the way to the minimum price point, leading to a lower expected profit for both sellers.


next up previous
Next: Multi-agent Q-learning Up: Pricing in agent economies Previous: Model agent economies

kephart
Wed Sep 29 11:53:06 EDT 1999