next up previous
Next: Multi-agent Q-learning Up: Pricing in agent economies Previous: Model agent economies

Single-agent Q-learning

Let us first consider ordinary single-agent Q-learning in the above two-seller economic models: one seller uses Q-learning to learn a Q-function and corresponding policy, while the other seller maintains a fixed pricing policy.

The procedure for Q-learning is as follows. Let Q(s,a) represent the discounted long-term expected reward to an agent for taking action a in state s. The discounting of future rewards is accomplished by a discount parameter tex2html_wrap_inline443 such that the value of a reward expected at n time steps in the future is discounted by tex2html_wrap_inline563 . Assume that the Q(s,a) function is represented by a lookup table containing a value for every possible state-action pair, and assume that the table entries are initialized to arbitrary values. Then the procedure for solving for Q(s,a) is to infinitely repeat the following two-step loop:

1. Select a particular state s and a particular action a, observe the immediate reward r for this state-action pair, and observe the resulting state s'.

2. Adjust Q(s,a) according to the following equation:

  equation68

where tex2html_wrap_inline579 is the learning rate parameter, and the tex2html_wrap_inline581 operation represents choosing the optimal action b among all possible actions that can be taken in the successor state s' leading to the greatest Q-value. A wide variety of methods may be used to select state-action pairs in step 1, provided that every state-action pair is visited infinitely often. For any stationary Markov Decision Problem, the Q-learning procedure is guaranteed to converge to the correct values, provided that the learning rate parameter is decreased over time with an appropriate schedule.

In the simulations described below the fixed-policy seller uses the myoptimal policy tex2html_wrap_inline589 represented for example in the Price-Quality model by equations 3 and 4.

In this model, the distinction between states and actions is somewhat blurred. It will be assumed that the ``state'' for each seller is sufficiently described by the other seller's last price, and that the ``action'' is the current price decision. This should be a sufficient state description because no other history is needed either for the determination of immediate reward, or for the calculation of the myoptimal price by the fixed-strategy player. The definitions of immediate reward r and next-state s' have also been modified for the two-agent case as follows: let s' be the state that is obtained, starting from s, of one action by the Q-learner and a response action by the fixed-strategy opponent. Likewise, the immediate reward is defined as the sum of the two rewards obtained after those two actions. These modifications were introduced so that the state s' would have the same player to move as state s. (A possible alternative to this, which has not been investigated, is to include the side-to-move as additional information in the state-space description.)

An important issue in Q-learning is ``exploration'' of the state space. Each state-action pair must be visited sufficiently often for learning to converge. This often necessitates some sort of randomized off-policy trajectory generation, for example, by Boltzmann exploration. In the simulations reported below, the sequence of state-action pairs selected for the Q-table updates were generated by uniform random selection from amongst all possible table entries. This does not correspond to actual on-line training in a real environment, but is appropriate for situations where one has an accurate simulation of the environment, and can envision attempting to solve for optimal strategies for both players via off-line training in the laboratory.

The initial values of the Q-tables were generally set to the immediate reward values. (Consequently the initial Q-derived policies corresponded to myoptimal policies.) The learning rate was varied with time according to:

  equation73

where the initial learning rate tex2html_wrap_inline603 was usually set to 0.1, and the constant tex2html_wrap_inline605 when the simulation time t was measured in units of tex2html_wrap_inline609 , the size of the Q-table. (N is the number of possible prices that could be selected by either player.) A number of different values of the discount parameter tex2html_wrap_inline443 were studied, ranging from tex2html_wrap_inline449 to tex2html_wrap_inline617 .

Results for single-agent Q-learning in all three models indicated that Q-learning worked well (as expected) in each case. In each model, for each value of the discount parameter, exact convergence of the Q-table to a stationary optimal solution was found. The convergence times ranged from a few hundred sweeps through each table element, for smaller values of tex2html_wrap_inline443 , to at most a few thousand updates for the largest values of tex2html_wrap_inline443 . In addition, once Q-learning converged, the expected cumulative profit or utility of the policy derived from the Q-function was then measured, by running the Q-policy against the other player's myopic policy from 100 random starting states, each for 200 time steps, and averaging the resulting cumulative utility for each player. In each case, it was found that the seller achieved greater profit against a myopic opponent by using a Q-derived policy than by using a myopic policy. (This was true even for tex2html_wrap_inline449 , because, due to the redefinition of Q updates summing over two time steps, the case tex2html_wrap_inline449 effectively corresponds to a two-step optimization, rather than the one-step optimization of the myopic policies.) Furthermore, the cumulative utility obtained with the Q-derived policy monotonically increased with the increasing tex2html_wrap_inline443 (as expected).

It was also interesting to note that in many cases, the expected utility of the myopic opponent also increased when playing against the Q-learner, and also improved monotonically with increasing tex2html_wrap_inline443 . The explanation is that, rather than better exploiting the myopic opponent, as would be expected in a zero-sum game, the Q-learner instead reduced the region over which it would participate in a mutually undercutting price war. Typically one finds in these models that with myopic vs. myopic play, large-amplitude price wars are generated that start at very high prices and persist all the way down to very low prices. When a Q-learner competes against a myopic opponent, there are still price wars starting at high prices, however, the Q-learner abandons the price war more quickly as the prices decrease. The effect is that the price-war regime is smaller and confined to higher average prices, leading to a closer approximation to cooperative or collusive behavior, with greater expected utilites for both players. It is interesting that this can come about even though both players are entirely selfish.

An illustrative example of the results of single-agent Q-learning is shown below in figures 2 and 3. Figure 2 plots the average utility for both sellers in the Shopbot model, when one of the sellers is myopic and the other is a Q-learner. (As the model is symmetric, it doesn't matter which seller is the Q-learner.)

   figure79
Figure 2: Plot of average utility per time step for seller 1 and seller 2, as a function of discount parameter tex2html_wrap_inline443 , in the Shopbot model when one seller (seller 2, open circles) uses a myopic policy, and the other seller (seller 1, filled circles) uses Q-learning to learn an optimal policy against the myopic player. The dashed line indicates baseline expected utility when both sellers are myopic.

   figure88
Figure 3: Cross-plot of myopic price curve vs. Q-derived price curve in the Shopbot model at tex2html_wrap_inline445 . Seller 1 is the Q-learner and seller 2 is myopic. The dashed line and arrows indicate how the prices will evolve over time using these pricing policies, starting from a particular price pair, indicated by the filled circle.

Figure 3 plots the myopic price curve of seller 2 against the Q-derived price curve (at tex2html_wrap_inline445 ) of seller 1. We can see that both curves have a maximum price of 1 and a minimum price of approximately 0.58. The portion of both curves lying along the diagonal indicates undercutting behavior, in which case the seller will respond to the opponent's price by undercutting by tex2html_wrap_inline535 , the price discretization interval.

The system dynamics for the state tex2html_wrap_inline465 in figure 3 can be obtained by alternately applying the two pricing policies. This can be done by a simple iterative graphical construction, in which for any given starting point, one first holds tex2html_wrap_inline441 constant and moves horizontally to the tex2html_wrap_inline643 curve, and then one holds tex2html_wrap_inline439 constant and moves vertically to the tex2html_wrap_inline647 curve. We see in this figure that the iterative graphical construction leads to an unending cyclic price war, whose trajectory is indicated by the dashed line. Note that the price-war behavior begins at the price pair (1, 1), and persists until a price of approximately 0.83. At this point, seller 1 abandons the price war, and resets its price to 1, leading once again to another round of undercutting.

The amplitude of this price war is diminished compared to the situation in which both players use a myopic policy. In that case, seller 1's curve would be a mirror image of seller 2's curve, and the price war would persist all the way to the minimum price point, leading to a lower expected utility for both sellers.


next up previous
Next: Multi-agent Q-learning Up: Pricing in agent economies Previous: Model agent economies

kephart
Wed Sep 29 11:51:48 EDT 1999