next up previous
Next: Results of lookup table Up: Single and Multi-agent Q-learning Previous: Single and Multi-agent Q-learning

Learning algorithm

The standard procedure for Q-learning is as follows. Let Q(s,a) represent the discounted long-term expected reward (with discount parameter tex2html_wrap_inline235 ) to an agent for taking action a in state s. (The value of a reward expected at n time steps in the future is discounted by tex2html_wrap_inline283 .) Assume that Q(s,a) is represented by a lookup table containing a value for every possible state-action pair, and that the table entries are initialized to arbitrary values. Then the procedure for solving for Q(s,a) is to infinitely repeat the following two-step loop:

1. Select a particular state s and a particular action a, observe the immediate reward r for this state-action pair, and the resulting state s'.

2. Adjust Q(s,a) according to the following equation:

  equation22

where tex2html_wrap_inline299 is the learning rate parameter. A variety of methods may be used to select state-action pairs in step 1, provided that every state-action pair is visited sufficiently often. When tex2html_wrap_inline299 is decreased over time with an appropriate schedule, the above procedure is guaranteed to converge to the correct values for stationary MDPs.

In our economic models, the distinction between states and actions is somewhat blurred. It will be assumed that the ``state'' for each seller is sufficiently described by the other seller's last price, and that the ``action'' is the current price decision. This should be a sufficient state description because no other history is needed either for the determination of immediate reward, or for the calculation of the myoptimal price by the fixed-strategy player. We have also redefined r and s' for the two-agent case as follows: let s' be the state that is obtained, starting from s, of one action by the Q-learner and a response by the opponent. Likewise, r is defined as the sum of the two rewards obtained after those two actions. These modifications were introduced so that the state s' would have the same player to move as state s. (A possible alternative to this, which has not been investigated, is to include the side-to-move as additional information in the state-space description.)


next up previous
Next: Results of lookup table Up: Single and Multi-agent Q-learning Previous: Single and Multi-agent Q-learning

kephart
Tue Mar 21 00:52:15 EST 2000