next up previous
Next: Conclusions Up: Pseudo-convergent Q-Learning by Competitive Previous: Asymmetric solution

Which solution will occur?

Here we explore the conditions that determine whether the symmetric or asymmetric solution is obtained. First, we start by adding gaussian noise, with mean 0 and amplitude tex2html_wrap_inline602 , to the Q functions of the symmetric solution. We then allow Q-learning to run, and observe whether the symmetric or asymmetric solution is obtained. An example illustrating the initial perturbed policies at tex2html_wrap_inline596 and tex2html_wrap_inline570 is shown in Fig.  8; this particular initial condition happened to evolve to the symmetric solution.

   figure254
Figure 8: Example initial perturbed policies obtained by starting from the symmetric solution at tex2html_wrap_inline570 and adding noise with amplitude tex2html_wrap_inline596 . Final response functions are symmetric; most other trials at this noise amplitude reached the asymmetric solution.

As the noise amplitude is increased, the initial policies tend to be further from the symmetric solution, and the Q functions tend to evolve more often towards the asymmetric solution. For 16 different choices of noise amplitude ranging from 0.01 to 50.00, 100 trials were conducted. At tex2html_wrap_inline1042 , the noise is so slight that the policies are usually unchanged. At tex2html_wrap_inline1044 , the initial response functions are essentially random.

The percentage of trials that yielded the symmetric solution as a function of the noise amplitude for tex2html_wrap_inline570 and tex2html_wrap_inline600 is given by Fig. 9(a). The same data can be viewed in a different way, by plotting the probability of obtaining the symmetric solution as a function of the total Manhattan distance of the two noisy initial response curves from the ideal noise-free symmetric solution. This is shown in Fig. 9(b). (The probability is obtained by averaging over 100 trials centered around each distance.)

In addition to the randomness of the starting state, the random exploration dynamics of Q-learning also influences the resulting final state. We have performed experiments starting many trials from a specific random starting state, and found that some trials converged to the symmetric solution while other trials went to the asymmetric solution.

Fig.  9 supports the conceptual interpretation of two-player Q-learning dynamics in terms of a basin of attraction around the symmetric solution, delineated by a distance parameter in either policy space or in Q-function space. In both spaces, there is a small region around the symmetric solution such that virtually any starting state within the region will invariably converge to the symmetric solution. For low tex2html_wrap_inline576 , the symmetric solution can be reached even when the starting state is well outside this region--roughly a 30% chance when tex2html_wrap_inline570 . For moderate to high tex2html_wrap_inline576 , the symmetric solution cannot be reached if the starting state lies beyond this region. This explains why we only observed the asymmetric solution for moderate to high tex2html_wrap_inline576 in our earlier studies [7].

   figure265
Figure 9: Probability of obtaining the symmetric solution, starting from randomly perturbed initial Q-functions, at tex2html_wrap_inline570 and tex2html_wrap_inline600 . a) As a function of noise amplitude tex2html_wrap_inline602 . b) As a function of total Manhattan distance of initial policies from the ideal symmetric solution.


next up previous
Next: Conclusions Up: Pseudo-convergent Q-Learning by Competitive Previous: Asymmetric solution

kephart
Tue Mar 21 00:33:02 EST 2000