Some preliminary results of combining Q-learning with neural networks were reported in (Tesauro, 1999). The neural nets typically appeared to reach peak profitability in a few hundred sweeps through the training cases (corresponding to a few hours of CPU time). The policies were reasonably good at this point, and qualitatively similar to the lookup table policies, but the quality of approximation of the Q-function was poor, as indicated by large Bellman error. With much further training (out to several days of CPU time), the Bellman error improved significantly, but there was no improvement in policy profitability. It is possible that, with enough additional training, further improvements in profitability might be found, but it appeared that the required training times would be prohibitively long.