[comp.ai.neural-nets] A neural net who plays chess

markh@csd4.csd.uwm.edu (Mark William Hopkins) (10/04/90)

   I've been sitting on this extremely simple but potentially very powerful
design which would enable a neural net to learn to play chess effectively,
starting even from scratch.  Haven't gotten around to coding it yet, or
testing it, but if you'd like to try it yourself, I've posted a specification
below.

(1) ARCHITECTURE
   The design is actually of a hybrid architecture: embodying both your typical
game-playing/search AI program and a typical neural net.  Simply put, the
search program (the "left hemisphere") acts as a 'predictor'.  It searches
out the best configurations N moves ahead of the current configuration.  The
neural net (the "right hemisphere") acts as an 'evaluator'.  It's purpose is
to fill in the role formerly held by the hard-coded heuristics you usually
would have otherwise found in a chess-playing program.  These two components
reenforce each other in such a way that the evaluator gradually learns to
"squash" more and more of the N-move search tree into a static global
evaluation function.  This, of course, serves to increase the effectiveness of
the predictor.

   (a) Evaluator
   The input to the evaluator is a board configuration which contains
information to allow you to completely determine the position of any piece
on the board (or whether it has been captured or not).  It can be as simple
as a matrix, or as complex as a net with 'diagonal detectors', 'knight-L
detectors', etc. built directly into the input representation.  Alternatively,
it might just be a list of board locations indexed by piece.  The output is a
number, say, between -1 and 1, indicating the value of the board.  A 1
indicates that the board is a winning board from the program's point of view,
and -1 ... a losing configuration.

   (b) Predictor
   The predictor is a program to implement search, such as minimax with a
cutoff.  At the cutoff, the evaluator is used to estimate the configuration in
question.  This allows the predictor to return with a value in short finite
time.
   It also contains the body of rules that determine what comprises a valid
move, and a winning, losing, or stalemate configuration.

(2) INITIALIZATION
   The evaluator MAY be initialized by training it using an already available
evaluation function as a supervisor to incorporate already developed expertise
(why re-invent the wheel?), or can simply be set to a random configuration to
simulate the process of learning the game from scratch.

(3) THE PLAY CYCLE
   (a) Move generation
   As mentioned above, to generate a move, the predictor performs a search
and uses the evaluator at the cut-off point.  It comes back with a best
possible move AND an estimate as to the value of the current configuration.

   (b) Learning
   Upon completion of a move, the evaluator is updated with the training pair
(value, configuration).

   After the opponent makes a move, the play cycle is then resumed at (1).
Play continues, of course, until the program or its opponent wins (or, using
standard rules, until a draw or stalemate happens).

(4) EXPECTED PERFORMANCE
   Provided the network was designed with care, I expect to see impressive
gains in playing ability over time.

   (a) Static evaluation and search-tree squashing.
      As time goes on, because of the feedback the predictor is providing to
      the evaluator, the evaluator will actually learn to make static
      evaluations of the current board, effectively squashing the N-move
      search tree into a single global evaluation.  This kind of static
      evaluation in lieu of searching is precisely what good chess players
      claim to be able to do.

   (b) Spontaneous emergence of openings.
      After getting its butt severely kicked the first few hundred thousand
      games (again, we're talking about the case where the evaluator starts
      from zero-knowledge), natural pathways will develop that correspond to
      the openings typically taught to a novice.

   (c) Spontaneous emergence of endgames.
      If the input network architecture has been designed properly, the
      evaluator may be able enough to make generalizations (say, of
      tranalation-invariance) to embody knowledge of the more-or-less
      standard endgame configurations.

(5) EMBEDDING THE PREDICTOR
   An interesting extension to the design specified above would be to embed
the search control structure of the predictor itself in a neural net.  If this
is done, then the hybrid "crutch" (that's all it really is: a crutch) can
finally be discarded.

dave@cogsci.indiana.edu (David Chalmers) (10/06/90)

In article <6758@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:

>   I've been sitting on this extremely simple but potentially very powerful
>design which would enable a neural net to learn to play chess effectively,
>starting even from scratch.  Haven't gotten around to coding it yet, or
>testing it, but if you'd like to try it yourself, I've posted a specification
>below.
>
>the search program (the "left hemisphere") acts as a 'predictor'.  It searches
>out the best configurations N moves ahead of the current configuration.  The
>neural net (the "right hemisphere") acts as an 'evaluator'.  It's purpose is
>to fill in the role formerly held by the hard-coded heuristics you usually
>would have otherwise found in a chess-playing program.  These two components
>reenforce each other in such a way that the evaluator gradually learns to
>"squash" more and more of the N-move search tree into a static global
>evaluation function.  This, of course, serves to increase the effectiveness of
>the predictor.

There is the seed of a very nice idea here.  i.e., the best evaluation 
function in chess is a fixed point of the "minimax search" operator in
function space.  (Spelled out, that means that for an optimal evaluation
function E from positions to values, evaluating the position via minimax
search, applying E at the end of the search tree, should provide exactly the
same result as applying E directly to the position.)  Combined with certain
constraints on E, e.g. E(checkmate position) = 1, E(checkmated position) = -1,
E(2 kings) = 0, this characterization should provide a very tightly
constrained evaluation function that would play a mean game of chess.  So,
is there any way to take advantage of this by using methods of fixed-point
analysis?  I confess to know almost nothing about computer chess, so I
don't know if the idea is commonplace there.

Your suggestion of bootstrapping the evaluator with the predictor is a nice
way of taking advantage of this property.  There are a couple of problems,
though...

>   (b) Learning
>   Upon completion of a move, the evaluator is updated with the training pair
>(value, configuration).

>(4) EXPECTED PERFORMANCE
>   Provided the network was designed with care, I expect to see impressive
>gains in playing ability over time.
>   (a) Static evaluation and search-tree squashing.
>   (b) Spontaneous emergence of openings.
>   (c) Spontaneous emergence of endgames.

You may be overestimating the capabilities of networks just a tad.  Network
training isn't magic... networks are ultimately pretty simple
structures, not great for computing incredibly complex mappings like the ones
required here (yes, I know that you can get universal approximation, but only
at the cost of generalization).  After all, if you expect the above to work,
why not just try: train the network directly on (position, move) pairs directly
derived from the games of Kasparov and other grandmasters.  We'll get a
grandmaster network, easy!  And once we've done that, why not try: train a
network to simulate the total performance of a person in all domains by
training it from data provided over the lifetime of a person: (sensory
input at time t, motor output at time t).  Human intelligence, just like
that!  Of course, you'll need to use a recurrent network, but that's no
problem, we can just train it up with back-propagation through time or the
Williams/Zipser algorithm.  I think you see the problem by now...

>(2) INITIALIZATION
>   The evaluator MAY be initialized by training it using an already available
>evaluation function as a supervisor to incorporate already developed expertise
>(why re-invent the wheel?), or can simply be set to a random configuration to
>simulate the process of learning the game from scratch.

One thing you don't want to do is to start completely from scratch.  Even
bootstrapping needs to start from somewhere.  All this would guarantee is
that you would arrive at *some* fixed point of the minimax operator -- but
there are a lot of these -- the constant zero function for instance.  You
have to make your initial evaluator minimally competent (incorporating at
minimum the constraints mentioned in the first paragraph above, and preferably
more) if you want bootstrapping to get anywhere.

Actually, as things stand there's no need for the network to play actual games
at all.  You just generate positions randomly, calculate the minimax+evaluator
prediction, and train the evaluator on that.  (Only one step of minimax search
is required, nothing deep.)  In fact, your proposal doesn't incorporate any
feedback from wins and losses, so playing games doesn't serve any purpose
at all (except perhaps to confine the training of evaluation to a "reasonable"
portion of position space).  And by turning it into a fully-supervised learning
problem, you've made it much easier than it would be with the kind of
reinforcement learning that would be required from playing actual games, with
all the accompanying credit-assignment problems.

Nice idea, and I'm not sure that it couldn't be taken advantage of
in various interesting ways.  Nothing about this need be specific to neural
networks, incidentally -- any kind of learning system would do.  Just one
that has an architecture that guarantees both complete learning and
perfect generalization for the problem at hand.  Which sadly doesn't exist in
1990, but hey, that's only a minor point...

--
Dave Chalmers     (dave@cogsci.indiana.edu)      
Concepts and Cognition, Indiana University.

"It is not the least charm of a theory that it is refutable."

markh@csd4.csd.uwm.edu (Mark William Hopkins) (10/09/90)

In article <62519@iuvax.cs.indiana.edu> dave@cogsci.indiana.edu (David Chalmers) writes:
>...  In fact, your proposal doesn't incorporate any
>feedback from wins and losses, so playing games doesn't serve any purpose
>at all (except perhaps to confine the training of evaluation to a "reasonable"
>portion of position space)...

It wasn't made explicit, but the search program (where the rules and winning,
losing and draw conditions are stored) provides the feedback to the evaluator.

mahler@latcs1.oz.au (Daniel Mahler) (10/09/90)

I did my honours thesis on using NN's learn to play games.
I used a program based on the principle you describe
to learn noughts and crosses.
There are 2 points worth making about this aproach.
1) It is essentially an extension of the learning technique
in Samuel's checkers program.
Samuel used an evaluation function that was a linear combination
of given feature functions.
These weights were optimised by getting the minimax value for a position
using the current weights, and using it as a training signal.
Samuel worked before the perceptron era, so he did not call it
a perceptron.
His technique worked because the feature selectors were designed by
a human expert.
I tried the natural extension: multilevel perceptron with bp,
working from a simple representation of the board and trying to
see if the net can develop tactical and strategic concepts.
I started with random weights, because my interests were
mainly theoretical (that's what I always say when something
doesn't work 8>) )
Part of the reason for my techniques failure is the second point.

2) The minimax operator does not have a unique fix-point
(I like this terminology, shame it's to late for my thesis).
In fact it has infinitely many, as any constant value function
is a fix-point.
This means, in NN terms, lots and lots of local minima.
Supplying known strategic features as inputs simplifies the energy space.
This kind of learning is probably affected by game tree pathology,
especially in the early stages.

I think the solution to this problem is to incorporate a genetic
techniques with large populations and/or use reinforcement learning
(Barto & Sutton's Arp element).
Samuel (and I) used a sort of genetic technique with population
size 2.
There are 2 nets that play each other called champion and challenger,
or master and apprentice. The weights in champion are fixed.
Challenger learns by the method under discussion.
When challenger outperforms champion by some criterion,
it becomes champion and a new random challenger is generated.
	This leads to peculiarities which i ascribe to 'inbreeding'.
The graph of the number of games taken by each
successive challenger to become champion is a saw-tooth.
Each new challenger takes longer than the last,
(which I interpret as each champion being better)
until suddenly there is a champion that lasts a very short time,
and then the cycle repeats, though the peaks seem to get higher.
(fractals fanatics may wonder if there are peaks within peaks).
I take this to mean that eventually there evolve nets
designed to beat the champion though they can be beaten by
a random player.
	Daniel Mahler

beating


D
A

mahler@latcs1.oz.au (Daniel Mahler) (10/10/90)

	Several people have pointed out
that my claim about trivial fixpoints
is not true as your program should recognise
terminal game positions and give absolute
value to these.  You of course have to do this to give
bp some goal directed bias, but for a game of any complexity
there will be only few terminal nodes in your search horizon.
There will usually be some in chess but NONE
in go or othello until the endgame.
	For noughts and crosses we can overcome all the problems
by tabulating all the legal positions and their true min-max values.
This is feasible as there are less than 3^9 legal positions
to evaluate. It happens in a few seconds of real-time.
Then you do not have to worry about using genetics
to avoid inbred strategies, but you do not have to
worry about neural nets or learning either.
	I gave my tic-tac-toe program a limited
horizon because I was intersted in going onto go
(bravery is usually born of innocence/ignorance).
	Overall, I do not think this generalised
Samuel's technique is viable for learning from scratch,
which is what machine learning people really want. :)
(For this Lenat's Eurisco, genetics or Barto&Sutton seem more promising)
First it is computationally
very expensive: to produce the teaching signal
you must do a search and eavaluate a neural net
at each terminal node. Initially this signal will be poor
anyway. It is likely that tree pathology (Nau's work)
becomes a significant factor under these circumstances.
Also for a complex game the net must
learn to cluster positions by tactical/stragic concepts
from the extremely small percentage of all legal positions
it will ever see. 
	However, a slight modification of Samuel's idea seems good.
Use feature detectors designed by experts as inputs to a small net.
This will then be Samuels method extended to allow nonlinear combinations
of these features as evaluation functions.
	Daniel

markh@csd4.csd.uwm.edu (Mark William Hopkins) (10/11/90)

In article <8966@latcs1.oz.au> mahler@latcs1.oz.au (Daniel Mahler) writes:

On the emergence of goal-directed behavior:
>	Several people have pointed out
>that my claim about trivial fixpoints
>is not true as your program should recognise
>terminal game positions and give absolute
>value to these.  You of course have to do this to give
>bp some goal directed bias, but for a game of any complexity
>there will be only few terminal nodes in your search horizon.
>There will usually be some in chess but NONE
>in go or othello until the endgame.

The evaluator is always learning based on inputs from positions N moves
ahead.  Initially it will only have substantial input within N moves
of the end, but this learning propagates as it takes THIS input in later games
and evaluates N moves from *it*: now 2N moves from the end.  And so on...  The
evaluator's trying to converge onto a non-trivial fixed point.

The goal directed behavior should propagate from bottom up, as it were.

If it really learns, the first thing you'll notice is that it'll progressively
avoid bad openings one by one because they lead to quick endings.  Eventually,
by exclusion, you'll have the good openings left.

In that sense, the bottom-up learning also leads to top-down learning.

I think the relevant issues are probably whether the network will be big
enough to handle the knowledge a typical chess master has of the state space,
and whether it can learn quick enough without having to beating it against a
wall a hundred thousand times just to even get it to do something as easy as
picking up a simple boolean function?

On using hard-coded feature detectors to expedite learning:
>	However, a slight modification of Samuel's idea seems good.
>Use feature detectors designed by experts as inputs to a small net.
>This will then be Samuels method extended to allow nonlinear combinations
>of these features as evaluation functions.

What's to keep the evaluator neural net from spontaneously generating nodes
that emulate feature detection in its hidden layer(s) anyhow?

As a concrete example on a simpler game try this design on tic-tac-toe:

LAYER 3: 1 node (output).
LAYER 2: N nodes (try N = 12).
LAYER 1: 9 nodes (input .. connected to tic-tac-toe grid)

For input, a square occupied by the program's piece evaluates to +1, a square
occupied by an opponent's piece to -1, and an empty square to 0.

Every cell is connected to every other cell in an adjacent layer.  Use
back-propagation *with thresholds*.  Take your learning input from a minimax
search program that looks merely 1 move ahead using the output node of the
evaluator function to resolve all non-winning/non-losing/non-draw positions.

When using +1, and -1 as limmiting activation values the sigmoid function
1/(1 + e^(-X)) becomes tanh(X), by the way, and the error correction factor
Y(1 - Y) becomes 1 - Y*Y.

I predict two things will happen:
   (1) The program, when playing X, will open in the middle square
       with increasing frequency.  When playing O, it will move into a
       corner in response to an X placed in the middle more and more
       frequently.
   (2) Some of the hidden-layer nodes will evolve spontaneously into
       3-in-a-row detectors.