[comp.ai.neural-nets] Connectionist Finite State Machines -- description of an architecture

markh@csd4.csd.uwm.edu (Mark William Hopkins) (12/01/90)

   This is a description of a rather simple architecture that can be used to
train a neural net to be a finite state machine using only backpropagation.

(0) BASIC DESCRIPTION
   The net consists of 4 layers: an input layer, an output layer, a hidden
layer and a "hidden input" layer.  The hidden input layer is used to help
represent the transition function of the finite state machine.

   The hidden input layer and input layer are both used to determine the
activations of the cells in the hidden layer.  The hidden input layer receives
its activations from the hidden layer itself WITH A DELAY of one time step.

(1) FINITE STATE MACHINES
   For our purposes, a Finite State Machine (FSM) can be defined as an
automata with the following components:

	      * A set of STATES (S).
	      * A set of INPUT signals (I).
	      * A set of OUTPUT signals (O).
and           * A TRANSITION function T: IxS -> SxO.
		which may be decomposed into a
		* State Transition Function Ts: IxS -> S
           and  * Output Function To: IxS -> O.
		with T(i, s) = (Ts(i, s), To(i, s)).

We are not considering what start states or terminating states might be in this
application of FSM theory.

   Two Finite State Machines are considered equivalent here if they produce
the same output, given the same history and the same input.  The "history"
should be considered as the left-context if the input is thought of as being
arranged in a linear sequence.  This context may be infinite, since we're
not necessarily considering start states here.  Actually for an N-state FSM
I believe you only need to consider contexts consisting of the previous N
inputs in this definition.

   The task at hand if to train a neural net to coverge to a FSM that is
equivalent to the one it is being presented with.  This equivalent machine may
or may not have the same set of states, the intent of the definition given
above was to allow for the possibility that the state sets may be different.

(2) THE FUNCTION OF THE LAYERS.
   Intuitively, the two hidden layers are supposed to represent both the
current state and previous state of the FSM.  The hidden layer effectively
collects the output of the State Transition function using the previous
state represented in the hidden input layer and the input represented in
the input layer, and the output layer collects the outputs derived from the
input and hidden input layers.

   The weights between ALL the layers are variable, so the state set itself
IS BEING LEARNED during training.

(3) ARCHITECTURE.
   This is a standard layered backpropagation neural net, with one minor
modification.  
   (a) External inputs are collected into the input layer.
   (b) The current state, represented by the set of activations of the hidden
       layer during the previous step are 'shifted' into the hidden input layer.
   (c) Weighted sums of the activations in the hidden input layer and input
       layer are passed to the hidden layer.  A sigmoid function such as
       1/(1 + exp(-x)) is applied to derive the activations of the hidden
       layer.
   (d) The same process is applied to the hidden input layer and input layer to
       derive the activations of the output layer.

An alternative to (d), would be to simply pass the hidden layer's activations
to the output layer, but this will not be guaranteed to produce valid learning
behavior though it is a simpler architecture.

   After presentation of the training output backpropagation is applied just
as it would be for any layered neural net.

(4) TRAINING SEQUENCE
    Training a FSM differs slightly from training a layered backprop.
neural net.  The order in which inputs and outputs are presented to the neural
net does matter.  Outputs should faithfully reflect the output that the
FSM being presented to the neural net would produce with history taken into
account.

(5) EXAMPLE
   A flip-flop can be considered as a 2-state FSM with the following
components:
	     INPUTS: (P, Q, N)
	     OUTPUTS: (R, S)
	     STATES: (0, 1)
	     TRANSITION FUNCTION:
	        Ts(P, s) = 0, Ts(Q, s) = 1, Ts(N, s) = s for s = 0, 1.
	        To(P, s) = To(N, 0) = R
	        To(Q, s) = To(N, 1) = S

A valid training sequence would accurately reflect the fact that outputs from
previous time steps are preserved when the input is N:

       (P, R), (N, R), (Q, S), (N, S), (N, S), (P, R), (Q, S), ...

The way in which this condition would be violated is if the next item in the
training sequence were
			    (N, R).

   A demo neural net was written as per (3) with the simplification mentioned
after (d) and successfully trained to follow this FSM.

markh@csd4.csd.uwm.edu (Mark William Hopkins) (12/01/90)

In article <7982@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>   (d) The same process is applied to the hidden input layer and input layer to
>       derive the activations of the output layer.
>
>An alternative to (d), would be to simply pass the hidden layer's activations
>to the output layer, but this will not be guaranteed to produce valid learning
>behavior though it is a simpler architecture.

A minor correction: the hidden layer should STILL be connected to the output
layer in either case, or there won't be any way for error signals to propagate
back to train the state transition function.

elman@crl.ucsd.edu (Jeff Elman) (12/01/90)

In article <7982@uwm.edu> Mark Hopkins write:
>
>   This is a description of a rather simple architecture that can be used to
>train a neural net to be a finite state machine using only backpropagation.
>

Yes, this is an interesting architecture.  It is a variant of a general
class of networks proposed by Mike Jordan in his 1986 UCSD Ph.D. dissertation.

Several of us have experimented with the architecture you describe.
You might be interested in some reports using this architecture.

I have a 1988 TR called 'Finding structure in time'; a revised
version appeared in the March/April issue of Cognitive Science this year.
This sort network was applied to a variety of domains in which the
task was prediction.  The challenge for the network was to learn the
underlying dynamics which produced the time series.

Servan-Schreiber, Cleeremans, and McClelland report work using the
same architecture (which they called a simple recurrent network)
to predict a time series which was generated by FSA.  Although the
net did a good job of representing the states of certain FSA's, 
certain limitations in the SRN were revealed.

I've used this architecture to predict words in complex sentences
(i.e., sentences with subordinate clauses).  The issue I was interested
in was the ability of the net to model sentences in which there was
(presumably) an underlyingly hierarchical structure--can such networks
represent constituent structure, using distributed representations.
The network did in fact learn to do this, but there turn out to be
interesting and I think important differences between the state
representation of hierarchy, and the more traditional stack
representation.  This work has also been reported in a couple of TR's.

There is actually quite a bit interesting work that's been reported
with this architecture, including work by Mike Jordan, Mike Gasser &
Chan-Do Lee, Mary Hare, Janet Wiles, St. John & McClelland, Risto
Miikkulainen & Mike Dyer, Gary Cottrell & Fu-Sheng Tsung, Bob Port,
Steve Small, among many others (sorry--I've undoubtedly missed
something important!).


Jeff Elman
Cog Sci/UCSD

rr2p+@andrew.cmu.edu (Richard Dale Romero) (12/01/90)

I'm also using this structure for a class paper to see how well it can do
predicting the next letter in a word.  More importantly, it should predict
what letters can not come next in a word.  One thing to watch for in this
sort of task is your training set.  If you don't produce a random distribution
of letter orderings, the network will tend to learn the probability of a
certain
letter appearing next in a word.  In other words, if you train it on twice as
many words that contain "st"'s as "sh"'s, it will learn that the probability
of a 't' following an 's' is twice as high as an 'h' following the same 's'.
It has learned quite successfully learned that certain letters signify the end
of a word, ie 'ly', 'y', 'ing', and that vowels can follow consonants, but I
am still working on getting my training dictionary up to snuff.

-rick

danforth@riacs.edu (Douglas G. Danforth) (12/06/90)

In <7982@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:

>   This is a description of a rather simple architecture that can be used to
>train a neural net to be a finite state machine using only backpropagation.
		 (... rest of nice description deleted ...)

Mark,
    Your description brought back an interesting psychology controversy 
from the 60's.  Patrick Suppes at Stanford was able to show theoretically
that a Stimulus-Response model could be trained so that asymptotically it
would emulate a finite state machine.  Your architecture is now an embodiment
of that principle.  Thanks for the posting.

--
Douglas G. Danforth   		    (danforth@riacs.edu)
Research Institute for Advanced Computer Science (RIACS)
M/S 230-5, NASA Ames Research Center
Moffett Field, CA 94035