markh@csd4.csd.uwm.edu (Mark William Hopkins) (12/01/90)
This is a description of a rather simple architecture that can be used to train a neural net to be a finite state machine using only backpropagation. (0) BASIC DESCRIPTION The net consists of 4 layers: an input layer, an output layer, a hidden layer and a "hidden input" layer. The hidden input layer is used to help represent the transition function of the finite state machine. The hidden input layer and input layer are both used to determine the activations of the cells in the hidden layer. The hidden input layer receives its activations from the hidden layer itself WITH A DELAY of one time step. (1) FINITE STATE MACHINES For our purposes, a Finite State Machine (FSM) can be defined as an automata with the following components: * A set of STATES (S). * A set of INPUT signals (I). * A set of OUTPUT signals (O). and * A TRANSITION function T: IxS -> SxO. which may be decomposed into a * State Transition Function Ts: IxS -> S and * Output Function To: IxS -> O. with T(i, s) = (Ts(i, s), To(i, s)). We are not considering what start states or terminating states might be in this application of FSM theory. Two Finite State Machines are considered equivalent here if they produce the same output, given the same history and the same input. The "history" should be considered as the left-context if the input is thought of as being arranged in a linear sequence. This context may be infinite, since we're not necessarily considering start states here. Actually for an N-state FSM I believe you only need to consider contexts consisting of the previous N inputs in this definition. The task at hand if to train a neural net to coverge to a FSM that is equivalent to the one it is being presented with. This equivalent machine may or may not have the same set of states, the intent of the definition given above was to allow for the possibility that the state sets may be different. (2) THE FUNCTION OF THE LAYERS. Intuitively, the two hidden layers are supposed to represent both the current state and previous state of the FSM. The hidden layer effectively collects the output of the State Transition function using the previous state represented in the hidden input layer and the input represented in the input layer, and the output layer collects the outputs derived from the input and hidden input layers. The weights between ALL the layers are variable, so the state set itself IS BEING LEARNED during training. (3) ARCHITECTURE. This is a standard layered backpropagation neural net, with one minor modification. (a) External inputs are collected into the input layer. (b) The current state, represented by the set of activations of the hidden layer during the previous step are 'shifted' into the hidden input layer. (c) Weighted sums of the activations in the hidden input layer and input layer are passed to the hidden layer. A sigmoid function such as 1/(1 + exp(-x)) is applied to derive the activations of the hidden layer. (d) The same process is applied to the hidden input layer and input layer to derive the activations of the output layer. An alternative to (d), would be to simply pass the hidden layer's activations to the output layer, but this will not be guaranteed to produce valid learning behavior though it is a simpler architecture. After presentation of the training output backpropagation is applied just as it would be for any layered neural net. (4) TRAINING SEQUENCE Training a FSM differs slightly from training a layered backprop. neural net. The order in which inputs and outputs are presented to the neural net does matter. Outputs should faithfully reflect the output that the FSM being presented to the neural net would produce with history taken into account. (5) EXAMPLE A flip-flop can be considered as a 2-state FSM with the following components: INPUTS: (P, Q, N) OUTPUTS: (R, S) STATES: (0, 1) TRANSITION FUNCTION: Ts(P, s) = 0, Ts(Q, s) = 1, Ts(N, s) = s for s = 0, 1. To(P, s) = To(N, 0) = R To(Q, s) = To(N, 1) = S A valid training sequence would accurately reflect the fact that outputs from previous time steps are preserved when the input is N: (P, R), (N, R), (Q, S), (N, S), (N, S), (P, R), (Q, S), ... The way in which this condition would be violated is if the next item in the training sequence were (N, R). A demo neural net was written as per (3) with the simplification mentioned after (d) and successfully trained to follow this FSM.
markh@csd4.csd.uwm.edu (Mark William Hopkins) (12/01/90)
In article <7982@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes: > (d) The same process is applied to the hidden input layer and input layer to > derive the activations of the output layer. > >An alternative to (d), would be to simply pass the hidden layer's activations >to the output layer, but this will not be guaranteed to produce valid learning >behavior though it is a simpler architecture. A minor correction: the hidden layer should STILL be connected to the output layer in either case, or there won't be any way for error signals to propagate back to train the state transition function.
elman@crl.ucsd.edu (Jeff Elman) (12/01/90)
In article <7982@uwm.edu> Mark Hopkins write: > > This is a description of a rather simple architecture that can be used to >train a neural net to be a finite state machine using only backpropagation. > Yes, this is an interesting architecture. It is a variant of a general class of networks proposed by Mike Jordan in his 1986 UCSD Ph.D. dissertation. Several of us have experimented with the architecture you describe. You might be interested in some reports using this architecture. I have a 1988 TR called 'Finding structure in time'; a revised version appeared in the March/April issue of Cognitive Science this year. This sort network was applied to a variety of domains in which the task was prediction. The challenge for the network was to learn the underlying dynamics which produced the time series. Servan-Schreiber, Cleeremans, and McClelland report work using the same architecture (which they called a simple recurrent network) to predict a time series which was generated by FSA. Although the net did a good job of representing the states of certain FSA's, certain limitations in the SRN were revealed. I've used this architecture to predict words in complex sentences (i.e., sentences with subordinate clauses). The issue I was interested in was the ability of the net to model sentences in which there was (presumably) an underlyingly hierarchical structure--can such networks represent constituent structure, using distributed representations. The network did in fact learn to do this, but there turn out to be interesting and I think important differences between the state representation of hierarchy, and the more traditional stack representation. This work has also been reported in a couple of TR's. There is actually quite a bit interesting work that's been reported with this architecture, including work by Mike Jordan, Mike Gasser & Chan-Do Lee, Mary Hare, Janet Wiles, St. John & McClelland, Risto Miikkulainen & Mike Dyer, Gary Cottrell & Fu-Sheng Tsung, Bob Port, Steve Small, among many others (sorry--I've undoubtedly missed something important!). Jeff Elman Cog Sci/UCSD
rr2p+@andrew.cmu.edu (Richard Dale Romero) (12/01/90)
I'm also using this structure for a class paper to see how well it can do predicting the next letter in a word. More importantly, it should predict what letters can not come next in a word. One thing to watch for in this sort of task is your training set. If you don't produce a random distribution of letter orderings, the network will tend to learn the probability of a certain letter appearing next in a word. In other words, if you train it on twice as many words that contain "st"'s as "sh"'s, it will learn that the probability of a 't' following an 's' is twice as high as an 'h' following the same 's'. It has learned quite successfully learned that certain letters signify the end of a word, ie 'ly', 'y', 'ing', and that vowels can follow consonants, but I am still working on getting my training dictionary up to snuff. -rick
danforth@riacs.edu (Douglas G. Danforth) (12/06/90)
In <7982@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes: > This is a description of a rather simple architecture that can be used to >train a neural net to be a finite state machine using only backpropagation. (... rest of nice description deleted ...) Mark, Your description brought back an interesting psychology controversy from the 60's. Patrick Suppes at Stanford was able to show theoretically that a Stimulus-Response model could be trained so that asymptotically it would emulate a finite state machine. Your architecture is now an embodiment of that principle. Thanks for the posting. -- Douglas G. Danforth (danforth@riacs.edu) Research Institute for Advanced Computer Science (RIACS) M/S 230-5, NASA Ames Research Center Moffett Field, CA 94035