[comp.ai.neural-nets] BEP in STP is RIST

coopere@rocky2.rockefeller.edu (Ellis D. Cooper) (07/29/90)
          sgmd(N(G))     sgmd(N(H))
               |\            /|
               | \          / |
               |  \        /  |
d(G).F------->W(GF).F    E.W(HE)<------d(H).E
               |    \    /    |
d(G).E--->W(GE).E    \  /    F.W(HF)<--d(H).F
               |      \/      |
               |      /\      |
               |     /  \     |
               |    /    \    |
               |   /      \   |
               |  /        \  |
               | /          \ |
               |/            \|
          sgmd(N(E))     sgmd(N(F))
               |\            /|
               | \          / |
               |  \        /  |
d(E).D------->W(ED).D    C.W(FC)<------d(F).C
               |    \    /    |
d(E).C--->W(EC).C    \  /    D.W(FD)<--d(F).D
               |      \/      |
               |      /\      |
               |     /  \     |
               |    /    \    |
               |   /      \   |
               |  /        \  |
               | /          \ |
               |/            \|
          sgmd(N(C))     sgmd(N(D))
               |\            /|
               | \          / |
               |  \        /  |
d(C).B--->W(CB).B  \     A.W(DA)<------d(D).A
               |    \    /    |
d(C).A--->W(CA).A    \  /    B.W(DB)<--d(D).B
               |      \/      |
               |      /\      |
               |     /  \     |
               |    /    \    |
               |   /      \   |
               |  /        \  |
               | /          \ |
               |/            \|
               A              B 

d(H) := (Q - H).sgmd'(N(H))
d(G) := (P - G).sgmd'(N(G))      
d(F) := (d(G).W(GF) + d(H).W(HF)).sgmd'(N(F))
d(E) := (d(G).W(GE) + d(H).W(HE)).sgmd'(N(E))
d(D) := (d(E).W(ED) + d(F).W(FD)).sgmd'(N(D))
d(C) := (d(E).W(FC) + d(F).W(FC)).sgmd'(N(C))

I used to remind myself what science is by thinking
<observation and tested generalization>. Now I add
to this, <comprehensiveness with simplicity>. This
addition emphasizes Occam's Razor, which shows up
also in modern computing as Reduced Instruction Set
Computing (RISC).Along this line I am exploring
Symbol Train Processing,which is Reduced
Instruction Set Thinking (RIST).

The basic object of study in symbol train
processing is the <symbol train>. At the behavioral
level, the sequence of words in a spoken sentence,
possibly separated by hesitations of varying
duration, is a good example of a symbol train. At
the neuronal level, the sequence of action
potentials in a <spike train> generated at the axon
hillock of a cortical neuron, is another very good
example of a symbol train. In a narrow sense, a
<symbol train> is by definition a sequence of the
form 

(*)   dt(1) s(1) dt(2) s(2) ... dt(K) s(K)

where s(1), s(2),...,s(K) are unanalyzed <primitive
 symbols> - e.g., words, or spikes - and
dt(1),dt(2),...,dt(K) are intervals of time. (The
indices 1,...,Kin (*) ought to be written as
subscripts.) In the general case, a symbol train is
a spatial-temporal distribution of weighted,
targeted symbols 

(**)  dt(1) S(1) dt(2) S(2) ...,dt(K) S(K)

where S(k) is a set of primitive symbols, each of
which has a numerical <weight> and a <target>. The
weight on a symbol is a general measure of
<strength>, as are the weights in the artificial
neural networks of connectionism.The target of a
(weighted) primitive symbol is a symbol train
processor (STP). A general definition of STP is not
given here - a broad introduction to symbol train
processing is available in a research report
(Cooper, 1990a). The notation (**) captures the two
orthogonal dimensions of a symbol train, its notion
of successive simultaneities, where the left-to-
right reading of the notation corresponds to the
passage of time from past to future, and the use of
a <set> of weighted, targeted symbols corresponds to
the <parallel> emission and absorption of messages
Perhaps there are many different systematic ways to
describe the generation and processing of symbol
trains. In this work, though, <symbol train
processing> refers to my particular graphical
choice of mechanisms for symbol train-valued
functions of symbol train variables.

Symbol train processing exemplifies RIST because it
uses only five basic syntactical elements for
drawing executable diagrams of  algorithms and
models of phenomena from a wide range of scientific
experience. These elements are circles which
represent states with lifetimes, rectangles
representing symbol train processors (this is like
subroutines), and three kinds of arrows: arrows
with closed arrowheads representing triggerable
state transitions (as in finite state automata)
which truncate the life of one state and force
another state to be (re-)born, arrows with open
arrowheads representing state transitions with
death of one state (timeout) and (re-)birth of
another, and dashed arrows representing paths for
transmission of messages between processors. When a
state dies naturally at the end of its lifetime
(i.e., without having its life truncated by arrival
of a trigger), it sends out messages. For
flexibility of expressiveness, a lifetime of a
state may be of length zero. I have drawn and in
many cases executed several dozen small STP systems
to illustrate the expressiveness of the STP
conventions. These include low-pass filters, band-
pass filters, frequency doublers,bi- and multi-
lateral inhibitors, the eight-queens problem,the
traveling salesman problem, a photoreceptor model
in neurophysiology, a wave motor, a time window, a
delay transducer,an accumulation and exponential
decay, an Euler method forward integration, a
second-order Runge-Kutta forward integration,a
second-order Runge-Kutta forward integration with
adaptive time-step, a random walk of a single
particle, a Michaelis-Menten enzyme-substrate
chemical reaction, two models of the calcium
oscillator of horseshoe crab ventral
photoreceptors, an electron transfer between donor
and acceptor molecules in a photo-active material,
and the deterministic evolution (via timeouts) of
a quantum mechanical system disturbed by
measurements (via triggers). 

The purpose of this application note is to exhibit
how easy it is to set up the standard (non-
optimized) backward-error propagation algorithm(as
in [Rumelhart et al, PDP, v.1, pp. 324-327], <the
delta rule for semilinear activation functions in
feedforward networks>) using just a small subset of
the STP conventions, namely the boxes and dashed
arrows. The STP system graph shown above should be
compared to the roughly 30 lines of C code for the
BEP algorithm presented in [McCleland-Rumelhart,
Explorations in PDP,pp.138-140], with four doubly-
nested loops.In the following explanation of the
above graph I will be referring to a printout
before me of a Macintosh Superpaint drawing
(available on request, see below, and suitable for
framing). The above character-based allusion to
that drawing is provided for rough guidance only.

This is a four-layer feedforward network with input
at the bottom, A and B, and output at the top, G
and H, and two units in each hidden layer, C, D and
E, F. In all notations of the form N(X) the index
X is really a subscript to N. Likewise for all
notations of the form d(X); and in all notations of
the form W(XY), the XY is a double-subscript to W
All notations consisting of either single letters
or subscripted letters are enclosed in tight-
fitting rectangular boxes. All notations consisting
of two such boxes separated by the multiplication
operator, represented above by the period, <.>, are
enclosed by another rectangular box. All vertical
and diagonal dashed lines have arrowheads at their
upper ends. All of the vertical or diagonal dashed
arrows are labeled by the notations which have the
multiplication operators on the same line as the
arrow.

All of the notations enclosed in rectangular boxes
represent <numerical STPs>. A numerical STP has at
any given time a state, which is a numerical value
 In other words, a numerical STP is just a variable
 as in any programming language. But a numerical
STP should be thought of as an object, as in
object-oriented programming, since dashed arrows
arriving at a numerical STP carry messages which
change the state of the numerical STP. Not only
that, if two arrows terminate at a numerical STP,
then simultaneous messages to go to a particular
value are summed together, so that the new state is
 the sum of the values of the messages. All
notations of the form N(X) sum their inputs in this
 way.

There are two kinds of numerical STPs, the
independent and the dependent ones. This is just
the same as the distinction between independent and
dependent variables in calculus. In the STP graph
above, the independent numerical STPs are A,B
(inputs), P,Q (desired outputs), and all notations
of the form W(XY) (weight from unit Y to unit X).
All other numerical STPs are dependent. The rule
for computing the new state of a dependent STP is
simply that it should be re-computed whenever any
of arguments changes state. For example,
sgmd(N(C)), which is the sigmoidal activation
function sgmd applied to the numerical STP N(C),
will be re-computed whenever either of the inputs A
or B changes, since N(C) receives messages from
both inputs. If they both change at the same time,
then the new state of sgmd(N(C)) is sgmd(A+B).A
dependent numerical STP cannot change state by
receiving messages: it changes only in response to
changes in its arguments. Both independent and
dependent numerical STPs can emit messages when
they change state.The dashed arrow with its tail
end attached to a box carries the messages from the
box. If the arrow is not labeled, then by default
the message is assumed to be the value of the new
state of the STP represented by the box. For
example, when d(G) changes state (for the reason
given below) the state of the dependent numerical
STP d(G).F changes, which causes the value d(G).F
to be sent as a message to the independent
numerical STP, W(GF). Since the notations d(X) are
the <delta terms>, you see that all the horizontal
dashed arrows in the STP graph carry weight-
changing messages.

With the explanation of STP conventions out of the
way, the action of the STP graph for BEP can now be
described as follows. Suppose the independent
numerical STPs of the form W(XY), the weights, are
given initial values by any means whatever (e.g.,
by a pseudo-random number generator). Suppose we
want the network to learn to produce output P,Q
given input A,B. The training cycle is as follows.
Send the desired values to P and Q. Then the STPs
which depend on P and Q, namely d(G) and d(H), will
be updated and send messages down the graph which
adjust the weights W(GF), W(HE), W(GE), and W(HF).
These changes force re-computation of d(E) and d(F)
according to the equations written with := at the
bottom of the STP graph. Hence, W(ED), etc. are re-
computed, and so on, as the consequences of those
desired values for P and Q trickle down the graph.
Now, suppose messages to A and B arrive
simultaneously with the inputs which are supposed
to produce the P and Q. Then sgmd(NC)) and
sgmd(N(D)) are re-computed, which forces sgmd(N(E))
and sgmd(N(F)) to be re-computed, which forces
sgmd(N(G)) and sgmd(N(H)) to be re-computed, which
forces d(G) and d(H) to be re-computed, and we are
back to the trickle down.The floating of values up
from A and B followed by trickling down of values
from d(G) and d(H) constitutes one training cycle.
If the resulting values in G:=sgmd(N(G)) and
H:=sgmd(N(H)) are not sufficiently near to the
desired P and Q, respectively, then send the same
messages to A and B again. (This recycling can be
accomplished with a little STP oscillator, not to
be given here).

What's the point of implementing BEP in STP ? The
outermostquestion is, why have a visual programming
language ? Because many of us programmers feel
cramped, having to work in 1-dimensional,line-by-
line programming. We want to have a 2 or 3
dimensional representation of algorithms which is to
regular programming as painting or sculpture is to
music. Also, there are pedagogical advantages to
visual programming. In particular, the learning
curve for object-oriented programming could be
diminished by the right visual programming
language. Why is symbol train processing a very good
visual programming language ? The great variety of
examples which have been graphed or executed shows
that symbol train processing is comprehensive; its
syntax is obviously simple, and the semantics is
immediately intuited.It was inspired by
neuroscience in the same sense that FORTRAN was
inspired by scientific and engineering needs, LISP
by AI needs, PASCAL by computer science education
needs, APL by algebra needs, PROLOG by logic needs,
FORTH by control needs, or C by programming
needs.The point of implementing BEP in STP is,
first, to show that it can be done - just more
evidence for the comprehensiveness of STP; second,
BEP fits naturally into a simple subset of an
already very simple visual language; third, STP
lends itself readily to concurrent, object-oriented
interpretation, so BEP inherits those features.

The following reference materials on symbol train
processing are available by email request (be sure
to include your address). The easiest way for me to
send these things is on a Macintosh 800K disk.
Also, that way paper is used only for what you
really need in hard copy form.(A printout of the
Superpaint drawing will accompany the disk).

Symbol Train Processing
(April, 1990a; 50 pages, 20 figures)

An Object-Oriented Interpreter for Symbol Train
Processing
(May, 1990b; 15 pages, 3 figures; includes the
Mathematica
STP Notebook and the ChemiKine Instructions)

Symbol Train Processing Applied to Formalizing
Mental Models 
of Calcium Oscillators
(July, 1990c; 28 pages, 11 figures)

Motivation and Status of Symbol Train Processing
(July, 1990d; 7 pages)

The following demonstration programs are available
on a standard 1.2 MB floppy disk to run on an IBM
PC/AT clone type of machine:

Bilateral Inhibition
(inh.exe in C)

Frequency Doubler
(demo.exe in C under Windows)