coopere@rocky2.rockefeller.edu (Ellis D. Cooper) (07/29/90)
sgmd(N(G)) sgmd(N(H)) |\ /| | \ / | | \ / | d(G).F------->W(GF).F E.W(HE)<------d(H).E | \ / | d(G).E--->W(GE).E \ / F.W(HF)<--d(H).F | \/ | | /\ | | / \ | | / \ | | / \ | | / \ | | / \ | |/ \| sgmd(N(E)) sgmd(N(F)) |\ /| | \ / | | \ / | d(E).D------->W(ED).D C.W(FC)<------d(F).C | \ / | d(E).C--->W(EC).C \ / D.W(FD)<--d(F).D | \/ | | /\ | | / \ | | / \ | | / \ | | / \ | | / \ | |/ \| sgmd(N(C)) sgmd(N(D)) |\ /| | \ / | | \ / | d(C).B--->W(CB).B \ A.W(DA)<------d(D).A | \ / | d(C).A--->W(CA).A \ / B.W(DB)<--d(D).B | \/ | | /\ | | / \ | | / \ | | / \ | | / \ | | / \ | |/ \| A B d(H) := (Q - H).sgmd'(N(H)) d(G) := (P - G).sgmd'(N(G)) d(F) := (d(G).W(GF) + d(H).W(HF)).sgmd'(N(F)) d(E) := (d(G).W(GE) + d(H).W(HE)).sgmd'(N(E)) d(D) := (d(E).W(ED) + d(F).W(FD)).sgmd'(N(D)) d(C) := (d(E).W(FC) + d(F).W(FC)).sgmd'(N(C)) I used to remind myself what science is by thinking <observation and tested generalization>. Now I add to this, <comprehensiveness with simplicity>. This addition emphasizes Occam's Razor, which shows up also in modern computing as Reduced Instruction Set Computing (RISC).Along this line I am exploring Symbol Train Processing,which is Reduced Instruction Set Thinking (RIST). The basic object of study in symbol train processing is the <symbol train>. At the behavioral level, the sequence of words in a spoken sentence, possibly separated by hesitations of varying duration, is a good example of a symbol train. At the neuronal level, the sequence of action potentials in a <spike train> generated at the axon hillock of a cortical neuron, is another very good example of a symbol train. In a narrow sense, a <symbol train> is by definition a sequence of the form (*) dt(1) s(1) dt(2) s(2) ... dt(K) s(K) where s(1), s(2),...,s(K) are unanalyzed <primitive symbols> - e.g., words, or spikes - and dt(1),dt(2),...,dt(K) are intervals of time. (The indices 1,...,Kin (*) ought to be written as subscripts.) In the general case, a symbol train is a spatial-temporal distribution of weighted, targeted symbols (**) dt(1) S(1) dt(2) S(2) ...,dt(K) S(K) where S(k) is a set of primitive symbols, each of which has a numerical <weight> and a <target>. The weight on a symbol is a general measure of <strength>, as are the weights in the artificial neural networks of connectionism.The target of a (weighted) primitive symbol is a symbol train processor (STP). A general definition of STP is not given here - a broad introduction to symbol train processing is available in a research report (Cooper, 1990a). The notation (**) captures the two orthogonal dimensions of a symbol train, its notion of successive simultaneities, where the left-to- right reading of the notation corresponds to the passage of time from past to future, and the use of a <set> of weighted, targeted symbols corresponds to the <parallel> emission and absorption of messages Perhaps there are many different systematic ways to describe the generation and processing of symbol trains. In this work, though, <symbol train processing> refers to my particular graphical choice of mechanisms for symbol train-valued functions of symbol train variables. Symbol train processing exemplifies RIST because it uses only five basic syntactical elements for drawing executable diagrams of algorithms and models of phenomena from a wide range of scientific experience. These elements are circles which represent states with lifetimes, rectangles representing symbol train processors (this is like subroutines), and three kinds of arrows: arrows with closed arrowheads representing triggerable state transitions (as in finite state automata) which truncate the life of one state and force another state to be (re-)born, arrows with open arrowheads representing state transitions with death of one state (timeout) and (re-)birth of another, and dashed arrows representing paths for transmission of messages between processors. When a state dies naturally at the end of its lifetime (i.e., without having its life truncated by arrival of a trigger), it sends out messages. For flexibility of expressiveness, a lifetime of a state may be of length zero. I have drawn and in many cases executed several dozen small STP systems to illustrate the expressiveness of the STP conventions. These include low-pass filters, band- pass filters, frequency doublers,bi- and multi- lateral inhibitors, the eight-queens problem,the traveling salesman problem, a photoreceptor model in neurophysiology, a wave motor, a time window, a delay transducer,an accumulation and exponential decay, an Euler method forward integration, a second-order Runge-Kutta forward integration,a second-order Runge-Kutta forward integration with adaptive time-step, a random walk of a single particle, a Michaelis-Menten enzyme-substrate chemical reaction, two models of the calcium oscillator of horseshoe crab ventral photoreceptors, an electron transfer between donor and acceptor molecules in a photo-active material, and the deterministic evolution (via timeouts) of a quantum mechanical system disturbed by measurements (via triggers). The purpose of this application note is to exhibit how easy it is to set up the standard (non- optimized) backward-error propagation algorithm(as in [Rumelhart et al, PDP, v.1, pp. 324-327], <the delta rule for semilinear activation functions in feedforward networks>) using just a small subset of the STP conventions, namely the boxes and dashed arrows. The STP system graph shown above should be compared to the roughly 30 lines of C code for the BEP algorithm presented in [McCleland-Rumelhart, Explorations in PDP,pp.138-140], with four doubly- nested loops.In the following explanation of the above graph I will be referring to a printout before me of a Macintosh Superpaint drawing (available on request, see below, and suitable for framing). The above character-based allusion to that drawing is provided for rough guidance only. This is a four-layer feedforward network with input at the bottom, A and B, and output at the top, G and H, and two units in each hidden layer, C, D and E, F. In all notations of the form N(X) the index X is really a subscript to N. Likewise for all notations of the form d(X); and in all notations of the form W(XY), the XY is a double-subscript to W All notations consisting of either single letters or subscripted letters are enclosed in tight- fitting rectangular boxes. All notations consisting of two such boxes separated by the multiplication operator, represented above by the period, <.>, are enclosed by another rectangular box. All vertical and diagonal dashed lines have arrowheads at their upper ends. All of the vertical or diagonal dashed arrows are labeled by the notations which have the multiplication operators on the same line as the arrow. All of the notations enclosed in rectangular boxes represent <numerical STPs>. A numerical STP has at any given time a state, which is a numerical value In other words, a numerical STP is just a variable as in any programming language. But a numerical STP should be thought of as an object, as in object-oriented programming, since dashed arrows arriving at a numerical STP carry messages which change the state of the numerical STP. Not only that, if two arrows terminate at a numerical STP, then simultaneous messages to go to a particular value are summed together, so that the new state is the sum of the values of the messages. All notations of the form N(X) sum their inputs in this way. There are two kinds of numerical STPs, the independent and the dependent ones. This is just the same as the distinction between independent and dependent variables in calculus. In the STP graph above, the independent numerical STPs are A,B (inputs), P,Q (desired outputs), and all notations of the form W(XY) (weight from unit Y to unit X). All other numerical STPs are dependent. The rule for computing the new state of a dependent STP is simply that it should be re-computed whenever any of arguments changes state. For example, sgmd(N(C)), which is the sigmoidal activation function sgmd applied to the numerical STP N(C), will be re-computed whenever either of the inputs A or B changes, since N(C) receives messages from both inputs. If they both change at the same time, then the new state of sgmd(N(C)) is sgmd(A+B).A dependent numerical STP cannot change state by receiving messages: it changes only in response to changes in its arguments. Both independent and dependent numerical STPs can emit messages when they change state.The dashed arrow with its tail end attached to a box carries the messages from the box. If the arrow is not labeled, then by default the message is assumed to be the value of the new state of the STP represented by the box. For example, when d(G) changes state (for the reason given below) the state of the dependent numerical STP d(G).F changes, which causes the value d(G).F to be sent as a message to the independent numerical STP, W(GF). Since the notations d(X) are the <delta terms>, you see that all the horizontal dashed arrows in the STP graph carry weight- changing messages. With the explanation of STP conventions out of the way, the action of the STP graph for BEP can now be described as follows. Suppose the independent numerical STPs of the form W(XY), the weights, are given initial values by any means whatever (e.g., by a pseudo-random number generator). Suppose we want the network to learn to produce output P,Q given input A,B. The training cycle is as follows. Send the desired values to P and Q. Then the STPs which depend on P and Q, namely d(G) and d(H), will be updated and send messages down the graph which adjust the weights W(GF), W(HE), W(GE), and W(HF). These changes force re-computation of d(E) and d(F) according to the equations written with := at the bottom of the STP graph. Hence, W(ED), etc. are re- computed, and so on, as the consequences of those desired values for P and Q trickle down the graph. Now, suppose messages to A and B arrive simultaneously with the inputs which are supposed to produce the P and Q. Then sgmd(NC)) and sgmd(N(D)) are re-computed, which forces sgmd(N(E)) and sgmd(N(F)) to be re-computed, which forces sgmd(N(G)) and sgmd(N(H)) to be re-computed, which forces d(G) and d(H) to be re-computed, and we are back to the trickle down.The floating of values up from A and B followed by trickling down of values from d(G) and d(H) constitutes one training cycle. If the resulting values in G:=sgmd(N(G)) and H:=sgmd(N(H)) are not sufficiently near to the desired P and Q, respectively, then send the same messages to A and B again. (This recycling can be accomplished with a little STP oscillator, not to be given here). What's the point of implementing BEP in STP ? The outermostquestion is, why have a visual programming language ? Because many of us programmers feel cramped, having to work in 1-dimensional,line-by- line programming. We want to have a 2 or 3 dimensional representation of algorithms which is to regular programming as painting or sculpture is to music. Also, there are pedagogical advantages to visual programming. In particular, the learning curve for object-oriented programming could be diminished by the right visual programming language. Why is symbol train processing a very good visual programming language ? The great variety of examples which have been graphed or executed shows that symbol train processing is comprehensive; its syntax is obviously simple, and the semantics is immediately intuited.It was inspired by neuroscience in the same sense that FORTRAN was inspired by scientific and engineering needs, LISP by AI needs, PASCAL by computer science education needs, APL by algebra needs, PROLOG by logic needs, FORTH by control needs, or C by programming needs.The point of implementing BEP in STP is, first, to show that it can be done - just more evidence for the comprehensiveness of STP; second, BEP fits naturally into a simple subset of an already very simple visual language; third, STP lends itself readily to concurrent, object-oriented interpretation, so BEP inherits those features. The following reference materials on symbol train processing are available by email request (be sure to include your address). The easiest way for me to send these things is on a Macintosh 800K disk. Also, that way paper is used only for what you really need in hard copy form.(A printout of the Superpaint drawing will accompany the disk). Symbol Train Processing (April, 1990a; 50 pages, 20 figures) An Object-Oriented Interpreter for Symbol Train Processing (May, 1990b; 15 pages, 3 figures; includes the Mathematica STP Notebook and the ChemiKine Instructions) Symbol Train Processing Applied to Formalizing Mental Models of Calcium Oscillators (July, 1990c; 28 pages, 11 figures) Motivation and Status of Symbol Train Processing (July, 1990d; 7 pages) The following demonstration programs are available on a standard 1.2 MB floppy disk to run on an IBM PC/AT clone type of machine: Bilateral Inhibition (inh.exe in C) Frequency Doubler (demo.exe in C under Windows)