drt@chinet.chi.il.us (Donald Tveter) (09/16/90)
Posting-number: Volume 14, Issue 84 Submitted-by: Donald Tveter <drt@chinet.chi.il.us> Archive-name: back-prop/part01 #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of archive 1 (of 4)." # Contents: README # Wrapped by drt@chinet on Fri Aug 31 08:17:03 1990 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'README' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'README'\" else echo shar: Extracting \"'README'\" \(34957 characters\) sed "s/^X//" >'README' <<'END_OF_FILE' X.ce XFast Back-Propagation X.ce XCopyright (c) 1990 by Donald R. Tveter X X X.ul XIntroduction X X The programs described below were produced for my own use in studying Xback-propagation and for doing experiments that are found in my Xintroduction to Artificial Intelligence textbook, \fIThe Basis of XArtificial Intelligence\fR, to be published by Computer Science Press. XI have copyrighted these files but I hereby give permission to anyone to Xuse them for experimentation, educational purposes or to redistribute Xthem on a not for profit basis. All others that may want to use, change Xor redistribute these programs for commercial purposes, should contact Xme by mail at: X X.na X.nf X Dr. Donald R. Tveter X 5228 N. Nashville Ave. X Chicago, Illinois 60656 X USENET: drt\@chinet.chi.il.us X.ad X.fi X XAlso, I would be interested in hearing your suggestions, bug reports Xand major successes or failures. X X There are four simulators that can be constructed from the Xincluded files. The program, rbp, does back-propagation using double Xprecision floating point weights and arithmetic. The program, bp, does Xback-propagation using 16-bit integer weights, 16 and 32-bit integer Xarithmetic and some double precision floating point arithmetic. The Xprogram, sbp, uses 16-bit integer symmetric weights but only allows Xtwo-layer networks. The program srbp does the same using 64-bit Xfloating point weights. The purpose of sbp and srbp is to produce Xnetworks that can be used with the Boltzman machine relaxation Xalgorithm (not included). X X In most cases, the 16-bit integer programs are the most useful, Xbecause they are the fastest. With a 10 MHz 68010, connections can be Xprocessed at up to about 45,000 per second and weight changes can be Xdone at up to about 25,000 per second. These values depend on the exact Xproblem. The integer versions will probably be faster on most machines Xthan the versions that use real arithmetic. Unfortunately, sometimes X16-bit integer weights don't have enough range or precision and then Xusing the floating point versions may be necessary. Many other speed-up Xtechniques are included in these programs. X X.ul XMaking the Simulators X X To make a particular executable file, use the makefile given Xwith the data files and make any or all of them like so: X X.ce Xmake bp X.ce X make sbp X.ce X make rbp X.ce X make srbp X XOne option exists for bp and sbp. If your compiler is smart enough Xto divide by 1024 by shifting, use "-DSMART". X X To make a record of all the input and output from the programs, Xthe following small UNIX command file I call record can be used: X X.na X.nf Xtrap "" 2 Xoutfile="${1}.record" Xif test -f $outfile X then X rm $outfile X fi Xecho $outfile X(tee -a $outfile | $*) | tee -a $outfile Xprocess=`ps | grep tee | cut -c1-6` Xkill $process X.ad X.fi X XFor example to make a record of all the input and output from the Xprogram bp using data file, xor, use: X X.ce Xrecord bp xor X X X.ul XA Simple Example X X Each version would normally be called with the name of a file to read Xcommands from, as in: X X.ce Xbp xor X XWhen no file name is specified, bp expects to take commands from the Xkeyboard (UNIX stdin file). After the file name from the command line Xis read and the commands in the file are executed, commands are then Xtaken from the keyboard. X X The commands are one letter commands. Most commands have Xoptional parameters. The `*' character is a comment. It can be used Xto make the remainder of the line a comment. Here is an example of Xan input file to do the xor problem: X X.na X.nf X* input file for the xor problem X Xm 2 1 1 * make a 2-1-1 network Xc 1 1 3 1 * add this extra connection Xc 1 2 3 1 * add this extra connection Xs 7 * seed the random number function Xk 0 1 * give the network random weights X Xn 4 * read four new patterns into memory X1 0 1 X0 0 0 X0 1 1 X1 1 0 X Xe 0.5 * set eta to 0.5 (and eta2 to 0.05) Xa 0.9 * set alpha to 0.9 X.ad X.fi X XIn this example, the m command is a command to make a network. The Xnumbers following it are the number of units for each layer. The m Xcommand connects adjacent layers with weights. The following c Xcommands create extra connections from layer 1, unit 1 to layer 3, Xunit 1 and from layer 1, unit 2 to layer 3, unit 1. The `s' command Xsets the seed for the random number function. The `k' command then Xgives the network random weights. The `k' command has another use as Xwell. It can be used to try to kick a network out of a local minimum. XHere, the meaning of "k 0 1" is to examine all the weights in the Xnetwork and for every weight equal to 0 (and they all start out at 0), Xadd in a random number between -1 and +1. The `n' command Xspecifies four new patterns to be read into memory. With the `n' Xcommand, any old patterns that may have been present are removed. XThere is also an `x' command that behaves like the `n' command, except Xthe `x' commands \fIadds\fR the extra patterns to the current training Xset. The input pattern comes first, followed by the output pattern. XThe statement, e 0.5, sets eta, the learning rate, to 0.5 and eta2 from Xthe differential step size algorithm to one tenth this, or 0.05. The Xlast line sets alpha, the momentum parameter, to 0.9. X X The above statements set up the network and when the list of Xcommands runs out, commands are taken from the keyboard. The Xfollowing messages and prompt appears: X X.na X.nf X.ne2 XFast Backpropagation Copyright (c) 1990 by Donald R. Tveter Xtaking commands from stdin now X[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? X.ad X.fi X XThe square brackets enclose a list of the possible commands. XThe `r' command is used to run the training algorithm. Typing in "r 200 X100" as shown below, means run 200 iterations through the patterns Xand print the output patterns every 100 iterations: X X.na X.nf X.ne3 X[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? r 200 100 Xrunning . . . X.ne5 X100 iterations, s 7, k 0 1.00, file = xor X 1 0.81 (0.03739) X 2 0.13 (0.01637) X 3 0.85 (0.02262) X 4 0.17 (0.02988) X.ne5 X159 iterations, s 7, k 0 1.00, file = xor X 1 0.90 (0.00973) X 2 0.07 (0.00467) X 3 0.92 (0.00565) X 4 0.09 (0.00739) Xpatterns learned to within 0.10 at iteration 159 X.ad X.fi X XThe program immediately prints out the "running . . ." message. After Xeach 100 iterations, a header line giving some program parameters Xis printed out, followed by the results that occur when each of the four Xpatterns is submitted to the network. If the second number defining Xhow often to print out values is omitted, the values will not print Xeven when the learning is finished. The values in parentheses at the Xend of each line give the sum of the squared error on the output units Xfor each output pattern. These error numbers are useful to see because Xthey give you some idea of how fast each pattern is being learned. XThe program also reports that the patterns have been learned to within Xthe default tolerance of 0.1. This check for the tolerance being met Xis done for every learning iteration. Sometimes in the integer version Xthe program will do a few extra iterations before declaring Xthe problem done. This is because of truncation errors in the Xarithmetic done to check for convergence. X X A particular test pattern can be input to the network with the `p' Xcommand, as in: X X.na X.nf X.ne2 X[?!*AabCcEefhijklmnoPpQqRrSstwx]? p 1 0 X 0.91 X.ad X.fi X XTo have the system evaluate a particular stored pattern, say pattern Xnumber 4, use the `P' command as in: X X.na X.nf X.ne2 X[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? P4 X 4 0.09 (0.00739) X.ad X.fi X XTo print all the values for all the training patterns without doing Xany learning, type `P': X X.na X.nf X.ne5 X[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? P X 1 0.90 (0.00973) X 2 0.07 (0.00467) X 3 0.92 (0.00565) X 4 0.09 (0.00739) X.ad X.fi X X One thing you might want to know are the values of the weights Xthat have been produced. To see this, there is the `w' command. XThe `w' command gives the value of the weights leading into Xa particular unit and also data about how the activation value of the Xunit is computed. Two integers after the w specify the layer and Xunit number within the layer whose weights should be printed. For Xexample, if you want the weights leading into the unit at layer 2, Xposition number 1, type: X X.na X.nf X.ne6 X[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? w 2 1 Xlayer unit unit value weight input from unit X 1 1 1.00000 7.27930 7.27930 X 1 2 1.00000 -5.66797 -5.66797 X 2 t 1.00000 2.74902 2.74902 X sum = 4.36035 X.ad X.fi X XIn this example, the unit at layer 2, number 1 is receiving input from Xunits 1 and 2 in the previous (the input) layer and from a unit, t. XUnit t is the threshold unit. The "unit value" column gives the value Xof the input units for the last time some pattern was placed on the Xinput units. In this case, the fourth pattern was the last one that the Xnetwork has seen. The next column lists the weights on the connections Xinto the unit at (2,1). The final column is the result from multiplying Xtogether the unit value and the weight. Beneath this column, the sum of Xthe inputs is given. X X Another important command is the help command. It is the letter X`h' (not `?') followed by the letter of the command. The help command Xwill give a brief summary of how to use the command. Here, we type Xh h for help with help: X X.na X.nf X.ne3 X[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? h h X Xh <letter> gives help for command <letter>. X.ad X.fi X X Finally, to end the program, the `q' (for quit) command is entered: X X[?!*AabCcEefHhijklmnoPpQqRrSstWwx]? q X X.ul XInput and Output Formats X X The programs are able to read patterns in two different formats. The Xdefault input format is the compressed (condensed) format. In it, each Xvalue is one character and it is not necessary to have blanks between Xthe characters. For example, in compressed format, the patterns for xor Xcould be written out in either of the following ways: X X.ce X101 10 1 X.ce X000 00 0 X.ce X011 01 1 X.ce X110 11 0 X XThe second example is preferable because it makes it Xeasier to see the input and the output patterns. Compressed format can Xalso be used to input patterns with the `p' command. XIn addition to using 1 and 0 as input, the character, `?' can be used. XThis character is initially defined to be 0.5, but it can be redefined Xusing the Q command like so: X X.ce XQ 0.7 X XThis sets the value of ? to 0.7. Other valid input characters are the Xletters, `h', `i', `j' and `k'. The `h' stands for `hidden'. Its Xmeaning in an input string is that the value at this point in the string Xshould be taken from the next unit in the second layer of the network. XNormally this will be the second layer of a three-layer network. This Xnotation is useful for specifying simple recurrent Xnetworks. Naturally, `i', `j' and `k' stand for taking input Xvalues from the third, fourth and fifth layers (if they exist). A Xsimple example of a recurrent network is given later. X X The other input format for numbers is real. The number portion must Xstart with a digit (.35 is not allowed, but 0.35 is). Exponential Xnotation is not allowed. Real numbers have to be separated by a space. XThe `h', `i', `j', `k' and `?' characters are also allowed with real Xinput patterns. To take input in this format, it is necessary Xto set the input format to be real using the `f' (format) command as in: X X.ce Xf ir X XTo change back to the compressed format, use: X X.ce Xf ic X XOutput format is controlled with the `f' command as in: X X.ce Xf or X.ce Xf oc X.ce Xf oa X XThe first sets the output to real numbers. The second sets the Xoutput to be condensed mode where the value printed will be a `1' when Xthe unit value is greater than 1.0 - tolerance, a `^' when the value Xis above 0.5 but less than 1.0 - tolerance, a `v' when the value is Xless than 0.5 but greater than the tolerance. Below the tolerance Xvalue, a `0' is printed. The tolerance can be changed using the `t' Xcommand. For example, to make all values greater than 0.8 print Xas `1' and all values less than 0.2 print as `0', use: X X.ce Xt 0.2 X XOf course, this same tolerance value is also used to check to see if all Xthe patterns have converged. The third output format is meant to Xgive "analog condensed" output. In this format, a `c' is printed when Xa value is close enough to its target value. Otherwise, if the answer Xis close to X1, a `1' is printed, if the answer is close to 0, a `0' is printed, if Xthe answer is above the target but not close to 1, a `^' is printed and Xif the answer is below the target but not close to 0, a `v' is printed. XThis output format is designed for Xproblems where the output is a real number, as for instance, when the Xproblem is to make a network learn sin(x). X X With the f command, a number of sub-commands can be put on one line Xas in the following, where the input is set to real and the output Xis set to analog condensed: X X.ce Xf ir oa X XAlso, for the sake of convenience, the output format (and only the Xoutput format) can be set without using the `f', so that: X X.ce Xor X Xwill also make the output format real. X X In the condensed formats, the default is to print a blank after every X10 values. This can be altered using the `b' (for inserting breaks) Xcommand. The use for this command is to separate output values into Xlogical groups to make the output more readable. For instance, you may Xhave 24 output units where it makes sense to insert blanks after the X4th, 7th and 19th positions. To do this, specify: X X.ce Xb 4 7 19 X XThen for example, the output will look like: X X.na X.nf X 1 10^0 10^ ^000v00000v0 01000 (0.17577) X 2 1010 01v 0^0000v00000 ^1000 (0.16341) X 3 0101 10^ 00^00v00000v 00001 (0.16887) X 4 0100 0^0 000^00000v00 00^00 (0.19880) X.ad X.fi X XThe `b' command allows up to 20 break positions to be specified. XThe default output format is the real format with 10 numbers per Xline. For the output of real values, the `b' command specifies when to Xprint a carriage return, rather than when to print a blank. X X Sometimes the training set is so large that it is annoying to Xhave all the patterns print out every n iterations. To get a summary of Xhow learning is going, instead of all these patterns, use "f s+". XNow, if the command in the xor problem was "r 200 50" the following Xoutput summary will result: X X.na X.nf X 50 0 learned 4 unlearned 0.48364 error/unit X 100 0 learned 4 unlearned 0.16528 error/unit X 150 3 learned 1 unlearned 0.08813 error/unit X 159 4 learned 0 unlearned 0.08203 error/unit Xpatterns learned to within 0.10 at iteration 159 X.ad X.fi X XThe program counts up how many patterns were learned or not learned Xin each training pass before the weights are updated. Therefore, the Xstatus is one iteration out of date. The error/unit is the average Xabsolute value of the error on each unit for each pattern. To switch Xback to the longer report, use "f s-". The P command will list all the Xpatterns no matter what the setting of the summary parameter is. X X.ul XSaving and Restoring Weights and Related Values X X Sometimes the amount of time and effort needed to produce a set of Xweights to solve a problem is so great that it is more convenient to Xsave the weights rather than constantly recalculate them. Weights can Xbe saved as real values (the default) or as binary, to save space. To Xsave the weights enter the command, `S'. The weights are written on a Xfile called "weights". The following file comes from the Xxor problem: X X.na X.nf X159r file = xor X 7.2792968750 X -5.6679687500 X 2.7490234375 X 5.8486328125 X -5.0400390625 X -11.8574218750 X 8.3193359375 X.ad X.fi X XTo write the weights, the program starts with the second layer, writes Xout the weights leading into these units in order with the threshold Xweight last. Then it moves on to Xthe third layer, and so on. To restore these weights, type an `R' for Xrestore. At this time, the program reads the header line and sets the Xtotal number of iterations the program has gone through to be the first Xnumber it finds on the header line. It then reads the character Ximmediately after the number. The `r' indicates that the weights will Xbe real numbers represented as character strings. If the weights were Xbinary, the character would be a `b' rather than an `r'. Also, if the Xcharacter is `b', the next character is read. This next character Xindicates how many bytes are used per value. The integer versions, bp Xand sbp write files with 2 bytes per weight, while the real versions, Xrbp and srbp write files with 8 bytes per weight. With this notation, Xweight files written by one program can be read by the other. A binary Xweight format is specified within the `f' command by using "f wb". A Xreal format is specified by using "f wr". If your program specifies Xthat weights should be written in one format, but the weight file you Xread from is different, a warning message will be printed. There is no Xcheck made to see if the number of weights on the file equals the number Xof weights in the network. X X The above formats specify that only weights are written out and Xthis is all you need once the patterns have converged. However, if Xyou're still training the network and want to break off training and Xpick up the training from exactly the same point later, you need to save Xthe old weight changes when using momentum, and the parameters for the Xdelta-bar-delta method if you are using this technique. To save these Xextra parameters on the weights file, use "f wR" to write the extra Xvalues as real and "f wB" to write the extra values as binary. X X In the above example, the command S, was used to save the weights Ximmediately. Another alternative is to save weights at regular Xintervals. The command, S 100, will automatically save weights every X100 iterations the program does, that is, when the total iterations mod X100 = 0. The initial rate at which to save weights is set at 100,000, Xwhich generally means that no weights will ever be saved. X X Another use for saving weights has to do with trying to find the Xproper parameters to quickly solve the problem. Ordinarily, a high Xrate of learning is desirable, but often too high a rate of learning Xwill increase the error, rather than decrease it. In trying to find Xthe answer as quickly as possible, if the network seems to be Xconverging with the current parameters you can save the current weights Xand increase the learning rate. If this increased learning rate ruins Xthe convergence, then you can restore the weights you had before you Xmade this increase. X X X.ul XInitializing Weights and Giving the Network a `Kick' X X All the weights in the network initially start out at 0. In Xsymmetric networks then, no learning may result because error signals Xcancel themselves out. Even in non-symmetric Xnetworks, the training process will often converge faster if the weights Xstart out at small random values. To do this, the `k' command will Xtake the network and alter the weights in the following ways. Suppose Xthe command given is: X X.ce Xk 0 0.5 X XNow, if a weight is exactly 0, then the weight will be changed to a Xrandom value between +0.5 and -0.5. The above command can therefore be Xused to initialize the weights in the network. A more complex use of Xthe `k' command is to decrease the magnitude of large weights in the Xnetwork by a certain random amount. For instance, in the following Xcommand: X X.ce Xk 2 8 X Xall the weights in the network that are greater than or equal to 2, will Xbe decreased by a random number between 0 and 8. Weights Xless than or equal to -2 will be increased by a random number Xbetween 0 and 8. The seed to the random number generator can be Xchanged using the `s' command as in "s 7". The integer parameter in the X`s' command is of type, unsigned. X X Another method of giving a network a kick is to add hidden layer Xunits. The command: X X.ce XH 2 0.5 X Xadds one unit to layer 2 of the network and all the weights that are Xcreated are initialized to between - 0.5 and + 0.5. X X The subject of kicking a back-propagation network out of local minima Xhas barely been studied and there is no guarantee that the above methods Xare very useful in general. X X.ul XSetting the Algorithm to Use X X A number of different variations on the original back-propagation Xalgorithm have been proposed in order to speed up convergence. Some Xof these have been built into these simulators. Some of the methods Xcan be mixed together. The two most important choices are the Xderivative term to use and the update method to use. The default Xderivative is the one devised by Fahlman: X X.ce X0.1 + s(1-s) X Xwhere s is the activation value of the unit. The reason for adding in Xthe 0.1 term to the correct formula for the derivative is that when s is Xclose to 0 or 1, the amount of error passed back is very small and so Xlearning is very slow. Adding the 0.1 speeds up the learning process. X(For the original description of this method, see "Faster Learning XVariations of Back-Propagation: An Empirical Study", by Scott E. XFahlman, in \fIProceedings of the 1988 Connectionist Models Summer XSchool\fR, Morgan Kaufmann, 1989.) Besides Fahlman's derivative and the Xoriginal one, the differential step size method (see "Stepsize Variation XMethods for Accelerating the Back-Propagation Algorithm", by Chen and XMars, in \fIIJCNN-90-WASH-DC\fR, Lawrence Erlbaum, 1990) takes the Xderivative to be 1 in the layer going into the output units and uses the Xoriginal derivative for all other layers. The learning rate for the Xinner layers is normally set to 1/10 the rate in the outer layer. To Xset the derivative, use the `A' command as in: X X.ne4 X A do * use the original derivative X A df * use Fahlman's derivative X A dd * use the differential step size derivative X X The algorithm command can contain other sub-commands besides the Xsetting of the derivative. The other major choice is the update method. XThe choices are the original one, the differential step size method, XJacob's delta-bar-delta method, the continuous update method and the Xcontinuous update method with the differential step size etas. To set Xthese update methods use: X X.na X.nf X.ne6 X A uo * the original update method X A ud * the differential step size method X A uj * Jacob's delta-bar-delta method X A uc * the continuous update method X A uC * the continuous update method with the differential X * step size etas X.ad X.fi X XThe differential step size method uses the standard eta when updates Xare made to the units leading into the output layer. For deeper layers, Xanother value will be used. The default is to use an eta, called eta2, Xfor the inner layers that is one-tenth the standard eta. These etas Xboth get set using the `e' command (not a sub-command of the `A' Xcommand) as in: X X.ce Xe 0.5 0.1 X XThe standard eta will be set to 0.5 and eta2 will be 0.1 If eta2 Xhad been omitted, it would have been set to 0.05. Jacob's Xdelta-bar-delta method uses a number of special parameters and these Xare set using the `j' command. Jacob's update method can actually be Xused with any of the three choices for derivatives and the algorithm Xwill find its own value of eta for each weight. The differential Xstep size derivative is often very effective with Jacob's Xdelta-bar-delta method. X X There are five other `A' sub-commands. First, the activation Xfunction can be either the piece-wise linear function or the original Xsmooth activation function, but the smooth function is only available Xwith the programs that use real weights and arithmetic. To set the Xtype of function, use: X X A ap * for the piece-wise activation function X A as * for the smooth activation function X XThe piece-wise function can save quite a lot in execution time despite Xthe fact that it normally increases the number of iterations required Xto solve a problem. X X Second, it has been reported that using a sharper sigmoid shaped Xactivation function will produce faster convergence (see "Speeding Up XBack Propagation" by Yoshio Izui and Alex Pentland in the Proceedings of X\fIIJCNN-90-WASH-DC\fR, Lawrence Erlbaum Associates, 1990 ). If we let Xthe function be: X X 1 X ----------------, X 1 + exp (-D * x) X Xincreasing D will make the sigmoid sharper while decreasing D will Xmake it flatter. To set this parameter, to say, 8, use: X X.ce XA D 8 * sets the sharpness to 8 X XThe default value is 1. A larger D is also useful in the integer Xversion of back-propagation where the weights are limited to between X-32 and +31.999. A larger D value in effect magnifies the weights and Xmakes it possible for the weights to stay smaller. Values of D less Xthan one may be useful in extracting a network from a local minima X(see "Handwritten Numeral Recognition by Multi-layered Neural Network Xwith Improved Learning Algorithm" by Yamada, Kami, Temma and Tsukumo in XProceedings of the 1989 IJCNN, IEEE Press). Also, when you have large Xinput values, values of D less than 1 can be used to scale down the Xactivation to higher level units. X X The third miscellaneous command is the `b' command to control Xwhether or not to backpropagate error for units that have learned Xtheir response to within a given tolerance. The default is to Xalways backpropagate error. The advantage to not backpropagating Xerror is that this can save computer time and sometimes actually Xdecrease the number of iterations that are required to solve the Xproblem. This parameter can be set like so: X X A b+ * always backpropagate error X A b- * don't backpropagate error when close X X The fourth `A' sub-command allows you to limit the weights Xthat the network produces to some restricted range. This can be Ximportant in the programs with 16-bit weights. These programs limit Xthe weights to be from -32 to +31.999. When a weight near +31.999 is Xincreased a little it can overflow and produce a negative value. When Xone or more weights overflow, the learning usually takes a dramatic Xturn for the worse, or on rare occasions, it suddenly improves. To Xhave the program check for weights above 30 or below -30, enter: "A l X30". This also limits the Xabsolute values of the weights to be less than or equal to 30. The Xweights are checked after they have been updated and if a weight is Xgreater than this limit, it is set equal to this limit. The first time Xthis happens, a warning message is produced. With this method, it is Xpossible, in principle, for a large weight change to cause overflow Xwithout being caught, but this is unlikely. To stop the weight Xchecking, set the limit to 0. The default is to not check. X X The final miscellaneous `A' sub-command is `s', for skip. Setting Xs = n will have the program skip whole patterns for n iterations that Xhave been learned to within the required tolerance. For example, to Xskip patterns that have been learned for 5 iterations, use: "A s 5". X X.ul XJacob's Delta-Bar-Delta Method and Parameters X X Jacob's delta-bar-delta method attempts to find a learning rate Xeta, for each individual weight. The parameters are the initial Xvalue for the etas, the amount by which to increase an eta that seems Xto be too small, the rate at which to decrease an eta that is apparently Xtoo large, a maximum value for each eta and a parameter used in keeping Xa running average of the slopes. Here are examples of setting these Xparameters: X X.na X.nf X j d 0.5 * sets the decay rate to 0.5 X j e 0.1 * sets the initial etas to 0.1 X j k 0.25 * sets the amount to increase etas by (kappa) to X * 0.25 X j m 10 * sets the maximum eta to 10 X j t 0.7 * sets the history parameter, theta, to 0.7 X.ad X.fi X XThese settings can all be placed on one line: X X.ce Xj d 0.5 e 0.1 k 0.25 m 10 t 0.7 X XThe version implemented here does not use momentum. X X The idea behind the delta-bar-delta method is to let the program find Xits own learning rate for each weight. The `e' sub-command sets the Xinitial value for each of these learning rates. When the program sees Xthat the slope of the error surface averages out to be in the same Xdirection for several iterations for a particular weight, the program Xincreases the eta value by an amount, kappa, given by the `k' parameter. XThe network will then move down this slope faster. When the program Xfinds the slope changes signs, the assumption is that the program has Xstepped over to the other side of the minima and it is nearing the Xminimum from the opposite side. Therefore, it cuts down the learning Xrate, by the decay factor, given by the `d' parameter. For instance, a Xd value of 0.5 cuts the learning rate for the weight in half. The `m' Xparameter specifies the maximum allowable value for an eta. The `t' Xparameter (theta) is used to compute a running average of the slope of Xthe weight and must be in the range 0 <= t < 1. The running average at Xiteration i, a\di\u , is defined as: X X.ce Xa\di\u = (1 - t) slope\di\u + ta\di-1\u, X Xso small values for t make the most recent slope more important than Xthe previous average of the slope. Determining the learning rate for Xback-propagation automatically is, of course, very desirable and this Xmethod often speeds up convergence by quite a lot. Unfortunately, bad Xchoices for the delta-bar-delta parameters give bad results and a lot of Xexperimentation may be necessary. For more, see "Increased Rates of XConvergence" by Robert A. Jacobs, in \fINeural Networks\fR, Volume 1, XNumber 4, 1988. X X.ul XRecurrent Networks X X Recurrent back-propagation networks take values from higher level Xunits and use them as activation values of lower level units. This Xgives a network a simple kind of short-term memory, possibly a little Xlike human short-term memory. For instance, suppose you want a network Xto memorize the two short sequences, "acb" and "bcd". In the middle of Xboth of these sequences is the letter, "c". In the first case you Xwant a network to take in "a" and output "c". Then take in "c" and Xoutput "b". In the second case you want a network to take in "b" and Xoutput "c". Then take in "c" and output "d". To do this, a network Xneeds a simple memory of what came before the "c". X X Let the network be an 7-3-4 network where input units 1-4 and output Xunits 1-4 stand for the letters a-d. Furthermore, let there be 3 hidden Xlayer units. The hidden units will feed their values back down to the Xinput units 5-7, where they become input for the next step. To see why Xthis works, suppose the patterns have been learned by the network. XInputing the "a" from the first string produces some random pattern of Xactivation on the hidden layer units and "c" on the output units. The Xpattern from the hidden units is copied down to the input layer. XSecond, the letter, "c" is presented to the network together with the Xrandom pattern, now on units 5-7. XHowever, if the "b" from the second string is presented first, there Xwill be a different random pattern on the hidden layer units. These Xvalues are copied to units 5-7. These values Xcombine with the "c" to produce another random pattern. This random Xpattern will be different from the pattern the first string produced. XThis difference can be used by the network to make the response for the Xfirst string, "b" and the response for the second string, "d". XThe training patterns for the network can be: X X 1000 000 0010 * "a" prompts the output, "c" X 0010 hhh 0100 * inputing "c" should produce "b" X X 0100 000 0010 * "b" prompts the output, "c" X 0010 hhh 0001 * inputing "c" should produce "d" X Xwhere the first four values on each line are the normal input, the Xmiddle three either start out all zeros or take their values from the Xprevious values of the hidden units. The code for taking these values Xfrom the hidden layer units is "h". The last set of values represents Xthe output that should be produced. To take values from the third layer Xof a network, the code is "i". For the fourth and fifth layers (if they Xexist) the codes are "j" and "k". Training recurrent networks can take Xmuch longer than training standard networks. X X.ul XMiscellaneous Commands X X Below is a list of some miscellaneous commands, a short example of Xeach and a short description of the command. X X.IP " ? ? " 15 XA `?' will print program status information. X X.IP " ! !cmd " 15 XAnything after `!' will be passed on to UNIX as a command to execute. X X.IP " C " 15 XThe C command will clear the network of values, reset the number of Xiterations, set the seed to 0 and reset other values so that another Xrun can be made with a new seed value. X X.IP " E E 1 " 15 XEntering "E 1" will echo all the input. "E 0" will stop Xechoing command input. The default is to not echo input, since it Xappears on the screen automatically. Echoing input is useful when Xcommands are taken from a file of commands, using the `i' command Xdescribed below. It can also be useful when reading commands from Xa file when there is some kind of error within the file. X X.IP " i i f " 15 XEntering "i f" will read commands from the file, f. When there are Xno more commands on a file, the program starts reading from the Xkeyboard. (Its very handy to have a set of fixed commands in a file Xto, in effect, create a new command.) X X.IP " l l 2 " 15 XEntering "l 2" will print the values of the units on layer 2, Xor whatever layer is specified. X X.IP " T T -3 " 15 XIn sbp and srbp only, "T -3" sets all the threshold weights Xto -3 or whatever value is specified and freezes them at this value. X X.IP " W W 0.9 " 15 XEntering "W 0.9" will remove (whittle away) all the weights with Xabsolute values less than 0.9. X.in-15 X XIn addition, when a user generated interrupt occurs (by typing DEL) Xthe program will drop its current task and take the next command. X X.ul XLimitations X X Weights in the bp and sbp programs are 16-bit integer weights, where Xthe real value of the weight has been multiplied by 1024. The integer Xversions cannot handle weights less than -32 or greater than 31.999. XWeights are only checked if the Algorithm parameter, l, is set to a Xvalue greater than 0. Large learning rates with the differential step Xsize derivative and using the continuous update method can produce Xoverflow. There are other places in these programs where calculations Xcan possibly overflow as well and none of these places are checked. XOverflow seems highly unlikely, in these other places, however. Input Xvalues for the integer versions can run from -31.994 to X31.999. Due to the method used to implement recurrent connections, Xinput values in the real version are limited to -31994.0 and above. END_OF_FILE if test 34957 -ne `wc -c <'README'`; then echo shar: \"'README'\" unpacked with wrong size! fi # end of 'README' fi echo shar: End of archive 1 \(of 4\). cp /dev/null ark1isdone MISSING="" for I in 1 2 3 4 ; do if test ! -f ark${I}isdone ; then MISSING="${MISSING} ${I}" fi done if test "${MISSING}" = "" ; then echo You have unpacked all 4 archives. rm -f ark[1-9]isdone else echo You still need to unpack the following archives: echo " " ${MISSING} fi ## End of shell archive. exit 0