markh@csd4.csd.uwm.edu (Mark William Hopkins) (11/10/90)
A typical intro. to backpropagation that goes into any kind of analysis in depth will derive the weight (and threshold) corrections used in backprop. from the Gradient Descent method applied to a certain least-squares error function. The Gradient Descent method tries to find the minimum of this function, which is supposed to represent the optimal state for the system to be in. Associated with this method is a constant, usually denoted as the Greek letter ETA, which controls the learning rate of the system. You have to set it "by hand". If you make it too large, the system oscillates and fails to learn, and if you set it too small, it learns too slow. So, the obvious idea occurred to me way back in November of last year: use Newton's Method to find the zero of the error function. Newton's Method does not generally get stuck in local minima because it's not even trying to find minima in the first place. And when you think about it, finding minima of the error function wasn't really even the goal in the first place -- finding the zero (or a near-zero) was. Also, when it converges it converges fast. You might think that using Newton's Method for backprop. is going to entail a computationally expensive operation of calculating derivatives, maybe a couple matrix inversions and divisions. However, the fact is that backprop. with Newton's Method is VIRTUALLY IDENTICAL to backprop. with Gradient Descent. What's the difference? Newton's Method calculates the optimal learning "constant" ETA on the fly. The only unfortunate things about the method is that it doesn't actually converge so fast when the function in question behaves like f(x) = x^2 near the zero, and the error function ALWAYS has this property near its zero. The other fault of the method is that it behaves erratically near local minima, and particularily near points where the error function almost but not quite reaches 0. Now the erratic behavior of Newton's Method near a local minima, when you think about it, is actually a desired feature. You WANT your neural net to jump out of a local minimum when it hits one. And because you're dividing by a learning constant that approaches 0 as you near the local minimum to determine ETA, the system becomes more "volatile" as it approaches such a point. The only time you don't want the system to jump around erratically is when you're actually nearing the zero. This is something that can actually be fixed quite easily by providing a cut-off: no more learning when the error falls below a certain threshold. Now what you have is a system that even controls, autonomously, when it learns: another desired feature. This also keeps the system from jumping out from near convergence when there is, in fact, no actual zero to the error function but just a near-zero. A simulation of a Newton's Method backprop. neural net provided some degree of conformation of these major points. A simple neural net was set up to learn the exclusive-or function. Backprop. without momentum terms exhibits horribly slow convergence to this function, even when its weights are initialized to values near convergence. The problem is that some parts of the underlying surface of the error function require learning constants, ETA, of (say) under 0.25, and other parts require learning constants of in excess of 1000000 (!). Newton's method determines this dynamically. When the weights are set initially near the optimum setting, the neural net literally zooms in on the zero in a relatively small numbner of steps (under a few hundred). The value of ETA can get as high as 1000000 before the learning "cut-off" mentioned above is applied. More generally, backpropagation with Newton's method will converge extremely quickly wheve gradient descent backprop. is left behind crawling along taking its own sweet time. Not everything's perfect, though. Certain settings of weights and thresholds on even this relatively simple training function caused the net to get stuck in a state where it was trying to emulate the exclusive or function with an or function. This is definitely related to Newton's method relative helplessness at points far away from the zero of the error function. If you play around with Newton's Method on a one-dimensional graph, you will find that it always gets stuck in V-shaped valleys -- or in two dimensions, in cone-shaped "sink holes". However, Gradient Descent gets stuck here too... But generally, using Newton's Method seems to provide a natural way to accelerate learning when and only when it needs acceleration.
eeoglesb@cybaswan.UUCP (j.oglesby eleceng pgrad) (11/13/90)
In article <7556@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes: > > A typical intro. to backpropagation that goes into any kind of analysis in ...... the rest of the message > But generally, using Newton's Method seems to provide a natural way to >accelerate learning when and only when it needs acceleration. I would like to add a bit to this last message: I believe Mark talks about a 1D Newton line search to set the step size, to which I would add the following: There are other more robust and just as fast 1D line search that can be used. Performing an accurate line search on each search is generally a wast of computation and can in some cases slow you significantly. Producing a good search direction ie something like conjugate gradient rather than steepest descent, with an approximate but fast line search, will in general yield a faster solution than an accurate line search with stepest descent directions. I have recently seen a paper on Newtons method for ID line searches with a variety of stratigies for deriving search directions, which leads me to the conclusion : Yes any line search is generally better than setting a constant step size (ETA), but if it is performed accuratley then in general you will be wasting time. Unfortunately this is paper is still in review so I can't reference it. I use a quadratic interpolation line search using a couple of points in the search direction. This involves a couple of function evalutions in the search direction from which an estimate of the step size to the minimum is calculated. This potentially shuld be good at finding LOCAL MINIMUM but I haven't found this a problem in 3 years of experience. 4 D EXOR's are generally solved in < 10 line searches. Hope my experience is helpful to some of you folks out there, JO. ------------------------------------------------------------------------------ John Oglesby. UUCP : ...!ukc!pyr.swan.ac.uk!eeoglesb Speech Processing Group, JANET : eeoglesb@uk.ac.swan.pyr Electrical Engineering Dept., Phone : +44 792 205678 Ex 4564 University of Wales, Fax : +44 792 295686 Swansea, SA2 8PP, U.K. Telex : 48358 ------------------------------------------------------------------------------
ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (11/15/90)
In article <2124@cybaswan.UUCP> eeoglesb@cybaswan.UUCP (j.oglesby eleceng pgrad) writes: >I use a quadratic interpolation line search using a couple of points in the >search direction. This involves a couple of function evalutions in the search >direction from which an estimate of the step size to the minimum is calculated. >This potentially shuld be good at finding LOCAL MINIMUM but I haven't found >this a problem in 3 years of experience. 4 D EXOR's are generally solved in >< 10 line searches. Have you tried the 2 intertwined spirals problem? It really gives my conjugate-gradient backprop neural net program a rough time with local minima. But it is easily solved with cascade-correlation. -Tom
loren@dweasel.llnl.gov (Loren Petrich) (11/15/90)
In article <6873@jhunix.HCF.JHU.EDU> ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) writes: >In article <2124@cybaswan.UUCP> eeoglesb@cybaswan.UUCP (j.oglesby eleceng pgrad) writes: >>I use a quadratic interpolation line search using a couple of points in the >>search direction. This involves a couple of function evalutions in the search >>direction from which an estimate of the step size to the minimum is calculated. >>This potentially shuld be good at finding LOCAL MINIMUM but I haven't found >>this a problem in 3 years of experience. 4 D EXOR's are generally solved in >>< 10 line searches. > >Have you tried the 2 intertwined spirals problem? It really gives my >conjugate-gradient backprop neural net program a rough time with >local minima. But it is easily solved with cascade-correlation. Does anyone know of any readily-available Cascade-Correlation source? I tried writing a Cascade-Correlation routine, without very much success. I think that the problem of local minima will necessarily plague ANY localized search algorithm, because if it finds one minimum, it will know nothing about the other minima. The only practical way of getting around that difficulty, at least that I know of, is by using a stochastic approach, using several different starting points. And even that is not guaranteed to find the global minimum. I know that, in some cases, the minimization problem is known to be NP-complete, which means that there is no known algorithm which solves the problem in some power of the problem size -- all known solution of NP-complete problems go either exponentially or factorially with the problem size. However, all NP-complete problems appear to be fundamentally equivalent, and a solution of one may well apply to all. $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ Loren Petrich, the Master Blaster: loren@sunlight.llnl.gov Since this nodename is not widely known, you may have to try: loren%sunlight.llnl.gov@star.stanford.edu
ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (11/17/90)
In article <86170@lll-winken.LLNL.GOV> loren@dweasel.llnl.gov (Loren Petrich) writes: > Does anyone know of any readily-available Cascade-Correlation >source? PostScript Paper: ftp from cheops.cis.ohio-state.edu Source: in C and Lisp from pt.cs.cmu.edu (I think the directory is connect/code) > I tried writing a Cascade-Correlation routine, without very >much success. Did you use Quickprop to train the weights? Did you use multiple candidate units? I think both are needed for good performance. I must admit I think Quickprop is kind of a hack, but it does work, though I think conjugate-gradient methods might also be useful for training the weights of a cascade-correlation network. > I think that the problem of local minima will necessarily >plague ANY localized search algorithm, because if it finds one >minimum, it will know nothing about the other minima. Yeah. The important thing about cascade-corrletion is that it is significantly different than backprop since it does not strictly follow the gradient-descent of error. It is a kind of "compositional learning" which creates useful subgoals (i.e. hidden units as feature detectors which maximally help reduce the error) and puts them together to solve a larger goal. Thus it could be scooting all over the error surface. Whether or not this reduces local minima problems is not mathematically obvious, but common-sense and my examination of results so far says that it probably does (at least for the two-spirals problem). -Thomas Edwards (Looking for Ph.D. programs to do neurally-inspired VLSI for next fall)
loren@dweasel.llnl.gov (Loren Petrich) (11/19/90)
In article <6905@jhunix.HCF.JHU.EDU> ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) writes: >In article <86170@lll-winken.LLNL.GOV> loren@dweasel.llnl.gov (Loren Petrich) writes: >> Does anyone know of any readily-available Cascade-Correlation >>source? > >PostScript Paper: ftp from cheops.cis.ohio-state.edu >Source: in C and Lisp from pt.cs.cmu.edu (I think the directory is > connect/code) I've already "been" to cheops.cis.ohio-state.edu, and I had already gotten Fahlman's paper from there. It is implementing what he had suggested that was the big trouble. I looked in pt.cs.cmu.edu and I was not allowed access to the contents of connect/code. I got a message stating that 530 Access not allowed for guest users for path connect/code I was also not allowed access into any other (suspected) directory. >> I tried writing a Cascade-Correlation routine, without very >>much success. > >Did you use Quickprop to train the weights? Did you use multiple >candidate units? I think both are needed for good performance. >I must admit I think Quickprop is kind of a hack, but it does work, >though I think conjugate-gradient methods might also be useful for >training the weights of a cascade-correlation network. I had used Quickprop -- and multiple candidate units. Without much success. The NN almost always gets stuck with some weights VERY big and others small. And when I put in a big weight decay rate, the NN usually gets very lousy results. I'm training on 0 or 1 (-1 or 1 in my code) decision problems like the encoder and line-symmetry problems. $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ Loren Petrich, the Master Blaster: loren@sunlight.llnl.gov Since this nodename is not widely known, you may have to try: loren%sunlight.llnl.gov@star.stanford.edu
ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (11/19/90)
In article <86351@lll-winken.LLNL.GOV> loren@dweasel.llnl.gov (Loren Petrich) writes: > I looked in pt.cs.cmu.edu and I was not allowed access to the >contents of connect/code. Hmm...there are two possibilities. 1) That isn't the right address 2) It isn't there any more. I'm going to check with R. Scott Crowder who wrote the code, but until then, I'll send a copy of cascor1.c to the first five people who request it (Loren gets copy #1). > I had used Quickprop -- and multiple candidate units. Without >much success. The NN almost always gets stuck with some weights VERY >big and others small. And when I put in a big weight decay rate, the >NN usually gets very lousy results. I won't profess to be an expert at Quickprop. You might just want to try good old conjugate-gradient for weight updates. Otherwise, it is not apparent what is going wrong. It should be able to rather quickly solve the two-spirals problem. Oh well, I'll send you cascor1.c. -Thomas Edwards