[comp.ai.neural-nets] Using Newton's Method to speed up backpropagation.

markh@csd4.csd.uwm.edu (Mark William Hopkins) (11/10/90)

   A typical intro. to backpropagation that goes into any kind of analysis in
depth will derive the weight (and threshold) corrections used in backprop.
from the Gradient Descent method applied to a certain least-squares error
function.

   The Gradient Descent method tries to find the minimum of this function,
which is supposed to represent the optimal state for the system to be in.
Associated with this method is a constant, usually denoted as the Greek letter
ETA, which controls the learning rate of the system.  You have to set it "by
hand".  If you make it too large, the system oscillates and fails to learn, and
if you set it too small, it learns too slow.

   So, the obvious idea occurred to me way back in November of last year: use
Newton's Method to find the zero of the error function.  Newton's Method does
not generally get stuck in local minima because it's not even trying to find
minima in the first place.  And when you think about it, finding minima of
the error function wasn't really even the goal in the first place -- finding
the zero (or a near-zero) was.  Also, when it converges it converges fast.

   You might think that using Newton's Method for backprop. is going to entail
a computationally expensive operation of calculating derivatives, maybe a
couple matrix inversions and divisions.  However, the fact is that backprop.
with Newton's Method is VIRTUALLY IDENTICAL to backprop. with Gradient
Descent.  What's the difference?  Newton's Method calculates the optimal
learning "constant" ETA on the fly.

   The only unfortunate things about the method is that it doesn't actually
converge so fast when the function in question behaves like f(x) = x^2 near
the zero, and the error function ALWAYS has this property near its zero.  The
other fault of the method is that it behaves erratically near local minima,
and particularily near points where the error function almost but not quite
reaches 0.

   Now the erratic behavior of Newton's Method near a local minima, when you
think about it, is actually a desired feature.  You WANT your neural net to
jump out of a local minimum when it hits one.  And because you're dividing
by a learning constant that approaches 0 as you near the local minimum to
determine ETA, the system becomes more "volatile" as it approaches such a
point.

   The only time you don't want the system to jump around erratically is when
you're actually nearing the zero.  This is something that can actually be fixed
quite easily by providing a cut-off: no more learning when the error falls
below a certain threshold.  Now what you have is a system that even controls,
autonomously, when it learns: another desired feature.  This also keeps the
system from jumping out from near convergence when there is, in fact, no
actual zero to the error function but just a near-zero.

   A simulation of a Newton's Method backprop. neural net provided some degree
of conformation of these major points.  A simple neural net was set up to
learn the exclusive-or function.  Backprop. without momentum terms exhibits
horribly slow convergence to this function, even when its weights are
initialized to values near convergence.  The problem is that some parts of the
underlying surface of the error function require learning constants, ETA, of
(say) under 0.25, and other parts require learning constants of in excess of
1000000 (!).

   Newton's method determines this dynamically.  When the weights are set
initially near the optimum setting, the neural net literally zooms in on the
zero in a relatively small numbner of steps (under a few hundred).  The
value of ETA can get as high as 1000000 before the learning "cut-off"
mentioned above is applied.  More generally, backpropagation with Newton's
method will converge extremely quickly wheve gradient descent backprop. is
left behind crawling along taking its own sweet time.

   Not everything's perfect, though.  Certain settings of weights and
thresholds on even this relatively simple training function caused the net to
get stuck in a state where it was trying to emulate the exclusive or function
with an or function.  This is definitely related to Newton's method relative
helplessness at points far away from the zero of the error function.

   If you play around with Newton's Method on a one-dimensional graph, you will
find that it always gets stuck in V-shaped valleys -- or in two dimensions,
in cone-shaped "sink holes".  However, Gradient Descent gets stuck here too...

   But generally, using Newton's Method seems to provide a natural way to
accelerate learning when and only when it needs acceleration.

eeoglesb@cybaswan.UUCP (j.oglesby eleceng pgrad) (11/13/90)

In article <7556@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>
>   A typical intro. to backpropagation that goes into any kind of analysis in

...... the rest of the message

>   But generally, using Newton's Method seems to provide a natural way to
>accelerate learning when and only when it needs acceleration.

I would like to add a bit to this last message:

I believe Mark talks about a 1D Newton line search to set the step size, to
which I would add the following:

There are other more robust and just as fast 1D line search that can be used.
Performing an accurate line search on each search is generally a wast of
computation and can in some cases slow you significantly. Producing a good
search direction ie something like conjugate gradient rather than steepest
descent, with an approximate but fast line search, will in general yield
a faster solution than an accurate line search with stepest descent directions.

I have recently seen a paper on Newtons method for ID line searches with
a variety of stratigies for deriving search directions, which leads me to
the conclusion : Yes any line search is generally better than setting a 
constant step size (ETA), but if it is performed accuratley then in general
you will be wasting time. Unfortunately this is paper is still in review
so I can't reference it.

I use a quadratic interpolation line search using a couple of points in the
search direction. This involves a couple of function evalutions in the search
direction from which an estimate of the step size to the minimum is calculated.
This potentially shuld be good at finding LOCAL MINIMUM but I haven't found
this a problem in 3 years of experience. 4 D EXOR's are generally solved in 
< 10 line searches.

Hope my experience is helpful to some of you folks out there,
JO.

------------------------------------------------------------------------------
 John Oglesby.                        UUCP  : ...!ukc!pyr.swan.ac.uk!eeoglesb 
 Speech Processing Group,             JANET : eeoglesb@uk.ac.swan.pyr         
 Electrical Engineering Dept.,        Phone : +44 792 205678  Ex 4564         
 University of Wales,                 Fax   : +44 792 295686                  
 Swansea, SA2 8PP, U.K.               Telex : 48358                           
------------------------------------------------------------------------------

ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (11/15/90)

In article <2124@cybaswan.UUCP> eeoglesb@cybaswan.UUCP (j.oglesby eleceng pgrad) writes:
>I use a quadratic interpolation line search using a couple of points in the
>search direction. This involves a couple of function evalutions in the search
>direction from which an estimate of the step size to the minimum is calculated.
>This potentially shuld be good at finding LOCAL MINIMUM but I haven't found
>this a problem in 3 years of experience. 4 D EXOR's are generally solved in 
>< 10 line searches.

Have you tried the 2 intertwined spirals problem?  It really gives my
conjugate-gradient backprop neural net program a rough time with
local minima.  But it is easily solved with cascade-correlation.

-Tom

loren@dweasel.llnl.gov (Loren Petrich) (11/15/90)

In article <6873@jhunix.HCF.JHU.EDU> ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) writes:
>In article <2124@cybaswan.UUCP> eeoglesb@cybaswan.UUCP (j.oglesby eleceng pgrad) writes:
>>I use a quadratic interpolation line search using a couple of points in the
>>search direction. This involves a couple of function evalutions in the search
>>direction from which an estimate of the step size to the minimum is calculated.
>>This potentially shuld be good at finding LOCAL MINIMUM but I haven't found
>>this a problem in 3 years of experience. 4 D EXOR's are generally solved in 
>>< 10 line searches.
>
>Have you tried the 2 intertwined spirals problem?  It really gives my
>conjugate-gradient backprop neural net program a rough time with
>local minima.  But it is easily solved with cascade-correlation.

	Does anyone know of any readily-available Cascade-Correlation
source?

	I tried writing a Cascade-Correlation routine, without very
much success.

	I think that the problem of local minima will necessarily
plague ANY localized search algorithm, because if it finds one
minimum, it will know nothing about the other minima. The only
practical way of getting around that difficulty, at least that I know
of, is by using a stochastic approach, using several different
starting points. And even that is not guaranteed to find the global
minimum.

	I know that, in some cases, the minimization problem is known
to be NP-complete, which means that there is no known algorithm which
solves the problem in some power of the problem size -- all known
solution of NP-complete problems go either exponentially or
factorially with the problem size. However, all NP-complete problems
appear to be fundamentally equivalent, and a solution of one may well
apply to all.


$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Loren Petrich, the Master Blaster: loren@sunlight.llnl.gov

Since this nodename is not widely known, you may have to try:

loren%sunlight.llnl.gov@star.stanford.edu

ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (11/17/90)

In article <86170@lll-winken.LLNL.GOV> loren@dweasel.llnl.gov (Loren Petrich) writes:
>	Does anyone know of any readily-available Cascade-Correlation
>source?

PostScript Paper: ftp from cheops.cis.ohio-state.edu
Source: in C and Lisp from pt.cs.cmu.edu (I think the directory is
   connect/code)

>	I tried writing a Cascade-Correlation routine, without very
>much success.

Did you use Quickprop to train the weights?  Did you use multiple 
candidate units?  I think both are needed for good performance.
I must admit I think Quickprop is kind of a hack, but it does work,
though I think conjugate-gradient methods might also be useful for
training the weights of a cascade-correlation network.

>	I think that the problem of local minima will necessarily
>plague ANY localized search algorithm, because if it finds one
>minimum, it will know nothing about the other minima.

  Yeah.  The important thing about cascade-corrletion is that it is
significantly different than backprop since it does not strictly
follow the gradient-descent of error.  It is a kind of "compositional
learning" which creates useful subgoals (i.e. hidden units as feature
detectors which maximally help reduce the error) and puts them together
to solve a larger goal.  Thus it could be scooting all over the
error surface.  Whether or not this reduces local minima problems
is not mathematically obvious, but common-sense and my examination
of results so far says that it probably does (at least for the
two-spirals problem).

-Thomas Edwards
 (Looking for Ph.D. programs to do neurally-inspired VLSI for next fall)

loren@dweasel.llnl.gov (Loren Petrich) (11/19/90)

In article <6905@jhunix.HCF.JHU.EDU> ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) writes:
>In article <86170@lll-winken.LLNL.GOV> loren@dweasel.llnl.gov (Loren Petrich) writes:
>>	Does anyone know of any readily-available Cascade-Correlation
>>source?
>
>PostScript Paper: ftp from cheops.cis.ohio-state.edu
>Source: in C and Lisp from pt.cs.cmu.edu (I think the directory is
>   connect/code)

	I've already "been" to cheops.cis.ohio-state.edu, and I had
already gotten Fahlman's paper from there. It is implementing what he
had suggested that was the big trouble.

	I looked in pt.cs.cmu.edu and I was not allowed access to the
contents of connect/code. I got a message stating that

530 Access not allowed for guest users for path connect/code

	I was also not allowed access into any other (suspected) directory.

>>	I tried writing a Cascade-Correlation routine, without very
>>much success.
>
>Did you use Quickprop to train the weights?  Did you use multiple 
>candidate units?  I think both are needed for good performance.
>I must admit I think Quickprop is kind of a hack, but it does work,
>though I think conjugate-gradient methods might also be useful for
>training the weights of a cascade-correlation network.

	I had used Quickprop -- and multiple candidate units. Without
much success. The NN almost always gets stuck with some weights VERY
big and others small. And when I put in a big weight decay rate, the
NN usually gets very lousy results.

	I'm training on 0 or 1 (-1 or 1 in my code) decision problems
like the encoder and line-symmetry problems.


$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Loren Petrich, the Master Blaster: loren@sunlight.llnl.gov

Since this nodename is not widely known, you may have to try:

loren%sunlight.llnl.gov@star.stanford.edu

ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (11/19/90)

In article <86351@lll-winken.LLNL.GOV> loren@dweasel.llnl.gov (Loren Petrich) writes:
>	I looked in pt.cs.cmu.edu and I was not allowed access to the
>contents of connect/code.

Hmm...there are two possibilities.  1)  That isn't the right address
2)  It isn't there any more.  I'm going to check with R. Scott
Crowder who wrote the code, but until then, I'll send a copy of
cascor1.c to the first five people who request it (Loren gets copy
#1).

>	I had used Quickprop -- and multiple candidate units. Without
>much success. The NN almost always gets stuck with some weights VERY
>big and others small. And when I put in a big weight decay rate, the
>NN usually gets very lousy results.

I won't profess to be an expert at Quickprop.  You might just want to
try good old conjugate-gradient for weight updates.

Otherwise, it is not apparent what is going wrong.  It should be able
to rather quickly solve the two-spirals problem.  Oh well, I'll send
you cascor1.c.

-Thomas Edwards