[comp.ai.neural-nets] Backpropagation with Newton's Method, and recurrence. Source code.

markh@csd4.csd.uwm.edu (Mark William Hopkins) (11/13/90)

   If you would like to test out Newton's method for backpropagation neural
nets, and if you'd like to try out a relatively unknown, but efficient,
algorithm for training recurrent bp. neural nets, you can obtain some demos
via ftp.

   The programs illustrate recurrent Newton's Method backprop. on a familiar
application: learning the exclusive-or function.  Also, you will find a program
that successfully emulates a finite state machine (a flip flop) using recurrent
back prop. with "persistent activations", and a (not-so-successful) program
that attempts to LEARN the flip flop.

   Needless to say, training backpropagation to do a flip flop is not an easy
task, but bp. with recurrence and persistent activations is powerful enough to
represent any finite state machine (including even an entire CPU!) ... so
convergence may be possible with the right presentation strategy using nothing
more than the generalized delta rule (!)

   Do an anonymous ftp to csd4.csd.uwm.edu, set binary mode and pluck out
nn.Z from the top-level directory, uncompress it, "de-tar" it and run it
on any IBM-compatible with a Quick Basic 4.5 compiler.  The source has been
written in such a way as to make translation to Berkeley C (using the curses
package), or MicroSoft Quick C relatively easy.

From command prompt:
>ftp csd4.csd.uwm.edu

Ftp login procedure:
> Name: anonymous
> Password: ident

From the ftp prompt:
> binary
> get nn.Z
> quit

Back in command prompt:
> uncompress nn.Z
> tar -xf nn

(This sequence takes about 20 to 30 seconds :) ).

I'd be interested in hearing any comments on the software.

 -- Mark Hopkins (markh@csd4.csd.uwm.edu)

------------------------------------------------------------

Disclaimer:  Everything below the dotted line is patently false.

wan@isl.Stanford.EDU (Eric A. Wan) (11/14/90)

We find your recent postings concerning Newton's Method for neural
networks somewhat misleading.  Using Newton's method to find a zero of
the mse surface does not seem like a very wise thing to do.  The mse
surface almost never has a zero in a real problem.  The goal of an
optimization problem like training a neural network is to minimize an
objective function, not to find a zero in that function (especially if
one does not exist). You do not want to use Newton's method to find a
zero of the error surface but rather of the gradient.  Some of the
difficulties you encountered (i.e.  infinite jumps near a local
minimum) are a result of this error. You mention the method does not
work well for f(x) = x^2.  In fact, Newton's Method applied to the
gradient of x^2 converges in one step.

Furthermore, you are considering terms independently.  Only diagonal
terms of the "Hessian" (not actually a Hessian in your algorithm) are
being approximated.  This is not Newton's method.  True Newton's
method requires the full Hessian. Using the diagonal performs no
rotation and is more akin to improving eigenvalue spread.  Le Cun and
Becker (Proceedings of the Connectionist Models Summer School, 1988)
derived a method for finding the exact diagonal values.  However,
their method is only valid for networks with a single hidden layer.

Note: Newton's Methods (applied to the gradient) as well as other
second order methods (Conjugate Gradient, Quasi-Newton, etc.) have
been studied for use with neural networks.  In all cases the
algorithms are subject to local minima.  It is true your method is not
subject to local minima.  On the other hand, it will also never
reach a stable solution unless you impose arbitrary halting rules.


- M. Lehr, E. Wan, F. Beaufays, S. Piche

markh@csd4.csd.uwm.edu (Mark William Hopkins) (11/16/90)

In article <1248@helens.Stanford.EDU> wan@isl.Stanford.EDU (Eric A. Wan) writes:
>You mention the method does not work well for f(x) = x^2.  In fact, Newton's
>Method applied to the gradient of x^2 converges in one step.

You lost me here.  Try using Newton's method to find a zero of x^2.  This is
what I was describing.  It has the same (relatively poor) performance as
binary search:

		  x[n+1] = x[n] -  x[n]^2/(2*x[n]) = 1/2 * x[n].

>Using Newton's method to find a zero of the mse surface does not seem like a
>very wise thing to do.  The mse surface almost never has a zero in a real
>problem.  The goal of an optimization problem like training a neural network
>is to minimize an objective function, not to find a zero in that function
>(especially if one does not exist).

I simply disagree.

You have to concede: it's a valid point to say that the goal is to find a zero
or near-zero of the error function.  It's no use finding any minima if they
aren't 'near' zero.  And if there aren't any minima AT ALL near a zero on the
surface, then how could it even make sense to talk about convergence in the
first place?  A "solution" with a significant error is simply not a solution,
even if it's "optimum".

It goes the other way around too: if you're already near a zero, then going
further to find a minima is just plain counter-productive.  You know the
saying: "if it ain't broke, don't fix it".  All you end up doing is forcing the
net to learn something that, in the judgement of the people who defined "near"
for that particular application, it ALREADY 'knows'.

(On Newton's Method never fully converging):
>On the other hand, it will also never reach a stable solution unless you
>impose arbitrary halting rules.

I pointed this out, but argued that it was a positive feature to have, to
have the neural net autonomously determine when it learns and when it doesn't.
The particular halting rule I described was not arbitrary but followed directly
from the considerations above about when you can say something is "solved" and
when it's not "solved".  You just provide a cutoff on the error function
itself.

landman@hanami.Eng.Sun.COM (Howard A. Landman) (11/20/90)

>In article <1248@helens.Stanford.EDU> wan@isl.Stanford.EDU (Eric A. Wan) writes:
>>You mention the method does not work well for f(x) = x^2.  In fact, Newton's
>>Method applied to the gradient of x^2 converges in one step.

In article <7684@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>You lost me here.  Try using Newton's method to find a zero of x^2.

The gradient of x^2 is 2x.

--
	Howard A. Landman
	landman@eng.sun.com -or- sun!landman

wan@whirlwind.Stanford.EDU (Eric A. Wan) (11/22/90)

In article <7684@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>In article <1248@helens.Stanford.EDU> wan@isl.Stanford.EDU (Eric A. Wan) writes:
>>You mention the method does not work well for f(x) = x^2.  In fact, Newton's
>>Method applied to the gradient of x^2 converges in one step.
>
>You lost me here.  Try using Newton's method to find a zero of x^2.  This is
>what I was describing.  It has the same (relatively poor) performance as
>binary search:
>
>		  x[n+1] = x[n] -  x[n]^2/(2*x[n]) = 1/2 * x[n].
>

I think you still don't quite understand what we mean.  Near the
current value of X_k look at the first two terms of the Taylor
expansion:

    f(X) = f(X_k) + grad(f)(X-X_k) + (1/2)[(X-X_k)^T]*H*(X-X_k) + O(X^3)

Now take the gradient of the equation (dropping higher order terms)

   grad(f) = grad(f(X_k)) + H*(X-X_k) 

Using Newton's to find the zero of the gradient (which corresponds to
a minimum if H is positive definite) yields the following algorithm

    X_{k+1) = X_k - H^(-1)*grad(f)      (1)

This is the version of Newton's Method for minimizing a function found
in most optimization texts.

Now consider the example of f(x) = x^2.  (grad = 2x, H = grad(grad) = 2)
From equation (1) with any initial value of x = x_0
 
   x_1 = x_0 - (1/2)*2x_0 = x_0 - x_0 = 0

One step convergence to the minimum!  For any general hyperdimensional
quadratic one step convergence is also attained (not so if you search
for a zero). 
   
>You have to concede: it's a valid point to say that the goal is to find a zero
>or near-zero of the error function.  It's no use finding any minima if they
>aren't 'near' zero.  And if there aren't any minima AT ALL near a zero on the
>surface, then how could it even make sense to talk about convergence in the
>first place?  A "solution" with a significant error is simply not a solution,
>even if it's "optimum".

Sorry, but we don't agree.  The goal of training neural networks is
finding a minimum which doesn't have to be a zero.  For a simple
example, suppose you want to classify points drawn from two highly
overlapping Gaussian distributions.  In this case the optimal solution
will have a large error.  If the network were to have a large number
of weights, you could cause it to classify the points in your training
set perfectly, and the error of the trained system would be very close
to zero.  Such a network, however, would perform very poorly outside
the training set because you are overfitting the training data.  In
order for good generalization in this situation, you must use a very
small network which is capable of implementing a function not much
more complex than the optimal boundary (which is a hyperellipsoid or a
hyperplane, depending on the relative covariance of the two
distributions).  In this case the minimum solution will clearly have a
large error.  This sort of situation occurs often in real
classification problems.  For the most part, the only case where this
is not true to some extent is in contrived logic problems that are
probably better solved by something other than neural networks.

If you manage to get a neural network to solve speech recognition with
zero mean squared error we will retract our statements.

>
>(On Newton's Method never fully converging):
>>On the other hand, it will also never reach a stable solution unless you
>>impose arbitrary halting rules.
>
>I pointed this out, but argued that it was a positive feature to have, to
>have the neural net autonomously determine when it learns and when it doesn't.
>The particular halting rule I described was not arbitrary but followed directly
>from the considerations above about when you can say something is "solved" and
>when it's not "solved".  You just provide a cutoff on the error function
>itself.

True your method provides a way to jump out of local minima, but to us
it certainly doesn't seem to pick a very good way to go about it.  If
it were to hit a local minimum exactly where the gradient is zero,
your algorithm will throw the weights to infinity, which doesn't seem
like such a good idea.  Why not just have your algorithm start
training at a new random point if the local minimum is above the
threshold you have chosen?  

Anyway, it still seems to make more sense to search for a minimum
rather than for a zero in the mse function.

E. Wan, M. Lehr, F. Beaufays, (Steve gave up)

markh@csd4.csd.uwm.edu (Mark William Hopkins) (11/29/90)

In article <3146@exodus.Eng.Sun.COM> landman@hanami.Eng.Sun.COM (Howard A. Landman) writes:
>>In article <1248@helens.Stanford.EDU> wan@isl.Stanford.EDU (Eric A. Wan) writes:
>>>You mention the method does not work well for f(x) = x^2.  In fact, Newton's
>>>Method applied to the gradient of x^2 converges in one step.
>
>In article <7684@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>>You lost me here.  Try using Newton's method to find a zero of x^2.
>
>The gradient of x^2 is 2x.

Hence the iteration formula:

	      x[n+1] = x[n] - f(x[n])/f'(x[n]) = 1/2*x[n].

So Newton's Method applied to x^2 converges only as slowly as the Method of
Bisection.  The original article wasn't talking about using Netwon's Method
on the gradient of x^2, but on x^2.

lehr@whirlwind.Stanford.EDU (Mike Lehr) (11/29/90)

In article <7937@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>In article <3146@exodus.Eng.Sun.COM> landman@hanami.Eng.Sun.COM (Howard A. Landman) writes:
>>>In article <1248@helens.Stanford.EDU> wan@isl.Stanford.EDU (Eric A. Wan) writes:
>>>>You mention the method does not work well for f(x) = x^2.  In fact, Newton's
>>>>Method applied to the gradient of x^2 converges in one step.
>>
>>In article <7684@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>>>You lost me here.  Try using Newton's method to find a zero of x^2.
>>
>>The gradient of x^2 is 2x.
>
>Hence the iteration formula:
>
>	      x[n+1] = x[n] - f(x[n])/f'(x[n]) = 1/2*x[n].
>
>So Newton's Method applied to x^2 converges only as slowly as the Method of
>Bisection.  The original article wasn't talking about using Netwon's Method
>on the gradient of x^2, but on x^2.

We all give up.

-M. Lehr, F. Beaufays, E. Wan, S. Piche