[comp.ai.neural-nets] Experience with Conjugate gradient minimizations

jbracher@weber.ucsd.edu (Matt Kennel) (09/13/90)
I have had some experience using the C.G. algorithm for training BP
and similar types of networks.

I've found that although it works as advertised, sometimes what it's designed
for is not what you want.  It is indeed very good at finding local minima
quickly and accurately; unfortunately often it seems to get "stuck"
for me when trying to train a network with 500 or 1000 free parameters instead
of the 5 or so typically considered in classic numerical analysis textbooks.

For a ANN, I found that alternating between a few iterations
"dumb" gradient descent and then some time with C.G. worked pretty well:
the C.G. zapped along quickly, but when it appeared to get stuck, I'd
switch over to dumb gradient descent.  But when G.D. started to "screw up"
by actualyl increasing the error due to an innapropriate step size, I'd
switch back to C.G.  The stupidity (and speed) of G.D. seemed to get it out
of local minima (by being totally blind to them) that sometimes caused
C.G. trouble.

Note that the inverse "hessian" types of algorithms in The Bible (numerical
recipies) are not practial for training neural networks because their
internal overhead (not due to function minimization or gradient computation)
goes as O(N^2), N the # of free parameters, whereas CG goes as O(N) in
time and storage.

Matt K.
mbk@inls1.ucsd.edu