[comp.ai.neural-nets] State of the Art Feed-Forward Network Training Algorithms

aj3u@opal.cs.virginia.edu (Asim Jalis) (05/17/91)

What is the state of the art in training feed-forward networks.  I
have seen a few papers in conference proceedings that deal with this
topic and did not see anything that improved performance over plain
Backpropagation drastically for the general case.

Does anyone have any references or a bibliography for this by any
chance.

Thanks.

Asim Jalis.

smagt@fwi.uva.nl (Patrick van der Smagt) (05/17/91)

aj3u@opal.cs.virginia.edu (Asim Jalis) writes:

>What is the state of the art in training feed-forward networks.
...
>did not see anything that improved performance over plain
>Backpropagation drastically for the general case.

The answer to this question is really very, very simple.  Remember
that teaching a feed-forward network is nothing but minimisation
of a function

	E = 1/2  \sum_p (\vec d_p - \vec a)^2

by varying the parameters W.  This has been investigated a zillion of
times.  Look up any book on numerical analysis, e.g.,

	%A W. H. Press
	%A B. P. Flannery
	%A S. A. Teukolsky
	%A W. T. Vetterling
	%T Numerical Recipes: The Art of Scientific Computing
	%I Cambridge University Press
	%C Cambridge
	%D 1986

	%A J. Stoer
	%A R. Bulirsch
	%T Introduction to Numerical Analysis
	%I Springer-Verlag
	%C New York--Hei\-del\-berg--Ber\-lin
	%D 1980

to read about improvements to steepest descent (aka gradient descent)
minimisation.  Amongst the best methods is conjugate gradient optimisation.

I myself haven't used error back-propagation for over a year, but CG
instead.  It sizzles.

						Patrick van der Smagt 

		`But wait a bit!' the oysters cried
		`Before we have our chat.'
		`For all of us are out of breath'
		`And some of us are fat.'
		`No hurry,' said the carpenter.
		They thanked him much for that.
								    /\/\
                                                                    \  /
Organisation: Faculty of Mathematics & Computer Science             /  \
              University of Amsterdam, Kruislaan 403,            _  \/\/  _
              NL-1098 SJ  Amsterdam, The Netherlands            | |      | |
Phone:        +31 20  525 7524                                  | | /\/\ | |
Fax:          +31 20  525 7490                                  | | \  / | |
                                                                | | /  \ | |
email:        smagt@fwi.uva.nl                                  | | \/\/ | |
                                                                | \______/ |
                                                                 \________/

								    /\/\
``The opinions expressed herein are the author's only and do        \  /
not necessarily reflect those of the University of Amsterdam.''     /  \
                                                                    \/\/

greenba@gambia.crd.ge.com (ben a green) (05/17/91)

In article <1991May17.090435.9180@fwi.uva.nl> smagt@fwi.uva.nl (Patrick van der Smagt) writes:

   aj3u@opal.cs.virginia.edu (Asim Jalis) writes:

   >What is the state of the art in training feed-forward networks.
   ...
   >did not see anything that improved performance over plain
   >Backpropagation drastically for the general case.

   The answer to this question is really very, very simple.  Remember
   that teaching a feed-forward network is nothing but minimisation
   of a function

	   E = 1/2  \sum_p (\vec d_p - \vec a)^2

   by varying the parameters W.  This has been investigated a zillion of
   times.  Look up any book on numerical analysis, e.g.,

	   %A W. H. Press
	   %A B. P. Flannery
	   %A S. A. Teukolsky
	   %A W. T. Vetterling
	   %T Numerical Recipes: The Art of Scientific Computing
	   %I Cambridge University Press
	   %C Cambridge
	   %D 1986

	   %A J. Stoer
	   %A R. Bulirsch
	   %T Introduction to Numerical Analysis
	   %I Springer-Verlag
	   %C New York--Hei\-del\-berg--Ber\-lin
	   %D 1980

   to read about improvements to steepest descent (aka gradient descent)
   minimisation.  Amongst the best methods is conjugate gradient optimisation.

   I myself haven't used error back-propagation for over a year, but CG
   instead.  It sizzles.

More needs to be said about conjugate gradient optimisation. I agree
fully with Patrick van der Smagt, but the word is about that CG is no good.
There was a thread on this newsgroup recently "Are CG algorithms any good?"

Through the courtesy of a Government Agency, whose permission I do not have
to name, I recently ran a test training a net on some representations of
speech sounds. They had found that backprop trained to 90% accuracy in
about 150,000 presentations of the training set, while CG could not get
past 70%. They were using a CG implementation that shall be nameless, but
which is made available to the public by a university.

My implementation of CG trained to 90% on this problem in 1676 presentations
of the training set. That's a factor of 89 faster than backprop.

I have no idea why the other implementation failed so badly. There are
many choices to make concerning how to do linesearches, for example.

This is not the only example: Another hit on CG was made in the thesis
of a student of a well-known NN researcher in the Northeast. He said that
CG got stuck in local minima. He was kind enough to share the data with us,
and we trained on the problem very quickly with CG and with no local
minimum problem.

For an introduction to CG applied to NN training, see the excellent article
by Kramer and Sangiovanni in Adv. in Neural Information Processing I, 
pp. 40-48, 1989. Buy the book from Morgan Kaufman, San Mateo, CA, USA,
if you have to.

Please do not ask for my software. GE won't let me give it out.

Ben

--
Ben A. Green, Jr.              
greenba@crd.ge.com
  Speaking only for myself, of course.

ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (05/18/91)

In article <GREENBA.91May17100557@gambia.crd.ge.com> greenba@gambia.crd.ge.com (ben a green) writes:
>In article <1991May17.090435.9180@fwi.uva.nl> smagt@fwi.uva.nl (Patrick van der Smagt) writes
>   aj3u@opal.cs.virginia.edu (Asim Jalis) writes:
>   >What is the state of the art in training feed-forward networks.

>   I myself haven't used error back-propagation for over a year, but CG
>   instead.  It sizzles.

>My implementation of CG trained to 90% on this problem in 1676 presentations
>of the training set. That's a factor of 89 faster than backprop.

I think it is important to point out that backpropogation refers to
a method of developing the error gradient w.r.t the weights.
One might use simple gradient descent, steepest descent w. linesearch,
CG (which really rocks when properly done), or modified Newton Methods
(which can go even faster than CG, but not by a heck of alot).

Someone at Oregon Graduate Institute used to have a good CG program
avaliable via anonymous ftp.  I have used that implementation, and
it was exceedingly fast.  Yes, you'd find local minima in small problems,
but in most normal size problems there were none.

I would reccommend that people interested in training NNs look into
Cascade-Correlation (Fahlman, TR available on cheops.cis.ohio-state.edu
in /pub/neuroprose I believe).  It builds up a network with a minimal
number of hidden units (relatively minimal, I don't think it is 
optimally minimal), and all learning is done on a single layer of
weights at a time, so no nasty backprop pass.  It is exceedingly
fast, especially if you use something better than simple gradient
descent on the correlation and error minimization.

Cascade-Correlation has recently been extended to recurrent nets,
and I plan to see how it works on a sun activity predictor over the
next 3 months.

-Thomas Edwards