[comp.ai.neural-nets] Back Propagation Algorithm question...

camargo@cs.columbia.edu (Francisco Camargo) (05/30/89)

Can anyone put some light in the following issue:

How should one compute the weight adjustments in BackProp ?
From reading PDP, one gathers the impression that the DELTAS
should be acumulated over all INPUT PATTERNS and only then
a STEP is taken towards the gradient. Robins Monroe suggests
a stochastic algorithm with proved convergency if one takes one
step at each pattern presentation, but dumps its effect by a factor
1/k where "k" is the presentation number. Other people,(from codes
that I've seen flying around) seems to take a STEP a each presentation
a don't take into account any dumping factors. I've tried myself both
approaches and they all seem to work. After all, which is the correct way
of adjusting the weights ? Acumulate the errors over all patterns ? Or, work
towards the minimum as new patterns are presented. Which are the implications ?

Any light is this issue is extremelly appreciated.

Francisco A. Camargo
CS Department - Columbia University
camargo@cs.columbia.edu


PS: A few weeks ago, I requested some pointers to Learning Algorithms in NN
and promissed a summary of the replies. It is comming. I have not forgoten my
responsibilities with this group. Even  though I got more requests than really
new info, I'll have a summary posted shortly. And thanks for all who
contributed.

heumann@hpmtlx.HP.COM ($John_Heumann@hpmtljh) (05/31/89)

> / hpmtlx:comp.ai.neural-nets / camargo@cs.columbia.edu (Francisco Camargo) /  5:26 pm  May 29, 1989 /
> 
> Can anyone put some light in the following issue:
> 
> How should one compute the weight adjustments in BackProp ?
> From reading PDP, one gathers the impression that the DELTAS
> should be acumulated over all INPUT PATTERNS and only then
> a STEP is taken towards the gradient. Robins Monroe suggests
> a stochastic algorithm with proved convergency if one takes one
> step at each pattern presentation, but dumps its effect by a factor
> 1/k where "k" is the presentation number. Other people,(from codes
> that I've seen flying around) seems to take a STEP a each presentation
> a don't take into account any dumping factors. I've tried myself both
> approaches and they all seem to work. After all, which is the correct way
> of adjusting the weights ? Acumulate the errors over all patterns ? Or, work
> towards the minimum as new patterns are presented. Which are the implications ?
> 
> Any light is this issue is extremelly appreciated.
> 
> Francisco A. Camargo
> CS Department - Columbia University
> camargo@cs.columbia.edu
> 
> 
> PS: A few weeks ago, I requested some pointers to Learning Algorithms in NN
> and promissed a summary of the replies. It is comming. I have not forgoten my
> responsibilities with this group. Even  though I got more requests than really
> new info, I'll have a summary posted shortly. And thanks for all who
> contributed.
> ----------

A few comments.

1) Note that if backprop is modified by addition of a search (rather than
fixed step size) in the minimization direction, its simply a form of gradient
descent.  In light of this,

2) If you want to accumulate the gradient accurately over the entire search
space your forced to either a) accumulate over all samples before altering
any weights, or b) take a tiny (actually infinitesmal) step after each
sample is presented.  Doing a full step after each sample presentation
destroys the descent property of the algorithm.  

3) If your're using backprop as originally presented (i.e. either fixed step
size or with a momentum term), I don't believe there is any general way to
establish that one method is superior to the other for all problems.  I've
seen search spaces on which one wonders rather aimlessly and the other
converges rapidly; the trouble is that which is good and which is bad 
depends on the particular problem!  Its certainly true that doing something
which causes you to deviate from a true descent path (like adding a momentum
term or taking a step after each sample) can help you escape local minima
in selected cases and can lead to more rapid convergence on some problems.
Unfortunately, it can also lead to aimless wandering and poor performance
on others.

4) If you can find the full reference, I'd be interested in seeing the
Monroe paper, since I'm unaware of any backprop-like method with proven
convergence for non-convex search spaces.

5) Personally, my choice for optimizing NN's is to modify backprop to be
a true gradient descent method and the use either the Fletcher-Reeves or
or Pollak-Ribiere methods for accelerated convergence.  Doing so means you
WILL have trouble with local minima if they're present in your search space,
but avoids all the tweaky parameters in the original backpropagation
algorithm.  (Since there's no one set of paramaters that appear applicable
across a wide range of problems, you can waste a huge amount of time trying
to tweak the learning rate or the size of the momentum term; to my mind this
is simply not practical for large problems).

Hope this is of some help.

ps: Note that Rummelhart et al are purposefully rather vague on whether
the weight adjustment is to be done after each sample presentation.  If
you carefully compare the chapter on backprop in PDP with that in their
Nature paper, you'll find that each paper uses a different tactic!