camargo@cs.columbia.edu (Francisco Camargo) (05/30/89)
Can anyone put some light in the following issue: How should one compute the weight adjustments in BackProp ? From reading PDP, one gathers the impression that the DELTAS should be acumulated over all INPUT PATTERNS and only then a STEP is taken towards the gradient. Robins Monroe suggests a stochastic algorithm with proved convergency if one takes one step at each pattern presentation, but dumps its effect by a factor 1/k where "k" is the presentation number. Other people,(from codes that I've seen flying around) seems to take a STEP a each presentation a don't take into account any dumping factors. I've tried myself both approaches and they all seem to work. After all, which is the correct way of adjusting the weights ? Acumulate the errors over all patterns ? Or, work towards the minimum as new patterns are presented. Which are the implications ? Any light is this issue is extremelly appreciated. Francisco A. Camargo CS Department - Columbia University camargo@cs.columbia.edu PS: A few weeks ago, I requested some pointers to Learning Algorithms in NN and promissed a summary of the replies. It is comming. I have not forgoten my responsibilities with this group. Even though I got more requests than really new info, I'll have a summary posted shortly. And thanks for all who contributed.
heumann@hpmtlx.HP.COM ($John_Heumann@hpmtljh) (05/31/89)
> / hpmtlx:comp.ai.neural-nets / camargo@cs.columbia.edu (Francisco Camargo) / 5:26 pm May 29, 1989 / > > Can anyone put some light in the following issue: > > How should one compute the weight adjustments in BackProp ? > From reading PDP, one gathers the impression that the DELTAS > should be acumulated over all INPUT PATTERNS and only then > a STEP is taken towards the gradient. Robins Monroe suggests > a stochastic algorithm with proved convergency if one takes one > step at each pattern presentation, but dumps its effect by a factor > 1/k where "k" is the presentation number. Other people,(from codes > that I've seen flying around) seems to take a STEP a each presentation > a don't take into account any dumping factors. I've tried myself both > approaches and they all seem to work. After all, which is the correct way > of adjusting the weights ? Acumulate the errors over all patterns ? Or, work > towards the minimum as new patterns are presented. Which are the implications ? > > Any light is this issue is extremelly appreciated. > > Francisco A. Camargo > CS Department - Columbia University > camargo@cs.columbia.edu > > > PS: A few weeks ago, I requested some pointers to Learning Algorithms in NN > and promissed a summary of the replies. It is comming. I have not forgoten my > responsibilities with this group. Even though I got more requests than really > new info, I'll have a summary posted shortly. And thanks for all who > contributed. > ---------- A few comments. 1) Note that if backprop is modified by addition of a search (rather than fixed step size) in the minimization direction, its simply a form of gradient descent. In light of this, 2) If you want to accumulate the gradient accurately over the entire search space your forced to either a) accumulate over all samples before altering any weights, or b) take a tiny (actually infinitesmal) step after each sample is presented. Doing a full step after each sample presentation destroys the descent property of the algorithm. 3) If your're using backprop as originally presented (i.e. either fixed step size or with a momentum term), I don't believe there is any general way to establish that one method is superior to the other for all problems. I've seen search spaces on which one wonders rather aimlessly and the other converges rapidly; the trouble is that which is good and which is bad depends on the particular problem! Its certainly true that doing something which causes you to deviate from a true descent path (like adding a momentum term or taking a step after each sample) can help you escape local minima in selected cases and can lead to more rapid convergence on some problems. Unfortunately, it can also lead to aimless wandering and poor performance on others. 4) If you can find the full reference, I'd be interested in seeing the Monroe paper, since I'm unaware of any backprop-like method with proven convergence for non-convex search spaces. 5) Personally, my choice for optimizing NN's is to modify backprop to be a true gradient descent method and the use either the Fletcher-Reeves or or Pollak-Ribiere methods for accelerated convergence. Doing so means you WILL have trouble with local minima if they're present in your search space, but avoids all the tweaky parameters in the original backpropagation algorithm. (Since there's no one set of paramaters that appear applicable across a wide range of problems, you can waste a huge amount of time trying to tweak the learning rate or the size of the momentum term; to my mind this is simply not practical for large problems). Hope this is of some help. ps: Note that Rummelhart et al are purposefully rather vague on whether the weight adjustment is to be done after each sample presentation. If you carefully compare the chapter on backprop in PDP with that in their Nature paper, you'll find that each paper uses a different tactic!