[comp.ai.neural-nets] GD and LSE

gblee@maui.cs.ucla.edu (03/25/89)

> There is another paper with a similar claim.
   **************************************************************
>  "Gradient descent fails to separate" is its title.
>            By : M. Brady and R.Raghavan
    *****************************************************
>The paper shows the failure of BP in the case of examples where 
>there are no local minima.  They assert (and they could be right as ...
>                                                               
>                                         (FIAT LUX)

Can you tell us where the paper you mentioned is published and when?
May be somebody else in this news group are also interested in ...
I know another similar paper which attacked GD....

 Sutton, R. Two problems with BP and other steepest-descent learning
 procedures for networks.

 He points out:
1. steepest descent is a particulary poor for surface containing "ravines"
2. steepest descent results in high level of interference between learning
with different patterns...

Unfortunately, I forgot where this paper was published.
Can anybody out there tell where it was published for us?

--Geunbae Lee
  AI lab, UCLA

mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (03/25/89)

In article <22206@shemp.CS.UCLA.EDU> gblee@CS.UCLA.EDU () writes:
>I know another similar paper which attacked GD....
>
> Sutton, R. Two problems with BP and other steepest-descent learning
> procedures for networks.
>
> He points out:
>1. steepest descent is a particulary poor for surface containing "ravines"
>2. steepest descent results in high level of interference between learning
>with different patterns...
>
>Unfortunately, I forgot where this paper was published.
>Can anybody out there tell where it was published for us?
>
>--Geunbae Lee
>  AI lab, UCLA


1)  Simply going down the gradient direction has been known for ages
to be a poor heuristic if you're in steep valleys.  It's simply
good fortune that it works in as many cases as it does.  In fact,
the "momentum method" is a very simple way to alleviate this problem
by trying to go faster in the slowly-changing direction.

Conjugate gradient is specifically designed to deal with this kind
of problem.  The disadvantages are 1) algorithm much more complicated
to implement 2) it is not always faster.  In terms of the number
of "iterations", it's certainly faster than fixed-step steepest-descent,
but now each iteration requires a gradient evaluation and several
error evaluations for an approximate line-minimization.  Just try
doing that in silicon...

Often, many fast dumb steps beat a smart but slow method.  In my personal
experience, standard steepest descent w/ momentum is often the best in
terms of bottom-line wall-clock performance compared to fancier
algorithms.

2)  I believe that when you add more examples to the training set
the error surface becomes more "bumpy" locally, and thus the gradient
vector at each point doesn't necessarily point to the overall global
minimum as closely.  I'm not sure about this and am trying to verify it.
Using the momentum method helps for this too, because you effectively end up
averaging the gradient vector over several steps.

In a network whose error surface is non-linear in the weights
(multi-layer perceptron, e.g.) I'm not at all convinced that there
could ever be a general-purpose learning algorithm guaranteed to
give the global minimum solution.  In practical terms, the appropriate
criterion is only "good enough".

Matt Kennel
mbkennel@phoenix.princeton.edu

sdo@linus.UUCP (Sean D. O'Neil) (03/27/89)

In article <22206@shemp.CS.UCLA.EDU> gblee@CS.UCLA.EDU () writes:
>>  "Gradient descent fails to separate" is its title.
>>            By : M. Brady and R.Raghavan
>    *****************************************************
>>The paper shows the failure of BP in the case of examples where 
>>there are no local minima.  They assert (and they could be right as ...
>
>Can you tell us where the paper you mentioned is published and when?


I attended the presentation by Raghavan at ICNN '88 in San Diego, so 
at least the paper has been published in those proceedings.  My 
impression is that their result is fairly simple to understand.  
Essentially, they point out that minimizing the number of misclassifications 
is not identical to the least-squares solution, and they give several 
examples in which linear class separation is possible but the least-squares 
solution does not in fact separate the classes.

Aha, here it is.  The paper is:

Brady, M., R. Raghavan, and J. Slawny, "Gradient Descent Fails to
Separate", in IEEE International Conference on Neural Networks,
IEEE, San Diego, CA, pp. I-649 to I-656, July 24-27, 1988.


Sean