gblee@maui.cs.ucla.edu (03/25/89)
> There is another paper with a similar claim. ************************************************************** > "Gradient descent fails to separate" is its title. > By : M. Brady and R.Raghavan ***************************************************** >The paper shows the failure of BP in the case of examples where >there are no local minima. They assert (and they could be right as ... > > (FIAT LUX) Can you tell us where the paper you mentioned is published and when? May be somebody else in this news group are also interested in ... I know another similar paper which attacked GD.... Sutton, R. Two problems with BP and other steepest-descent learning procedures for networks. He points out: 1. steepest descent is a particulary poor for surface containing "ravines" 2. steepest descent results in high level of interference between learning with different patterns... Unfortunately, I forgot where this paper was published. Can anybody out there tell where it was published for us? --Geunbae Lee AI lab, UCLA
mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (03/25/89)
In article <22206@shemp.CS.UCLA.EDU> gblee@CS.UCLA.EDU () writes: >I know another similar paper which attacked GD.... > > Sutton, R. Two problems with BP and other steepest-descent learning > procedures for networks. > > He points out: >1. steepest descent is a particulary poor for surface containing "ravines" >2. steepest descent results in high level of interference between learning >with different patterns... > >Unfortunately, I forgot where this paper was published. >Can anybody out there tell where it was published for us? > >--Geunbae Lee > AI lab, UCLA 1) Simply going down the gradient direction has been known for ages to be a poor heuristic if you're in steep valleys. It's simply good fortune that it works in as many cases as it does. In fact, the "momentum method" is a very simple way to alleviate this problem by trying to go faster in the slowly-changing direction. Conjugate gradient is specifically designed to deal with this kind of problem. The disadvantages are 1) algorithm much more complicated to implement 2) it is not always faster. In terms of the number of "iterations", it's certainly faster than fixed-step steepest-descent, but now each iteration requires a gradient evaluation and several error evaluations for an approximate line-minimization. Just try doing that in silicon... Often, many fast dumb steps beat a smart but slow method. In my personal experience, standard steepest descent w/ momentum is often the best in terms of bottom-line wall-clock performance compared to fancier algorithms. 2) I believe that when you add more examples to the training set the error surface becomes more "bumpy" locally, and thus the gradient vector at each point doesn't necessarily point to the overall global minimum as closely. I'm not sure about this and am trying to verify it. Using the momentum method helps for this too, because you effectively end up averaging the gradient vector over several steps. In a network whose error surface is non-linear in the weights (multi-layer perceptron, e.g.) I'm not at all convinced that there could ever be a general-purpose learning algorithm guaranteed to give the global minimum solution. In practical terms, the appropriate criterion is only "good enough". Matt Kennel mbkennel@phoenix.princeton.edu
sdo@linus.UUCP (Sean D. O'Neil) (03/27/89)
In article <22206@shemp.CS.UCLA.EDU> gblee@CS.UCLA.EDU () writes: >> "Gradient descent fails to separate" is its title. >> By : M. Brady and R.Raghavan > ***************************************************** >>The paper shows the failure of BP in the case of examples where >>there are no local minima. They assert (and they could be right as ... > >Can you tell us where the paper you mentioned is published and when? I attended the presentation by Raghavan at ICNN '88 in San Diego, so at least the paper has been published in those proceedings. My impression is that their result is fairly simple to understand. Essentially, they point out that minimizing the number of misclassifications is not identical to the least-squares solution, and they give several examples in which linear class separation is possible but the least-squares solution does not in fact separate the classes. Aha, here it is. The paper is: Brady, M., R. Raghavan, and J. Slawny, "Gradient Descent Fails to Separate", in IEEE International Conference on Neural Networks, IEEE, San Diego, CA, pp. I-649 to I-656, July 24-27, 1988. Sean