[comp.ai.neural-nets] Scaled Conjugate Gradient

fodslett@daimi.aau.dk (Martin Moller) (11/17/90)

Recently there has been a lot of interest in faster learning algorithms than
back-propagation. 

Newtons method and conjugate gradient methods has been mentioned. 
Common for these algorithms is that they raise the calculation complexity 
per learning iteration considerable. Newtons method by inverting the Hessian 
matrix to the error function and the conjugate gradient methods by
performing a line search in order to determine a good step size.

For the past 1 1/2 year I have been working on developing a conjugate 
gradient method avoiding this line search. 
Instead of using a line search, I use a scaling technique. The algorithm has 
been in use now for more than 1/2 a year and works well. 
For each iteration it has to calculate the gradient to the error twice. 
For comparison standard conjugate gradient methods using line search 
use in average 4-20 calculations of the gradient. 
The algorithm is an order of magnitude faster than backpropagation and 
seems to be able to handle ravine phenomena much more effective than BP.

A preprint paper describing this algorithm in detail will soon be available 
(in 1 or 2 weeks) by ftp. Here follows a short abstract of the paper:

Abstract-- A supervised learning algorithm (Scaled Conjugate Gradient, SCG)
with superlinear convergence rate is introduced. SCG uses second order 
information from the neural network, but requires only O(N) memory usage.
The performance of SCG is benchmarked against the performance of the standard
backpropagation algorithm (BP) and several recently proposed standard conjugate
gradient algorithms. SCG yields a speed-up at least an order of magnitude 
relative to BP. The speed-up depends on the convergence criterion,i.e., the
bigger demand for reduction in error the bigger the speed-up. SCG is fully
automated including no user dependent parameters and avoids a time consuming 
line search, which other conjugate gradient algorithms use in order to 
determine a good step size.
Incorporating problem dependent structural information in the architecture of 
a neural network often lowers the overall complexity. The smaller the 
complexity of the neural network relative to the problem domain, the bigger the
possibility that the weights space contains long ravines characterized by sharp
curvature. While BP is inefficient on these ravine phenomena, SCG handles them 
efectively.


	Martin

NB. Any question or comments on this short writing or on the preprint following
shortly would be appriciated.

______________________________ 

Martin Fodslette Moller
Computer Science Dept.
University of Aarhus
Denmark
e-mail: fodslett@daimi.aau.dk
_______________________________

kingsley@hpwrce.HP.COM (Kingsley Morse) (11/18/90)

Sounds like the scaled approach may hold promise. How much more cpu does it need
as the number of training patterns increases? As the number of inputs 
increases? Does the cpu time increase exponentially? Polynomially? Linearly?

							Kingsley

fodslett@daimi.aau.dk (Martin Moller) (11/27/90)

>Sounds like the scaled approach may hold promise. How much more cpu does it need
>as the number of training patterns increases? As the number of inputs 
>increases? Does the cpu time increase exponentially? Polynomially? Linearly?

>							Kingsley


Well, I have not got any results on the first question, but concerning the
cpu time versus the number of inputs, there is a good example in the preprint
coming. 

The parity problem is here used as a test example. SCG and BP was 
tested on 3,4,5,6 and 7 bit parity problem using 10 different initial weight
vectors. Three and four layer network architectures was used for each problem.
An average of the total calls of the error and the gradient was used to compare
the performance of the two algorithms.
SCG obtained a speed-up between 18 and 46 relative to BP.

let N be the number of weights and biases in the networks. 
The SCG algorithm was in all runs bounded by O(N**2) and O(NlogN) function
calls, while BP was bounded by O(N**3) and O(N**2logN) function calls.
These results are experimentel results and the bounds does of course
depend on the nature of the task. BP has been reported to show exponential
scaling; such results has not been experienced with SCG.


		Martin