[comp.ai.neural-nets] Are Conjugate Gradient algorithms any good?

baronen@daimi.aau.dk (Carsten Greve) (03/04/91)

Recently there has been much talk about the so-called Conjugate Gradient
algorithms and their use in feed-forward neural networks. We have applied
one of these algorithms, the Scaled Conjugate Gradient algorithm [1], on
the NETtalk problem [2] with poor results.

In their original experiment Sejnowski and Rosenberg used the conventional
back-propagation algorithm with weight updates after each presentation of
one word (word update). Our experiments with the same algorithm confirmed the
results of Sejnowski and Rosenberg. However, experiments showed that
back-propagation was unable to converge if the weights were updated only
after the entire training set had been presented (epoch update).

The SCG algorithm is reported, like several other Conjugate Gradient
algorithms, to outperform ordinary back-propagation with epoch learning.
Yet we found that the SCG algorithm was unable to match the performance of
back-propagation when word updates were used instead. In the SCG algorithm
weights are (normally) updated once after each presentation of the entire
training set (epoch update).

A modified version of the SCG algorithm permits more frequent weight updates.
The net is trained on a small number of patterns for some time after which
the weights are updated. Then some new training patterns are chosen and
the net is trained on those patterns, et cetera. This version gave improved
results. However, the SCG was still unable to match standard back-propagation
with word updates.

We have used a fully connected 3-layer net of 7 * 26 input units, 60 hidden
units, and 57 output units (one for each phoneme). The training set consists of
1000 words, and the net was trained for 30000 word presentations.

		   Average error on each pattern	Number of
		   after 30000 word presentations:	weight updates:

Back-prop with word update:   0.125                           30000
Back-prop with epoch update:  2.888 (failed to converge)         30
SCG with block update: *)     0.303                             300
SCG with epoch update:        0.871                              30

	*) A block size of 10 words was used, with each block being
	trained 10 times before weights were updated.

Even after 200000 word presentations the SCG algorithm with epoch update
had only reduced the average error to 0.215.

We have also tested the SCG algorithm on the protein data that Ning Qian
and Sejnowski used [3], and again we found that SCG was unable to compete
with back-propagation with pattern update.

We would like to know whether any of you have successfully applied a Conjugate
Gradient algorithm on a large scale problem, and, if so, whether it out-
performed ordinary back-propagation with pattern update. Also, we would like
to know, if there exist any Conjugate Gradient algorithms which allow the use
of pattern update.


References:

1. Moller, Martin F. (1990) "A Scaled Conjugate Algorithm for Fast Supervised
Learning" Preprint. Available by ftp from cheops.cis.ohio-state.edu in
directory pub/neuroprose as moller.conjugate-gradient.ps.Z

2. Sejnowski, T.J., and Rosenberg, C.R. (1987).  "Parallel networks that
learn to pronounce English text" in Complex Systems, 1, 145-168.

3. Ning Qian and Terrence J. Sejnowski (1988), "Predicting the Secondary
Structure of Globular Proteins Using Neural Network Models" in Journal of
Molecular Biology 202, 865-884.  Academic Press.

esrmm@warwick.ac.uk (Denis Anthony) (03/05/91)

In article <1991Mar4.142559.21857@daimi.aau.dk> baronen@daimi.aau.dk (Carsten Greve) writes:
>Recently there has been much talk about the so-called Conjugate Gradient
>algorithms and their use in feed-forward neural networks. We have applied
>one of these algorithms, the Scaled Conjugate Gradient algorithm [1], on
>the NETtalk problem [2] with poor results.
>
>In their original experiment Sejnowski and Rosenberg used the conventional
>back-propagation algorithm with weight updates after each presentation of
>one word (word update). Our experiments with the same algorithm confirmed the
>results of Sejnowski and Rosenberg. However, experiments showed that
>back-propagation was unable to converge if the weights were updated only
>after the entire training set had been presented (epoch update).
>
>The SCG algorithm is reported, like several other Conjugate Gradient
>algorithms, to outperform ordinary back-propagation with epoch learning.
>Yet we found that the SCG algorithm was unable to match the performance of
>back-propagation when word updates were used instead. In the SCG algorithm
>weights are (normally) updated once after each presentation of the entire
>training set (epoch update).
>

Others have found SCG not that wonderful e.g.

%Q Chan L
%D 1990
%T Efficacy of Different Learning Algorithms of the Back Propagation Network (in Conference Proc. Computer and Communication Systems)
%J Proceedings of the IEEE Region 10

who found it performed worse that dynamic learning rate adjustment.

I have experimented with the latter (dynamic learning rate and momentum adjustment) as in

%Q Vogl T.P, Mangis J.K, Rigler A.K, Zink W.T, and Alkon D.L
%D 1988
%T Accelerating the Convergence of the Back-Propagation Method
%J Biological Cybernetics
%V 59
%P 257-263

and found as Carsten Greve reports for SCG, using epoch updates was inferior to pattern updates.
Specifically epoch learning with dynamic adjustment was worse than
pattern learning without any parameter tuning. Changing the algorithm to pattern learning with
adjustment does improve convergence, though for reasons I do not understand the convergence rate
is initially worse, and after a while the learning rate shoots up and convergence surpasses the "ordinary"
method. Is there some theoretical reason why pattern learning should be better ? The maths assumes epoch learning in
Rumelhart et al's original discussion, but they say with small learning rate it should be approximately the same. But
this is not what we find, it seems to be better.

The odd convergence curve (slower than ordinary learning, then changing slope) that I got for dynamic learning
(which did not occur in epoch dynamic learning, there I just got an asymptotic convergence to a high error)
may be due to initial descent in error ravines, where increasing learning rate oscilates from
side to side of the ravine, but each time at a lower total error, so the algorithm keeps increasing learning rate, and
at a point when it arrives at the base of the ravine, then shoots along it as that is the highest slope, and does
so more quickly with the higher learning rate.

Is this reasonable ? Any other ideas ?

Denis.

pluto@cornelius.ucsd.edu (Mark Plutowski) (03/07/91)

Reiterating the discussion so far:
Some quite interesting empirical results comparing two gradient descent algorithms:

(1) a version of conjugate gradient ("Scaled Conjugate Gradient") 
(2) backpropagation.


Variations of these two algorithms are formed by processing the training examples 
in the following modes:

(a) batched (a.k.a. "epoch") learning:  the calculation of each weight 
update utilizes information from all of the available training examples.  

(b) pattern learning:  each weight update is performed after presentation 
of a single example.

(c) "Block" learning: this lies between (a) and (b) in the sense that each 
weight update utilizes a subset of the available training examples.


Remarks:
I was most interested in the report by Denis Anthony, i.e., 

	>... using epoch updates was inferior to pattern updates.

except it was unclear from context whether this refers to 
a comparison of (1a) v. (1b), or between  (2a) and (2b).  


Regarding the usage of the Scaled Conjugate Gradient algorithm to obtain
the combination (1b), unless there is some aspect of Scaled Conjugate Gradient
that makes it fundamentally different from the usual conjugate gradient 
techniques, I cannot imagine many situations where (1b) could perform 
anywhere nearly as well as (1a).  Can someone clarify for me what SCG brings to 
the usual CG techniques?  

In general, if you are doing a line search in the gradient direction 
obtained by conjugate gradient, (which is the usual way the CG direction is
utilized to obtain a weight update) this direction should
be obtained using ALL of the training examples.  Essentially, by doing a 
line search upon the direction obtained by (1b) you will likely destroy 
learning obtained over previous examples.

The math suggests that (1b) should fail miserably; variants of 
conjugate gradient that allow (1b) to work at all would be most interesting, since
they are (seemingly) finding a conjugate gradient direction compatible
with all of the examples by looking at only a subset of the examples;  
it seems plausible that (1c) should work better than (1b), yet not as well as
(1a)  (unless this subset of training examples is chosen judiciously.)  
I have some ideas about how to improve upon (1c) if anyone is interested in 
trying it out on an application.

The comparison of (2a) and (2b) is most interesting: of interest would be 
a simultaneous comparison with a modified Gauss-Newton approach, (where
the "Hessian" matrix is used to scale the weight update appropriately for 
each weight.)  (Recall that the backpropagation update is the special case 
of the Gauss-Newton update obtained by setting the Hessian to the identity matrix.)

[Suggestion: throw away the off-diagonal terms and use only
the diagonal terms;  the resulting matrix is always invertible when the diagonal 
terms are non-zero.  This may alleviate the technical difficulties 
normally encountered with using the Hessian.  In a sense, this approach 
scales the learning rate appropriately for each weight.  Again, I would suspect 
that this would be best done by batching, but would be interested in any 
empirical evidence to the contrary.]

radford@ai.toronto.edu (Radford Neal) (03/08/91)

>I was most interested in the report by Denis Anthony, i.e., 
>
>	>... using epoch updates was inferior to pattern updates.

I think this discussion may cause some confusion. From the original
posting, it is clear that what is claimed is that using epoch updates 
is inferior to using pattern updates ** if the criterion for cost is the
number of applications of the network to training cases **. If the
criterion for cost is number of weight changes, then there is no reason
to think epoch updates are inferior, and every reason to think they 
are superior.

This difference is not surprising. If the training set is large and
contains many very similar, maybe even identical, patterns, then applying 
the network to all patterns before changing weights is wasteful. More 
typically, however, the training data is not all that voluminous, and 
omitting some patterns might seriously impair gradient descent unless
the learning rate is set quite low.

In any comparison of learning methods, one must be clear on what the cost 
criterion is, and one must be sure to find the optimal settings for 
the learning rate and other adjustable parameters for each of the
techniques being compared.

    Radford

esrmm@warwick.ac.uk (Denis Anthony) (03/09/91)

In article <pluto.668285404@cornelius> pluto@cornelius.ucsd.edu (Mark Plutowski) writes:
>Reiterating the discussion so far:
>Some quite interesting empirical results comparing two gradient descent algorithms:
>
>(1) a version of conjugate gradient ("Scaled Conjugate Gradient")
>(2) backpropagation.
>
>
>Variations of these two algorithms are formed by processing the training examples
>in the following modes:
>
>(a) batched (a.k.a. "epoch") learning:  the calculation of each weight
>update utilizes information from all of the available training examples.
>
>(b) pattern learning:  each weight update is performed after presentation
>of a single example.
>
>(c) "Block" learning: this lies between (a) and (b) in the sense that each
>weight update utilizes a subset of the available training examples.
>
>
>Remarks:
>I was most interested in the report by Denis Anthony, i.e.,
>
>	>... using epoch updates was inferior to pattern updates.
>
>except it was unclear from context whether this refers to
>a comparison of (1a) v. (1b), or between  (2a) and (2b).
>

2a v 2b (ie. backprop)

Denis

chester@udel.edu (Daniel Chester) (03/09/91)

In his March 6th reply to Denis Anthony, Mark Plutowski made the assertion
that "the backpropagation update is the special case of the Gauss-Newton
update obtained by setting the Hessian to the identity matrix."  This is
incorrect; to get something like the backpropagation update, the Hessian
has to be set to the 0 matrix. Even then it is not the same because it
does a line search where backpropagation does one step.  If you set the
Hessian to the identity matrix, the Gauss-Newton update becomes a
conjugate-gradient method.  See the following reference for details.

David F. Shanno. "Conjugate gradient methods with inexact searches", in
Mathematics of Operations Research, Vol. 3, No. 3, August 1978, 244-256.

This reference claims that the resulting memoryless BFGS algorithm 
"substantially outperforms known conjugate gradient methods on a wide class
of problems", though I haven't tried it.

Daniel Chester
Univerisity of Delaware
Department of Computer and Information Sciences
Newark, DE 19716

chester@udel.edu
--
Daniel Chester

pluto@mangani.ucsd.edu (Mark Plutowski) (03/12/91)

chester@udel.edu (Daniel Chester) writes:

>In his March 6th reply to Denis Anthony, Mark Plutowski made the assertion
>that "the backpropagation update is the special case of the Gauss-Newton
>update obtained by setting the Hessian to the identity matrix."  This is
>incorrect; to get something like the backpropagation update, the Hessian
>has to be set to the 0 matrix. Even then it is not the same because it
>does a line search where backpropagation does one step.  If you set the
>Hessian to the identity matrix, the Gauss-Newton update becomes a
>conjugate-gradient method.  See the following reference for details.

Thank you, I misspoke.
We are referring to different objects.  I believe you are 
referring to the matrix composed from second derivatives.
I was referring to the matrix that weights the gradient of the squared error  
that appears when deriving gradient descent for squared error via   
a first order Taylor expansion.

In my references (see below) this is referred to as the Gauss-Newton update, 
and the inverse of the outer product (summed over all training examples) of the 
gradient of the network function w.r.t. the weights is used as an approximation
to the Hessian.  (The expectation of this matrix as sample size grows large 
is the Hessian).  Unfortunately, it is sometimes mistakenly referred to as the 
so-called Hessian in some of the literature, as I did above; 
thank you for pointing this out.  

I hope that someday we will be able to post precise equations rather than verbal 
descriptions - this would alleviate this kind of confusion!

Thanks also for the references for obtaining CG from Gauss-Newton.


References:
1) Fedorov, Theory of Optimal Experiments, 1972, (p.35)
2) Seber & Wild, Nonlinear Regression, 1989, pp. 21-23.

-=-=
M.E. Plutowski,  pluto%cs@ucsd.edu 

UCSD,  Computer Science and Engineering 0114
9500 Gilman Drive
La Jolla, California 92093-0114

landman@hanami.Eng.Sun.COM (Howard A. Landman) (03/13/91)

In article <91Mar7.145659edt.437@neuron.ai.toronto.edu> radford@ai.toronto.edu (Radford Neal) writes:
>In any comparison of learning methods, one must be clear on what the cost 
>criterion is

Absolutely.

>typically, however, the training data is not all that voluminous, and 
>omitting some patterns might seriously impair gradient descent unless
>the learning rate is set quite low.

Do you think it would be fair to say that training data is typically
not very large because people simply don't have machines powerful
enough (or algorithms efficient enough) to deal with anything larger?
I could easily be running a few hundred thousand patterns of a few
hundred inputs each into a few thousand neurons, *IF* I had anything
that could handle it.

One disadvantage of CG methods is that they often require the whole
training set to be memory-resident.  For gigantic training data this
can be a real problem.

Does anyone have any insights on methods for handling large amounts
of training data efficiently?

--
	Howard A. Landman
	landman@eng.sun.com -or- sun!landman

esrmm@warwick.ac.uk (Denis Anthony) (03/13/91)

In article <9682@exodus.Eng.Sun.COM> landman@hanami.Eng.Sun.COM (Howard A. Landman) writes:
>
>Does anyone have any insights on methods for handling large amounts
>of training data efficiently?
>
Data compression of input set using auto-associative neural nets as in

%Q Cottrell G.W
%D 1990
%T Face Recognition Using Feature Extraction
%J INNS Int. Neural Networks Conference
%V 1
%P 322-325

or using principal components to do same thing as in


%Q Anthony D.M, Hines E.L, Taylor D, and Barham J
%D 1989
%T A Study of Data Compression using Neural Networks and Principal Component Analysis (IEE Coloquiuum on Biomedical Applications of Digital Signal Processing)
%P 2/1-2/4

greenba@gambia.crd.ge.com (ben a green) (03/13/91)

In article <9682@exodus.Eng.Sun.COM> landman@hanami.Eng.Sun.COM (Howard A. Landman) writes:

	...

   Do you think it would be fair to say that training data is typically
   not very large because people simply don't have machines powerful
   enough (or algorithms efficient enough) to deal with anything larger?
   I could easily be running a few hundred thousand patterns of a few
   hundred inputs each into a few thousand neurons, *IF* I had anything
   that could handle it.

In the commercial and military applications we have investigated, training data
are limited by the nature of the problem. Suppose you want to diagnose faults
in aircraft engines from data collected before every takeoff. You have lots
of normal patterns but typically very few fault patterns of any one kind
of fault.

   One disadvantage of CG methods is that they often require the whole
   training set to be memory-resident.  For gigantic training data this
   can be a real problem.

I don't understand why this is peculiar to CG methods. Any method that requires
repeated updating of weights will want to retain the training set in memory just
in order to avoid being IO-bound.

   Does anyone have any insights on methods for handling large amounts
   of training data efficiently?

If the hundred thousand patterns are required in order to define the decision
boundary with sufficient precision, then there is no alternative. If they are
not, you can try sampling. 

Sorry if these seem like obvious suggestions, but we have thought a lot about the
problem and have come up with nothing better.

BTW, we use a variety of Partial CG with such success that we have stopped trying
anything else. Maybe it depends on the kind of problem you have. We are mystified
by the complaints against CG. A recent MIT thesis blasted it, but we ran the
same data through our version with excellent results.

BG, 5 kyu
--
Ben A. Green, Jr.              
greenba@crd.ge.com
  Speaking only for myself, of course.

landman@hanami.Eng.Sun.COM (Howard A. Landman) (03/21/91)

In article <9682@exodus.Eng.Sun.COM> I wrote:
>>One disadvantage of CG methods is that they often require the whole
>>training set to be memory-resident.  For gigantic training data this
>>can be a real problem.

In article <GREENBA.91Mar13081609@gambia.crd.ge.com> greenba@gambia.crd.ge.com (ben a green) writes:
>I don't understand why this is peculiar to CG methods. Any method that requires
>repeated updating of weights will want to retain the training set in memory
>just in order to avoid being IO-bound.

Assuming that the entire training set can fit into your virtual memory, that's
true, although page faults can cause that to become "IO-bound" as well.
But I had one case where the training data was over 500 MB.  Since my VM size
was less than 500 MB, the program which required data to be memory-resident
simply DIDN'T WORK, but a program that read the data each pass would merely
have been slow.

A more subtle aspect: in some cases (e.g. mine), the "original" training
data is far more dense than the training data which has been massaged into
the input format (or memory layout) of the program.  In extreme cases (e.g.
mine :-) the difference can be greater than two orders of magnitude (2 MB vs
500 MB).  For a "one sample at a time" program, if you have source, it is
possible to embed the code to do this expansion in the program itself, so that
the entire expanded training set never exists anywhere, and all the swap & I/O
problems vanish (at the cost in CPU of reexpanding the data each time).  For
an "all data in memory at once" program, you don't have that choice; the whole
expanded data set must exist in memory even if you can avoid having it on disk.

Even if embedded expansion is possible, it may not always make sense.  This
depends on the relative expense of expanding versus the performance cost of
doing the training with a fully expanded data set.  In my case, expanding
the data was about 1/8th the cost of doing a single CG training cycle, so
the overhead would have been quite acceptable as long as each training cycle
ran more than 12% faster when the program virtual image was 3 MB in size than
when it was 500 MB in size.  (There is also an implicit assumption here that
each datum only needs to be expanded once per training cycle.  This is clearly
true for the "one sample at a time" approaches, but may not be for CG.)  Even
if that speedup was not forthcoming, at least the program would have been able
to handle the full set of training data.

You're right that this doesn't *necessarily* have anything to do with CG
per se, except that the CG program I chose to use ("opt") had the above
features.  I think that line search may more-or-less require all training
data to be present.  If anyone knows of a CG program which *doesn't* need
all data in-memory, please describe it.

--
	Howard A. Landman
	landman@eng.sun.com -or- sun!landman