[comp.ai.neural-nets] Why the more input neurons, the faster the convergence..?

rcoahk@koel.co.rmit.oz (Alvaro Hui Kau) (08/24/90)

Hi, all experts:

From a recent experiment on Guassian data classification 
using Bp Algorithm, I found that the higher dimensions
ones( so need more input neurons) converge much much faster
than those of lower dimensions.

The order of difference is nearly 100 folds!

I am wondering whether this is a general behavior of Bp nets,
can anyone verify this for me.

Of course, I use the same number of vector pairs in all case!


Neural-nets are fun if we all have supercomputers....
===============================================================================
	Alvaro Hui		|ACSnet		akkh@mullian.oz
    4th Year B.E.\ B.Sc.	|Internet &	akkh@mullian.ee.mu.OZ.AU
   University of Melbourne	|Arpanet	rcoahk@koel.co.rmit.OZ.AU 
         	   		|Arpa-relay	akkh%mullian.oz@uunet.uu.net
       	    	 		|Uunet		....!munnari!mullian!akkh   
                  		|EAN		akkh@mullian.ee.mu.oz.au
===============================================================================

===============================================================================
	Alvaro Hui		|ACSnet		akkh@mullian.oz
    4th Year B.E.\ B.Sc.	|Internet &	akkh@mullian.ee.mu.OZ.AU
   University of Melbourne	|Arpanet	rcoahk@koel.co.rmit.OZ.AU 

ins_atge@jhunix.HCF.JHU.EDU (Thomas G Edwards) (08/25/90)

In article <5462@minyos.xx.rmit.oz> rcoahk@koel.co.rmit.oz (Alvaro Hui Kau) writes:
>From a recent experiment on Guassian data classification 
>using Bp Algorithm, I found that the higher dimensions
>ones( so need more input neurons) converge much much faster
>than those of lower dimensions.

I have also noticed this.  I believe that networks with few dimensions
suffer very seriously from local minima problems (i.e. XOR).
Remember, it has been shown that Bp nets are _very_ sensitive to 
initial weight conditions (Sorry, my reference isn't here right now),
and different initial weights can change convergence times by orders
of magnitude (at least for small problems).

The solution?  Well, you could use high-dimensional networks, but of
course you then have to spend more time per epoch.  I think the best
idea is to use conjugate-gradient methods (see _Numerical_Recipes_ ,
or the paper on efficient parallel learning methods in _Neural_Information_
Processing_Systems_I [ed. Touretzky]) or at least steepest-descent
with linesearch.  The line-searches allow you to quickly cross vast
wastelands of nearly flat error surface, which would take a very long
time with vanilla Bp.  Then you won't need a supercomputer to do you
neural networks (although a Sun would be nice...)

Try the conjugate-gradient learning program OPT available via anon ftp
from cse.ogi.edu in the /pub/nnvowels directory.

-Thomas Edwards
The Johns Hopkins University / U.S. Naval Research Lab

alexis@oahu.cs.ucla.edu (Alexis Wieland) (08/27/90)

> From a recent experiment on Guassian data classification 
> using Bp Algorithm, I found that the higher dimensions
> ones( so need more input neurons) converge much much faster
> than those of lower dimensions.

> The order of difference is nearly 100 folds!

Since bp neural nets work with a weighted sum of the inputs, and since
the variance of a sum of independant Gaussian distributed random
variables tends to 0 as the number gets large, *any* classifier
working on Gaussian data should perform better with more inputs.  This
is characteristic of Gaussian classifiers.

The behaviour you report is often even more true for neural nets.  It
is simple to create examples (even inadventently) where the noise free
(and effectively also the high dimensional) case is linearly separable
(i.e., a net will learn quickly) and the high noise case is not
(i.e., learning will be comparatively slow).  A 100 fold difference 
is quite believable.

Actually, our experiences in the past shows there's often more to it than
that.  Four or so years ago we (like everyone else) did a character
recognition system (ours was independant of rotation).  To make a long
story short, 8x8 images took about 10x the wall clock time to learn as
16x16 images.  The difference was that smaller images were so bleary
that it really was hard to distinguish some characters, say a 'C' and
a 90 degree rotated 'A', (this is all in a INNS '87 paper) The moral
is "know your data" ....  (re another discussion,  Russ Leighton the 
co-author of that work, later extended those techniques, used lots of 
limited receptive fields and a handful of tricks and found objects in 
computer generated composite images of up to 1024x1024).

	- alexis.


><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><
  Alexis Wieland                     also part-time/on-call 
  grad student at                    lead scientist at
  UCLA CS Department                 The MITRE Corporation, Washington
  alexis@CS.UCLA.EDU                 (don't ask, it's a long commute).
><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><

demers@odin.ucsd.edu (David E Demers) (08/28/90)

In article <5462@minyos.xx.rmit.oz> rcoahk@koel.co.rmit.oz (Alvaro Hui Kau) writes:
>From a recent experiment on Guassian data classification 
>using Bp Algorithm, I found that the higher dimensions
>ones( so need more input neurons) converge much much faster
>than those of lower dimensions.

>The order of difference is nearly 100 folds!

>I am wondering whether this is a general behavior of Bp nets,
>can anyone verify this for me.

>Of course, I use the same number of vector pairs in all case!

This should not be a surprising result.  The more degrees of freedom
you allow your model, the easier it should be to reduce error.  What
you might find, however, is that your net does not generalize well
to other inputs.  What the net is doing is building a smooth
function; and as we all know from function approximation, zero
error on the data does not necessarily mean we have a good model!

Dave