[comp.ai.neural-nets] Size limits of BP

frey@eecae.UUCP (Zachary Frey) (05/03/89)

In article <18589@gatech.edu> myke@gatech.UUCP (Myke Reynolds) writes:
>[ART's] memory capacity is no less than that of a linear filter, 
>and its size is
>not limited, unlike BP. Since size = memory capacity, its memory capacity
>is limited only by your implementation of a linear equation solver.

I am not familiar with ART, but I am familiar with back-propagation from
the Rummelhart & McClelland PDP volumes, and I don't remember ever
seeing anything about a size limit to networks implemented with back-
propagation.  Could you elaborate?

I am currently working on implementing a simulation for feedforword
networks using BP as a learning rule that should work for arbitrarily
large networks (limited by computer resources, of course).  Since the
equations involved are recursively defined, I don't see why there should
be a size limit on the net.

Zach Frey

-- 
* U.S.nail:  Zachary Frey           || e-mail:  frey@frith.egr.msu.edu    *
*            326 Abbot Hall         ||          frey@eecae.ee.msu.edu     *
*            E. Lansing, MI  48825  || voice:   (517)355-6421             *
* DISCLAIMER: My opinions, my responsiblity.                              *

myke@gatech.edu (Mike Rynolds) (05/04/89)

In article <15313@eecae.UUCP> frey@eecae.UUCP (Zachary Frey) writes:
>I am not familiar with ART, but I am familiar with back-propagation from
>the Rummelhart & McClelland PDP volumes, and I don't remember ever
>seeing anything about a size limit to networks implemented with back-
>propagation.  Could you elaborate?
>
Try increasing the number of internal nodes without changing the input/output
you train it on. If you were to simulate more complex input/output, an increased
number of internal nodes would be necessary to learn the greater complexity.
But even without greater complexity you will notice a rapid decrease in
learning rate as a function of the number of internal nodes, and at a certain
point, it stops learning all together.
-- 
Myke Reynolds
School of Information & Computer Science, Georgia Tech, Atlanta GA 30332
uucp:	...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!myke
Internet:	myke@gatech.edu

mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (05/04/89)

In article <18615@gatech.edu) myke@gatech.UUCP (Mike Rynolds) writes:
)In article <15313@eecae.UUCP) frey@eecae.UUCP (Zachary Frey) writes:
))I am not familiar with ART, but I am familiar with back-propagation from
))the Rummelhart & McClelland PDP volumes, and I don't remember ever
))seeing anything about a size limit to networks implemented with back-
))propagation.  Could you elaborate?
))
)Try increasing the number of internal nodes without changing the input/output
)you train it on. If you were to simulate more complex input/output, an increased
)number of internal nodes would be necessary to learn the greater complexity.
)But even without greater complexity you will notice a rapid decrease in
)learning rate as a function of the number of internal nodes, and at a certain
)point, it stops learning all together.

Yes, it starts to memorize once you have significantly more free parameters
than examples.  I would think, though, that this is a fundamental limitation---
you can always fit a 5th degree polynomial through 5 points, for example.
The same sort of thing should apply in networks...(plug for my adviser's
research).  If you need to use significantly more free weights than examples
then you're wasting weights, or the functional representation is poor.
If you don't have enough examples, you won't be able to learn a complicated
function if the given examples don't map out the input space well enough.

)-- 
)Myke Reynolds
)School of Information & Computer Science, Georgia Tech, Atlanta GA 30332
)uucp:	...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!myke
)Internet:	myke@gatech.edu

Something that I'd like to explore further is learning with a radial-basis
function network, i.e. a one-hidden layer network where the input layer's
input function is:

 	n_j =   sum_i  (o_i - w_ij)^2;  o_j = 1/(1+n_j^2) e.g.

instead of the conventional  n_j = sum_i o_i * w_ij.

You can learn the output layer of this network using a guaranteed
conventional algorithm (linear least-squares; singular-value-decomposition)
once you've selected the centers, i.e. the first layer of weights, with
k-means clustering for example.

With one hidden layer, this network can perform complicated nonlinear
transformations, unlike the simple perceptron.

For predicting chaotic time series, where the inherent locality of the
functional representation is an advantage, this method is more accurate
and faster to converge, I've found.

Matt Kennel
mbkennel@phoenix.princeton.edu