[comp.ai.neural-nets] Several hidden layers in feed-forward networks

vt_ai@abo.fi (01/07/91)

When do 2, 3, 4 or more hidden layers in feed-forward neural networks work
better than just one hidden layer ? Has anyone been using 5 or more hidden
layers ? Most people seem to contented with one. Thanks for any responses.

AI forskning vid v{rmeteknik
]bo Akademi

aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) (01/08/91)

In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes:
>When do 2, 3, 4 or more hidden layers in feed-forward neural networks work
>better than just one hidden layer ? Has anyone been using 5 or more hidden
>layers ? Most people seem to contented with one. Thanks for any responses.

Fewer layers are used mainly for three reasons.

1: One layer is usually enough (it is certainly enough in theory).
2: Sinle hidden-layer nets are easier to "decode" to see what is going
   on, figure out the induced model etc.
3: Back-propagated error values become increasingly less reliable as
   the number of layers increases. Trying to train very deep networks
   with back-propagation is almost impossible (though ways to improve
   this have been suggested).

Still, people do occasionally use 2, 3 even 4 hidden layers.

As for why more layers work "better", they often don't. But when they
do, it is because of the greater potential "complexity" available.
Think of each neuron in layer k as forming a distorted linear
superposition of the outputs from the previous layer. If the neurons
in the net have monotonic activation functions, as they usually do,
an output layer neuron in a single hidden-layer net requires about 2n
hidden neurons to compose a function with n modes (peaks). However,
if there are two hidden layers of n and 2 neurons respectively, the
2 hidden neurons of hidden layer 2 can each put together a function
with 0.5n peaks (or thereabouts), and the output neuron can combine
these two functions monotonically and still produce n peaks. The
second network thus needs only n+2 hidden neurons instead of 2n. Of
course, this analysis is extremely simplistic. Once you get into
high-dimensional situations, and depending on the function being
approximated, almost any situation can arise. Also, nore hidden layers
can often exact in computational cost what they give in terms of
economy of neurons.

Ali Minai

esrmm@warwick.ac.uk (Denis Anthony) (01/08/91)

In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes:
>In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes:
>
>As for why more layers work "better", they often don't. But when they
>do, it is because of the greater potential "complexity" available.
>Think of each neuron in layer k as forming a distorted linear
>superposition of the outputs from the previous layer. If the neurons
>in the net have monotonic activation functions, as they usually do,
>an output layer neuron in a single hidden-layer net requires about 2n
>hidden neurons to compose a function with n modes (peaks).

Why 2n ? Is this emprical, or based on maths ? Or is it obvious,
ie. 2n to form n peaks and n troughs. Apologies if I am being
a bit dim.

Denis

tylerh@nntp-server.caltech.edu (Tyler R. Holcomb) (01/09/91)

esrmm@warwick.ac.uk (Denis Anthony) writes:

>In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes:
>>In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes:
>>

...

>>an output layer neuron in a single hidden-layer net requires about 2n
>>hidden neurons to compose a function with n modes (peaks).

>Why 2n ? Is this emprical, or based on maths ? Or is it obvious,
>ie. 2n to form n peaks and n troughs. Apologies if I am being
>a bit dim.

>Denis

The 2n rule actually has a very strong theoretical basis. If one
views each hidden layer as performing a topological transformation
on a n-sphere (an odd, but compeletely equivalent view
of feed forward neural computation), then one can justify the 2n
rule from  Whitney's
Theorem of differential geometry.  This was demonstrated
in an unpublished work of John Cortese (it was a term project
for one of his classes).

To get more infomation, send e-mail to jcort@tybalt.caltech.edu.

I have a copy of the paper, but I do not have the right to
be distributing an unpublished work that isn't mine.

happy theorizing!

aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) (01/10/91)

In article <1991Jan8.091631.16219@warwick.ac.uk> esrmm@warwick.ac.uk (Denis Anthony) writes:
>In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes:
>>In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes:
>>
>>As for why more layers work "better", they often don't. But when they
>>do, it is because of the greater potential "complexity" available.
>>Think of each neuron in layer k as forming a distorted linear
>>superposition of the outputs from the previous layer. If the neurons
>>in the net have monotonic activation functions, as they usually do,
>>an output layer neuron in a single hidden-layer net requires about 2n
>>hidden neurons to compose a function with n modes (peaks).
>
>Why 2n ? Is this emprical, or based on maths ? Or is it obvious,
>ie. 2n to form n peaks and n troughs. Apologies if I am being
>a bit dim.

No, you are right. This is neither mathematical nor really empirical.
It is just meant to be an approximate argument. I'll try to clarify.

Suppose we have a 1-N-1 network, where the first 1 is just an input unit.
Assuming that the output neuron uses a linear composition with a
monotonic squashing function, and that all hidden nits are monotonic,
the output is a distorted linear superposition of the hidden unit
activations. Since each hidden unit can only provide one "slope"
(due to monotonicity), about 2n will be needed to produce 2n slopes
(= n peaks). However, because the hidden unit activations are
non-linear and can have very different (albeit monotonic) shapes,
it is possible for fewer than 2n hidden units to produce n peaks, but
only if function shapes are very variable. In general, superposing
2n monotonic functions of approximately similar shape (but with negative
and positive weights) will tend to produce less than n peaks. My
argument was that as we increase the number of layers, we provide 
additional scope for recombination of superpositions from previous
layers. Having two hidden layers of N and 2 units is sort of like
having one hidden layer of 2N units, because each unit in the second
layer forms an independent version of the first layer, both of which
are then available to the next layer. Again, this is not meant to be
a theorem, just an illustrative argument. Of course, I assume that
a layer takes input only from its immediate predecessor.

Ali Minai