vt_ai@abo.fi (01/07/91)
When do 2, 3, 4 or more hidden layers in feed-forward neural networks work better than just one hidden layer ? Has anyone been using 5 or more hidden layers ? Most people seem to contented with one. Thanks for any responses. AI forskning vid v{rmeteknik ]bo Akademi
aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) (01/08/91)
In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes: >When do 2, 3, 4 or more hidden layers in feed-forward neural networks work >better than just one hidden layer ? Has anyone been using 5 or more hidden >layers ? Most people seem to contented with one. Thanks for any responses. Fewer layers are used mainly for three reasons. 1: One layer is usually enough (it is certainly enough in theory). 2: Sinle hidden-layer nets are easier to "decode" to see what is going on, figure out the induced model etc. 3: Back-propagated error values become increasingly less reliable as the number of layers increases. Trying to train very deep networks with back-propagation is almost impossible (though ways to improve this have been suggested). Still, people do occasionally use 2, 3 even 4 hidden layers. As for why more layers work "better", they often don't. But when they do, it is because of the greater potential "complexity" available. Think of each neuron in layer k as forming a distorted linear superposition of the outputs from the previous layer. If the neurons in the net have monotonic activation functions, as they usually do, an output layer neuron in a single hidden-layer net requires about 2n hidden neurons to compose a function with n modes (peaks). However, if there are two hidden layers of n and 2 neurons respectively, the 2 hidden neurons of hidden layer 2 can each put together a function with 0.5n peaks (or thereabouts), and the output neuron can combine these two functions monotonically and still produce n peaks. The second network thus needs only n+2 hidden neurons instead of 2n. Of course, this analysis is extremely simplistic. Once you get into high-dimensional situations, and depending on the function being approximated, almost any situation can arise. Also, nore hidden layers can often exact in computational cost what they give in terms of economy of neurons. Ali Minai
esrmm@warwick.ac.uk (Denis Anthony) (01/08/91)
In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes: >In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes: > >As for why more layers work "better", they often don't. But when they >do, it is because of the greater potential "complexity" available. >Think of each neuron in layer k as forming a distorted linear >superposition of the outputs from the previous layer. If the neurons >in the net have monotonic activation functions, as they usually do, >an output layer neuron in a single hidden-layer net requires about 2n >hidden neurons to compose a function with n modes (peaks). Why 2n ? Is this emprical, or based on maths ? Or is it obvious, ie. 2n to form n peaks and n troughs. Apologies if I am being a bit dim. Denis
tylerh@nntp-server.caltech.edu (Tyler R. Holcomb) (01/09/91)
esrmm@warwick.ac.uk (Denis Anthony) writes: >In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes: >>In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes: >> ... >>an output layer neuron in a single hidden-layer net requires about 2n >>hidden neurons to compose a function with n modes (peaks). >Why 2n ? Is this emprical, or based on maths ? Or is it obvious, >ie. 2n to form n peaks and n troughs. Apologies if I am being >a bit dim. >Denis The 2n rule actually has a very strong theoretical basis. If one views each hidden layer as performing a topological transformation on a n-sphere (an odd, but compeletely equivalent view of feed forward neural computation), then one can justify the 2n rule from Whitney's Theorem of differential geometry. This was demonstrated in an unpublished work of John Cortese (it was a term project for one of his classes). To get more infomation, send e-mail to jcort@tybalt.caltech.edu. I have a copy of the paper, but I do not have the right to be distributing an unpublished work that isn't mine. happy theorizing!
aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) (01/10/91)
In article <1991Jan8.091631.16219@warwick.ac.uk> esrmm@warwick.ac.uk (Denis Anthony) writes: >In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes: >>In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes: >> >>As for why more layers work "better", they often don't. But when they >>do, it is because of the greater potential "complexity" available. >>Think of each neuron in layer k as forming a distorted linear >>superposition of the outputs from the previous layer. If the neurons >>in the net have monotonic activation functions, as they usually do, >>an output layer neuron in a single hidden-layer net requires about 2n >>hidden neurons to compose a function with n modes (peaks). > >Why 2n ? Is this emprical, or based on maths ? Or is it obvious, >ie. 2n to form n peaks and n troughs. Apologies if I am being >a bit dim. No, you are right. This is neither mathematical nor really empirical. It is just meant to be an approximate argument. I'll try to clarify. Suppose we have a 1-N-1 network, where the first 1 is just an input unit. Assuming that the output neuron uses a linear composition with a monotonic squashing function, and that all hidden nits are monotonic, the output is a distorted linear superposition of the hidden unit activations. Since each hidden unit can only provide one "slope" (due to monotonicity), about 2n will be needed to produce 2n slopes (= n peaks). However, because the hidden unit activations are non-linear and can have very different (albeit monotonic) shapes, it is possible for fewer than 2n hidden units to produce n peaks, but only if function shapes are very variable. In general, superposing 2n monotonic functions of approximately similar shape (but with negative and positive weights) will tend to produce less than n peaks. My argument was that as we increase the number of layers, we provide additional scope for recombination of superpositions from previous layers. Having two hidden layers of N and 2 units is sort of like having one hidden layer of 2N units, because each unit in the second layer forms an independent version of the first layer, both of which are then available to the next layer. Again, this is not meant to be a theorem, just an illustrative argument. Of course, I assume that a layer takes input only from its immediate predecessor. Ali Minai