[comp.ai.neural-nets] BP input scaling, normalization

mcguire@fornax.UUCP (Michael McGuire) (04/20/91)

I have been using back-propagation to combine two sets of 11 parameters 
(22 inputs) into 11 output classes (there are 275 training patterns and 275 
test patterns). Therefore the net has 22 inputs, 11 outputs and possibly some
hidden layers. The inputs for each set were scaled by a respective constant
so that the input values were in the range 0 to 1 (this was a requirement of
the BP software).
My questions arise from the results I obtained:
	1. Different scaling constants resulted in very different
	   classification performances.
	2. A network with no hidden-layers outperformed nets with 1 hidden
	   layer (both nets had near perfect classification on the training
	   patterns).

Questions:
	1. What are the effects of scaling the inputs to a BP net and is
	   there an optimal way to do this (especialy since I have 2 sets of
	   inputs that need to be scaled differently).
	2. Why would a single-layer net outperform a two-layer net (2-layer
	   net only had 5 hidden units). I would expect the two-layer net to
	   at least do as well.
	3. Do output activations of 0.1 and 0.9 (as opposed to 0.0 and 1.0)
	   help the generalization process.
	4. Is there a different neural net better suited to this type of
	   classification (Radial Basis Functions?).

Thanks in advance to all those who respond.

Mike McGuire
Engineering Science
Simon Fraser University
Canada
e-mail: mcguire@cs.sfu.ca 

andy@honda.ece.uiuc.edu (Andy Bereson) (04/21/91)

> From: mcguire@fornax.UUCP (Michael McGuire)
> Subject: BP input scaling, normalization
> 
> I have been using back-propagation to combine two sets of 11 parameters 
> (22 inputs) into 11 output classes (there are 275 training patterns and 275 
> test patterns). Therefore the net has 22 inputs, 11 outputs and possibly some
> hidden layers. The inputs for each set were scaled by a respective constant
> so that the input values were in the range 0 to 1 (this was a requirement of
> the BP software).
> 	1. Different scaling constants resulted in very different
> 	   classification performances.
> 	2. A network with no hidden-layers outperformed nets with 1 hidden
> 	   layer (both nets had near perfect classification on the training
> 	   patterns).
> 
> 	1. What are the effects of scaling the inputs to a BP net and is
> 	   there an optimal way to do this (especialy since I have 2 sets of
> 	   inputs that need to be scaled differently).

Scaling the inputs to back-propagation has almost no affect, usually.
Such scaling corresponds trivially to scaling the weight matrix.  The
initial weights (which are usually random) will overcome any good that
scaling the inputs does.  I've been experimenting with scaling the
initial weights, but I haven't found anything interesting yet.

> 	2. Why would a single-layer net outperform a two-layer net (2-layer
> 	   net only had 5 hidden units). I would expect the two-layer net to
> 	   at least do as well.

If you are getting near perfect classification with no hidden units, then
you may not need hidden units.  Generally speaking, generalization is
improved by minimizing the number of hidden units in the network.

> 	3. Do output activations of 0.1 and 0.9 (as opposed to 0.0 and 1.0)
> 	   help the generalization process.

Some people believe this is true.  It sets more realistic goals for the
error measure since the sigmoid used most frequently in back-prop
asymptotes towards 0.0 and 1.0 and therefore, even when the net outputs
the correct answer it will sense some error and try to make further
_improvements_.  I do this usually, however, I don't really know if it's
best.  I'd like to see results that answer that answer this question, but
I'm not aware of any.

tap@ai.toronto.edu (Tony Plate) (04/23/91)

In article <2533@fornax.UUCP> mcguire@fornax.UUCP (Michael McGuire) writes:
>
>Questions:
>	1. What are the effects of scaling the inputs to a BP net and is
>	   there an optimal way to do this (especialy since I have 2 sets of
>	   inputs that need to be scaled differently).
Scaling inputs has no effect on what solutions can be implemented
by the network (since the weights can be scaled to compensate), but
it might effect the training.
>	2. Why would a single-layer net outperform a two-layer net (2-layer
>	   net only had 5 hidden units). I would expect the two-layer net to
>	   at least do as well.
Two possible reasons (there may be others)
(1) Not enough hidden units
(2) Net is overtrained

Solutions are:
(1) Increase number of hidden units
(2) Stop training earlier (use cross validation to decide when to stop)

>	3. Do output activations of 0.1 and 0.9 (as opposed to 0.0 and 1.0)
>	   help the generalization process.

This might help, but in my experience it is better to use softmax outputs.

>	4. Is there a different neural net better suited to this type of
>	   classification (Radial Basis Functions?).
>
Depends on the shape of the classes in input space.

-- 
---------------- Tony Plate ----------------------  tap@ai.utoronto.ca -----
Department of Computer Science, University of Toronto, 
10 Kings College Road, Toronto, 
Ontario, CANADA M5S 1A4
----------------------------------------------------------------------------