[comp.ai.neural-nets] generalization in NN's

nealiphc@milton.u.washington.edu (Phillip Neal) (04/03/91)

I have a problem with the ability of a neural net to generalize.
I have 600 observations of a 6 predictor variable input vector
to classify these observations into 1 of 4 groups.

I break the data into a 400 observation training set and
a 200 observation test set.

When I use a simple linear discriminant function with seperate
covariance matrices and compare that against a NN with 6 input,
12 hidden and 4 output nodes. Here's what I get for correct
classification rates:

			LDF	NN
train			48.5	59.0
test			42.0	37.0

And no matter how long I let the NN run, and no matter what
number of hidden layer nodes, I always get about the same 
results.

So, what's the deal ? Is my sample size too small ? Are there
any good papers that cover this kind of problem ?

I know I am violating the rule of thumb to have 10 times more
training data than nodes in the net. But hey, data is expensive.

Thanks,

Phil Neal
phil@iris.iphc.washington.edu direct to my workstation

sietsma@latcs1.oz.au (Jocelyn Sietsma Penington) (04/03/91)

In article <1991Apr2.205240.24668@milton.u.washington.edu> nealiphc@milton.u.washington.edu (Phillip Neal) writes:
>I have a problem with the ability of a neural net to generalize.
...
>I break the data into a 400 observation training set and
>a 200 observation test set.
...  [NN does better on training set than linear discr. fn., but poorer on
      test set]
>And no matter how long I let the NN run, and no matter what
>number of hidden layer nodes, I always get about the same 
>results.
>
>I know I am violating the rule of thumb to have 10 times more
>training data than nodes in the net. But hey, data is expensive.

For starters, I think the rule of thumb quoted above is nonsense - it
doesn't take any notice of the characteristics of your data.  I think
it was calculated for training random inputs to random outputs, and who
wants to do that? 

The problem here may well be that you are actually training too long.
See the paper by Chauvin in NIPS 2, or by Weigend, Huberman and Rumelhart
(Predicting the future: a connect'st approach - Stanford-PDP-90-01, to 
appear in Int'l J. of Neural Systems) for graphs showing that as training 
continues, performance on the training set continuously improves, but 
performance on the test set reaches a maximum and then declines.

Unfortunately the only cures I know are expensive, either in data or time.

1. You can split your data set in three: training, cross-validation and testing.
Train, periodically checking the error rate on the cross-validation set.
When this starts to rise, stop training.  Use the test set to find the true
generalization performance.

2. You can reduce the effective size of your network.  The 2 papers I referenced
above are about adding an extra cost term to the standard back-prop of errors
to encourage the network to eliminate unnecessary units or connections.  This
appears to prevent the overtraining problem.  Unfortunately it greatly 
increases time required for training, and getting the parameter values right
might be difficult. (I haven't tried these, so I don't know.)

2b.  You MIGHT get some improvement by taking your trained network as it now
and removing any redundant units by one of the available pruning methods.  On
a toy problem, I have found that this improves generalization.  (Sietsma & Dow
Neural Networks 1991)  See Mozer & Smolensky, NIPS 1, and Le Cun, Denker & 
Solla, NIPS 2, for alternate methods of pruning trained networks.

hope this helps,

Jocelyn
-- 
(Jocelyn Penington, a.k.a. Sietsma - feel free to use either)
Email: sietsma@LATCS1.oz.au            Address: Materials Research Laboratory
Phone: (03) 319 3775 or (03) 479 1057           PO Box 50, Melbourne 3032
This article does not commit me, LaTrobe Uni or M.R.L. to any act or opinion.

andy@honda.ece.uiuc.edu (Andy Bereson) (04/04/91)

> 
> I have a problem with the ability of a neural net to generalize.
> 
> When I use a simple linear discriminant function with seperate
> covariance matrices and compare that against a NN with 6 input,
> 12 hidden and 4 output nodes. Here's what I get for correct
> 
> 			LDF	NN
> train			48.5	59.0
> test			42.0	37.0
> 

A linear discriminant function is similar to back-prop with no hidden
units, and that does better than your twelve hidden units.  It sounds
like you may be using too many units.  This is a common cause for under-
generalization.  

> And no matter how long I let the NN run, and no matter what
> number of hidden layer nodes, I always get about the same 
> results.

after some amount of training the ability of the network to predict new
examples will begin to degrade.  Further training will only worsen
this problem.  Reducing the number of units and the number of training
epochs will help solve this problem. 

Good luck

Andy

greenba@gambia.crd.ge.com (ben a green) (04/04/91)

In article <1991Apr3.172206.23469@ux1.cso.uiuc.edu> andy@honda.ece.uiuc.edu (Andy Bereson) writes:

   > 
   > I have a problem with the ability of a neural net to generalize.
   > 
   > When I use a simple linear discriminant function with seperate
   > covariance matrices and compare that against a NN with 6 input,
   > 12 hidden and 4 output nodes. Here's what I get for correct
   > 
   > 			LDF	NN
   > train			48.5	59.0
   > test			42.0	37.0
   > 

   A linear discriminant function is similar to back-prop with no hidden
   units, and that does better than your twelve hidden units.  It sounds
   like you may be using too many units.  This is a common cause for under-
   generalization.  

Twelve hidden units can learn two-to-the-twelfth subclasses if one trains
to perfection. The original poster had only 400 training patterns, which
is less than two-to-the-ninth. So I heartily agree with Andy that one should
use many fewer hidden nodes in the quoted problem.

Try 2 for starters.

Ben
--
Ben A. Green, Jr.              
greenba@crd.ge.com
  Speaking only for myself, of course.

tylerh@nntp-server.caltech.edu (Tyler R. Holcomb) (04/04/91)

nealiphc@milton.u.washington.edu (Phillip Neal) writes:

>I have a problem with the ability of a neural net to generalize.
>I have 600 observations of a 6 predictor variable input vector
>to classify these observations into 1 of 4 groups.
(stuff deleted)
>So, what's the deal ? Is my sample size too small ? Are there
>any good papers that cover this kind of problem ?

>I know I am violating the rule of thumb to have 10 times more
>training data than nodes in the net. But hey, data is expensive.

Several Comments.  

1. I agree with Andy Bereson.  A single linear unit will recover
the MSE optimal linear discriminant.  This would seem to encourage
trying to use few hidden units.  In particular, use linear
feed through connections (direct connections from input
to output).  The theoretical benefit and practical utility
of this approach has been demonstrated by many authors (myself
included).  

2.  Kramer has shown that 
the underlying structure of a backprop net is ill-suited
for classification tasks like yours.  He suggests Radial
Basis Function Nets ( eg.  Moody and Darken, _Neural
COmputation_, vol 1, pp 281-294).

3.  Neural Networks are not the solution to all of the world's
problems.  Maybe your problem really is optimally seperated
by a linear discriminant!  

-- 
                ------------------------------------------------------------
Tyler Holcomb   *   "Remember, one treats others with courtesy and repsect *
tylerh@juliet   *   not because they are gentlemen or gentlewomen, but     *
  caltech.edu   *   because you are."       -Garth Henrichs                *