nealiphc@milton.u.washington.edu (Phillip Neal) (04/03/91)
I have a problem with the ability of a neural net to generalize. I have 600 observations of a 6 predictor variable input vector to classify these observations into 1 of 4 groups. I break the data into a 400 observation training set and a 200 observation test set. When I use a simple linear discriminant function with seperate covariance matrices and compare that against a NN with 6 input, 12 hidden and 4 output nodes. Here's what I get for correct classification rates: LDF NN train 48.5 59.0 test 42.0 37.0 And no matter how long I let the NN run, and no matter what number of hidden layer nodes, I always get about the same results. So, what's the deal ? Is my sample size too small ? Are there any good papers that cover this kind of problem ? I know I am violating the rule of thumb to have 10 times more training data than nodes in the net. But hey, data is expensive. Thanks, Phil Neal phil@iris.iphc.washington.edu direct to my workstation
sietsma@latcs1.oz.au (Jocelyn Sietsma Penington) (04/03/91)
In article <1991Apr2.205240.24668@milton.u.washington.edu> nealiphc@milton.u.washington.edu (Phillip Neal) writes: >I have a problem with the ability of a neural net to generalize. ... >I break the data into a 400 observation training set and >a 200 observation test set. ... [NN does better on training set than linear discr. fn., but poorer on test set] >And no matter how long I let the NN run, and no matter what >number of hidden layer nodes, I always get about the same >results. > >I know I am violating the rule of thumb to have 10 times more >training data than nodes in the net. But hey, data is expensive. For starters, I think the rule of thumb quoted above is nonsense - it doesn't take any notice of the characteristics of your data. I think it was calculated for training random inputs to random outputs, and who wants to do that? The problem here may well be that you are actually training too long. See the paper by Chauvin in NIPS 2, or by Weigend, Huberman and Rumelhart (Predicting the future: a connect'st approach - Stanford-PDP-90-01, to appear in Int'l J. of Neural Systems) for graphs showing that as training continues, performance on the training set continuously improves, but performance on the test set reaches a maximum and then declines. Unfortunately the only cures I know are expensive, either in data or time. 1. You can split your data set in three: training, cross-validation and testing. Train, periodically checking the error rate on the cross-validation set. When this starts to rise, stop training. Use the test set to find the true generalization performance. 2. You can reduce the effective size of your network. The 2 papers I referenced above are about adding an extra cost term to the standard back-prop of errors to encourage the network to eliminate unnecessary units or connections. This appears to prevent the overtraining problem. Unfortunately it greatly increases time required for training, and getting the parameter values right might be difficult. (I haven't tried these, so I don't know.) 2b. You MIGHT get some improvement by taking your trained network as it now and removing any redundant units by one of the available pruning methods. On a toy problem, I have found that this improves generalization. (Sietsma & Dow Neural Networks 1991) See Mozer & Smolensky, NIPS 1, and Le Cun, Denker & Solla, NIPS 2, for alternate methods of pruning trained networks. hope this helps, Jocelyn -- (Jocelyn Penington, a.k.a. Sietsma - feel free to use either) Email: sietsma@LATCS1.oz.au Address: Materials Research Laboratory Phone: (03) 319 3775 or (03) 479 1057 PO Box 50, Melbourne 3032 This article does not commit me, LaTrobe Uni or M.R.L. to any act or opinion.
andy@honda.ece.uiuc.edu (Andy Bereson) (04/04/91)
> > I have a problem with the ability of a neural net to generalize. > > When I use a simple linear discriminant function with seperate > covariance matrices and compare that against a NN with 6 input, > 12 hidden and 4 output nodes. Here's what I get for correct > > LDF NN > train 48.5 59.0 > test 42.0 37.0 > A linear discriminant function is similar to back-prop with no hidden units, and that does better than your twelve hidden units. It sounds like you may be using too many units. This is a common cause for under- generalization. > And no matter how long I let the NN run, and no matter what > number of hidden layer nodes, I always get about the same > results. after some amount of training the ability of the network to predict new examples will begin to degrade. Further training will only worsen this problem. Reducing the number of units and the number of training epochs will help solve this problem. Good luck Andy
greenba@gambia.crd.ge.com (ben a green) (04/04/91)
In article <1991Apr3.172206.23469@ux1.cso.uiuc.edu> andy@honda.ece.uiuc.edu (Andy Bereson) writes: > > I have a problem with the ability of a neural net to generalize. > > When I use a simple linear discriminant function with seperate > covariance matrices and compare that against a NN with 6 input, > 12 hidden and 4 output nodes. Here's what I get for correct > > LDF NN > train 48.5 59.0 > test 42.0 37.0 > A linear discriminant function is similar to back-prop with no hidden units, and that does better than your twelve hidden units. It sounds like you may be using too many units. This is a common cause for under- generalization. Twelve hidden units can learn two-to-the-twelfth subclasses if one trains to perfection. The original poster had only 400 training patterns, which is less than two-to-the-ninth. So I heartily agree with Andy that one should use many fewer hidden nodes in the quoted problem. Try 2 for starters. Ben -- Ben A. Green, Jr. greenba@crd.ge.com Speaking only for myself, of course.
tylerh@nntp-server.caltech.edu (Tyler R. Holcomb) (04/04/91)
nealiphc@milton.u.washington.edu (Phillip Neal) writes: >I have a problem with the ability of a neural net to generalize. >I have 600 observations of a 6 predictor variable input vector >to classify these observations into 1 of 4 groups. (stuff deleted) >So, what's the deal ? Is my sample size too small ? Are there >any good papers that cover this kind of problem ? >I know I am violating the rule of thumb to have 10 times more >training data than nodes in the net. But hey, data is expensive. Several Comments. 1. I agree with Andy Bereson. A single linear unit will recover the MSE optimal linear discriminant. This would seem to encourage trying to use few hidden units. In particular, use linear feed through connections (direct connections from input to output). The theoretical benefit and practical utility of this approach has been demonstrated by many authors (myself included). 2. Kramer has shown that the underlying structure of a backprop net is ill-suited for classification tasks like yours. He suggests Radial Basis Function Nets ( eg. Moody and Darken, _Neural COmputation_, vol 1, pp 281-294). 3. Neural Networks are not the solution to all of the world's problems. Maybe your problem really is optimally seperated by a linear discriminant! -- ------------------------------------------------------------ Tyler Holcomb * "Remember, one treats others with courtesy and repsect * tylerh@juliet * not because they are gentlemen or gentlewomen, but * caltech.edu * because you are." -Garth Henrichs *