[comp.ai.neural-nets] Generalization Criteria

aam9n@uvaee.ee.virginia.EDU (Ali Minai) (09/19/89)

Following the very interesting but rather inconclusive discussion of
generalization on comp.ai.neural-nets, I have a question. 

Given a *finite* set of input-output values, an estimator/approximator
is used to induce a "reasonable" model for the system that generated
the data. While there are many measures of "goodness of fit" with
respect to the given data set (e.g. mean squared error), there seems
to be no universally accepted measures for the "generalization" achieved.
So far, I have just seen variants of two approaches:

1) Using a "test set" of data not seen by the estimator during training,
   and comparing the error on this set with that on the training set.

2) Looking at the statistical properties of the estimator's responses on
   training and testing data to see how similar they are.

I would appreciate any references that describe/use/propound any other
criteria for generalization. In particular, I am interested in the
philosophical issues behind this problem and its implications for
the scientific method.

Thank you.

Ali Minai
Dept. of Electrical Engg.
Thornton Hall
University of Virginia
Charlottesville, VA 22903

aam9n@uvaee.ee.virginia.edu

cjoslyn@bingvaxu.cc.binghamton.edu (Cliff Joslyn) (09/20/89)

In article <506@uvaee.ee.virginia.EDU> aam9n@uvaee.ee.virginia.EDU (Ali Minai) writes:
>Given a *finite* set of input-output values, an estimator/approximator
>is used to induce a "reasonable" model for the system that generated
>the data. While there are many measures of "goodness of fit" with
>respect to the given data set (e.g. mean squared error), there seems
>to be no universally accepted measures for the "generalization"
achieved.

I'm not familiar with this specific application, but it seems likely to
me that it might be appropriate to use Maximum Entropy (MaxEnt) Theory. 
On this view, given that the input-output sets form a distribution, the
model is constructed so as to maximize the entropy relative to those
distributions, which act as constraints.  

The philosophy of this is that this is the *only* model which is
completely parsimonious relative to that data, that is uses exactly all
the data provided, making no further assumptions or ignoring any data. 
This view is strongly related to traditional information theory and
Bayesian statistics.  Most applications are in data analysis, especially
image reconstruction.

Christensen, Ronald: (1985) "Entropy Minimax Multivariate Statistical 
	Modeling", /Int. J. Gen. Sys./, v. 11

Cox, RT: "Probability, Frequency, and Reasonable Expectation",
     /American J. of Physics/, v. 14, pp. 1-13

Erickson, Gary J: , ed.(1988) /Maximum-Entropy and Bayesian Methods in
     Sci. + Eng./, v. 1,2, Kluwer

Jaynes, ET: (1957) "Information Theory and Statistical Mechanics",
     /Physical Review/, v. 106,108, pp. 620-630

     (1978) "Where Do We Stand on Maximum Entropy?", in: /Maximum Entropy
     Formalism/, ed. RD Levine, pp. 211-314, Cambridge, Cambridge,Mass

     (1990) /Probability Theory as Logic/, NOTE: To appear

Jumarie, Guy: (1987) "New Concepts in Information Theory", /Physica
     Scripta/, v. 35:2, pp. 220-224

     (1988) /Relative Information: Theories and Applications/,
     Springer-Verlag, Berlin

Kapur, JN: (1989) /Maximum Entropy Models in Science and Engineering/,
     John Wiley, New York

Polya, George: (1954) /Patterns of Plausible Inference/, Princeton U.
     Press, Princeton

Skilling, John: (1989) "Classic Maximum Entropy", in: /Maximum Entropy
     + Bayesian Methods/, ed. J. Skilling, pp. 45-52, Kluwer, New York

     , ed.(1989) /Maximum-Entropy and Bayesian Methods/, Kluwer

     (1990) "Quantified Maximum Entropy", in: /Proc. 9th MaxEnt Workshop/

Smith, C Ray: (1990) "From Rationality + Consistency to Bayesian
     Probability", in: /Proc. 9th MaxEnt Workshop/

Tribus, Myron: (1969) /Rational Descriptions, Decisions, and Designs/, 
	Pergamon, New York

Tribus, Myron, and Levine, RD: , eds.(1979) /Maximum Entropy
     Formalism/, MIT Press, Cambridge

-- 
O---------------------------------------------------------------------->
| Cliff Joslyn, Cybernetician at Large
| Systems Science, SUNY Binghamton, cjoslyn@bingvaxu.cc.binghamton.edu
V All the world is biscuit shaped. . .

cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator) (09/20/89)

Really-From: Alex Martelli <AIBM002@ICINECA.BITNET>

     Training set and test set are not necessarily separate, i.e. you
may be using parts of the training set to look for "convergency" or
generality of the trained model.  E.g. assume that you have a system
which accepts a stream of data points, building an internal model such
that it gives a probability distribution for "value of next point
forthcoming": as the next point comes, first it evaluates its likelihood
and outputs a "current generality" measure, such as running perplexity
for the last 10 points, then it uses the value to update the training
of the internal model.  This is not philosophically very different
from training/test set duality, but it can be very practical, since
you judge model convergency to generalization, not from a single number
measuring generality, but from a time-varying measure of goodness of
fit with training-data-before-the-model-was-trained-on-them, and it may
be possible to use characteristics of this curve (it is flattening out
to some plateau?  is it going down steadily after a peak, thus I am
overtraining my model?  etc) to decide when training can stop.
I was introduced to this approach by Dr. Federik Jelinek of IBM Research
at Yorktown and his group, and I believe it is expounded in their
publications on IEEE Journal of Acoustics, Speech and Signal Processing
on speech recognition and language modeling.

A philosophically different approach is to evaluate the amount of
information, as extracted from training data, that should be communicated
to a hypothetical receiver armed only with the bare structure of the
model to allow the model-as-trained to be reconstructed; weighing this
amount of information against the information needed to communicate the
data themselves given the model-as-trained (say with arithmetic or with
Huffman coding).  This also assumes a probabilistic model of course, and
like the first approach it has a strong flavor of information theory.
This idea was mentioned to me in a private note by Dr. Rissanen of IBM
Research at Almaden, but I don't know whether it has actually been tried
or it is "just" an idea; you could try contacting Dr. Rissanen himself
about this.

aam9n@uvaee.ee.virginia.EDU (Ali Minai) (09/21/89)

A few days ago, I posted a question on generalization criteria, and the
response has been very good. Several people sent me references, and I
will e-mail a list of these to anyone who is interested. Just send me
mail at
        aam9n@uvaee.ee.Virginia.EDU

At this point, let me be a little more specific about my question. I
am looking for a way to get away from the "test set" notion of measuring
generalization. A "test set" is arbitrary, and any generalization measure
it produces is good only relative to this set ---- and only partially so,
because the "test set" itself is only a fragment of the data on which
the estimator will be used. I realize that it is impossible to get away
from this relativity altogether. All measures must be relative to something
or the other. In the case of particular applications, it makes sense to
use real test sets to check estimator performance, but in cases of purely
theoretical interest, it might be more appropriate to use *internal*
measures of performance. By "internal" I mean measures which use only
the parameters of the estimator itself, or, at least, are not based on
arbitrarily chosen sets of extraneous data. One good measure which I
have already seen is Rissanen's minimum length criterion. Other entropy
related measures are also out there, and it is mainly on these that
I seek information. Of course, if other novel measures have been used,
I will be more than interested.

Here is a  very simplified example of the kind of problem I am looking at:

Two networks A and B are trained on a set of two points
in (x,y)-space. Network A, which is very small, learns a straight line.
Netwrork B, which is more complex, learns a sinusoid passing through the
points. I, on the other hand, was actually looking for a parabola (I
was stupid, okay! but this is only an example). I test my networks on
test points from my parabola, and both of them fail miserably. Why should
either of them be considered inadequate though? I had not given them enough
information. But this objection applies only relative to my desire ---- to
learn a parabola. If I remove this desire, there is no such thing as "enough"
information. Whatever is there is all there is. So how do I grade the
performance of my networks now? Do I say that network A is "better"
because it produced a "simpler" model? But that will get me into the
debate we have already been through on whether a polynomial of degree 4
is simpler an a simple exponential, and so forth. Thus, one aspect of
my question is, have people tried to use criteria for assigning a degree
of "simplicity" to functions? For example, if I had been using regular
samples on a time signal, I could use Shannon's theorem and say that
any generalization involving frequencies higher than the Shannon rate
is a "false" one, and has "generated" information gratuitously.
Given arbitrary samples, is there a natural extension?

Thank you,

Ali Minai
aam9n@uvaee.ee.Virginia.EDU

cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator) (09/21/89)

Really-From: "Michael H. Prager" <MHP100F@ODUVM.BITNET>

This message was originally submitted by  MHP100F@ODUVM to the CYBSYS-L list at
BINGVMB. 

On Tue, 19 Sep 89 20:17:26 EDT CYBSYS-L Moderator said:
>From: aam9n@uvaee.ee.virginia.EDU (Ali Minai)
>[ The following is cross-posted from alt.cyb-sys ]
>
>Given a *finite* set of input-output values, an estimator/approximator
>is used to induce a "reasonable" model for the system that generated
>the data. While there are many measures of "goodness of fit" with
>respect to the given data set (e.g. mean squared error), there seems
>to be no universally accepted measures for the "generalization" achieved....

I believe that this is indeed a very difficult and profound question.
One reference of interest is Barron, A.  1984.  Predicted squared error: a
criterion for automatic model selection.  Chapter 4 IN Farlow, S. J., editor.
Self-organizing methods in modeling.  Marcel Dekker, NY.

Barron has also done some work on criteria based on minimum descriptor
length, but I am not very familiar with that work.

It would be interesting to hear what else you come up with.

+--------------------------------------------------------------------+
| Michael H. Prager, Asst. Prof. || FAX: (804) 683-5303
| Department of Oceanography     || Bitnet: MHP100F AT ODUVM
| Old Dominion University        || Internet: mhp100f@oduvm.cc.odu.edu
| Norfolk, VA  23529-0276        || Telephone: (804) 683-6003
+--------------------------------------------------------------------+

ecctp@warwick.ac.uk (Dr J A K Cave) (09/27/89)

        The psychometric literature has spawned an approach known
as G-theory to quantify the generalizeability of structural information to
bigger "universes" - in practical terms it looks like a gussied-up
analysis of variance/covariance, but that's OK for some purposes.
I will try to dig up some references, but for starters you might
search the published works of Rich Shavelson: I know he did a summary
for the NAS that appeared in a paper on job performance measurement
within the last 3 years.

Econometrics has come up with a variety of measures grossly related to
MSE.  Examples include Mallow's Cp, Amemiya's PC, mean-square prediction error
on an independent sample; or Rbar-squared (the correlation coefficient
adjusted for degrees of freedom, using the correlation between predicted
and actual values on a disjoint set of data.  Don't know if this is
even close to what you are seeking (different disciplines, different meanings
for the same words...)