aam9n@uvaee.ee.virginia.EDU (Ali Minai) (09/19/89)
Following the very interesting but rather inconclusive discussion of generalization on comp.ai.neural-nets, I have a question. Given a *finite* set of input-output values, an estimator/approximator is used to induce a "reasonable" model for the system that generated the data. While there are many measures of "goodness of fit" with respect to the given data set (e.g. mean squared error), there seems to be no universally accepted measures for the "generalization" achieved. So far, I have just seen variants of two approaches: 1) Using a "test set" of data not seen by the estimator during training, and comparing the error on this set with that on the training set. 2) Looking at the statistical properties of the estimator's responses on training and testing data to see how similar they are. I would appreciate any references that describe/use/propound any other criteria for generalization. In particular, I am interested in the philosophical issues behind this problem and its implications for the scientific method. Thank you. Ali Minai Dept. of Electrical Engg. Thornton Hall University of Virginia Charlottesville, VA 22903 aam9n@uvaee.ee.virginia.edu
cjoslyn@bingvaxu.cc.binghamton.edu (Cliff Joslyn) (09/20/89)
In article <506@uvaee.ee.virginia.EDU> aam9n@uvaee.ee.virginia.EDU (Ali Minai) writes: >Given a *finite* set of input-output values, an estimator/approximator >is used to induce a "reasonable" model for the system that generated >the data. While there are many measures of "goodness of fit" with >respect to the given data set (e.g. mean squared error), there seems >to be no universally accepted measures for the "generalization" achieved. I'm not familiar with this specific application, but it seems likely to me that it might be appropriate to use Maximum Entropy (MaxEnt) Theory. On this view, given that the input-output sets form a distribution, the model is constructed so as to maximize the entropy relative to those distributions, which act as constraints. The philosophy of this is that this is the *only* model which is completely parsimonious relative to that data, that is uses exactly all the data provided, making no further assumptions or ignoring any data. This view is strongly related to traditional information theory and Bayesian statistics. Most applications are in data analysis, especially image reconstruction. Christensen, Ronald: (1985) "Entropy Minimax Multivariate Statistical Modeling", /Int. J. Gen. Sys./, v. 11 Cox, RT: "Probability, Frequency, and Reasonable Expectation", /American J. of Physics/, v. 14, pp. 1-13 Erickson, Gary J: , ed.(1988) /Maximum-Entropy and Bayesian Methods in Sci. + Eng./, v. 1,2, Kluwer Jaynes, ET: (1957) "Information Theory and Statistical Mechanics", /Physical Review/, v. 106,108, pp. 620-630 (1978) "Where Do We Stand on Maximum Entropy?", in: /Maximum Entropy Formalism/, ed. RD Levine, pp. 211-314, Cambridge, Cambridge,Mass (1990) /Probability Theory as Logic/, NOTE: To appear Jumarie, Guy: (1987) "New Concepts in Information Theory", /Physica Scripta/, v. 35:2, pp. 220-224 (1988) /Relative Information: Theories and Applications/, Springer-Verlag, Berlin Kapur, JN: (1989) /Maximum Entropy Models in Science and Engineering/, John Wiley, New York Polya, George: (1954) /Patterns of Plausible Inference/, Princeton U. Press, Princeton Skilling, John: (1989) "Classic Maximum Entropy", in: /Maximum Entropy + Bayesian Methods/, ed. J. Skilling, pp. 45-52, Kluwer, New York , ed.(1989) /Maximum-Entropy and Bayesian Methods/, Kluwer (1990) "Quantified Maximum Entropy", in: /Proc. 9th MaxEnt Workshop/ Smith, C Ray: (1990) "From Rationality + Consistency to Bayesian Probability", in: /Proc. 9th MaxEnt Workshop/ Tribus, Myron: (1969) /Rational Descriptions, Decisions, and Designs/, Pergamon, New York Tribus, Myron, and Levine, RD: , eds.(1979) /Maximum Entropy Formalism/, MIT Press, Cambridge -- O----------------------------------------------------------------------> | Cliff Joslyn, Cybernetician at Large | Systems Science, SUNY Binghamton, cjoslyn@bingvaxu.cc.binghamton.edu V All the world is biscuit shaped. . .
cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator) (09/20/89)
Really-From: Alex Martelli <AIBM002@ICINECA.BITNET> Training set and test set are not necessarily separate, i.e. you may be using parts of the training set to look for "convergency" or generality of the trained model. E.g. assume that you have a system which accepts a stream of data points, building an internal model such that it gives a probability distribution for "value of next point forthcoming": as the next point comes, first it evaluates its likelihood and outputs a "current generality" measure, such as running perplexity for the last 10 points, then it uses the value to update the training of the internal model. This is not philosophically very different from training/test set duality, but it can be very practical, since you judge model convergency to generalization, not from a single number measuring generality, but from a time-varying measure of goodness of fit with training-data-before-the-model-was-trained-on-them, and it may be possible to use characteristics of this curve (it is flattening out to some plateau? is it going down steadily after a peak, thus I am overtraining my model? etc) to decide when training can stop. I was introduced to this approach by Dr. Federik Jelinek of IBM Research at Yorktown and his group, and I believe it is expounded in their publications on IEEE Journal of Acoustics, Speech and Signal Processing on speech recognition and language modeling. A philosophically different approach is to evaluate the amount of information, as extracted from training data, that should be communicated to a hypothetical receiver armed only with the bare structure of the model to allow the model-as-trained to be reconstructed; weighing this amount of information against the information needed to communicate the data themselves given the model-as-trained (say with arithmetic or with Huffman coding). This also assumes a probabilistic model of course, and like the first approach it has a strong flavor of information theory. This idea was mentioned to me in a private note by Dr. Rissanen of IBM Research at Almaden, but I don't know whether it has actually been tried or it is "just" an idea; you could try contacting Dr. Rissanen himself about this.
aam9n@uvaee.ee.virginia.EDU (Ali Minai) (09/21/89)
A few days ago, I posted a question on generalization criteria, and the response has been very good. Several people sent me references, and I will e-mail a list of these to anyone who is interested. Just send me mail at aam9n@uvaee.ee.Virginia.EDU At this point, let me be a little more specific about my question. I am looking for a way to get away from the "test set" notion of measuring generalization. A "test set" is arbitrary, and any generalization measure it produces is good only relative to this set ---- and only partially so, because the "test set" itself is only a fragment of the data on which the estimator will be used. I realize that it is impossible to get away from this relativity altogether. All measures must be relative to something or the other. In the case of particular applications, it makes sense to use real test sets to check estimator performance, but in cases of purely theoretical interest, it might be more appropriate to use *internal* measures of performance. By "internal" I mean measures which use only the parameters of the estimator itself, or, at least, are not based on arbitrarily chosen sets of extraneous data. One good measure which I have already seen is Rissanen's minimum length criterion. Other entropy related measures are also out there, and it is mainly on these that I seek information. Of course, if other novel measures have been used, I will be more than interested. Here is a very simplified example of the kind of problem I am looking at: Two networks A and B are trained on a set of two points in (x,y)-space. Network A, which is very small, learns a straight line. Netwrork B, which is more complex, learns a sinusoid passing through the points. I, on the other hand, was actually looking for a parabola (I was stupid, okay! but this is only an example). I test my networks on test points from my parabola, and both of them fail miserably. Why should either of them be considered inadequate though? I had not given them enough information. But this objection applies only relative to my desire ---- to learn a parabola. If I remove this desire, there is no such thing as "enough" information. Whatever is there is all there is. So how do I grade the performance of my networks now? Do I say that network A is "better" because it produced a "simpler" model? But that will get me into the debate we have already been through on whether a polynomial of degree 4 is simpler an a simple exponential, and so forth. Thus, one aspect of my question is, have people tried to use criteria for assigning a degree of "simplicity" to functions? For example, if I had been using regular samples on a time signal, I could use Shannon's theorem and say that any generalization involving frequencies higher than the Shannon rate is a "false" one, and has "generated" information gratuitously. Given arbitrary samples, is there a natural extension? Thank you, Ali Minai aam9n@uvaee.ee.Virginia.EDU
cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator) (09/21/89)
Really-From: "Michael H. Prager" <MHP100F@ODUVM.BITNET> This message was originally submitted by MHP100F@ODUVM to the CYBSYS-L list at BINGVMB. On Tue, 19 Sep 89 20:17:26 EDT CYBSYS-L Moderator said: >From: aam9n@uvaee.ee.virginia.EDU (Ali Minai) >[ The following is cross-posted from alt.cyb-sys ] > >Given a *finite* set of input-output values, an estimator/approximator >is used to induce a "reasonable" model for the system that generated >the data. While there are many measures of "goodness of fit" with >respect to the given data set (e.g. mean squared error), there seems >to be no universally accepted measures for the "generalization" achieved.... I believe that this is indeed a very difficult and profound question. One reference of interest is Barron, A. 1984. Predicted squared error: a criterion for automatic model selection. Chapter 4 IN Farlow, S. J., editor. Self-organizing methods in modeling. Marcel Dekker, NY. Barron has also done some work on criteria based on minimum descriptor length, but I am not very familiar with that work. It would be interesting to hear what else you come up with. +--------------------------------------------------------------------+ | Michael H. Prager, Asst. Prof. || FAX: (804) 683-5303 | Department of Oceanography || Bitnet: MHP100F AT ODUVM | Old Dominion University || Internet: mhp100f@oduvm.cc.odu.edu | Norfolk, VA 23529-0276 || Telephone: (804) 683-6003 +--------------------------------------------------------------------+
ecctp@warwick.ac.uk (Dr J A K Cave) (09/27/89)
The psychometric literature has spawned an approach known as G-theory to quantify the generalizeability of structural information to bigger "universes" - in practical terms it looks like a gussied-up analysis of variance/covariance, but that's OK for some purposes. I will try to dig up some references, but for starters you might search the published works of Rich Shavelson: I know he did a summary for the NAS that appeared in a paper on job performance measurement within the last 3 years. Econometrics has come up with a variety of measures grossly related to MSE. Examples include Mallow's Cp, Amemiya's PC, mean-square prediction error on an independent sample; or Rbar-squared (the correlation coefficient adjusted for degrees of freedom, using the correlation between predicted and actual values on a disjoint set of data. Don't know if this is even close to what you are seeking (different disciplines, different meanings for the same words...)