CALENDAR@IBM-SJ.ARPA (03/07/86)
IBM Almaden Research Center 650 Harry Road San Jose, CA 95120-6099 CALENDAR March 10 - 14, 1986 Computer STOCHASTIC COMPLEXITY AND THE MDL AND PMDL PRINCIPLES Science J. J. Rissanen, IBM Almaden Research Center Colloquium Thurs., Mar. 13 There is no rational basis in traditional 3:00 P.M. statistics for the comparison of two models Rear Audit. unless they have the same number of parameters. Hence, for example, the important selection-of-variables problem has a dozen or so solutions, none of which can be preferred over the others. Recently, inspired by the algorithmic notion of complexity, we introduced a new concept in statistics, the Stochastic Complexity of the observed data, relative to a class of proposed probabilistic models. In broad terms, it is defined as the least number of binary digits with which the data can be encoded by use of the selected models. The stochastic complexity also represents the smallest prediction errors which result when the data are predicted by use of the models. Accordingly, the associated optimal model represents all the statistical information in the data that can be extracted with the proposed models, and for this reason its computation, which we call the MDL (Minimum Description Length) principle, may be taken to be the fundamental problem in statistics. In this talk, we describe a special form of the MDL principle, which amounts to the minimization of squared "honest" prediction errors, and we apply it to two examples of polynomial curve fitting as well as to contingency tables. In the first example, which calls for the prediction of weight growth of mice, the degree of the MDL polynomial agrees with the optimal degree, determined in retrospect after the predicted weights were seen. The associated predictions also far surpass those made with the best traditional statistical techniques. A fundamental theorem is given, which permits comparison of models in the spirit of the Cramer-Rao inequality, except that the models need not have the same number of parameters. It also settles the issue of how the selection-of-variables problem is to be solved. Host: R. Arps (Refreshments at 2:45 P.M.) [...]