ck@rex.cs.tulane.edu (Cris Koutsougeras) (09/08/89)
I am tempted to present my thoughts about the issues raised in the "step function" discussion. Let me try to lead through the concepts of "correctness" and "generalization". Given a training set, suppose that a net learns a function F1 which "contains" (or agrees if you wish) with the training set (TS). On the basis of the info in TS only, one has to accept F1 as correct. The issue of a proper generalization has to do with the question whether the inference of F1 on the basis of TS is "reasonable". Reasonable is not something easy to define. What it intuitively means though is whether F1 reflects a relation characteristic of TS through which TS can be reconstructed. Also, of many such relations characterizing TS (and each of which is such that it alone can reconstruct TS) the most "reasonable" one would be (matter of definition) the most simplest one. Lets take an example because otherwise I sound too arbitrary. Suppose I want to teach someone the Greek word "louloudi" and I present as positive instances some red flowers and as negatives some trees some houses and some non-flower plants. Suppose also that none of the negative instances are red. Would anyone who would think that louloudi means a red flower or something red in general be considered stupid ??? Well louloudi means flower and it is my fault that I chose a TS in which the color plays a determinant role, because I should expect that the "student" would try to identify the "simplest" discriminant rule which in this case hapens to be the color alone. So I would relate the concept of a learnable function with the TS in use. I would suggest that a given function is learnable with respect to a given TS if it is the "simplest" function which is correct under TS and that the "simplest" correct function is unique. In the case of Back Prop as a measure for simplicity we can adopt the degree of nonlinearity of the function. Lets see what happens in the BP operation. Components (primitive features) of the input which are irelevant to the I/O relation will have random values that make no sense (in other words would result in a very complicated description of the I/O relation) if we assume that we have a good TS. On the other hand those which are relevant would have recurent properties (or values to keep it simple). The changes of the weights implied by non-relevant components will be random and in the long run will cancel each other simply causing an oscilation which will be smaller and smaller with the annealing schedule. Those which are relevant though will cause a steady and directed change. That is how BP identifies the relevant features and (easy to understant) by extension how they synthesize a description. If the net does not have excessive non-linearity the relevant features (and their recurrent values) will dominate the changes and the end result. If there is, then any small bias of the values of irelevant features will be identified and will remain in the net's internal representations. It thus should become apparent that we should be talking about a function learnable with respect to a TS and that learnability should be connected with an optimality measure of some sort. Most of what I expressed here are thoughts that puzzled me when I was writing my dissertation. My escape then was to develop another model which was based on an effort to alleviate the problem of fixed non-linearity of Rumelhart's net. But I haven't yet addressed the problem of learnability in the general case as it was put earlier here. Which means that whoever feels that s/he wants to seriously contribute on this, is welcome to cooperate and lets get a paper out. Cris Koutsougeras ============================================================================== ,___, _ _ _ _ _ ___, ___ ,__ | | | | /\ |\ | | | ` / | | | | /--\ | \ | |- * | \ | |__| |__, / \ | \| |___, |___, * __/ * ==============================================================================
bill@boulder.Colorado.EDU (09/09/89)
[From Cris Koutsougeras:] > So I would relate the concept of a learnable function with the TS in use. > I would suggest that a given function is learnable with respect to a given > TS if it is the "simplest" function which is correct under TS and that the > "simplest" correct function is unique. Ah yes, Occam's razor slashes again! But the whole point, the very crux of the problem, is that "simplest" is not a well-defined concept. I will illustrate. There is an old mathematician's joke-paradox-problem, which goes as follows. The first five elements of a sequence are 1,2,4,8,16 -- what is the next element? Answer: 31. Why? Well, the way to extend a sequence is to find the simplest function that is consistent with the numbers you have. But the simplest sorts of functions, as every mathematician knows, are polynomials; and the polynomial (1/24)n^4 - (1/12)n^3 + (11/24)n^2 + (7/12)n + 1 , for n = 0,1,2,3,4, is the lowest order one that gives the members of the sequence. So to get the next element, just plug in n = 5; if you do, you get 31. You say that there is a simpler function? I say you're wrong. The function 2^n is really quite a complicated one -- it only looks simpler to you because you're familiar with it. Okay? Well, you needn't take the example too seriously in order to get the message, which is that "Simplicity is in the eye of the beholder". I realize that not everybody will agree with this notion, and it can be made to seem extremely counterintuitive: check out Hofstadter's 'Goedel, Escher, Bach' for an opposing viewpoint. Still, I think our attempts to design learning systems are gradually forcing us to realize that there is just no such thing as a universal notion of "simplicity".