[comp.ai.neural-nets] Original STEP discussion continued

ck@rex.cs.tulane.edu (Cris Koutsougeras) (09/08/89)

I am tempted to present my thoughts about the issues raised in the
"step function" discussion.

Let me try to lead through the concepts of "correctness" and "generalization".
Given a training set, suppose that a net learns a function F1 which "contains"
(or agrees if you wish) with the training set (TS). On the basis of the 
info in TS only, one has to accept F1 as correct. The issue of a proper
generalization has to do with the question whether the inference of F1 on
the basis of TS is "reasonable". Reasonable is not something easy to define.
What it intuitively means though is whether F1 reflects a relation 
characteristic of TS through which TS can be reconstructed. Also, of many
such relations characterizing TS (and each of which is such that it alone
can reconstruct TS) the most "reasonable" one would be (matter of definition)
the most simplest one. 

Lets take an example because otherwise I sound too arbitrary. Suppose I want
to teach someone the Greek word "louloudi" and I present as positive instances
some red flowers and as negatives some trees some houses and some non-flower
plants. Suppose also that none of the negative instances are red. Would 
anyone who would think that louloudi means a red flower or something red in
general be considered stupid ??? Well louloudi means flower and it is my fault
that I chose a TS in which the color plays a determinant role, because I should
expect that the "student" would try to identify the "simplest" discriminant 
rule which in this case hapens to be the color alone.

So I would relate the concept of a learnable function with the TS in use.
I would suggest that a given function is learnable with respect to a given
TS if it is the "simplest" function which is correct under TS and that the
"simplest" correct function is unique.

In the case of Back Prop as a measure for simplicity we can adopt the degree
of nonlinearity of the function. Lets see what happens in the BP operation.
Components (primitive features) of the input which are irelevant to the I/O
relation will have random values that make no sense (in other words would 
result in a very complicated description of the I/O relation) if we assume
that we have a good TS. On the other hand those  which are relevant would
have recurent properties (or values to keep it simple). The changes of the
weights implied by non-relevant components will be random and in the long run
will cancel each other simply causing an oscilation which will be smaller
and smaller with the annealing schedule. Those which are relevant though
will cause a steady and directed change. That is how BP identifies the 
relevant features and (easy to understant) by extension how they synthesize
a description. If the net does not have excessive non-linearity the relevant
features (and their recurrent values) will dominate the changes and the end
result. If there is, then any small bias of the values of irelevant features
will be identified and will remain in the net's internal representations.

It thus should become apparent that we should be talking about a function
learnable with respect to a TS and that learnability should be connected
with an optimality measure of some sort.

Most of what I expressed here are thoughts that puzzled me when I was 
writing my dissertation. My escape then was to develop another model
which was based on an effort to alleviate the problem of fixed non-linearity
of Rumelhart's net. But I haven't yet addressed the problem of learnability
in the general case as it was put earlier here. Which means that whoever
feels that s/he wants to seriously contribute on this, is welcome to
cooperate and lets get a paper out.



Cris Koutsougeras

==============================================================================
,___,   _  _   _              _   _   ___,               ___         ,__
  |     |  |   |       /\     |\  |   |                 |   `       /
  |     |  |   |      /--\    | \ |   |-        *       |            \
  |     |__|   |__,  /    \   |  \|   |___,             |___,  *    __/  *

==============================================================================

bill@boulder.Colorado.EDU (09/09/89)

[From Cris Koutsougeras:]
>  So I would relate the concept of a learnable function with the TS in use.
>  I would suggest that a given function is learnable with respect to a given
>  TS if it is the "simplest" function which is correct under TS and that the
>  "simplest" correct function is unique.

  Ah yes, Occam's razor slashes again!  But the whole point, the very crux
of the problem, is that "simplest" is not a well-defined concept.  I will
illustrate.

  There is an old mathematician's joke-paradox-problem, which goes as
follows.  The first five elements of a sequence are 1,2,4,8,16 -- what
is the next element?

  Answer: 31.

  Why?  Well, the way to extend a sequence is to find the simplest function
that is consistent with the numbers you have.  But the simplest sorts of
functions, as every mathematician knows, are polynomials; and the
polynomial (1/24)n^4 - (1/12)n^3 + (11/24)n^2 + (7/12)n + 1  , 
for n = 0,1,2,3,4, is the lowest order one that gives the members of the
sequence.  So to get the next element, just plug in n = 5; if you do,
you get 31.

  You say that there is a simpler function?  I say you're wrong.  The
function 2^n is really quite a complicated one -- it only looks simpler
to you because you're familiar with it.

  Okay?
  Well, you needn't take the example too seriously in order to get the
message, which is that "Simplicity is in the eye of the beholder".  I
realize that not everybody will agree with this notion, and it can be
made to seem extremely counterintuitive:  check out Hofstadter's
'Goedel, Escher, Bach' for an opposing viewpoint.  Still, I think
our attempts to design learning systems are gradually forcing
us to realize that there is just no such thing as a universal
notion of "simplicity".