[comp.ai.neural-nets] Neuron Digest V5 #45

neuron-request@HPLABS.HP.COM ("Neuron-Digest Moderator Peter Marvit") (11/13/89)

Neuron Digest   Friday, 10 Nov 1989
                Volume 5 : Issue 45

Today's Topics:
                             Re: Step Function
                        KNN (was Re: Step function)
                    Original STEP discussion continued
                  Re: Original STEP discussion continued
                          Re: Step function issue
                  Re: Step Function. Biases are necessary
                  Re: Step Function. Biases are necessary
                             Re: Step Function
                  Re: Step Function. Biases are necessary
                  Re: Step Function. Biases are necessary
                  Re: Step Function. Biases are necessary
                  Re: Step Function. Biases are necessary


Send submissions, questions, address maintenance and requests for old issues to
"neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request"
Use "ftp" to get old issues from hplpm.hpl.hp.com (15.255.176.205).

------------------------------------------------------------

Subject: Re: Step Function
From:    chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88)
Organization: Xerox Palo Alto Research Center
Date:    06 Sep 89 04:34:37 +0000 


>  Is it necessary to have a bias in order to be able to learn?
>
>  Not always. [...]


This assumes, as has much of the discussion on this topic, that one is
dealing with detertministic classification problems, that is, one in which
the proper response to an input is constant.  But many problems, such as
the classification of natural signals like speech, are statistical in that
at different times, the same point will be mapped to different classes.
Then you can at best model the discriminant function for each of the
categories, and then, given a point, choose the class that is most
probable.  This requires more than one instance per sample in order to use
the memorization technique.  So either you abandon memorization (since
there are usually an infinite # of inputs anyway, and you don't know in
advance what an effective discretization of the input space would be) and
therefore use a bias, or, in the cases where memorization is possible, you
will have to look at more than one instance of each input (I'm wondering:
Isn't this how KNN or k-means classifiers, or even codebooks, work?).

But what will be the proper number of instances to look at?  That cannot be
determined in adavance, so that may be yet another way bias can creep in...

Ron Chrisley

------------------------------

Subject: KNN (was Re: Step function)
From:    danforth@riacs.edu (Douglas G. Danforth)
Organization: Research Institute for Advanced Computer Science
Date:    06 Sep 89 16:39:40 +0000 

[[ Regarding Ron Chrisley's message ]]

    By using a large (256 bit) but finite discrete representation of, say
speech, memorization is still possible.  The continuous-to-discrete mapping
does introduce a bias (not too bad if the bit pattern size is large).
Large bit spaces can be handled by large numbers of hidden nodes which act
as sponges to soak up the information in the input patterns.  The output
pattern (class label) is then the pooled decision of locally activated
hidden units.  A "rule of thumb" one can use is that the number of
activated units should be approximately the square root of the number of
hidden units.  The above comments apply to sparse distributed memories.

    Since Ron raised the issue of K nearest neighbor, I would like to put
forth a challenge to the Net:

POSTULATE: K Nearest Neighbor will beat any Artificial Neural Network as
           a PRACTICAL alternative.

    I stress the word "practical".  There are published results (Xerox
PARC) that have ANN's beating out KNN's for some experiments.  The issue as
I see it: is it worth the effort to perfect ANN learning rules when KNN is
fast, simple, and very good on the average?

    Disadvantages of KNN: little biological plausibility, little "insight"
gained by using KNN.

    Note, however, that by using KNN with "train only on error" the size of
memory can be greatly reduced and the boundaries are where the majority of
the samples are placed.  With K small but larger than 1, the effects of
"label errors" can be reduced (averaged out).  For these reasons, I find it
difficult to dismiss KNN.

Comments?

Doug Danforth
danforth@riacs.edu

------------------------------

Subject: Original STEP discussion continued
From:    ck@rex.cs.tulane.edu (Cris Koutsougeras)
Organization: Computer Science Dept., Tulane Univ., New Orleans, LA
Date:    08 Sep 89 07:38:10 +0000 

I am tempted to present my thoughts about the issues raised in the "step
function" discussion.

Let me try to lead through the concepts of "correctness" and
"generalization".  Given a training set, suppose that a net learns a
function F1 which "contains" (or agrees if you wish) with the training set
(TS). On the basis of the info in TS only, one has to accept F1 as correct.
The issue of a proper generalization has to do with the question whether
the inference of F1 on the basis of TS is "reasonable". Reasonable is not
something easy to define.  What it intuitively means though is whether F1
reflects a relation characteristic of TS through which TS can be
reconstructed. Also, of many such relations characterizing TS (and each of
which is such that it alone can reconstruct TS) the most "reasonable" one
would be (matter of definition) the most simplest one.

Lets take an example because otherwise I sound too arbitrary. Suppose I
want to teach someone the Greek word "louloudi" and I present as positive
instances some red flowers and as negatives some trees some houses and some
non-flower plants. Suppose also that none of the negative instances are
red. Would anyone who would think that louloudi means a red flower or
something red in general be considered stupid ??? Well louloudi means
flower and it is my fault that I chose a TS in which the color plays a
determinant role, because I should expect that the "student" would try to
identify the "simplest" discriminant rule which in this case hapens to be
the color alone.

So I would relate the concept of a learnable function with the TS in use.
I would suggest that a given function is learnable with respect to a given
TS if it is the "simplest" function which is correct under TS and that the
"simplest" correct function is unique.

In the case of Back Prop as a measure for simplicity we can adopt the
degree of nonlinearity of the function. Lets see what happens in the BP
operation.  Components (primitive features) of the input which are
irelevant to the I/O relation will have random values that make no sense
(in other words would result in a very complicated description of the I/O
relation) if we assume that we have a good TS. On the other hand those
which are relevant would have recurent properties (or values to keep it
simple). The changes of the weights implied by non-relevant components will
be random and in the long run will cancel each other simply causing an
oscilation which will be smaller and smaller with the annealing schedule.
Those which are relevant though will cause a steady and directed change.
That is how BP identifies the relevant features and (easy to understant) by
extension how they synthesize a description. If the net does not have
excessive non-linearity the relevant features (and their recurrent values)
will dominate the changes and the end result. If there is, then any small
bias of the values of irelevant features will be identified and will remain
in the net's internal representations.

It thus should become apparent that we should be talking about a function
learnable with respect to a TS and that learnability should be connected
with an optimality measure of some sort.

Most of what I expressed here are thoughts that puzzled me when I was
writing my dissertation. My escape then was to develop another model which
was based on an effort to alleviate the problem of fixed non-linearity of
Rumelhart's net. But I haven't yet addressed the problem of learnability in
the general case as it was put earlier here. Which means that whoever feels
that s/he wants to seriously contribute on this, is welcome to cooperate
and lets get a paper out.

Cris Koutsougeras


------------------------------

Subject: Re: Original STEP discussion continued
From:    bill@boulder.Colorado.EDU
Organization: University of Colorado, Boulder
Date:    08 Sep 89 20:56:45 +0000 

[From Cris Koutsougeras:]
>  I would suggest that a given function is learnable with respect to a given
>  TS if it is the "simplest" function which is correct under TS and that the
>  "simplest" correct function is unique.

  Ah yes, Occam's razor slashes again!  But the whole point, the very crux
of the problem, is that "simplest" is not a well-defined concept.  I will
illustrate.

  There is an old mathematician's joke-paradox-problem, which goes as
follows.  The first five elements of a sequence are 1,2,4,8,16 -- what
is the next element?

  Answer: 31.

  Why?  Well, the way to extend a sequence is to find the simplest function
that is consistent with the numbers you have.  But the simplest sorts of
functions, as every mathematician knows, are polynomials; and the
polynomial (1/24)n^4 - (1/12)n^3 + (11/24)n^2 + (7/12)n + 1 , for n =
0,1,2,3,4, is the lowest order one that gives the members of the sequence.
So to get the next element, just plug in n = 5; if you do, you get 31.

  You say that there is a simpler function?  I say you're wrong.  The
function 2^n is really quite a complicated one -- it only looks simpler to
you because you're familiar with it.

  Okay?

  Well, you needn't take the example too seriously in order to get the
message, which is that "Simplicity is in the eye of the beholder".  I
realize that not everybody will agree with this notion, and it can be made
to seem extremely counterintuitive: check out Hofstadter's 'Goedel, Escher,
Bach' for an opposing viewpoint.  Still, I think our attempts to design
learning systems are gradually forcing us to realize that there is just no
such thing as a universal notion of "simplicity".

------------------------------

Subject: Re: Step function issue
From:    ck@rex.cs.tulane.edu (Cris Koutsougeras)
Organization: Computer Science Dept., Tulane Univ., New Orleans, LA
Date:    09 Sep 89 09:44:07 +0000 

[Bill Skaggs writes : ]
>  Ah yes, Occam's razor slashes again!  But the whole point, the very crux
>of the problem, is that "simplest" is not a well-defined concept.......
>  You say that there is a simpler function?  I say you're wrong........

We are in no dissagreement. And I am aware of the nature of the problem.
That is why I used quotes when talking about undefined concepts. Also I
gave the exmple of Rumelhart's net to point out that for specific domains
one could possibly define (and I clarified "arbitrarily") a measure for
such concepts. I am copying a part :

  So I would relate the concept of a learnable function with the TS in use.
  I would suggest that a given function is learnable with respect to a given
  TS if it is the "simplest" function which is correct under TS and that the
  "simplest" correct function is unique.

Besides my concern was merely to communicate an idea or an intuition if you
wish, and not to be rigorous. I thought that these postings are the
appropriate place for the exchange of ideas even if not crustalized.
Otherwise I would write a paper. Frankly I think that you did not read
carefully otherwise you wouldn't feel any need for razor comments.

Cris Koutsougeras

------------------------------

Subject: Re: Step Function. Biases are necessary
From:    apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date:    11 Sep 89 12:22:46 +0000 

Summary: Biases are necessary, I'M CONVINCED

I am absolutely convinced that bias is necessary for generalization.  When
any machine is presented with (an incomplete number of) examples of a
function and asked to generalize, that machine must choose between all the
possible functions that are consistent with the examples.  Its basis for
choosing is *DEFINED* as its bias. Without bias, the choice would be rather
random, and generalization would be impossible.  Therefore, if we are to
define learnability, it must be with respect to a bias or set of biases.

Now, a bias can be any definable criteria (this may or may not exclude
"simplicity" as a bias). It can be in the form of hardwiring (net topology)
or previously learned information (weights).  This supports someone's
comment that learnability should be dependent on network architecture.

The question arises: which is more important, learning biases or functions?
Well, since generalization is not possible without biases, functions cannot
therefore be learned (only memorized). So, if you want a machine to really
learn a function (generalize on it), biases are more important.

Ron Chrisley writes:

> [...] I do not see how the fact that
> generalization = bias implies the optimality of learning the boundary
> condsitions, and would be very interested in having you elaborate on why you
> think it might.


My reply to this is to give a simplified, one-dimensional case.

A boundary is most efficiently (read: learning will be faster) defined by
its location in n-dimesional space. Since neural nets don't learn this way,
the next most efficient definition of a boundary is obtained by giving
examples of two items very (infintessimally) close to the boundary but on
different sides of it.  In this way, in 1-D space for example, two points
can define a boundary.  Those two points or examples are the most important
ones to present to the net.

If, for instance, we wanted to teach the concept of negative and positive
(zero is the boundary), -1 and +1 (in integer space) would be a sufficient
set of examples (given, of course, some definition of bias).  Conversely,
examples like -102312341 and +823456 are not very helpful.

 ~ tony ~

        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        ~        Tony Russo             " Surrender to the void."       ~
        ~  AT&T Bell Laboratories                                       ~
        ~   apr@cbnewsl.ATT.COM                                         ~
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

------------------------------

Subject: Re: Step Function. Biases are necessary
From:    chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88)
Organization: Xerox Palo Alto Research Center
Date:    12 Sep 89 06:59:09 +0000 

I wrote:

> [...] I do not see how the fact that
> generalization = bias implies the optimality of learning the boundary
> conditions, and would be very interested in having you elaborate on why you
> think it might.


Then, Tony Russo said:

"My reply to this is to give a simplified, one-dimensional case... [[...]]"

I claim that although there might be algorithms that learn generalization
biases for which the boundary cases provide quickest learning, there are
also algorithms for which this is not the case.  For instance, some
algorithms may learn biases better if you provide exemplars.

I know this is exactly what you are claiming to not be the case, but I
don't yet see an argument.  What is the difference between -1:1 and
- -100000:100000?  If there is a difference in the quality of bias learning,
I am sure that it is dependent on some assumptions concerning the bias
learning algorithm you have in mind, or concerning the nature of the data.

The "boundary is best" does not seem to be true for arbitrary learning
algorithms, especially for particular generalization tasks.  Consider a 1D
task, where everything within distance D of the origin is in cat 1, and all
points outside of this region are in cat 2.  Now consider the following way
of learning bias: Start with the bias that after seeing n samples, you will
categorize everything within radius r of any of the samples as the class of
those samples, r being small.  Then, r is increased in a least squares way,
until generalization error is minimized.  Clearly, it would be best to use
samples near the origin to train this task/bias learning algorithm
combination.  If samples near the boundaries are used, then there will only
be small error in estimated generalization, resulting in small changes to
r, which would converge to the following classification: cat 1 if the
sample is within epsilon (the small value of r) of +D or -D.  But if
samples from the interiors of the classes are used, estimated
generalization error will better match actual error, which will be
initially high, resulting in an increase of r.  Thus we will wind up with
the following classification: cat 1 if the sample is within D of 0.

Don't get me wrong, I do think that learning near the boundaries, ala LVQ2,
is a good idea.  But I don't think it is a good idea for all tasks, I am
not convinced that it is a good way to learn 2nd-order *biases* (as opposed
to 1st order distributions), and even if it is good for that, I question
whether it has anything to do with the fact that generalization = bias, as
opposed to the Bayesian arguments Prof. Kohonen gives.  If it were true for
Bayesian reasons, you would also probably be assuming that the bias
learning is performed after you already have a relatively good solution to
the problem.

The reason why it was not a good idea in the example I gave was because
that bias learning alg needs information about the entire distribution.
Only looking at boundaries throws that away.

But of course, I may be off track here.  You certainly seem to hold the
gen=bias => boundary cases implication in high regard.  Please explain if I
have misunderstood.

Ron

By the way, has anybody looked at 2nd order bias learning as I have
sketched it out here?  Thanks to Tony for pointing me in the right
direction...

------------------------------

Subject: Re: Step Function
From:    Arun Jagota <sunybcs!sybil!jagota@RUTGERS.EDU>
Organization: SUNY @ Buffalo
Date:    12 Sep 89 14:40:04 +0000 

[[ Regarding Bill Skaggs' original comment ]]

>  Is it necessary to have a bias in order to be able to learn?  Not always.
[[ ... ]]

Even then, it might be *useful* to consider the complexity of the
representation (the functional form) of the learned function that the
device forms. A device will exhibit good functional form learning ability
if it forms a *good* representation.

Since what constitutes the *best* representation (min # of params, length
of representation in bits, etc) is a subjective matter, this amounts to
needing a bias. Moreover, an unbiased representation will be a table
look-up which has no generalisation capability (whereas a *better*
representation should have some generalisation capability).

CSNET  :  jagota@cs.buffalo.edu
BITNET :  jagota@sunybcs

------------------------------

Subject: Re: Step Function. Biases are necessary
From:    apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date:    13 Sep 89 11:55:11 +0000 

[[ Ron Chrisley wrote about qualities of learning algorithms and then about
learning bias. ]]

I think since the task is with respect to the origin, the bias should be
also.  Then only the distance D would need to be learned, and all the
information about D would be included in the boundary of radius D.

For instance, when I talk of learning boundaries, my bias must be that
everything in between those boundaries is of the same class.

> Then, r is increased in a least squares way ...
> [ sound argument deleted ]
> The reason why it was not a good idea in the example I gave was because that
> bias learning alg needs information about the entire distribution.  Only
> looking at boundaries throws that away.

Bayesian classifiers are really boundary sets.  The boundaries a
*calculated* from a priori knowledge of the distributions, but once a
boundary is calculated, the information about the entire distribution *is*
thrown away.  By teaching the machine those boundaries we have done the
same thing.

I believe a couple of points have been brought out in our discussion over the
past few weeks. In my *opinion*,
1) learning and memorization are two very different things.
2) learing implies generalization and rule-extraction. Memorization does not.
3) Biases of some sort are required to learn anything.
4) Learning is fastest with borderline patterns that require the machine
to differentiate subtle differences in classes. But, it also seems reasonable
that strikingly different examples also play an important role in learning.
5) Learnablility should be defined in terms of a particular set of biases,
perhaps dependent on network architecture. (e.g. some things are just not
learnable by a particular network or machine)

Not bad.

> By the way,has anybody looked at 2nd order bias learning as I have
> sketched it out here?  Thanks to Tony for pointing me in the right
> direction...

You're welcome. It's a lot of fun. I just have this vision of a bunch of
researchers quietly reading these messages and jotting down notes for
future work and papers. More people should join the discussion; none
of the five points above are proven.

 ~ tony ~

------------------------------

Subject: Re: Step Function. Biases are necessary
From:    danforth@riacs.edu (Douglas G. Danforth)
Organization: Research Institute for Advanced Computer Science
Date:    13 Sep 89 16:39:54 +0000 

[[ Tony Russo writes 5 points ]]

In regard to points (1) and (2).

     In a standard random access memory where all possible addresses can be
represented (24 bit address=> 16MB) there is no generalization.  Each slot
is filled independently of every other slot.  However, when dealing with
large numbers of bits, say 1,000, it is not possible to represent all
possible addresses and yet a "memory" can be constructed for this case.
The memory is sparse in the address space.  Only a sampling of the possible
memory addresses can be present.  These "hard locations" can act as
repositories for information written into the memory by distributing the
information among a set of hard locations which are "near" the desired (but
not physically present) address.  One can read from an arbitrary address by
"pooling" the information stored in the "nearby" hard locations and then
thresholding the result.

     The reason for this preamble is to show that reading from (presenting
an input pattern to) a sparse distributed memory (a neural net) can indeed
produce output which is a "generalization".  The generalization can take
the form of: (A) what's the most similar thing to this pattern that I have
seen before, or (B) what is the Platonic ideal of this fuzzy pattern?

     When dealing with very large dimensional spaces it becomes difficult
to dismiss the generalization characteristics of a sparse distributed
memory for they begin having animal-like capabilities.  Most neural net
research todate has focused on very small numbers of nodes: input, hidden,
and output.  For these small cases, I agree, the utility of memory may not
seem great.

     By "rule extraction" I assume you mean in analogy to human concious
throught where one can articulate the "rule" that one has discovered.  This
is an ongoing area of debate.  Is it necessary to "interpret" the
connection weights or just evaluate the performance of the system?  IMHO,
that depends upon your goals.


Doug Danforth
danforth@riacs.edu

------------------------------

Subject: Re: Step Function. Biases are necessary
From:    bill@boulder.Colorado.EDU
Organization: University of Colorado, Boulder
Date:    13 Sep 89 17:52:30 +0000 


  Well, the discussions on this topic are getting long-winded, with lots of
nested quotations and such, so it's probably pretty much exhausted its
vitality -- but I can't resist taking one more fling, and then I shall
remain resolutely silent.

>1) learning and memorization are two very different things.

    I bet there isn't a single psychologist in the whole world who
    doesn't think that memorization is a kind of learning.

>2) learing implies generalization and rule-extraction. Memorization does not.

    "Learning", as the word is usually used, is a more-or-less enduring
  change in behavior, caused by experience.  It implies generalization only
  in the sense that no two stimuli are ever exactly the same.  (As
  Heraclitus put it: you can't step twice into the same river.  The second
  time, it's a different river, and you're a different person.)  Whether it
  implies rule-extraction depends on what you mean by a "rule".  If a rule
  is simply an association between some inputs and outputs, then you're
  right; if it is more than that, you're not.

>3) Biases of some sort are required to learn anything.

    If I understand this statement, what it means is: In order to be able
  to generalize, a device must be capable of inferring responses to inputs
  it has not experienced, _and_there_is_no_uniquely_correct_
  way_of_doing_that_.

>4) Learning is fastest with borderline patterns that require the machine
>to differentiate subtle differences in classes. But, it also seems reasonable
>that strikingly different examples also play an important role in learning.

    Learning (of a categorization task) is usually _fastest_, at least in
  the early stages, with inputs that are "typical" of their categories: if
  you want to teach someone "mammal", you start with a mouse or a dog, not
  a dolphin or bat.  Borderline inputs are useful later on, after the
  categories have been roughly sketched out, because they give precise
  information about where the borders are.
   
>5) Learnablility should be defined in terms of a particular set of biases,
>perhaps dependent on network architecture. (e.g. some things are just not
>learnable by a particular network or machine)

    This point seems completely correct to me, and it is the most important
  point of the whole discussion.


------------------------------

Subject: Re: Step Function. Biases are necessary
From:    apr@cbnewsl.ATT.COM (anthony.p.russo)
Organization: AT&T Bell Laboratories
Date:    14 Sep 89 11:57:51 +0000 

[[ Responding to Doug Danforth comments on generalization ]]

I don't really want to belabor this discussion much longer, but I don't see
any generalization in your example. What has been generalized? On the
surface, you're saying the fuzzy pattern has been generalized to the
"ideal" of the stored templates, but I think that the memory has simply
calculated a predetermined distance function. In light of the ongoing
discussion, there are two ways to view this:

1) the energy function is the "bias" we've been talking about. But in this
case there is no other learning going on, so the generalization is due
solely to the bias. If you accept this, then every classifier generalizes,
and we need a better definition of generalization.

2) The energy function is not this "bias;" the biases are instead due to
the topology etc. of the network. In this case, the rule for similarity was
"given" to the network, not extracted by it.

It would be nice and clean if the second case were correct, but I wouldn't
bet on it.

> By "rule extraction" I assume you mean in analogy to human concious
> throught where one can articulate the "rule" that one has discovered.
> This is an ongoing area of debate.  Is it necessary to "interpret" the
> connection weights or just evaluate the performance of the system?  IMHO,
> that depends upon your goals.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You're right. 

 ~ tony ~

        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        ~        Tony Russo             " Surrender to the void."       ~
        ~   apr@cbnewsl.ATT.COM                                         ~
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

------------------------------

End of Neurons Digest
*********************