[comp.ai.neural-nets] : Step Function. Biases are necessary

apr@cbnewsl.ATT.COM (anthony.p.russo) (09/11/89)

I am absolutely convinced that bias is necessary for generalization.
When any machine is presented with (an incomplete number of) examples 
of a function and asked to generalize, that machine must choose between
all the possible functions that are consistent with the examples.
Its basis for choosing is *DEFINED* as its bias. Without bias, the
choice would be rather random, and generalization would be impossible.
Therefore, if we are to define learnability, it must be with respect to
a bias or set of biases.

Now, a bias can be any definable criteria (this may or may not exclude
"simplicity" as a bias). It can be in the form of hardwiring (net
topology) or previously learned information (weights).
This supports someone's comment that learnability should be dependent
on network architecture.

The question arises: which is more important, learning biases or functions?
Well, since generalization is not possible without biases, functions
cannot therefore be learned (only memorized). So, if you want a machine
to really learn a function (generalize on it), biases are more
important.

Ron Chrisley writes:

> [...] I do not see how the fact that
> generalization = bias implies the optimality of learning the boundary
> condsitions, and would be very interested in having you elaborate on why you
> think it might.
> 

My reply to this is to give a simplified, one-dimensional case.

A boundary is most efficiently (read: learning will be faster)
defined by its location in n-dimesional space. Since neural nets
don't learn this way, the next most efficient definition of a
boundary is obtained by giving examples of two items very (infintessimally)
close to the boundary  but on different sides of it.
In this way, in 1-D space for example, two points can define a boundary.
Those two points or examples are the most important ones to present
to the net. 

If, for instance, we wanted to teach the concept of 
negative and positive (zero is the boundary),
-1 and +1 (in integer space) would be a sufficient set of examples
(given, of course, some definition of bias).
Conversely, examples like -102312341 and +823456 are not very helpful.


 ~ tony ~

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~  AT&T Bell Laboratories					~
	~   apr@cbnewsl.ATT.COM						~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) (09/12/89)

I wrote:

> [...] I do not see how the fact that
> generalization = bias implies the optimality of learning the boundary
> conditions, and would be very interested in having you elaborate on why you
> think it might.
>

Then, Tony Russo said:

"My reply to this is to give a simplified, one-dimensional case...

A boundary is most efficiently (read: learning will be faster)
defined by its location in n-dimesional space. Since neural nets
don't learn this way, the next most efficient definition of a
boundary is obtained by giving examples of two items very (infintessimally)
close to the boundary  but on different sides of it.
In this way, in 1-D space for example, two points can define a boundary.
Those two points or examples are the most important ones to present
to the net.

If, for instance, we wanted to teach the concept of
negative and positive (zero is the boundary),
-1 and +1 (in integer space) would be a sufficient set of examples
(given, of course, some definition of bias).
Conversely, examples like -102312341 and +823456 are not very helpful."

I claim that although there might be algorithms that learn generalization
biases for which the boundary cases provide quickest learning, there are
also algorithms for which this is not the case.  For instance, some algorithms
may learn biases better if you provide exemplars.

I know this is exactly what you are claiming to not be the case, but I don't
yet see an argument.  What is the difference between -1:1 and -100000:100000?
If there is a difference in the quality of bias learning, I am sure that it
is dependent on some assumptions concerning the bias learning algorithm you
have in mind, or concerning the nature of the data.

The "boundary is best" does not seem to be true for arbitrary learning
algorithms, especially for particular generalization tasks.  Consider a 1D
task, where everything within distance D of the origin is in cat 1, and
all points outside of this region are in cat 2.  Now consider the following
way of learning bias:  Start with the bias that after seeing n samples, you
will categorize everything within radius r of any of the samples as the class
of those samples, r being small.  Then, r is increased in a least squares way,
until generalization error is minimized.  Clearly, it would be best to use
samples near the origin to train this task/bias learning algorithm combination.
If samples near the boundaries are used, then there will only be small error
in estimated generalization, resulting in small changes to r, which would
converge to the following classification:  cat 1 if the sample is within 
epsilon (the small value of r) of +D or -D.  But if samples from the interiors
of the classes are used, estimated generalization error will better match
actual error, which will be initially high, resulting in an increase of r.
Thus we will wind up with the following classification: cat 1 if the sample is
within D of 0.

Don't get me wrong, I do think that learning near the boundaries, ala LVQ2,
is a good idea.  But I don't think it is a good idea for all tasks, I am
not convinced that it is a good way to learn 2nd-order *biases* (as opposed
to 1st order distributions), and even if it is good for that, I question
whether it has anything to do with the fact that generalization = bias, as
opposed to the Bayesian arguments Prof. Kohonen gives.  If it were true for
Bayesian reasons, you would also probably be assuming that the bias learning
is performed after you already have a relatively good solution to the problem.

The reason why it was not a good idea in the example I gave was because that
bias learning alg needs information about the entire distribution.  Only
looking at boundaries throws that away.

But of course, I may be off track here.  You certainly seem to hold the
gen=bias => boundary cases implication in high regard.  Please explain if I
have misunderstood.


Ron

By the way, has anybody looked at 2nd order bias learning as I have sketched it
out here?  Thanks to Tony for pointing me in the right direction...

apr@cbnewsl.ATT.COM (anthony.p.russo) (09/13/89)

Ron Chrisley wrote:

> I claim that although there might be algorithms that learn generalization
> biases for which the boundary cases provide quickest learning, there are
> also algorithms for which this is not the case.  For instance, some algorithms
> may learn biases better if you provide exemplars.
> 
> I know this is exactly what you are claiming to not be the case, but I don't
> yet see an argument.  What is the difference between -1:1 and -100000:100000?
> If there is a difference in the quality of bias learning, I am sure that it
> is dependent on some assumptions concerning the bias learning algorithm you
> have in mind, or concerning the nature of the data.
> 
> The "boundary is best" does not seem to be true for arbitrary learning
> algorithms, especially for particular generalization tasks.  Consider a 1D
> task, where everything within distance D of the origin is in cat 1, and
> all points outside of this region are in cat 2.  Now consider the following
> way of learning bias:  Start with the bias that after seeing n samples, you
> will categorize everything within radius r of any of the samples as the class
> of those samples, r being small.  

***
I think since the task is with respect to the origin, the bias should be also.
Then only the distance D would need to be learned, and all the information
about D would be included in the boundary of radius D.
For instance, when I talk of learning boundaries, my bias must be that
everything in between those boundaries is of the same class.
***

> Then, r is increased in a least squares way,
> until generalization error is minimized.  Clearly, it would be best to use
> samples near the origin to train this task/bias learning algorithm combination

> [ sound argument deleted ]

> Don't get me wrong, I do think that learning near the boundaries, ala LVQ2,
> is a good idea.  But I don't think it is a good idea for all tasks, I am
> not convinced that it is a good way to learn 2nd-order *biases* (as opposed
> to 1st order distributions), and even if it is good for that, I question
> whether it has anything to do with the fact that generalization = bias, as
> opposed to the Bayesian arguments Prof. Kohonen gives.  If it were true for
> Bayesian reasons, you would also probably be assuming that the bias learning
> is performed after you already have a relatively good solution to the problem.
> 
> The reason why it was not a good idea in the example I gave was because that
> bias learning alg needs information about the entire distribution.  Only
> looking at boundaries throws that away.

***
Bayesian classifiers are really boundary sets. 
The boundaries a *calculated* from a priori knowledge of the distributions, but
once a boundary is calculated, the information about the entire distribution
*is* thrown away.
By teaching the machine those boundaries we have done the same thing.
***

> 
> But of course, I may be off track here.  You certainly seem to hold the
> gen=bias => boundary cases implication in high regard.  Please explain if I
> have misunderstood.
> 
***
I believe a couple of points have been brought out in our discussion over the
past few weeks. In my *opinion*,
1) learning and memorization are two very different things.
2) learing implies generalization and rule-extraction. Memorization does not.
3) Biases of some sort are required to learn anything.
4) Learning is fastest with borderline patterns that require the machine
to differentiate subtle differences in classes. But, it also seems reasonable
that strikingly different examples also play an important role in learning.
5) Learnablility should be defined in terms of a particular set of biases,
perhaps dependent on network architecture. (e.g. some things are just not
learnable by a particular network or machine)

Not bad.
***

> By the way,has anybody looked at 2nd order bias learning as I have sketched it
> out here?  Thanks to Tony for pointing me in the right direction...

***
You're welcome. It's a lot of fun. I just have this vision of a bunch of
researchers quietly reading these messages and jotting down notes for
future work and papers. More people should join the discussion; none
of the five points above are proven.
***

 ~ tony ~

danforth@riacs.edu (Douglas G. Danforth) (09/13/89)

Tony Russo writes:

>I believe a couple of points have been brought out in our discussion over the
>past few weeks. In my *opinion*,
>1) learning and memorization are two very different things.
>2) learing implies generalization and rule-extraction. Memorization does not.
>3) Biases of some sort are required to learn anything.
>4) Learning is fastest with borderline patterns that require the machine
>to differentiate subtle differences in classes. But, it also seems reasonable
>that strikingly different examples also play an important role in learning.
>5) Learnablility should be defined in terms of a particular set of biases,
>perhaps dependent on network architecture. (e.g. some things are just not
>learnable by a particular network or machine)


In regard to points (1) and (2).

     In a standard random access memory where all possible addresses
can be represented (24 bit address=> 16MB) there is no generalization.  Each
slot is filled independently of every other slot.  However, when dealing
with large numbers of bits, say 1,000, it is not possible to represent all possible
addresses and yet a "memory" can be constructed for this case.  The memory
is sparse in the address space.  Only a sampling of the possible memory addresses
can be present.  These "hard locations" can act as repositories for information
written into the memory by distributing the information among a set of hard
locations which are "near" the desired  (but not physically present) address.
One can read from an arbitrary address by "pooling" the information stored
in the "nearby" hard locations and then thresholding the result.

     The reason for this  preamble is to show that reading from (presenting
an input pattern to) a sparse distributed memory (a neural net) can indeed
produce output which is a "generalization".  The generalization can take
the form of: (A) what's the most similar thing to this pattern that I have
seen before, or (B) what is the Platonic ideal of this fuzzy pattern?

     When dealing with very large dimensional spaces it becomes difficult to
dismiss the generalization characteristics of a sparse distributed memory
for they begin having animal-like capabilities.  Most neural net research
todate has focused on very small numbers of nodes: input, hidden, and output.
For these small cases, I agree, the utility of memory may not seem great.

     By "rule extraction" I assume you mean in analogy to human concious throught
where one can articulate the "rule" that one has discovered.  This is an ongoing
area of debate.  Is it necessary to "interpret" the connection weights or just
evaluate the performance of the system?   IMHO, that depends upon your goals.


Doug Danforth
danforth@riacs.edu

bill@boulder.Colorado.EDU (09/14/89)

  Well, the discussions on this topic are getting long-winded,
with lots of nested quotations and such, so it's probably
pretty much exhausted its vitality -- but I can't resist taking
one more fling, and then I shall remain resolutely silent.

>I believe a couple of points have been brought out in our discussion over the
>past few weeks. In my *opinion*,
>1) learning and memorization are two very different things.

    I bet there isn't a single psychologist in the whole world who doesn't
  think that memorization is a kind of learning.

>2) learing implies generalization and rule-extraction. Memorization does not.

    "Learning", as the word is usually used, is a more-or-less enduring
  change in behavior, caused by experience.  It implies generalization
  only in the sense that no two stimuli are ever exactly the same.  (As
  Heraclitus put it:  you can't step twice into the same river.  The
  second time, it's a different river, and you're a different person.)
  Whether it implies rule-extraction depends on what you mean by a "rule".
  If a rule is simply an association between some inputs and outputs, then
  you're right; if it is more than that, you're not.

>3) Biases of some sort are required to learn anything.

    If I understand this statement, what it means is:  In order to be
  able to generalize, a device must be capable of inferring responses
  to inputs it has not experienced, _and_there_is_no_uniquely_correct_
  way_of_doing_that_.

>4) Learning is fastest with borderline patterns that require the machine
>to differentiate subtle differences in classes. But, it also seems reasonable
>that strikingly different examples also play an important role in learning.

    Learning (of a categorization task) is usually _fastest_, at least in 
  the early stages, with inputs that are "typical" of their categories:
  if you want to teach someone "mammal", you start with a mouse or a
  dog, not a dolphin or bat.  Borderline inputs are useful later on,
  after the categories have been roughly sketched out, because they give
  precise information about where the borders are.
   
>5) Learnablility should be defined in terms of a particular set of biases,
>perhaps dependent on network architecture. (e.g. some things are just not
>learnable by a particular network or machine)

    This point seems completely correct to me, and it is the most 
  important point of the whole discussion.

>Not bad.

apr@cbnewsl.ATT.COM (anthony.p.russo) (09/14/89)

Doug Danforth wrote:
> 
>      The reason for this  preamble is to show that reading from (presenting
> an input pattern to) a sparse distributed memory (a neural net) can indeed
> produce output which is a "generalization".  The generalization can take
> the form of: (A) what's the most similar thing to this pattern that I have
> seen before, or (B) what is the Platonic ideal of this fuzzy pattern?
> 
***

I don't really want to belabor this discussion much longer, but I don't
see any generalization in your example. What has been generalized? On the
surface, you're saying the fuzzy pattern has been generalized to the
"ideal" of the stored templates, but I think that the memory has simply
calculated a predetermined distance function. In light of the ongoing
discussion, there are two ways to view this:

1) the energy function is the "bias" we've been talking about. But in this case
there is no other learning going on, so the generalization is due solely
to the bias. If you accept this, then every classifier generalizes, and
we need a better definition of generalization.

2) The energy function is not this "bias;" the biases are instead due
to the topology etc. of the network. In this case, the rule for similarity
was "given" to the network, not extracted by it.

It would be nice and clean if the second case were correct, but I 
wouldn't bet on it.

> By "rule extraction" I assume you mean in analogy to human concious throught
> where one can articulate the "rule" that one has discovered.  This is an ongoing
> area of debate.  Is it necessary to "interpret" the connection weights or just
> evaluate the performance of the system?   IMHO, that depends upon your goals.
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You're right. 

 ~ tony ~

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~   apr@cbnewsl.ATT.COM						~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~