[comp.ai.neural-nets] : Step Function

apr@cbnewsl.ATT.COM (anthony.p.russo) (08/24/89)

In article <6980@sdcsvax.UCSD.Edu>, heirich@beowulf.ucsd.edu (Alan Heirich) writes:
> [...] I have not seen any literature
> proving bounds on learnability.  This is obviously a very important area,
> but if people are still quibbling about the prevelance of local minima in
> error landscapes, I can't imagine that anyone has proven such a result about
> which functions are learnable!
> 

Some things have been proven not to be learnable, i.e. EX-OR.
On the other hand, I'm sure no one has proven whether step-function
scaling is learnable or not, but intuitively it seems that it is.
Perhaps not by backprop ...

There is, happily, a small body of research in this area, but not nearly
enough. the best reference I can give is:

L.G. Valiant, "A theory of the Learnable," Communications of the ACM,
vol. 27, no.11, 1984.

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~  AT&T Bell Laboratories					~
	~   apr@cbnewsl.ATT.COM						~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vaidyana@cb.ecn.purdue.edu (Ramaswamy Vaidyanathan) (08/24/89)

In a recent article, Heinrich writes:

> This statement makes me uncomfortable.  It has been proven (White) that a
> two-layer feed forward network with a signmoid activation function (such as
> you find in standard back propagation networks) can approximate any Borel
> 
>
Could you please post where this paper this paper by White was published.
I have not seen this paper, it could be useful in my work.
Thanks in advance.
Vaidya.
 -------------------------

demers@beowulf.ucsd.edu (David E Demers) (08/25/89)

In article <1602@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
>In article <6980@sdcsvax.UCSD.Edu>, heirich@beowulf.ucsd.edu (Alan Heirich) writes:
>> [...] I have not seen any literature
>> proving bounds on learnability.  This is obviously a very important area,
				^  I think Alan may mean "learnability
of the step function." here. 

>> but if people are still quibbling about the prevelance of local minima in
>> error landscapes, I can't imagine that anyone has proven such a result about
>> which functions are learnable!

>Some things have been proven not to be learnable, i.e. EX-OR.

Surely you jest.  XOR is not learnable by a neural net with
no hidden layers, but is certainly a learnable function.

>On the other hand, I'm sure no one has proven whether step-function
>scaling is learnable or not, but intuitively it seems that it is.
[...] the best reference I can give is:

>L.G. Valiant, "A theory of the Learnable," Communications of the ACM,
>vol. 27, no.11, 1984.

Also interesting are:
Ehrenfeucht, Haussler, Kearns & Valiant, "A General Lower Bound on
the Number of Examples Needed for Learning" appearing in Information
and Computation, 1988 [I just have a preprint...]

and
Haussler, "Quantifying Inductive Bias: AI Learning Algorithms and
Valiant's Learning Framework", Artificial Intelligence 36 (1988).

The proceedings of the 1988 workshop on Computational Learning
Theory are available from Morgan Kaufman.  One of my summer
projects was to read several of the papers from the workshop...
maybe at Christmas...

But back to the original subject -

I think Alan Heirich is right that the step function may be
APPROXIMATED to arbitrary precision by a net with one hidden
layer and sigmoid transfer functions.  Backpropogation may not
be an adequate learning method; however the cited work by
White (appearing in the current issue of Neural Networks, I
believe { vol 2 no. 3}) shows that the function IS learnable.
This is stronger than showing that a net can approximate it.

So, given the function and given a bound on the error, there
is a net which can approximate the function to that bound, and
an algorithm to learn the mapping.  Our job is to discover them :-)

Dave DeMers
demers@cs.ucsd.edu
Dept. of Computer Science  & Engineering
UCSD
La Jolla, CA 92093

apr@cbnewsl.ATT.COM (anthony.p.russo) (08/29/89)

In article <6981@sdcsvax.UCSD.Edu>, demers@beowulf.ucsd.edu (David E Demers) writes:
> 
> >Some things have been proven not to be learnable, i.e. EX-OR.
> 
> Surely you jest.  XOR is not learnable by a neural net with
> no hidden layers, but is certainly a learnable function.
> 

X-OR is not learnable. If you are given the first three entries in the truth
table, you could not possibly generalize to the last entry with
any confidence. That is, of the two possibilities for the last entry,
neither is preferable.

X-OR is *memorizable*, not learnable.

In either case, learnability is not a requirement for neural net applicability.
Nets memorize well.

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~   apr@cbnewsl.ATT.COM						~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ackley@jeeves.bellcore.com (David H Ackley) (08/30/89)

In article<1667@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
>X-OR is not learnable. If you are given the first three entries in the truth
>table, you could not possibly generalize to the last entry with
>any confidence. 

This renders the term "learnable" meaningless.  Over the space of
boolean functions, your claim is equally true of all of them.  Without
some kind of preference structure or bias over the space, both outputs
are equally likely for the last row, regardless of the rest of the
table.  If you do allow such biases, I can pick one (say, a "parity
bias") that makes XOR "learnable".

Absolutely, generalization is a big important hard problem --- but
it's different than learnability, and it's not why XOR is famous.

| David Ackley   Cognitive Science Research Group |
|"To act or      Bell Communications Research Inc.|
|   to react,            ackley@flash.bellcore.com|
|     that is the question"    ...!bellcore!ackley|
				   
				   

apr@cbnewsl.ATT.COM (anthony.p.russo) (08/30/89)

> In article<1667@cbnewsl.ATT.COM> I wrote:
> >X-OR is not learnable. If you are given the first three entries in the truth
> >table, you could not possibly generalize to the last entry with
> >any confidence. 
> 
In article <17522@bellcore.bellcore.com>, ackley@jeeves.bellcore.com (David H Ackley) writes:
> 
> This renders the term "learnable" meaningless.  Over the space of
> boolean functions, your claim is equally true of all of them.  Without
> some kind of preference structure or bias over the space, both outputs
> are equally likely for the last row, regardless of the rest of the
> table.  If you do allow such biases, I can pick one (say, a "parity
> bias") that makes XOR "learnable".
> 
I agree that X-OR is learnable *with* bias. That however, is not the
definition of "learnable" in the strictest sense. I won't attempt
to prove or disprove your claim regarding all boolean functions,
but why must boolean functions be learnable at all? When you saw
X-OR for the first time, you most likely *memorized* the truth table.
If not, then after *all four* entries, you determined the rule
that if both inputs are equal, output "0"; else output "1".
Could you have determined this rule with just three examples? No.
How would you have chosen a bias? 

Perhaps our definitions of "learnable" are different. Mine is that,
with a fraction of the possible samples, one can generalize to 100%
accuracy. Otherwise, cfor example, if after 99 of 100 samples 
one cannot correctly predict the last output 
with 100% confidence, nothing has been learned
at all about the function in question. If it does take 100% of the
possibilities to learn something, that I claim it has not been learned,
but rather *memorized*.

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~   apr@cbnewsl.ATT.COM	           put disclaimer here.	        ~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mathew@jane.Jpl.Nasa.Gov (Mathew Yeates) (08/30/89)

>Perhaps our definitions of "learnable" are different. Mine is that,
>with a fraction of the possible samples, one can generalize to 100%
>accuracy. Otherwise, cfor example, if after 99 of 100 samples 
>one cannot correctly predict the last output 
>with 100% confidence, nothing has been learned
>at all about the function in question. If it does take 100% of the
>possibilities to learn something, that I claim it has not been learned,
>but rather *memorized*.

under this definition, no function is learnable. Considering a function as
a set of ordered pairs, seeing some subset of the ordered pairs can
never lead to a complete knowledge of the entire set. At least for
a mere  mortal. A rigorous definition of learnabilty from examples can
be found in Valiants "A Theory of the Learnable".

dg1v+@andrew.cmu.edu (David Greene) (08/31/89)

> Excerpts from netnews.comp.ai.neural-nets: 30-Aug-89 Re: : Step Function
> anthony.p.russo@cbnewsl. (1967)

> Perhaps our definitions of "learnable" are different. Mine is that,
> with a fraction of the possible samples, one can generalize to 100%
> accuracy. Otherwise, cfor example, if after 99 of 100 samples 
> one cannot correctly predict the last output 
> with 100% confidence, nothing has been learned
> at all about the function in question. If it does take 100% of the
> possibilities to learn something, that I claim it has not been learned,
> but rather *memorized*.


I believe you are confusing 'sample' with 'census'.   If the total
population of examples is 100 (and assuming no noise) then perhaps your
statement would be appropriate.  However if the 100 represents a 
sample, then the issue of learnability after 99 examples from that
sample is dependent on the size of the total population and the degree
of confidence you require.  (eg. Valient's PAC model of learnability --
Valient, L. G "A Theory of the Learnable" Communications of the ACM, 27
(11) :1134-1142, November 1984)


-David

--------------------------------------------------------------------
 David Perry Greene        ||    ARPA:          dg1v@andrew.cmu.edu
 GSIA /Robotics            ||                   dpg@isl1.ri.cmu.edu
 Carnegie Mellon Univ.     ||    BITNET:  dg1v%andrew@vb.cc.cmu.edu
 Pittsburgh, PA 15213      ||    UUCP: !harvard!andrew.cmu.edu!dg1v
--------------------------------------------------------------------
"You're welcome to use my opinions, just don't get them all wrinkled."

pluto@beowulf.ucsd.edu (Mark E. P. Plutowski) (08/31/89)

In article <1683@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
>Perhaps our definitions of "learnable" are different. Mine is that,
>with a fraction of the possible samples, one can generalize to 100%
>accuracy.

I would suggest you introduce one of two things into your 
definition of "learnability."   Either 1) make it dependent on
a particular architecture,  or, 2) make it dependent upon a
complexity measure, such as time needed to learn, number of units,
or number of weights.  

It makes sense to discuss learnability when we talk about a set
of concepts (e.g., functions) given a hypothesis space with which
to learn the concepts (e.g., threshold-logic).  Then, it is an
issue whether the concept in question can be represented within
the hypothesis space -- and then, we can discuss whether it can
be learned given a feasible amount of resources.

apr@cbnewsl.ATT.COM (anthony.p.russo) (08/31/89)

> >Perhaps our definitions of "learnable" are different. Mine is that,
> >with a fraction of the possible samples, one can generalize to 100%
> >accuracy. Otherwise, cfor example, if after 99 of 100 samples [...]
> 
In article <1989Aug30.162345.9569@elroy.jpl.nasa.gov>, mathew@jane.Jpl.Nasa.Gov (Mathew Yeates) writes:

> under this definition, no function is learnable. Considering a function as
> a set of ordered pairs, seeing some subset of the ordered pairs can
> never lead to a complete knowledge of the entire set. At least for
> a mere  mortal. A rigorous definition of learnabilty from examples can
> be found in Valiants "A Theory of the Learnable".

Yes, I should stick to Valiant's paper. However, one thing: a known function
is, in most cases, more than a set of ordered pairs. It is a set of ordered
pairs related in some way. Discover the relation and no further examples
are needed to have complete knowledge of the entire set. The question of
"learnability" is whether the function, or rule, can possibly be
extracted from a set of samples *with probability of error less than some
value*. This is one mistake I made when I said "100% confidence"--that's
not possible.

However, getting back to X-OR, this probability of error = 0.5, clearly
too high to consider it learnable.

Aside: someone commented that learnability should be defined dependent
on architecture. I don't see why.

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~   apr@cbnewsl.ATT.COM		   put disclaimer here		~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pluto@beowulf.ucsd.edu (Mark E. P. Plutowski) (09/01/89)

In article <1697@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
>... someone commented that learnability should be defined dependent
>on architecture. I don't see why.

I mentioned this for concreteness' sake.  
More generally, you want to know whether the 
set of hypotheses your learning protocol has
at its disposal can represent the function
you wish to learn to a satisfactory level of
approximation - is the hypothesis space sufficient
to characterize the set of concepts you wish to
learn ?  A particular architecture can represent
a subset of some concept space - this is its 
hypothesis space - and it may be only capable
of effectively learning some subset of its
hypotheses space. 

In the case you mention, a single 2-input threshold gate
cannot learn x-or because it cannot represent x-or.
A 2-input threshold gate network with at least 1 hidden
unit may not be able to effectively (feasibly, satisfactorily)
learn x-or, perhaps due to the ineffectiveness of the 
learning law.  

With the definition of learning you initially proposed,
we can say very little about learning - except that it
will be impossible for all practical purposes.

ackley@wind.bellcore.com (David H Ackley) (09/01/89)

In article <1697@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
>Yes, I should stick to Valiant's paper. However, one thing: a known function
							       ^^^^^^^^^^^^^^
>is, in most cases, more than a set of ordered pairs. It is a set of ordered
>pairs related in some way. 

By referring to "known functions" you are implicitly incorporating a
bias in your notion of learnability --- i.e., you are preferring
functions that have some (presumably simple) relation between input
and output.  I say "presumably simple" because I imagine you want to
rule out properties like "appearing in the same truth table" as
sufficient grounds for being "related in some way".

Can you make your criteria for "known-ness" explicit?  

| David Ackley   Cognitive Science Research Group |
| "There are     Bell Communications Research Inc.|
|     no facts,          ackley@flash.bellcore.com|
|       only factors"          ...!bellcore!ackley|
				   

apr@cbnewsl.ATT.COM (anthony.p.russo) (09/01/89)

In article <17538@bellcore.bellcore.com>, ackley@wind.bellcore.com (David H Ackley) writes:
> In article <1697@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes:
> >Yes, I should stick to Valiant's paper. However, one thing: a known function
> 							       ^^^^^^^^^^^^^^
> >is, in most cases, more than a set of ordered pairs. It is a set of ordered
> >pairs related in some way. 
> 
> By referring to "known functions" you are implicitly incorporating a
> bias in your notion of learnability --- i.e., you are preferring
> functions that have some (presumably simple) relation between input
> and output.  I say "presumably simple" because I imagine you want to
> rule out properties like "appearing in the same truth table" as
> sufficient grounds for being "related in some way".
> 
I could have omitted the word "known" and had a perfectly legitimate statement.
That is, ANY deterministic FUNCTION is a set of related ordered pairs,
whether known or not.

> Can you make your criteria for "known-ness" explicit?  

No. You bring up a good point, because your argument is really, "What functions
shall we consider in the hypothesis space?" 

I can't tell you; this appears to be getting very deep. On the surface,
it seems that biases are  !necessary! for learning anything at all.
If so, then the biases are probably hard-wired and not learned, since
they would have to be learned in terms of other biases, etc.

Does this make sense to anyone else, or have I gone off the deep end?

 ~ tony ~

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~   apr@cbnewsl.ATT.COM		   put disclaimer here		~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pluto@beowulf.ucsd.edu (Mark E. P. Plutowski) (09/02/89)

In article <1727@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) 
writes:
>I can't tell you; this appears to be getting very deep. On the surface,
>it seems that biases are  !necessary! for learning anything at all.
>If so, then the biases are probably hard-wired and not learned, since
>they would have to be learned in terms of other biases, etc.
> ~ tony ~

Bias is indeed necessary to learn:  
if there is no bias, then there is no
criterion for selecting one hypothesis over another, and 
the learning algorithm can do no better than random selection,
on average.  

bill@boulder.Colorado.EDU (09/04/89)

  Is it necessary to have a bias in order to be able to learn?

  Not always.  If the training set includes an example of every
possible input, then the device only needs to be able to "memorize"
the correct responses -- it doesn't need a preset bias.

  If, though, there are inputs not included in the training set,
then the device can only produce responses by somehow "generalizing"
from what it has seen, and the rules it uses for generalization
constitute a bias.  (I actually prefer the word "heuristic" to
"bias".  What I am saying here is that any device that learns from
incomplete data must use some sort of heuristic.)

  For devices that learn solely from examples -- which includes all
the neural networks I've seen -- that's all there is to it.  It's
not a deep question at all.

  It may be more interesting to wonder about more "intelligent"
devices, which are capable of learning from commentary as well as
examples.  ("Commentary" is, e.g., your English teacher telling you
that all proper nouns should be capitalized.)  Even this, though,
does not change the general principle.  A comment may be thought of
as a constraint upon the set of possible responses.  If all of the
examples, plus all of the constraints, do not force a unique response
to every possible input, then the device, in order to pick a response,
must use some sort of bias.

apr@cbnewsl.ATT.COM (anthony.p.russo) (09/05/89)

In article <11308@boulder.Colorado.EDU>, bill@boulder.Colorado.EDU writes:
> 
>   Is it necessary to have a bias in order to be able to learn?
> 
>   Not always. [...]
>   
>   If, though, there are inputs not included in the training set,
> then the device can only produce responses by somehow "generalizing"
> from what it has seen, and the rules it uses for generalization
> constitute a bias.  (I actually prefer the word "heuristic" to
> "bias".  What I am saying here is that any device that learns from
> incomplete data must use some sort of heuristic.)
> 
From what I have thought about and read here, it seems that learning
actually constitutes learning *biases* and nothing else (unless it is
memorization).

*IF* this is true (and I'm not sure it is), then for a classifier,
from an information-theoretic point of view, it then makes the most sense
to give the network examples that are on the borderline
of a rule or class. These "borderline" patterns should contain more
information about biases than good examples of a class.

In ways this makes sense -- it is akin to telling the network only where the
boundaries between classes are.

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	~  	 Tony Russo		" Surrender to the void."	~
	~   apr@cbnewsl.ATT.COM		   put disclaimer here		~
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pluto@beowulf.ucsd.edu (Mark E. P. Plutowski) (09/06/89)

In article <11308@boulder.Colorado.EDU> bill@synapse.Colorado.EDU (Bill Skaggs) writes:
>
>  Is it necessary to have a bias in order to be able to learn?
>
>  Not always.  If the training set includes an example of every
>possible input, then the device only needs to be able to "memorize"
>the correct responses -- it doesn't need a preset bias.
>

This is a strong form of a learning bias.  
The device assumes that everything
it sees during training is relevant, 
and, sufficient.  

>  It may be more interesting to wonder about more "intelligent"
>devices, which are capable of learning from commentary as well as
>examples.  ("Commentary" is, e.g., your English teacher telling you
>that all proper nouns should be capitalized. ... 

You might be interested in some approaches to learning theory
in which the device has access to a teacher that can provide more
than just a error-signal  ... e.g., once a hypothesis is formed by
the device, the teacher, rather than just saying whether or not
the hypothesis is correct for given values of inputs, can provide
counterexamples to the device.  Of course, this requires a teacher
with expert knowledge!  And, the ability to ascertain what 
hypothesis the device currently entertains.  

Applying this idea to neural networks is difficult.  
The question is: how to apportion this type of training
signal to the appropriate units in the network.  And, once
received by the pertinent units, what to make of it.  
Any ideas on how to backpropate such training information?
"Commentary" feedback can apply to the entire hypothesis formed 
by the device, not just its performance on a particular input.   

chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) (09/06/89)

Tony Russo writes:

"*IF* this is true (and I'm not sure it is), then for a classifier,
from an information-theoretic point of view, it then makes the most sense
to give the network examples that are on the borderline
of a rule or class. These "borderline" patterns should contain more
information about biases than good examples of a class.

In ways this makes sense -- it is akin to telling the network only where the
boundaries between classes are."

This is exactly the learning procedure behind Kohonen's LVQ2 (Learning Vector
Quantization) algorithm.  See his paper in ICNN '88.  He derives the optimality
of this method from Bayesian arguments.  I do not see how the fact that
generalization = bias implies the optimality of learning the boundary
condsitions, and would be very interested in having you elaborate on why you
think it might.

Ron Chrisley

bill@boulder.Colorado.EDU (09/06/89)

>  You might be interested in some approaches to learning theory
>  in which the device has access to a teacher that can provide more
>  than just a error-signal  ... e.g., once a hypothesis is formed by
>  the device, the teacher, rather than just saying whether or not
>  the hypothesis is correct for given values of inputs, can provide
>  counterexamples to the device.  Of course, this requires a teacher
>  with expert knowledge!  And, the ability to ascertain what 
>  hypothesis the device currently entertains.  
>
>  Applying this idea to neural networks is difficult.  
>  The question is: how to apportion this type of training
>  signal to the appropriate units in the network.  And, once
>  received by the pertinent units, what to make of it.  
>  Any ideas on how to backpropate such training information?
>  "Commentary" feedback can apply to the entire hypothesis formed 
>  by the device, not just its performance on a particular input.   

  In principle, if you can come up with the training information,
you can use back-prop (or something very much like it) to apply it.

  Back-prop is essentially gradient descent for an error function.
Usually the error function is either the total squared error in
the output layer for a randomly chosen input (this is "online" back-
prop), or the average total squared error for a set of inputs (which
is "batch-mode" back-prop).  But as far as the mathematics is concerned,
the error function can be anything you please.

  For a neural network, the "hypothesis" it forms is encoded by its
weights, and any sort of "comment" you please can be jammed into the
error function, if only you can formulate it as a numerical measure
calculable from the weights.

  In fact, a few people have already been trying to do that sort of
thing.  As you probably know, a persistent problem with back-prop
is that it tends to give networks that don't generalize very well to
new inputs.  One hope for getting better performance is to augment
the error function with an extra term representing the "complexity"
of the network, since it seems intuitively reasonable that simpler
networks should generalize better.  It isn't obvious how best to
measure complexity:  maybe the sum of the magnitudes of all the weights;
or maybe the number of weights exceeding some fixed threshold; or maybe
something else.  This is work in progress, but there have been some
promising beginnings.  (Perhaps somebody directly involved can comment.)

chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) (09/06/89)

In article <11308@boulder.Colorado.EDU> bill@synapse.Colorado.EDU (Bill Skaggs)
writes:
>
>  Is it necessary to have a bias in order to be able to learn?
>
>  Not always.  If the training set includes an example of every
>possible input, then the device only needs to be able to "memorize"
>the correct responses -- it doesn't need a preset bias.
>

This assumes, as has much of the discussion on this topic, that one is dealing
with detertministic classification problems, that is, one in which the proper
response to an input is constant.  But many problems, such as the
classification of natural signals like speech, are statistical in that at
different times, the same point will be mapped to different classes.  Then you
can at best model the discriminant function for each of the categories, and
then, given a point, choose the class that is most probable.  This requires
more than one instance per sample in order to use the memorization technique.
So either you abandon memorization (since there are usually an infinite # of
inputs anyway, and you don't know in advance what an effective discretization
of the input space would be) and therefore use a bias, or, in the cases where
memorization is possible, you will have to look at more than one instance of
each input (I'm wondering:  Isn't this how KNN or k-means classifiers, or even
codebooks, work?).

But what will be the proper number of instances to look at?  That cannot be
determined in adavance, so that may be yet another way bias can creep in...

Ron Chrisley

zmacv61@flamingo.doc.ic.ac.uk.doc.ic.ac.uk (L K C Leighton) (09/09/89)

I am new to neural networks, and to news.  dubious as i was about sending
a message over the network, i had to query this:

>X-OR is *memorizable*, not learnable.

>In either case, learnability is not a requirement for neural net applicability.
>Nets memorize well.

i thought the whole idea of neural nets was to provide learning abilites, not
just to provide a memorising function.  computers these days are fast enough
as it is, and with good programming style, a good database can be provided
which would far outstrip any neural network.
        perhaps someone could enlighten me on this, as programs i have been
working on (limited to 1000 neurons approx) have centred around imitating the
brain.  i saw memory recall-type neural nets as limited, as they have no means
to link one neural pattern to another (thought processing we do all the time).
        also - does anyone have any programs along these lines, or any others.
as i am new to this subject, i have not yet tackled any books or articles on
this subject, and a lot of the terminology is lost on me.  programs i can
understand :-)

many thanks

	luke leighton.

| Luke Leighton @ Imperial College          |   zmacv61@uk.ac.ic.doc           |
| they said it was impossible.  i agreed.   |   tharg@uk.ac.ic.cc              |
| i said it was impossible. they disagreed. |   zdac131@uk.ac.kcl.cc.elm       |
| Luke Leighton @ Imperial College          |   zmacv61@uk.ac.ic.doc           |
| they said it was impossible.  i agreed.   |   tharg@uk.ac.ic.cc              |
| i said it was impossible. they disagreed. |   (On Janet :-)                  |

jagota@sybil.cs.buffalo.edu (Arun Jagota) (09/12/89)

In article <11308@boulder.Colorado.EDU> bill@synapse.Colorado.EDU (Bill Skaggs) writes:
>
>  Is it necessary to have a bias in order to be able to learn?
>
>  Not always.  If the training set includes an example of every
>possible input, then the device only needs to be able to "memorize"
>the correct responses -- it doesn't need a preset bias.
>

Even then, it might be *useful* to consider the complexity of the 
representation (the functional form) of the learned function that the 
device forms. A device will exhibit good functional form learning ability 
if it forms a *good* representation.
Since what constitutes the *best* representation (min # of params, 
length of representation in bits, etc) is a subjective matter, this 
amounts to needing a bias. Moreover, an unbiased representation will be 
a table look-up which has no generalisation capability (whereas a 
*better* representation should have some generalisation capability).
CSNET  :  jagota@cs.buffalo.edu
BITNET :  jagota@sunybcs