neuron-request@HPLABS.HP.COM ("Neuron-Digest Moderator Peter Marvit") (11/13/89)
Neuron Digest Friday, 10 Nov 1989 Volume 5 : Issue 45 Today's Topics: Re: Step Function KNN (was Re: Step function) Original STEP discussion continued Re: Original STEP discussion continued Re: Step function issue Re: Step Function. Biases are necessary Re: Step Function. Biases are necessary Re: Step Function Re: Step Function. Biases are necessary Re: Step Function. Biases are necessary Re: Step Function. Biases are necessary Re: Step Function. Biases are necessary Send submissions, questions, address maintenance and requests for old issues to "neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request" Use "ftp" to get old issues from hplpm.hpl.hp.com (15.255.176.205). ------------------------------------------------------------ Subject: Re: Step Function From: chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) Organization: Xerox Palo Alto Research Center Date: 06 Sep 89 04:34:37 +0000 > Is it necessary to have a bias in order to be able to learn? > > Not always. [...] This assumes, as has much of the discussion on this topic, that one is dealing with detertministic classification problems, that is, one in which the proper response to an input is constant. But many problems, such as the classification of natural signals like speech, are statistical in that at different times, the same point will be mapped to different classes. Then you can at best model the discriminant function for each of the categories, and then, given a point, choose the class that is most probable. This requires more than one instance per sample in order to use the memorization technique. So either you abandon memorization (since there are usually an infinite # of inputs anyway, and you don't know in advance what an effective discretization of the input space would be) and therefore use a bias, or, in the cases where memorization is possible, you will have to look at more than one instance of each input (I'm wondering: Isn't this how KNN or k-means classifiers, or even codebooks, work?). But what will be the proper number of instances to look at? That cannot be determined in adavance, so that may be yet another way bias can creep in... Ron Chrisley ------------------------------ Subject: KNN (was Re: Step function) From: danforth@riacs.edu (Douglas G. Danforth) Organization: Research Institute for Advanced Computer Science Date: 06 Sep 89 16:39:40 +0000 [[ Regarding Ron Chrisley's message ]] By using a large (256 bit) but finite discrete representation of, say speech, memorization is still possible. The continuous-to-discrete mapping does introduce a bias (not too bad if the bit pattern size is large). Large bit spaces can be handled by large numbers of hidden nodes which act as sponges to soak up the information in the input patterns. The output pattern (class label) is then the pooled decision of locally activated hidden units. A "rule of thumb" one can use is that the number of activated units should be approximately the square root of the number of hidden units. The above comments apply to sparse distributed memories. Since Ron raised the issue of K nearest neighbor, I would like to put forth a challenge to the Net: POSTULATE: K Nearest Neighbor will beat any Artificial Neural Network as a PRACTICAL alternative. I stress the word "practical". There are published results (Xerox PARC) that have ANN's beating out KNN's for some experiments. The issue as I see it: is it worth the effort to perfect ANN learning rules when KNN is fast, simple, and very good on the average? Disadvantages of KNN: little biological plausibility, little "insight" gained by using KNN. Note, however, that by using KNN with "train only on error" the size of memory can be greatly reduced and the boundaries are where the majority of the samples are placed. With K small but larger than 1, the effects of "label errors" can be reduced (averaged out). For these reasons, I find it difficult to dismiss KNN. Comments? Doug Danforth danforth@riacs.edu ------------------------------ Subject: Original STEP discussion continued From: ck@rex.cs.tulane.edu (Cris Koutsougeras) Organization: Computer Science Dept., Tulane Univ., New Orleans, LA Date: 08 Sep 89 07:38:10 +0000 I am tempted to present my thoughts about the issues raised in the "step function" discussion. Let me try to lead through the concepts of "correctness" and "generalization". Given a training set, suppose that a net learns a function F1 which "contains" (or agrees if you wish) with the training set (TS). On the basis of the info in TS only, one has to accept F1 as correct. The issue of a proper generalization has to do with the question whether the inference of F1 on the basis of TS is "reasonable". Reasonable is not something easy to define. What it intuitively means though is whether F1 reflects a relation characteristic of TS through which TS can be reconstructed. Also, of many such relations characterizing TS (and each of which is such that it alone can reconstruct TS) the most "reasonable" one would be (matter of definition) the most simplest one. Lets take an example because otherwise I sound too arbitrary. Suppose I want to teach someone the Greek word "louloudi" and I present as positive instances some red flowers and as negatives some trees some houses and some non-flower plants. Suppose also that none of the negative instances are red. Would anyone who would think that louloudi means a red flower or something red in general be considered stupid ??? Well louloudi means flower and it is my fault that I chose a TS in which the color plays a determinant role, because I should expect that the "student" would try to identify the "simplest" discriminant rule which in this case hapens to be the color alone. So I would relate the concept of a learnable function with the TS in use. I would suggest that a given function is learnable with respect to a given TS if it is the "simplest" function which is correct under TS and that the "simplest" correct function is unique. In the case of Back Prop as a measure for simplicity we can adopt the degree of nonlinearity of the function. Lets see what happens in the BP operation. Components (primitive features) of the input which are irelevant to the I/O relation will have random values that make no sense (in other words would result in a very complicated description of the I/O relation) if we assume that we have a good TS. On the other hand those which are relevant would have recurent properties (or values to keep it simple). The changes of the weights implied by non-relevant components will be random and in the long run will cancel each other simply causing an oscilation which will be smaller and smaller with the annealing schedule. Those which are relevant though will cause a steady and directed change. That is how BP identifies the relevant features and (easy to understant) by extension how they synthesize a description. If the net does not have excessive non-linearity the relevant features (and their recurrent values) will dominate the changes and the end result. If there is, then any small bias of the values of irelevant features will be identified and will remain in the net's internal representations. It thus should become apparent that we should be talking about a function learnable with respect to a TS and that learnability should be connected with an optimality measure of some sort. Most of what I expressed here are thoughts that puzzled me when I was writing my dissertation. My escape then was to develop another model which was based on an effort to alleviate the problem of fixed non-linearity of Rumelhart's net. But I haven't yet addressed the problem of learnability in the general case as it was put earlier here. Which means that whoever feels that s/he wants to seriously contribute on this, is welcome to cooperate and lets get a paper out. Cris Koutsougeras ------------------------------ Subject: Re: Original STEP discussion continued From: bill@boulder.Colorado.EDU Organization: University of Colorado, Boulder Date: 08 Sep 89 20:56:45 +0000 [From Cris Koutsougeras:] > I would suggest that a given function is learnable with respect to a given > TS if it is the "simplest" function which is correct under TS and that the > "simplest" correct function is unique. Ah yes, Occam's razor slashes again! But the whole point, the very crux of the problem, is that "simplest" is not a well-defined concept. I will illustrate. There is an old mathematician's joke-paradox-problem, which goes as follows. The first five elements of a sequence are 1,2,4,8,16 -- what is the next element? Answer: 31. Why? Well, the way to extend a sequence is to find the simplest function that is consistent with the numbers you have. But the simplest sorts of functions, as every mathematician knows, are polynomials; and the polynomial (1/24)n^4 - (1/12)n^3 + (11/24)n^2 + (7/12)n + 1 , for n = 0,1,2,3,4, is the lowest order one that gives the members of the sequence. So to get the next element, just plug in n = 5; if you do, you get 31. You say that there is a simpler function? I say you're wrong. The function 2^n is really quite a complicated one -- it only looks simpler to you because you're familiar with it. Okay? Well, you needn't take the example too seriously in order to get the message, which is that "Simplicity is in the eye of the beholder". I realize that not everybody will agree with this notion, and it can be made to seem extremely counterintuitive: check out Hofstadter's 'Goedel, Escher, Bach' for an opposing viewpoint. Still, I think our attempts to design learning systems are gradually forcing us to realize that there is just no such thing as a universal notion of "simplicity". ------------------------------ Subject: Re: Step function issue From: ck@rex.cs.tulane.edu (Cris Koutsougeras) Organization: Computer Science Dept., Tulane Univ., New Orleans, LA Date: 09 Sep 89 09:44:07 +0000 [Bill Skaggs writes : ] > Ah yes, Occam's razor slashes again! But the whole point, the very crux >of the problem, is that "simplest" is not a well-defined concept....... > You say that there is a simpler function? I say you're wrong........ We are in no dissagreement. And I am aware of the nature of the problem. That is why I used quotes when talking about undefined concepts. Also I gave the exmple of Rumelhart's net to point out that for specific domains one could possibly define (and I clarified "arbitrarily") a measure for such concepts. I am copying a part : So I would relate the concept of a learnable function with the TS in use. I would suggest that a given function is learnable with respect to a given TS if it is the "simplest" function which is correct under TS and that the "simplest" correct function is unique. Besides my concern was merely to communicate an idea or an intuition if you wish, and not to be rigorous. I thought that these postings are the appropriate place for the exchange of ideas even if not crustalized. Otherwise I would write a paper. Frankly I think that you did not read carefully otherwise you wouldn't feel any need for razor comments. Cris Koutsougeras ------------------------------ Subject: Re: Step Function. Biases are necessary From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 11 Sep 89 12:22:46 +0000 Summary: Biases are necessary, I'M CONVINCED I am absolutely convinced that bias is necessary for generalization. When any machine is presented with (an incomplete number of) examples of a function and asked to generalize, that machine must choose between all the possible functions that are consistent with the examples. Its basis for choosing is *DEFINED* as its bias. Without bias, the choice would be rather random, and generalization would be impossible. Therefore, if we are to define learnability, it must be with respect to a bias or set of biases. Now, a bias can be any definable criteria (this may or may not exclude "simplicity" as a bias). It can be in the form of hardwiring (net topology) or previously learned information (weights). This supports someone's comment that learnability should be dependent on network architecture. The question arises: which is more important, learning biases or functions? Well, since generalization is not possible without biases, functions cannot therefore be learned (only memorized). So, if you want a machine to really learn a function (generalize on it), biases are more important. Ron Chrisley writes: > [...] I do not see how the fact that > generalization = bias implies the optimality of learning the boundary > condsitions, and would be very interested in having you elaborate on why you > think it might. My reply to this is to give a simplified, one-dimensional case. A boundary is most efficiently (read: learning will be faster) defined by its location in n-dimesional space. Since neural nets don't learn this way, the next most efficient definition of a boundary is obtained by giving examples of two items very (infintessimally) close to the boundary but on different sides of it. In this way, in 1-D space for example, two points can define a boundary. Those two points or examples are the most important ones to present to the net. If, for instance, we wanted to teach the concept of negative and positive (zero is the boundary), -1 and +1 (in integer space) would be a sufficient set of examples (given, of course, some definition of bias). Conversely, examples like -102312341 and +823456 are not very helpful. ~ tony ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ AT&T Bell Laboratories ~ ~ apr@cbnewsl.ATT.COM ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Subject: Re: Step Function. Biases are necessary From: chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) Organization: Xerox Palo Alto Research Center Date: 12 Sep 89 06:59:09 +0000 I wrote: > [...] I do not see how the fact that > generalization = bias implies the optimality of learning the boundary > conditions, and would be very interested in having you elaborate on why you > think it might. Then, Tony Russo said: "My reply to this is to give a simplified, one-dimensional case... [[...]]" I claim that although there might be algorithms that learn generalization biases for which the boundary cases provide quickest learning, there are also algorithms for which this is not the case. For instance, some algorithms may learn biases better if you provide exemplars. I know this is exactly what you are claiming to not be the case, but I don't yet see an argument. What is the difference between -1:1 and - -100000:100000? If there is a difference in the quality of bias learning, I am sure that it is dependent on some assumptions concerning the bias learning algorithm you have in mind, or concerning the nature of the data. The "boundary is best" does not seem to be true for arbitrary learning algorithms, especially for particular generalization tasks. Consider a 1D task, where everything within distance D of the origin is in cat 1, and all points outside of this region are in cat 2. Now consider the following way of learning bias: Start with the bias that after seeing n samples, you will categorize everything within radius r of any of the samples as the class of those samples, r being small. Then, r is increased in a least squares way, until generalization error is minimized. Clearly, it would be best to use samples near the origin to train this task/bias learning algorithm combination. If samples near the boundaries are used, then there will only be small error in estimated generalization, resulting in small changes to r, which would converge to the following classification: cat 1 if the sample is within epsilon (the small value of r) of +D or -D. But if samples from the interiors of the classes are used, estimated generalization error will better match actual error, which will be initially high, resulting in an increase of r. Thus we will wind up with the following classification: cat 1 if the sample is within D of 0. Don't get me wrong, I do think that learning near the boundaries, ala LVQ2, is a good idea. But I don't think it is a good idea for all tasks, I am not convinced that it is a good way to learn 2nd-order *biases* (as opposed to 1st order distributions), and even if it is good for that, I question whether it has anything to do with the fact that generalization = bias, as opposed to the Bayesian arguments Prof. Kohonen gives. If it were true for Bayesian reasons, you would also probably be assuming that the bias learning is performed after you already have a relatively good solution to the problem. The reason why it was not a good idea in the example I gave was because that bias learning alg needs information about the entire distribution. Only looking at boundaries throws that away. But of course, I may be off track here. You certainly seem to hold the gen=bias => boundary cases implication in high regard. Please explain if I have misunderstood. Ron By the way, has anybody looked at 2nd order bias learning as I have sketched it out here? Thanks to Tony for pointing me in the right direction... ------------------------------ Subject: Re: Step Function From: Arun Jagota <sunybcs!sybil!jagota@RUTGERS.EDU> Organization: SUNY @ Buffalo Date: 12 Sep 89 14:40:04 +0000 [[ Regarding Bill Skaggs' original comment ]] > Is it necessary to have a bias in order to be able to learn? Not always. [[ ... ]] Even then, it might be *useful* to consider the complexity of the representation (the functional form) of the learned function that the device forms. A device will exhibit good functional form learning ability if it forms a *good* representation. Since what constitutes the *best* representation (min # of params, length of representation in bits, etc) is a subjective matter, this amounts to needing a bias. Moreover, an unbiased representation will be a table look-up which has no generalisation capability (whereas a *better* representation should have some generalisation capability). CSNET : jagota@cs.buffalo.edu BITNET : jagota@sunybcs ------------------------------ Subject: Re: Step Function. Biases are necessary From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 13 Sep 89 11:55:11 +0000 [[ Ron Chrisley wrote about qualities of learning algorithms and then about learning bias. ]] I think since the task is with respect to the origin, the bias should be also. Then only the distance D would need to be learned, and all the information about D would be included in the boundary of radius D. For instance, when I talk of learning boundaries, my bias must be that everything in between those boundaries is of the same class. > Then, r is increased in a least squares way ... > [ sound argument deleted ] > The reason why it was not a good idea in the example I gave was because that > bias learning alg needs information about the entire distribution. Only > looking at boundaries throws that away. Bayesian classifiers are really boundary sets. The boundaries a *calculated* from a priori knowledge of the distributions, but once a boundary is calculated, the information about the entire distribution *is* thrown away. By teaching the machine those boundaries we have done the same thing. I believe a couple of points have been brought out in our discussion over the past few weeks. In my *opinion*, 1) learning and memorization are two very different things. 2) learing implies generalization and rule-extraction. Memorization does not. 3) Biases of some sort are required to learn anything. 4) Learning is fastest with borderline patterns that require the machine to differentiate subtle differences in classes. But, it also seems reasonable that strikingly different examples also play an important role in learning. 5) Learnablility should be defined in terms of a particular set of biases, perhaps dependent on network architecture. (e.g. some things are just not learnable by a particular network or machine) Not bad. > By the way,has anybody looked at 2nd order bias learning as I have > sketched it out here? Thanks to Tony for pointing me in the right > direction... You're welcome. It's a lot of fun. I just have this vision of a bunch of researchers quietly reading these messages and jotting down notes for future work and papers. More people should join the discussion; none of the five points above are proven. ~ tony ~ ------------------------------ Subject: Re: Step Function. Biases are necessary From: danforth@riacs.edu (Douglas G. Danforth) Organization: Research Institute for Advanced Computer Science Date: 13 Sep 89 16:39:54 +0000 [[ Tony Russo writes 5 points ]] In regard to points (1) and (2). In a standard random access memory where all possible addresses can be represented (24 bit address=> 16MB) there is no generalization. Each slot is filled independently of every other slot. However, when dealing with large numbers of bits, say 1,000, it is not possible to represent all possible addresses and yet a "memory" can be constructed for this case. The memory is sparse in the address space. Only a sampling of the possible memory addresses can be present. These "hard locations" can act as repositories for information written into the memory by distributing the information among a set of hard locations which are "near" the desired (but not physically present) address. One can read from an arbitrary address by "pooling" the information stored in the "nearby" hard locations and then thresholding the result. The reason for this preamble is to show that reading from (presenting an input pattern to) a sparse distributed memory (a neural net) can indeed produce output which is a "generalization". The generalization can take the form of: (A) what's the most similar thing to this pattern that I have seen before, or (B) what is the Platonic ideal of this fuzzy pattern? When dealing with very large dimensional spaces it becomes difficult to dismiss the generalization characteristics of a sparse distributed memory for they begin having animal-like capabilities. Most neural net research todate has focused on very small numbers of nodes: input, hidden, and output. For these small cases, I agree, the utility of memory may not seem great. By "rule extraction" I assume you mean in analogy to human concious throught where one can articulate the "rule" that one has discovered. This is an ongoing area of debate. Is it necessary to "interpret" the connection weights or just evaluate the performance of the system? IMHO, that depends upon your goals. Doug Danforth danforth@riacs.edu ------------------------------ Subject: Re: Step Function. Biases are necessary From: bill@boulder.Colorado.EDU Organization: University of Colorado, Boulder Date: 13 Sep 89 17:52:30 +0000 Well, the discussions on this topic are getting long-winded, with lots of nested quotations and such, so it's probably pretty much exhausted its vitality -- but I can't resist taking one more fling, and then I shall remain resolutely silent. >1) learning and memorization are two very different things. I bet there isn't a single psychologist in the whole world who doesn't think that memorization is a kind of learning. >2) learing implies generalization and rule-extraction. Memorization does not. "Learning", as the word is usually used, is a more-or-less enduring change in behavior, caused by experience. It implies generalization only in the sense that no two stimuli are ever exactly the same. (As Heraclitus put it: you can't step twice into the same river. The second time, it's a different river, and you're a different person.) Whether it implies rule-extraction depends on what you mean by a "rule". If a rule is simply an association between some inputs and outputs, then you're right; if it is more than that, you're not. >3) Biases of some sort are required to learn anything. If I understand this statement, what it means is: In order to be able to generalize, a device must be capable of inferring responses to inputs it has not experienced, _and_there_is_no_uniquely_correct_ way_of_doing_that_. >4) Learning is fastest with borderline patterns that require the machine >to differentiate subtle differences in classes. But, it also seems reasonable >that strikingly different examples also play an important role in learning. Learning (of a categorization task) is usually _fastest_, at least in the early stages, with inputs that are "typical" of their categories: if you want to teach someone "mammal", you start with a mouse or a dog, not a dolphin or bat. Borderline inputs are useful later on, after the categories have been roughly sketched out, because they give precise information about where the borders are. >5) Learnablility should be defined in terms of a particular set of biases, >perhaps dependent on network architecture. (e.g. some things are just not >learnable by a particular network or machine) This point seems completely correct to me, and it is the most important point of the whole discussion. ------------------------------ Subject: Re: Step Function. Biases are necessary From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 14 Sep 89 11:57:51 +0000 [[ Responding to Doug Danforth comments on generalization ]] I don't really want to belabor this discussion much longer, but I don't see any generalization in your example. What has been generalized? On the surface, you're saying the fuzzy pattern has been generalized to the "ideal" of the stored templates, but I think that the memory has simply calculated a predetermined distance function. In light of the ongoing discussion, there are two ways to view this: 1) the energy function is the "bias" we've been talking about. But in this case there is no other learning going on, so the generalization is due solely to the bias. If you accept this, then every classifier generalizes, and we need a better definition of generalization. 2) The energy function is not this "bias;" the biases are instead due to the topology etc. of the network. In this case, the rule for similarity was "given" to the network, not extracted by it. It would be nice and clean if the second case were correct, but I wouldn't bet on it. > By "rule extraction" I assume you mean in analogy to human concious > throught where one can articulate the "rule" that one has discovered. > This is an ongoing area of debate. Is it necessary to "interpret" the > connection weights or just evaluate the performance of the system? IMHO, > that depends upon your goals. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You're right. ~ tony ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ apr@cbnewsl.ATT.COM ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ End of Neurons Digest *********************