apr@cbnewsl.ATT.COM (anthony.p.russo) (09/11/89)
I am absolutely convinced that bias is necessary for generalization. When any machine is presented with (an incomplete number of) examples of a function and asked to generalize, that machine must choose between all the possible functions that are consistent with the examples. Its basis for choosing is *DEFINED* as its bias. Without bias, the choice would be rather random, and generalization would be impossible. Therefore, if we are to define learnability, it must be with respect to a bias or set of biases. Now, a bias can be any definable criteria (this may or may not exclude "simplicity" as a bias). It can be in the form of hardwiring (net topology) or previously learned information (weights). This supports someone's comment that learnability should be dependent on network architecture. The question arises: which is more important, learning biases or functions? Well, since generalization is not possible without biases, functions cannot therefore be learned (only memorized). So, if you want a machine to really learn a function (generalize on it), biases are more important. Ron Chrisley writes: > [...] I do not see how the fact that > generalization = bias implies the optimality of learning the boundary > condsitions, and would be very interested in having you elaborate on why you > think it might. > My reply to this is to give a simplified, one-dimensional case. A boundary is most efficiently (read: learning will be faster) defined by its location in n-dimesional space. Since neural nets don't learn this way, the next most efficient definition of a boundary is obtained by giving examples of two items very (infintessimally) close to the boundary but on different sides of it. In this way, in 1-D space for example, two points can define a boundary. Those two points or examples are the most important ones to present to the net. If, for instance, we wanted to teach the concept of negative and positive (zero is the boundary), -1 and +1 (in integer space) would be a sufficient set of examples (given, of course, some definition of bias). Conversely, examples like -102312341 and +823456 are not very helpful. ~ tony ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ AT&T Bell Laboratories ~ ~ apr@cbnewsl.ATT.COM ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) (09/12/89)
I wrote: > [...] I do not see how the fact that > generalization = bias implies the optimality of learning the boundary > conditions, and would be very interested in having you elaborate on why you > think it might. > Then, Tony Russo said: "My reply to this is to give a simplified, one-dimensional case... A boundary is most efficiently (read: learning will be faster) defined by its location in n-dimesional space. Since neural nets don't learn this way, the next most efficient definition of a boundary is obtained by giving examples of two items very (infintessimally) close to the boundary but on different sides of it. In this way, in 1-D space for example, two points can define a boundary. Those two points or examples are the most important ones to present to the net. If, for instance, we wanted to teach the concept of negative and positive (zero is the boundary), -1 and +1 (in integer space) would be a sufficient set of examples (given, of course, some definition of bias). Conversely, examples like -102312341 and +823456 are not very helpful." I claim that although there might be algorithms that learn generalization biases for which the boundary cases provide quickest learning, there are also algorithms for which this is not the case. For instance, some algorithms may learn biases better if you provide exemplars. I know this is exactly what you are claiming to not be the case, but I don't yet see an argument. What is the difference between -1:1 and -100000:100000? If there is a difference in the quality of bias learning, I am sure that it is dependent on some assumptions concerning the bias learning algorithm you have in mind, or concerning the nature of the data. The "boundary is best" does not seem to be true for arbitrary learning algorithms, especially for particular generalization tasks. Consider a 1D task, where everything within distance D of the origin is in cat 1, and all points outside of this region are in cat 2. Now consider the following way of learning bias: Start with the bias that after seeing n samples, you will categorize everything within radius r of any of the samples as the class of those samples, r being small. Then, r is increased in a least squares way, until generalization error is minimized. Clearly, it would be best to use samples near the origin to train this task/bias learning algorithm combination. If samples near the boundaries are used, then there will only be small error in estimated generalization, resulting in small changes to r, which would converge to the following classification: cat 1 if the sample is within epsilon (the small value of r) of +D or -D. But if samples from the interiors of the classes are used, estimated generalization error will better match actual error, which will be initially high, resulting in an increase of r. Thus we will wind up with the following classification: cat 1 if the sample is within D of 0. Don't get me wrong, I do think that learning near the boundaries, ala LVQ2, is a good idea. But I don't think it is a good idea for all tasks, I am not convinced that it is a good way to learn 2nd-order *biases* (as opposed to 1st order distributions), and even if it is good for that, I question whether it has anything to do with the fact that generalization = bias, as opposed to the Bayesian arguments Prof. Kohonen gives. If it were true for Bayesian reasons, you would also probably be assuming that the bias learning is performed after you already have a relatively good solution to the problem. The reason why it was not a good idea in the example I gave was because that bias learning alg needs information about the entire distribution. Only looking at boundaries throws that away. But of course, I may be off track here. You certainly seem to hold the gen=bias => boundary cases implication in high regard. Please explain if I have misunderstood. Ron By the way, has anybody looked at 2nd order bias learning as I have sketched it out here? Thanks to Tony for pointing me in the right direction...
apr@cbnewsl.ATT.COM (anthony.p.russo) (09/13/89)
Ron Chrisley wrote: > I claim that although there might be algorithms that learn generalization > biases for which the boundary cases provide quickest learning, there are > also algorithms for which this is not the case. For instance, some algorithms > may learn biases better if you provide exemplars. > > I know this is exactly what you are claiming to not be the case, but I don't > yet see an argument. What is the difference between -1:1 and -100000:100000? > If there is a difference in the quality of bias learning, I am sure that it > is dependent on some assumptions concerning the bias learning algorithm you > have in mind, or concerning the nature of the data. > > The "boundary is best" does not seem to be true for arbitrary learning > algorithms, especially for particular generalization tasks. Consider a 1D > task, where everything within distance D of the origin is in cat 1, and > all points outside of this region are in cat 2. Now consider the following > way of learning bias: Start with the bias that after seeing n samples, you > will categorize everything within radius r of any of the samples as the class > of those samples, r being small. *** I think since the task is with respect to the origin, the bias should be also. Then only the distance D would need to be learned, and all the information about D would be included in the boundary of radius D. For instance, when I talk of learning boundaries, my bias must be that everything in between those boundaries is of the same class. *** > Then, r is increased in a least squares way, > until generalization error is minimized. Clearly, it would be best to use > samples near the origin to train this task/bias learning algorithm combination > [ sound argument deleted ] > Don't get me wrong, I do think that learning near the boundaries, ala LVQ2, > is a good idea. But I don't think it is a good idea for all tasks, I am > not convinced that it is a good way to learn 2nd-order *biases* (as opposed > to 1st order distributions), and even if it is good for that, I question > whether it has anything to do with the fact that generalization = bias, as > opposed to the Bayesian arguments Prof. Kohonen gives. If it were true for > Bayesian reasons, you would also probably be assuming that the bias learning > is performed after you already have a relatively good solution to the problem. > > The reason why it was not a good idea in the example I gave was because that > bias learning alg needs information about the entire distribution. Only > looking at boundaries throws that away. *** Bayesian classifiers are really boundary sets. The boundaries a *calculated* from a priori knowledge of the distributions, but once a boundary is calculated, the information about the entire distribution *is* thrown away. By teaching the machine those boundaries we have done the same thing. *** > > But of course, I may be off track here. You certainly seem to hold the > gen=bias => boundary cases implication in high regard. Please explain if I > have misunderstood. > *** I believe a couple of points have been brought out in our discussion over the past few weeks. In my *opinion*, 1) learning and memorization are two very different things. 2) learing implies generalization and rule-extraction. Memorization does not. 3) Biases of some sort are required to learn anything. 4) Learning is fastest with borderline patterns that require the machine to differentiate subtle differences in classes. But, it also seems reasonable that strikingly different examples also play an important role in learning. 5) Learnablility should be defined in terms of a particular set of biases, perhaps dependent on network architecture. (e.g. some things are just not learnable by a particular network or machine) Not bad. *** > By the way,has anybody looked at 2nd order bias learning as I have sketched it > out here? Thanks to Tony for pointing me in the right direction... *** You're welcome. It's a lot of fun. I just have this vision of a bunch of researchers quietly reading these messages and jotting down notes for future work and papers. More people should join the discussion; none of the five points above are proven. *** ~ tony ~
danforth@riacs.edu (Douglas G. Danforth) (09/13/89)
Tony Russo writes: >I believe a couple of points have been brought out in our discussion over the >past few weeks. In my *opinion*, >1) learning and memorization are two very different things. >2) learing implies generalization and rule-extraction. Memorization does not. >3) Biases of some sort are required to learn anything. >4) Learning is fastest with borderline patterns that require the machine >to differentiate subtle differences in classes. But, it also seems reasonable >that strikingly different examples also play an important role in learning. >5) Learnablility should be defined in terms of a particular set of biases, >perhaps dependent on network architecture. (e.g. some things are just not >learnable by a particular network or machine) In regard to points (1) and (2). In a standard random access memory where all possible addresses can be represented (24 bit address=> 16MB) there is no generalization. Each slot is filled independently of every other slot. However, when dealing with large numbers of bits, say 1,000, it is not possible to represent all possible addresses and yet a "memory" can be constructed for this case. The memory is sparse in the address space. Only a sampling of the possible memory addresses can be present. These "hard locations" can act as repositories for information written into the memory by distributing the information among a set of hard locations which are "near" the desired (but not physically present) address. One can read from an arbitrary address by "pooling" the information stored in the "nearby" hard locations and then thresholding the result. The reason for this preamble is to show that reading from (presenting an input pattern to) a sparse distributed memory (a neural net) can indeed produce output which is a "generalization". The generalization can take the form of: (A) what's the most similar thing to this pattern that I have seen before, or (B) what is the Platonic ideal of this fuzzy pattern? When dealing with very large dimensional spaces it becomes difficult to dismiss the generalization characteristics of a sparse distributed memory for they begin having animal-like capabilities. Most neural net research todate has focused on very small numbers of nodes: input, hidden, and output. For these small cases, I agree, the utility of memory may not seem great. By "rule extraction" I assume you mean in analogy to human concious throught where one can articulate the "rule" that one has discovered. This is an ongoing area of debate. Is it necessary to "interpret" the connection weights or just evaluate the performance of the system? IMHO, that depends upon your goals. Doug Danforth danforth@riacs.edu
bill@boulder.Colorado.EDU (09/14/89)
Well, the discussions on this topic are getting long-winded, with lots of nested quotations and such, so it's probably pretty much exhausted its vitality -- but I can't resist taking one more fling, and then I shall remain resolutely silent. >I believe a couple of points have been brought out in our discussion over the >past few weeks. In my *opinion*, >1) learning and memorization are two very different things. I bet there isn't a single psychologist in the whole world who doesn't think that memorization is a kind of learning. >2) learing implies generalization and rule-extraction. Memorization does not. "Learning", as the word is usually used, is a more-or-less enduring change in behavior, caused by experience. It implies generalization only in the sense that no two stimuli are ever exactly the same. (As Heraclitus put it: you can't step twice into the same river. The second time, it's a different river, and you're a different person.) Whether it implies rule-extraction depends on what you mean by a "rule". If a rule is simply an association between some inputs and outputs, then you're right; if it is more than that, you're not. >3) Biases of some sort are required to learn anything. If I understand this statement, what it means is: In order to be able to generalize, a device must be capable of inferring responses to inputs it has not experienced, _and_there_is_no_uniquely_correct_ way_of_doing_that_. >4) Learning is fastest with borderline patterns that require the machine >to differentiate subtle differences in classes. But, it also seems reasonable >that strikingly different examples also play an important role in learning. Learning (of a categorization task) is usually _fastest_, at least in the early stages, with inputs that are "typical" of their categories: if you want to teach someone "mammal", you start with a mouse or a dog, not a dolphin or bat. Borderline inputs are useful later on, after the categories have been roughly sketched out, because they give precise information about where the borders are. >5) Learnablility should be defined in terms of a particular set of biases, >perhaps dependent on network architecture. (e.g. some things are just not >learnable by a particular network or machine) This point seems completely correct to me, and it is the most important point of the whole discussion. >Not bad.
apr@cbnewsl.ATT.COM (anthony.p.russo) (09/14/89)
Doug Danforth wrote: > > The reason for this preamble is to show that reading from (presenting > an input pattern to) a sparse distributed memory (a neural net) can indeed > produce output which is a "generalization". The generalization can take > the form of: (A) what's the most similar thing to this pattern that I have > seen before, or (B) what is the Platonic ideal of this fuzzy pattern? > *** I don't really want to belabor this discussion much longer, but I don't see any generalization in your example. What has been generalized? On the surface, you're saying the fuzzy pattern has been generalized to the "ideal" of the stored templates, but I think that the memory has simply calculated a predetermined distance function. In light of the ongoing discussion, there are two ways to view this: 1) the energy function is the "bias" we've been talking about. But in this case there is no other learning going on, so the generalization is due solely to the bias. If you accept this, then every classifier generalizes, and we need a better definition of generalization. 2) The energy function is not this "bias;" the biases are instead due to the topology etc. of the network. In this case, the rule for similarity was "given" to the network, not extracted by it. It would be nice and clean if the second case were correct, but I wouldn't bet on it. > By "rule extraction" I assume you mean in analogy to human concious throught > where one can articulate the "rule" that one has discovered. This is an ongoing > area of debate. Is it necessary to "interpret" the connection weights or just > evaluate the performance of the system? IMHO, that depends upon your goals. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You're right. ~ tony ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ apr@cbnewsl.ATT.COM ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~