neuron-request@HPLABS.HP.COM ("Neuron-Digest Moderator Peter Marvit") (11/11/89)
Neuron Digest Friday, 10 Nov 1989 Volume 5 : Issue 44 Today's Topics: Administrivia learning a step function using back propogation Re: Step Function Re: Re: Step Function Re: Step Function Re: Re: Step Function Step,approximations,etc Re: Step Function Re: Step Function Re: Step Function Re: Step Function Re: Step Function Re: Step Function Re: Step Function Re: Step Function Re: Step Function Re: Step Function Re: Step function Re: Step Function Re: Step Function Re: Re: Step Function Re: Step Function Re: Re: Step Function Send submissions, questions, address maintenance and requests for old issues to "neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request" Use "ftp" to get old issues from hplpm.hpl.hp.com (15.255.176.205). ------------------------------------------------------------ Subject: Administrivia From: "Neuron-Digest Moderator -- Peter Marvit" <neuron@hplabs.hp.com> Date: Fri, 10 Nov 89 15:48:55 -0800 As usual, there have been more messages than Digests, especially from the USENET feed of comp.ai.neural-nets; I may never get out from the flurry. Nonetheless, this Digest (and possibly the next) is an interesting series of exchanges which took place several months ago. I've done minimal editing, mostly to remove unneeded re-quotations. Any lack of continuity should therefore be attributed to me. I've made few editorial comments, since the sepakers seem more than capable. As usual, my editorial marks or comments are enclosed by double brackets [[ like this ]]. -Peter Marvit Immoderator ------------------------------ Subject: learning a step function using back propogation From: rohit vidwans <cs.utexas.edu!ut-emx!walt.cc.utexas.edu!rohit@TUT.CIS.OHIO-STATE.EDU> Organization: The University of Texas at Austin, Austin, Texas Date: 06 Aug 89 23:09:39 +0000 I am trying to teach a network to accept as input a step function and return as output the same step function or a scaled version of the same. I have tried single and multiple inputs and outputs with multiple hidden layers for my network..but am having no luck whatsoever...I am using back propogation learning and my hidden layers use the sigmoidal function. if anybody has any suggestions to make please post reply on net or send mail to above address. Neural Network Group Chemical Engineering Department University of Texas at Austin ------------------------------ Subject: Re: Step Function From: ck@rex.cs.tulane.edu (Cris Koutsougeras) Organization: Computer Science Dept., Tulane Univ., New Orleans, LA Date: Wed, 23 Aug 89 09:19:37 +0000 With the back prop only an analytic (continous differentiable) can be learned. The perfect step function is not therefore it will not be learned. A close approximation can be learned if your training set has enough samples around the point of the jump and of course quite enough coverage of the rest of the total interval. Look at it from an other point of view. No matter how many units (nodes) you put in your net, you are never going to have enough non-linearity to match the perfect step function. If you want I can send you a paper on some related experiments from our group at Tulane U. C. Koutsougeras ------------------------------ Subject: Re: Re: Step Function From: heirich@beowulf.ucsd.edu (Alan Heirich) Organization: EE/CS Dept. U.C. San Diego Date: 24 Aug 89 03:18:10 +0000 [[ Regarding Cris Koutsougeras' response ]] This statement makes me uncomfortable. It has been proven (White) that a two-layer feed forward network with a signmoid activation function (such as you find in standard back propagation networks) can approximate any Borel measurable function to an arbitrary degree of precision. So, for all intents and purposes, such a network can match a perfect step function to any discernible precision. On the other hand, I have not seen any literature proving bounds on learnability. This is obviously a very important area, but if people are still quibbling about the prevelance of local minima in error landscapes, I can't imagine that anyone has proven such a result about which functions are learnable! Alan Heirich Comp. Sci. & Eng., Cognitive Science C-014 University of California, San Diego 92093 heirich@cs.ucsd.edu aheirich@ucsd.bitnet ------------------------------ Subject: Re: Step Function From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 24 Aug 89 13:43:32 +0000 [[ Regarding Alan Herich's response ]] Some things have been proven not to be learnable, i.e. EX-OR. On the other hand, I'm sure no one has proven whether step-function scaling is learnable or not, but intuitively it seems that it is. Perhaps not by backprop ... There is, happily, a small body of research in this area, but not nearly enough. the best reference I can give is: L.G. Valiant, "A theory of the Learnable," Communications of the ACM, vol. 27, no.11, 1984. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ AT&T Bell Laboratories ~ ~ apr@cbnewsl.ATT.COM ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Subject: Re: Re: Step Function From: demers@beowulf.ucsd.edu (David E Demers) Organization: EE/CS Dept. U.C. San Diego Date: Fri, 25 Aug 89 04:39:32 +0000 >> [...] I have not seen any literature >> proving bounds on learnability. This is obviously a very important area, ^ I think Alan may mean "learnability of the step function." here. >Some things have been proven not to be learnable, i.e. EX-OR. Surely you jest. XOR is not learnable by a neural net with no hidden layers, but is certainly a learnable function. >On the other hand, I'm sure no one has proven whether step-function >scaling is learnable or not, but intuitively it seems that it is. [...] the best reference I can give is: >L.G. Valiant, "A theory of the Learnable," Communications of the ACM, >vol. 27, no.11, 1984. Also interesting are: Ehrenfeucht, Haussler, Kearns & Valiant, "A General Lower Bound on the Number of Examples Needed for Learning" appearing in Information and Computation, 1988 [I just have a preprint...] and Haussler, "Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework", Artificial Intelligence 36 (1988). The proceedings of the 1988 workshop on Computational Learning Theory are available from Morgan Kaufman. One of my summer projects was to read several of the papers from the workshop... maybe at Christmas... But back to the original subject - I think Alan Heirich is right that the step function may be APPROXIMATED to arbitrary precision by a net with one hidden layer and sigmoid transfer functions. Backpropogation may not be an adequate learning method; however the cited work by White (appearing in the current issue of Neural Networks, I believe { vol 2 no. 3}) shows that the function IS learnable. This is stronger than showing that a net can approximate it. So, given the function and given a bound on the error, there is a net which can approximate the function to that bound, and an algorithm to learn the mapping. Our job is to discover them :-) Dave DeMers demers@cs.ucsd.edu Dept. of Computer Science & Engineering UCSD La Jolla, CA 92093 ------------------------------ Subject: Step,approximations,etc From: Cris Koutsougeras <rex!ck@G.MS.UKY.EDU> Organization: Computer Science Dept., Tulane Univ., New Orleans, LA Date: 29 Aug 89 06:28:57 +0000 Let me make this a collective response to keep the number of messages smaller ... >From: kroger@titan.tsd.arlut.utexas.edu (Jim Kroger) > >Hi, Chris. I am very interested in understanding what you mean by >learning a function (continuous or otherwise) in the following.... Well, it is understood that the Rumelhart net and the Back Prop constitute a general tool for curve fitting. Given a certain configuration of the net and weights-thresholds/biases, one can derive (but do not try it if you have many units) a closed form expression for the input-output function. That is the function "learned" by the process of adaptation. Now the process of adaptation is based on the samples of the training set. These samples cannot only be seen as "input pattern - output class" but more generaly as "input vector value - output vector values". In the case we were discussing teaching a function means to sample a given function and use the samples as training set. The expectation is to see the net learning the sampled function while trying to fit a curve on the samples. In other words the nets final transfer function should be mathematicaly equivalent to (or an approximation of) the sampled one. From there all the difficult problems start which are basicaly due to the degree of non-linearity needed/existing etc. etc. >From: Ali Minai <aam9n@uvaee.ee.virginia.edu> > >This is in response to your posting on the net about neural nets >learning a step function. If you have any results on this, I would >be very interested in references, comments etc. Thanks. All those who have sent me personal requests will get my response. I can not respond "yesterday" since the classes start now but I will. It would save some effort though if you are able to retrieve the following publications. If you are not then let me know. George, R., B. Geraci, R. Shrikanth and C. Koutsougeras, "A Methodical Study of the Rumelhart Model", 5th IASTED Intern. Conf. Expert Systems and Neural Networks. Hawaii, Aug. 1989. The effect of more or less units than required in a (general) network is discussed in : Koutsougeras, C. and C.A. Papachristou, "A Neural Network Model for Discrete Mappings", in Proceedings of the IEEE International Conference on Languages for Automation (LFA '88), August 1988. Other references : Duda & Hart "Pattern recognition and Schene analysis" John Wiley & Sons 1973 Koutsougeras, C. and C.A. Papachristou, "Training of A Neural Network Model for Pattern Classification Based on an Entropy Measure", in Proceedings ofthe IEEE International Conference on Neural Networks (ICNN '88), IEEE, July 1988. Also Alan Heirich writes : >This statement makes me uncomfortable. [[ etc...] That is what I was truing to point out. The more non-linearity you introduce the better the precision, but no way exact fit. The amount of non-linearity can be increased by introducing more units per layer or more layers although additional layers can get the net easily out of hand. In fact Ken-itci Funahashi in "On the approximate realization of continous mappings by neural networks" (Neural Networks Vol.2 No. # '89) has proven that the arbitrary approximation is possible with only one hidden layer. That is one only needs to add units only in the hidden layer. Nevertheless we agree on the arbitrary approximation except that the point here is that it is an approximation and not an exact fit for the particular function. Cris Koutsougeras Computer Science Dpt Tulane University ------------------------------ Subject: Re: Step Function From: "anthony.p.russo" <att!cbnewsl!apr@UCBVAX.BERKELEY.EDU> Organization: AT&T Bell Laboratories Date: 29 Aug 89 11:40:29 +0000 > Surely you jest. XOR is not learnable by a neural net with > no hidden layers, but is certainly a learnable function. X-OR is not learnable. If you are given the first three entries in the truth table, you could not possibly generalize to the last entry with any confidence. That is, of the two possibilities for the last entry, neither is preferable. X-OR is *memorizable*, not learnable. In either case, learnability is not a requirement for neural net applicability. Nets memorize well. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ apr@cbnewsl.ATT.COM ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Subject: Re: Step Function From: ackley@wind.bellcore.com (David H Ackley) Organization: Bellcore, Morristown, NJ Date: 30 Aug 89 07:29:49 +0000 In article<1667@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes: >X-OR is not learnable. If you are given the first three entries in the truth >table, you could not possibly generalize to the last entry with >any confidence. This renders the term "learnable" meaningless. Over the space of boolean functions, your claim is equally true of all of them. Without some kind of preference structure or bias over the space, both outputs are equally likely for the last row, regardless of the rest of the table. If you do allow such biases, I can pick one (say, a "parity bias") that makes XOR "learnable". Absolutely, generalization is a big important hard problem --- but it's different than learnability, and it's not why XOR is famous. | David Ackley Cognitive Science Research Group | |"To act or Bell Communications Research Inc.| | to react, ackley@flash.bellcore.com| | that is the question" ...!bellcore!ackley| ------------------------------ Subject: Re: Step Function From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 30 Aug 89 12:55:37 +0000 [[ Regarding David Ackley's response ]] > This renders the term "learnable" meaningless. I agree that X-OR is learnable *with* bias. That however, is not the definition of "learnable" in the strictest sense. I won't attempt to prove or disprove your claim regarding all boolean functions, but why must boolean functions be learnable at all? When you saw X-OR for the first time, you most likely *memorized* the truth table. If not, then after *all four* entries, you determined the rule that if both inputs are equal, output "0"; else output "1". Could you have determined this rule with just three examples? No. How would you have chosen a bias? Perhaps our definitions of "learnable" are different. Mine is that, with a fraction of the possible samples, one can generalize to 100% accuracy. Otherwise, cfor example, if after 99 of 100 samples one cannot correctly predict the last output with 100% confidence, nothing has been learned at all about the function in question. If it does take 100% of the possibilities to learn something, that I claim it has not been learned, but rather *memorized*. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ apr@cbnewsl.ATT.COM put disclaimer here. ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Subject: Re: Step Function From: mathew@jane.Jpl.Nasa.Gov (Mathew Yeates) Organization: Image Analysis Systems Grp, JPL Date: 30 Aug 89 16:23:45 +0000 under [[ Russo's ]] this definition, no function is learnable. Considering a function as a set of ordered pairs, seeing some subset of the ordered pairs can never lead to a complete knowledge of the entire set. At least for a mere mortal. A rigorous definition of learnabilty from examples can be found in Valiants "A Theory of the Learnable". ------------------------------ Subject: Re: Step Function From: dg1v+@andrew.cmu.edu (David Greene) Organization: Graduate School of Industrial Administration, Carnegie Mellon, Pittsburgh, PA Date: 30 Aug 89 17:07:35 +0000 [[ Regarding Russo's definition of learnability ]] I believe you are confusing 'sample' with 'census'. If the total population of examples is 100 (and assuming no noise) then perhaps your statement would be appropriate. However if the 100 represents a sample, then the issue of learnability after 99 examples from that sample is dependent on the size of the total population and the degree of confidence you require. (eg. Valient's PAC model of learnability -- Valient, L. G "A Theory of the Learnable" Communications of the ACM, 27 (11) :1134-1142, November 1984) - -David - -------------------------------------------------------------------- David Perry Greene || ARPA: dg1v@andrew.cmu.edu GSIA /Robotics || dpg@isl1.ri.cmu.edu Carnegie Mellon Univ. || BITNET: dg1v%andrew@vb.cc.cmu.edu Pittsburgh, PA 15213 || UUCP: !harvard!andrew.cmu.edu!dg1v - -------------------------------------------------------------------- "You're welcome to use my opinions, just don't get them all wrinkled." ------------------------------ Subject: Re: Step Function From: pluto@beowulf.ucsd.edu (Mark E. P. Plutowski) Organization: EE/CS Dept. U.C. San Diego Date: 31 Aug 89 04:22:41 +0000 I would suggest you introduce one of two things into your definition of "learnability." Either 1) make it dependent on a particular architecture, or, 2) make it dependent upon a complexity measure, such as time needed to learn, number of units, or number of weights. It makes sense to discuss learnability when we talk about a set of concepts (e.g., functions) given a hypothesis space with which to learn the concepts (e.g., threshold-logic). Then, it is an issue whether the concept in question can be represented within the hypothesis space -- and then, we can discuss whether it can be learned given a feasible amount of resources. ------------------------------ Subject: Re: Step Function From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 31 Aug 89 11:22:21 +0000 [[ Replying to Matthew Yeats ]] Yes, I should stick to Valiant's paper. However, one thing: a known function is, in most cases, more than a set of ordered pairs. It is a set of ordered pairs related in some way. Discover the relation and no further examples are needed to have complete knowledge of the entire set. The question of "learnability" is whether the function, or rule, can possibly be extracted from a set of samples *with probability of error less than some value*. This is one mistake I made when I said "100% confidence"--that's not possible. However, getting back to X-OR, this probability of error = 0.5, clearly too high to consider it learnable. Aside: someone commented that learnability should be defined dependent on architecture. I don't see why. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ apr@cbnewsl.ATT.COM put disclaimer here ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Subject: Re: Step Function From: pluto@beowulf.ucsd.edu (Mark E. P. Plutowski) Organization: EE/CS Dept. U.C. San Diego Date: 31 Aug 89 17:01:20 +0000 >... someone commented that learnability should be defined dependent >on architecture. I don't see why. I mentioned this for concreteness' sake. More generally, you want to know whether the set of hypotheses your learning protocol has at its disposal can represent the function you wish to learn to a satisfactory level of approximation - is the hypothesis space sufficient to characterize the set of concepts you wish to learn ? A particular architecture can represent a subset of some concept space - this is its hypothesis space - and it may be only capable of effectively learning some subset of its hypotheses space. In the case you mention, a single 2-input threshold gate cannot learn x-or because it cannot represent x-or. A 2-input threshold gate network with at least 1 hidden unit may not be able to effectively (feasibly, satisfactorily) learn x-or, perhaps due to the ineffectiveness of the learning law. With the definition of learning you initially proposed, we can say very little about learning - except that it will be impossible for all practical purposes. ------------------------------ Subject: Re: Step Function From: David H Ackley <wind!ackley@FLASH.BELLCORE.COM> Organization: Bellcore, Morristown, NJ Date: 31 Aug 89 18:02:23 +0000 In article <1697@cbnewsl.ATT.COM> apr@cbnewsl.ATT.COM (anthony.p.russo) writes: >Yes, I should stick to Valiant's paper. However, one thing: a known function ^^^^^^^^^^^^^^ >is, in most cases, more than a set of ordered pairs. It is a set of ordered >pairs related in some way. By referring to "known functions" you are implicitly incorporating a bias in your notion of learnability --- i.e., you are preferring functions that have some (presumably simple) relation between input and output. I say "presumably simple" because I imagine you want to rule out properties like "appearing in the same truth table" as sufficient grounds for being "related in some way". Can you make your criteria for "known-ness" explicit? | David Ackley Cognitive Science Research Group | | "There are Bell Communications Research Inc.| | no facts, ackley@flash.bellcore.com| | only factors" ...!bellcore!ackley| ------------------------------ Subject: Re: Step Function From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 01 Sep 89 13:52:34 +0000 I could have omitted the word "known" and had a perfectly legitimate statement. That is, ANY deterministic FUNCTION is a set of related ordered pairs, whether known or not. > Can you make your criteria for "known-ness" explicit? No. You bring up a good point, because your argument is really, "What functions shall we consider in the hypothesis space?" I can't tell you; this appears to be getting very deep. On the surface, it seems that biases are !necessary! for learning anything at all. If so, then the biases are probably hard-wired and not learned, since they would have to be learned in terms of other biases, etc. Does this make sense to anyone else, or have I gone off the deep end? ~ tony ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ apr@cbnewsl.ATT.COM put disclaimer here ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Subject: Re: Step function From: danforth@riacs.edu (Douglas G. Danforth) Organization: Research Institute for Advanced Computer Science Date: 01 Sep 89 22:18:45 +0000 [[ Regarding Tony Russo's previous message ]] Two comments: (1) If you will allow me to modify your words, it is interesting that bias can also be looked upon as a "basis". By choosing a biased view of the world one does at least have a place to stand. If an organism has a collection of biased views then there is a possibility that the collection can "span" a space and act as a basis for representing any other biased view in that space. (2) In a deep way we are all limited by the sensory space in which we live. The representation of a function which spans a space larger than we are aware can only be approximated by functions within the smaller space. We are blind to that which we do not know. We know we don't know some things. However, there are things we don't know we don't know. Very humbling (see Fernando Flores and Terry Winograd's book, Computers and Cognition (??), sorry forgot the exact title). - -------------------------------------- Doug Danforth danforth@riacs.edu - -------------------------------------- ------------------------------ Subject: Re: Step Function From: bill@boulder.Colorado.EDU Organization: University of Colorado, Boulder Date: 03 Sep 89 18:56:49 +0000 Is it necessary to have a bias in order to be able to learn? Not always. If the training set includes an example of every possible input, then the device only needs to be able to "memorize" the correct responses -- it doesn't need a preset bias. If, though, there are inputs not included in the training set, then the device can only produce responses by somehow "generalizing" from what it has seen, and the rules it uses for generalization constitute a bias. (I actually prefer the word "heuristic" to "bias". What I am saying here is that any device that learns from incomplete data must use some sort of heuristic.) For devices that learn solely from examples -- which includes all the neural networks I've seen -- that's all there is to it. It's not a deep question at all. It may be more interesting to wonder about more "intelligent" devices, which are capable of learning from commentary as well as examples. ("Commentary" is, e.g., your English teacher telling you that all proper nouns should be capitalized.) Even this, though, does not change the general principle. A comment may be thought of as a constraint upon the set of possible responses. If all of the examples, plus all of the constraints, do not force a unique response to every possible input, then the device, in order to pick a response, must use some sort of bias. ------------------------------ Subject: Re: Step Function From: apr@cbnewsl.ATT.COM (anthony.p.russo) Organization: AT&T Bell Laboratories Date: 05 Sep 89 11:26:00 +0000 > Is it necessary to have a bias in order to be able to learn? > Not always. [...] From what I have thought about and read here, it seems that learning actually constitutes learning *biases* and nothing else (unless it is memorization). *IF* this is true (and I'm not sure it is), then for a classifier, from an information-theoretic point of view, it then makes the most sense to give the network examples that are on the borderline of a rule or class. These "borderline" patterns should contain more information about biases than good examples of a class. In ways this makes sense -- it is akin to telling the network only where the boundaries between classes are. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ Tony Russo " Surrender to the void." ~ ~ apr@cbnewsl.ATT.COM put disclaimer here ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Subject: Re: Re: Step Function From: pluto@beowulf.ucsd.edu (Mark E. P. Plutowski) Organization: EE/CS Dept. U.C. San Diego Date: Tue, 05 Sep 89 20:12:11 +0000 > Not always. If the training set includes an example of every >possible input, then the device only needs to be able to "memorize" >the correct responses -- it doesn't need a preset bias. This is a strong form of a learning bias. The device assumes that everything it sees during training is relevant, and, sufficient. You might be interested in some approaches to learning theory in which the device has access to a teacher that can provide more than just a error-signal ... e.g., once a hypothesis is formed by the device, the teacher, rather than just saying whether or not the hypothesis is correct for given values of inputs, can provide counterexamples to the device. Of course, this requires a teacher with expert knowledge! And, the ability to ascertain what hypothesis the device currently entertains. Applying this idea to neural networks is difficult. The question is: how to apportion this type of training signal to the appropriate units in the network. And, once received by the pertinent units, what to make of it. Any ideas on how to backpropate such training information? "Commentary" feedback can apply to the entire hypothesis formed by the device, not just its performance on a particular input. ------------------------------ Subject: Re: Step Function From: chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) Organization: Xerox Palo Alto Research Center Date: 06 Sep 89 01:05:30 +0000 [[ Tony Russo writes about borderline patterns, et alia ]] This is exactly the learning procedure behind Kohonen's LVQ2 (Learning Vector Quantization) algorithm. See his paper in ICNN '88. He derives the optimality of this method from Bayesian arguments. I do not see how the fact that generalization = bias implies the optimality of learning the boundary condsitions, and would be very interested in having you elaborate on why you think it might. Ron Chrisley ------------------------------ Subject: Re: Re: Step Function From: bill@boulder.Colorado.EDU Organization: University of Colorado, Boulder Date: Wed, 06 Sep 89 02:24:07 +0000 In principle, if you can come up with the training information, you can use back-prop (or something very much like it) to apply it. Back-prop is essentially gradient descent for an error function. Usually the error function is either the total squared error in the output layer for a randomly chosen input (this is "online" back- prop), or the average total squared error for a set of inputs (which is "batch-mode" back-prop). But as far as the mathematics is concerned, the error function can be anything you please. For a neural network, the "hypothesis" it forms is encoded by its weights, and any sort of "comment" you please can be jammed into the error function, if only you can formulate it as a numerical measure calculable from the weights. In fact, a few people have already been trying to do that sort of thing. As you probably know, a persistent problem with back-prop is that it tends to give networks that don't generalize very well to new inputs. One hope for getting better performance is to augment the error function with an extra term representing the "complexity" of the network, since it seems intuitively reasonable that simpler networks should generalize better. It isn't obvious how best to measure complexity: maybe the sum of the magnitudes of all the weights; or maybe the number of weights exceeding some fixed threshold; or maybe something else. This is work in progress, but there have been some promising beginnings. (Perhaps somebody directly involved can comment.) ------------------------------ End of Neurons Digest *********************