neuron-request@HPLABS.HP.COM (Neuron-Digest Moderator Peter Marvit) (11/13/88)
Neuron Digest Thursday, 10 Nov 1988 Volume 4 : Issue 21 Today's Topics: How to ensure that back-propagation separates separable patterns reply on back-propagation fails to separate paper Re: reply on back-propagation fails to separate paper Re: reply on back-propagation fails to separate paper Re: reply on back-propagation fails to separate paper LMS fails even in separable cases ANNs for Image classification (Remote Sensing) INNS paper reference looking for papers Political science viewed as a Neural Net :-) polyvalued digits in classifier messages? Production rule LHS pattern matching with neural nets Response wanted on neural net user interfaces Send submissions, questions, address maintenance and requests for old issues to "neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request" ------------------------------------------------------------ Subject: How to ensure that back-propagation separates separable patterns From: Geoffrey Hinton <hinton@ai.toronto.edu> Date: Thu, 27 Oct 88 15:39:44 -0400 [[ Editor's Note: This series of entries is from a mailing list of active Connectionist researchers. For you readers who are pursuing different aspects of ANNs, you might be interested. -PM ]] We have recently obtained the following results, and would like to know if anyone has seen them previously written up. The description is informal but accurate: There has been some interest in understanding the behavior of backpropagation in feedforward nets with no hidden neurons. This is a particularly simple case that does not require the full power of back-propagation (it can also be approached from the point of view of perceptrons), but we believe that it provides useful insights into the structure of the error surface, and that understanding this case is a prerequisite for understanding the more general case with hidden layers. In [1], the authors give examples illustrating the fact that while a training set may be separable, a net performing backpropagation (gradient) search may get stuck in a solution which fails to separate. The objective of this note is to point out that if one uses instead a threshold procedure, where we do not penalize values ``beyond'' the targets, then such counterexamples cease to exist, and in fact that one has a convergence theorem that closely parallels that for perceptrons: the continuous gradient adjustment procedure is such that, from any initial weight configuration, a separating set of weights is obtained in finite time. We also show how to modify the example given in [2] to conclude that there is a training set consisting of 125 binary vectors and a network configuration for which there are nonglobal local minima, even if threshold LMS is used. In this example, the training set is of course not separable. Finally, we compare our results to more classical pattern recognition theorems, in particular to the convergence of the relaxation procedure for threshold LMS with linear output units ([3]), and show that the continuous adjustment approach is essential in the study of the nonlinear case. Another essential difference with the linear case is that in the latter nonglobal local minima cannot happen even if the data is not separable. References: [1] Brady, M., R. Raghavan and J. Slawny, ``Backpropagation fails to separate where perceptrons succeed,'' submitted for publication. Summarized version in ``Gradient descent fails to separate,'' in {\it Proc. IEEE International Conference on Neural Networks}, San Diego, California, July 1988, Vol. I, pp.649-656. [2] Sontag, E.D., and H.J. Sussmann, ``Backpropagation can give rise to spurious local minima even for networks without hidden layers,'' submitted. [3] Duda, R.O., and P.E. Hart, {\it Pattern Classificationa and Scene Analysis}, Wiley, New York, 1973. Geoff Hinton Eduardo Sontag Hector Sussman ------------------------------ Subject: reply on back-propagation fails to separate paper From: Richard Lippmann <rpl@ll-sst.ARPA> Date: Fri, 28 Oct 88 09:42:23 -0400 Geoff, We came up with the same conclusion a while ago when some people were worried about the performance of back propagation but never published it. Back propagation with limits seems to converge correctly for those contrived deterministic cases where minimizing total squared error does not minimize the percent patterns classified correctly. The limits cause the algorithm to change from an LMS mean-square minimizing approach to perceptron-like error corrective approach. Typically, however, the difference in percent patterns classified correctly between the local and global solutions in those cases tends to be small. In practice, we found that convergence for the one contrived case we tested with limits took rather a long time. I have never seen this published and it would be good to see your result published with a convergence proof. I have also seen little published on the effect of limits on performance of classifiers or on final weight values. Rich ------------------------------ Subject: Re: reply on back-propagation fails to separate paper From: mehra@aquinas.csl.uiuc.edu (Pankaj Mehra) Date: Fri, 28 Oct 88 11:44:02 -0500 Hi everybody. When I heard Brady et al.'s talk at ICNN-88, I thought that the results simply pointed out that a correct approach to classification may not give equal importance to all training samples. As is well-known, classical back-prop converges to a separating surface that depends on the LMS error summed uniformly over all training samples. I think that the new results provide a case for attaching more importance to the elements on concept boundaries. I have been working on this problem (of trying to characterize "boundary" elements) off and on, without much success. Basically, geometric characterizations exist but they are too complex to evaluate. What is interesting, however, is the fact that complexity of learning (hence, the time for convergence) depend on the nature of the separating surface. Theoretical results also involve similar concepts, e.g. VC-dimension. Also notice that if one could somehow "learn" the characteristic of boundary elements, then one could ignore a large part of the training sample and still converge properly using a threshold procedure like that suggested in Geoff's note. Lastly, since back-prop is not constrained to always use LMS as the error function, one wonders if there is an intelligent method (that can be automated) for constructing error functions based on the complexity of the separating surface. - - Pankaj Mehra {mehra%aquinas@uxc.cso.uiuc.edu} ------------------------------ Subject: Re: reply on back-propagation fails to separate paper From: sontag@fermat.rutgers.edu (Eduardo Sontag) Date: Sat, 29 Oct 88 10:50:52 -0400 Mark, Re your question to Mehra about complexity of boundaries and VC dimension, we just had a talk at Rutgers yesterday by Eric Baum (baum@pupgg.princeton.edu) about this. You should ask him for copies of his papers on the subject, which also contain references to Valiant's and other related work. I think that the relation between Brady et.al. and our results can be explained better in terms of threshold vs nonthreshold costs than in terms of relative weightings of terms. - -eduardo Eduardo D. Sontag Rutgers Center for Systems and Control (SYCON) Rutgers University (sontag@fermat.rutgers.edu) ------------------------------ Subject: Re: reply on back-propagation fails to separate paper From: terry@cs.jhu.edu (Terry Sejnowski <terry@cs.jhu.edu>) Date: Mon, 31 Oct 88 19:35:09 -0500 The proceedings of the CMU Connectionist Models Summer School has two papers on optimal choice of training set based on "critical" or "boundary" patterns: Karen Huyser on boolean functions and Ahmad Subatai on the majority function. The proceedings are available from Morgan Kaufmann. Terry ------------------------------ Subject: LMS fails even in separable cases From: neural!jsd@hplabs.HP.COM (John Denker) Date: Wed, 02 Nov 88 11:25:35 -0500 Yes, we noticed that a Least-Mean-Squares (LMS) network even with no hidden units fails to separate some problems. Ben Wittner spoke at the IEEE NIPS meeting in Denver, November 1987, describing !two! failings of this type. He gave an example of a situation in which LMS algorithms (including ordinary versions of back-prop) are metastable, i.e. they fail to separate the data for certain initial configurations of the weights. He went on to describe another case in which the algorithm actually !leaves! the solution region after starting within it. He also pointed out that this can lead to learning sessions in which the categorization performance of back-prop nets (with or without hidden units) is not a monotonically improving function of learning time. Finally, he presented a couple of ways of modifying the algorithm to prevent these problems, and proved a convergence theorem for the modified algorithms. One of the key ideas is something that has been mentioned in several recent postings, namely, to have zero penalty when the training pattern is well-classified or "beyond". We cited Minsky & Papert as well as Duda & Hart; we believe they were more-or-less aware of these bugs in LMS, although they never presented explicit examples of the failure modes. Here is the abstract of our paper in the proceedings, _Neural Information Processing Systems -- Natural and Synthetic_, Denver, Colorado, November 8-12, 1987, Dana Anderson Ed., AIP Press. We posted the abstract back in January '88, but apparently it didn't get through to everybody. Reprints of the whole paper are available. Strategies for Teaching Layered Networks Classification Tasks Ben S. Wittner (1) John S. Denker AT&T Bell Laboratories Holmdel, New Jersey 07733 ABSTRACT: There is a widespread misconception that the delta-rule is in some sense guaranteed to work on networks without hidden units. As previous authors have mentioned, there is no such guarantee for classification tasks. We will begin by presenting explicit counter-examples illustrating two different interesting ways in which the delta rule can fail. We go on to provide conditions which do guarantee that gradient descent will successfully train networks without hidden units to perform two-category classification tasks. We discuss the generalization of our ideas to networks with hidden units and to multi-category classification tasks. (1) Currently at NYNEX Science and Technology / 500 Westchester Ave. White Plains, NY 10604 ------------------------------ Subject: ANNs for Image classification (Remote Sensing) From: csrobe@cs.wm.edu (Chip Roberson) Date: Sun, 06 Nov 88 21:33:15 -0500 Keywords: Artifical Neural Networks, Remote Sensing, Image Processing, Landsat, Land Use I am beginning to do some image processing of remote sensing data to do land use studies. Much of my work will include taking Landsat and SPOT raster data (as well vector data from various sources) and classifying the spectral signatures so we can determine how land segments are used (vegetation, barren, stripped, fresh water, etc.). What I am wondering is what neural computation techniques have been used in this domain? I have seen where ANNs have been used as maximum-likelihood classifiers for pattern recognition, but has anyone done any specific applications that are related to what I'm attempting to do? Are there any interestng issues regarding this kind of image processing? I'm interested in receiving references, comments, wild theories, names, phone numbers, caveats, random thoughts, and questions. Many Thanks, - -chip Chip Roberson csrobe@cs.wm.edu [128.239.1.30] Dept. of Computer Science csrobe@icase.edu College of William and Mary #csrobe@wmmvs.bitnet Williamsburg, VA 23185 ...!uunet!pyrdc!gmu90x!wmcs!csrobe (804) 253-4393 ------------------------------ Subject: INNS paper reference From: Rockwell.henr@Xerox.COM Date: 01 Nov 88 10:45:00 -0500 Reply-To: rockwell.henr@xerox.com Organization: Xerox I'm looking for a reference for a paper presented at the INNS conference. Someone spoke about it briefly during another talk. The paper dealt with boundry cases for training data and how network performance was increased by inclusion of as many boundry cases as possible. Does this ring any bells or are there any other papers on this subject that are known? ------------------------------ Subject: looking for papers From: aam9n@uvaee.ee.virginia.EDU (Ali Minai) Organization: EE Dept, U of Virginia, Charlottesville Date: 03 Nov 88 00:36:18 +0000 I am looking for two papers by Lapedes and Farber, and will be grateful if some kind soul out there could send me copies. The papers are: GENETIC DATA BASE ANALYSIS WITH NEURAL NETS. and NONLINEAR SIGNAL PROCESSING USING NEURAL NETS. Both are by A. Lapedes & R. Farber, and both were presented to the IEEE Conference on Neural Information Processing Systems--Natural and Synthetic. Any info about how to obtain proceedings fot this conference will also be appreciated. Thanks in advance Ali Minai, Dept. of Electrical Engg. Thornton Hall University of Virginia Charlottesville, VA 22901 aam9n@uvaee.ee.Virginia.EDU aam9n@maxwell.acc.Virginia.EDU ------------------------------ Subject: Political science viewed as a Neural Net :-) From: Scott.Fahlman@B.GP.CS.CMU.EDU Date: Tue, 25 Oct 88 21:24:04 -0400 [[ Editor's note: I would have liked to get this off before last Tuesday, but it's better late than never. After all, Canada is having their election shortly, as well. Given that most elections generate more heat than light, I'd say a Boltzmann Machine model may be more apropos. -PM ]] I was thinking about the upcoming U.S. election today, and it occurred to me that the seemingly useless electoral college mandated by the U.S. constitution might actually be of some value. A direct democratic election is basically a threshold decision function with lots of inputs and with fixed weights; add the electoral college and you've got a layered network with fifty hidden units, each with a non-linear threshold function. A direct election can only choose a winner based on some linearly separable function of voter opinions. You would expect to see complex issues forcibly projected onto some crude 1-D scale (e.g. "liberal" vs. "conservative" or "wimp" vs. "macho"). With a multi-layer decision network the system should be capable of performing a more complex separation of the feature space. Though they lacked the sophisticated mathematical theories available today, the designers of our constitution must have sensed the severe computational limitations of direct democracy and opted for the more complex decision system. Unfortunately, we do not seem to be getting the full benefit of this added flexibility. What the founding fathers left out of this multi-layer network is a mechanism for adjusting the weights in the network based on how well the decision ultimately turned out. Perhaps some form of back-propagation would work here. It might be hard to agree on a proper error measure, but the idea seems worth exploring. For example, everyone who voted for Nixon in 1972 should have the weight of his his future votes reduced by epsilon; a large momentum term would be added to the reduction for those people who had voted for Nixon previously. The reduction would be greater for voters in states where the decision was close (if any such states can be found). There is already a mechanism in place for altering the output weights of the hidden units: those states that correlate positively with the ultimate decision end up with more political "clout", then with more defense-related jobs. This leads to an influx of people and ultimately to more electoral votes for that state. Some sort of weight-decay term would be needed to prevent a runaway process in which all of the people end up in California. We might also want to add more cross-connections in the network. At present, each voter affects only one hidden unit, the state where he resides. This somewhat limits the flexibility of the learning process in assigning arbitrary functions to the hidden units. To fix this, we could allow voters to register in more than one state. George Bush has five or six home states; why not make this option available to all voters? More theoretical analysis of this complex system is needed. Perhaps NSF should fund a center for this kind of thinking. The picture is clouded by the observation that individual voters are not simple predicates: most of them have a rudimentary capacity for simple inference and in some cases they even exhibit a form of short-term learning. However, these minor perturbations probably cancel out on the average, and can be treated as noise in the decision units. Perhaps the amount of noise can be manipulated to give a crude approximation to simulated annealing. - -- Scott ------------------------------ Subject: polyvalued digits in classifier messages? From: brian@caen.engin.umich.edu (Brian Holtz) Date: Sat, 05 Nov 88 20:43:57 -0500 Does anyone know of any references that describe classifier systems whose messages are composed of digits that may take more than two values? For instance, I want to use a genetic algorithm to train a classifier system to induce lexical gender rules in Latin. Has any work been done on managing the complexity of going beyond binary-coded messages, or (better yet) encoding characters in messages in a useful, non-ASCIIish way? Brian Holtz ------------------------------ Subject: Production rule LHS pattern matching with neural nets From: pratt@paul.rutgers.edu (Lorien Y. Pratt) Date: Fri, 28 Oct 88 16:54:11 -0400 Hello, I'm interested in any work involving matching the left-hand side of a production rule using a neural network. I imagine that one could use a representation language which isn't first order logic which could be used for the LHS, also and which might be more amenable to a neural network approach. Thanks (again!) for any pointers, Lori - ------------------------------------------------------------------- Lorien Y. Pratt Computer Science Department pratt@paul.rutgers.edu Rutgers University Busch Campus (201) 932-4634 Piscataway, NJ 08854 ------------------------------ Subject: Response wanted on neural net user interfaces From: cs162faw@sdcc18.ucsd.EDU (Phaedrus) Organization: University of California, San Diego Date: Wed, 02 Nov 88 02:36:39 +0000 I am an undergraduate student who is currently helping to design/test a user-interface for a neural-net software/hardware package. The package in question is from HNC, and currently they are using a simple menu-based system called Netset. HNC is presently developing a applications language called Axon. I know there are many other products on the market that simulate neural networks (whether through hardware or software). I would be interested to talk to any people who have used either menu-based systems or language-like systems and what they liked/disliked about them. Our particular goal is to make the interface easy to use, so any comments about other software's user interfaces would be VERY appreciated. Also, if you could mention how proficient you already are with neural networks, your programming skill, and any other personal information. If anyone out there has used Netset, comments about it would especially be appreciated. Please reply by e-mail. Thank you in advance. No signature file....yet. James Shaw ------------------------------ End of Neurons Digest *********************