neuron-request@HPLABS.HP.COM ("Neuron-Digest Moderator Peter Marvit") (04/19/89)
Neuron Digest Tuesday, 18 Apr 1989 Volume 5 : Issue 18 Today's Topics: Cure for a PDP large .net bug. Questions About Hopfield Networks S. Pinker / A. Prince Re: S. Pinker / A. Prince Re: S. Pinker / A. Prince Pinker and Prince article Re: S. Pinker / A. Prince Re: S. Pinker / A. Prince Training Re: Training Re: Training Re: Training Re: Training Re: Training Re: Training Re: Training Re: Training Re: Training Send submissions, questions, address maintenance and requests for old issues to "neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request" ARPANET users can get old issues via ftp from hplpm.hpl.hp.com (15.255.16.205). ------------------------------------------------------------ Subject: Cure for a PDP large .net bug. From: jdg@antique.UUCP (John Gabbe) Organization: AT&T Bell Labs, Holmdel, NJ Date: Fri, 24 Feb 89 09:19:34 -0500 [[ Editor's Note: John and I had quite a time of it trying to push his message through the mail, hence the apparent lateness. -PM ]] James McClelland accepted this fix for later versions of the PDP program. For those of you who have earlier versions the following may be helpful. April 20, 1988 PDP Software Inquiries c/o Texts Manager MIT Press 55 Hayward Street Cambridge, MA 02142 Dear Sir: The PDP software file ../src/weights.c, contains errors in the segments reproduced below. The pertinent lines are marked with an E# or C in the left margin, which serves as a key to the comments that follow the code. * * * * * * * * * * * * * * * * * * * * /* This file is part of the PDP software package. Copyright 1987 by James L. McClelland and David E. Rumelhart. Please refer to licensing information in the file license.txt, which is in the same directory with this source file and is included here by reference. */ /* file: weights.c read in network descriptions, and set up constraints. First version implemented by Elliot Jaffe. Date of last revision: 8-12-87/JLM. */ line 21 . . . line 609 if (nlinks) { constraints = (struct constraint *) emalloc ((unsigned int)(sizeof (struct constraint) * (nlinks + 1))); for (i = 0; i < nlinks; i++) { constraints[i].num = 0; constraints[i].max = MAXCONSTRAINTS; constraints[i].cvec = ((float **) emalloc((unsigned int)(sizeof(float *) * MAXCONSTRAINTS))); constraints[i].ivec = ((float **) emalloc((unsigned int)(sizeof(float *) * MAXCONSTRAINTS))); E1 for (j = 0; j < nunits; j++) { constraints[i].cvec[j] = NULL; constraints[i].ivec[j] = NULL; } } } line 626 . . . line 783 /* realloc positive_constraints, negative_constraints, and link constraints this is called whenever the allocated constraint lists run out of space for additional constraints 14-May-87 MAF / 15-May-87 JLM */ enlarge_constraints(con_index) int con_index; { E2 if (con_index = ENLARGE_POS) { C maxpos += 100; positive_constraints = ((float **) erealloc ((char *) positive_constraints, C (unsigned int) ((maxpos - 100) * sizeof(float *)), (unsigned int) (maxpos * sizeof(float *)))); } E3 else if (con_index = ENLARGE_NEG) { C maxneg += 100; negative_constraints = ((float **) erealloc ((char *) negative_constraints, C (unsigned int) ((maxneg -100) * sizeof (float *)), (unsigned int) (maxneg * sizeof(float *)))); } else { C constraints[con_index].max += 100; constraints[con_index].cvec = ((float **) erealloc ((char *)constraints[con_index].cvec, (unsigned int) C ((constraints[con_index].max - 100) * sizeof(float *)), (unsigned int) (constraints[con_index].max * sizeof(float *)))); constraints[con_index].ivec = ((float **) erealloc ((char *)constraints[con_index].ivec, (unsigned int) C ((constraints[con_index].max - 100) * sizeof(float *)), (unsigned int) (constraints[con_index].max * sizeof(float *)))); E4 } } * * * * * * * * * * * * * * * * * * * * E1. At the line marked E1, MAXCONSTRAINTS items have just been emalloc'd but nunits items are NULL'ed. When nunits > MAXCONSTRAINTS, unallocated memory is overwritten destroying (on Suns and Vaxen) malloc's bookkeeping and leading to a segmentation error. This line should be replaced by: for (j = 0; j < nunits && j < MAXCONSTRAINTS; j++) { E2, E3. These should be compares, not assignments. Replace `=' with `=='. As the code is written, positive_constraints is always erealloc'd and nothing else is ever erealloc'd. E4. If it is necessary to NULL the pointers at E1, then it should be necessary to NULL the additional space allocated here. C. For consistency, all the 100's should be replaced by MAXCONSTRAINTS. The following code inserted at E4 will then initialize the additional space: for (j = constraints[con_index].max - MAXCONSTRAINTS; j < constraints[con_index].max; j++) { constraints[con_index].cvec[j] = NULL; constraints[con_index].ivec[j] = NULL; } These comments are provided as a courtesy; no claims are made or responsibility assumed regarding their correctness. Yours truly, John D. Gabbe AT&T Bell Laboratories copy to netnews comp.ai.neural-nets P.S. In the official revision, McClelland introduced an additional constant to implement comment C, instead of overloading MAXCONSTRAINTS as is done above. - JDG 2/24/89. ------------------------------ Subject: Questions About Hopfield Networks From: mmm@cup.portal.com (Mark Robert Thorson) Organization: The Portal System (TM) Date: Mon, 20 Mar 89 06:08:58 +0000 I have a question or two about Hopfield networks used as associative memory. Hopfield says that when multiple parallel networks are driven from the same clue bus, and drive the same output bus, only the network containing the memory corresponding to the clue responds (assuming the memory is stored in one of the networks, and only one of the networks). In other places, Hopfield says these networks always drive themselves to a stable state. My question is: is there some kind of "null" or "indecision" stable state? When multiple memory planes are used, what state do they have when presented with a memory which they do not hold? ------------------------------ Subject: S. Pinker / A. Prince From: "Roger Gonzales" <spam@sun.soe.clarkson.edu> Date: Wed, 22 Mar 89 21:59:26 +0000 I just read a rather scathing article by Steven Pinker (MIT) and Alan Prince (Brandeis) that tore PDP apart. Has anyone seen any responses to this article that defend Rumelhart and McClelland? The article bummed me out, and I'm not up enough on language processing to really debate what they claim. ++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++ ------------------------------ Subject: Re: S. Pinker / A. Prince From: kortge@Portia.Stanford.EDU (Chris Kortge) Organization: Stanford University Date: Thu, 23 Mar 89 04:29:28 +0000 The article doesn't tear PDP apart at all (I assume you mean the one in _Cognition_, published also as a book, "Connections and Symbols"). What it does do is tear apart one very simple model of a complex phenomenon (learning of past tenses). As far as I could tell, virtually all its criticisms could be answered with a multilayer network model; that is, the faults of the R&M model derive mostly from the fact that it's just a single layer associative network, with hand-wired representations. As I understand it (having heard Dave Rumelhart's response), the R&M model was never intended to be anything near a complete model of tense acquisition--rather, it was mainly supposed to demonstrate that "rule-like" behavior can coexist with "special case" behavior in the same set of connecting weights. If you want to read an article which _attempts_ to tear PDP _in general_ apart, read the one by Fodor and Pylyshyn in the same book. It didn't make much sense to me, but then I guess I don't have the MIT perspective. If someone really wants to blast PDP, there are _real_ problems with it, like scaling of learning time, which make more sense to focus on than the things F&P talk about. Chris Kortge kortge@psych.stanford.edu ------------------------------ Subject: Re: S. Pinker / A. Prince From: "Roger Gonzalez,,," <sun.soe.clarkson.edu!spam@TCGOULD.TN.CORNELL.EDU> Date: 23 Mar 89 21:46:49 +0000 > As far as I could tell, virtually all its criticisms > could be answered with a multilayer network model; that is, the faults > of the R&M model derive mostly from the fact that it's just a single > layer associative network, with hand-wired representations. Yeah, so I noticed after I waded through the last 50 pages. My impressions changed after I was done reading the article. Seems to me that they were getting all picky about details that I don't think R&M intended to be the god's truth about language processing... Correct me if I'm wrong, but weren't R&M just saying "Look, here's a simple little model that does a pretty good job with past tenses... and look, it even seems to exhibit some of the behavior of children learning language.." Some of P&P's accusations seemed pretty trivial anyway: "It can learn rules found in no human language" SOoooo? (Or are they assuming the ol' language aquisition device?) Anyway, I'm reading the "nastier" article you suggested right now. ++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++ ------------------------------ Subject: Pinker and Prince article From: gblee@maui.cs.ucla.edu Organization: UCLA Computer Science Department Date: Fri, 24 Mar 89 00:14:38 +0000 The article is : Pinker, S. and Prince, A. (1988). On language and connectionism: analysis of a parallel distributed processing models of language inqusition. Cognition, Vol. 28, PP 73-193. Actually this volume of cognition contains two other papers which attack PDP model and it is published as a book. the two other articles are: Fodor, J. and Pylyshyn, A. (1988) Connectionism and cognitive architecture: a critical analysis. Lachter, J and Bever, T. G. (1988) The relation between linguistic structure and associative theories of language learning -- A constructive critique of some connectionist learning models. Geunbae Lee AI Lab, UCLA ------------------------------ Subject: Re: S. Pinker / A. Prince From: kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) Organization: Purdue University Engineering Computer Network Date: Thu, 23 Mar 89 22:25:56 +0000 There is another paper with a similar claim. "Gradient descent fails to separate" is its title. By : M. Brady and R.Raghavan The paper shows the failure of BP in the case of examples where there are no local minima. They assert (and they could be right as such claims have been "romantic",as Minsky put it!) that least square solutons do not minimize the # of misclassifications. They have examples where Perceptron does well while the gradient descent with LSR fails. They conclude that the failure of GD and LSE may be much more wide spread than presumed. SURYA (FIAT LUX) ------------------------------ Subject: Re: S. Pinker / A. Prince From: Allan F Randall <ncc!alberta!randall@uunet.uu.net> Organization: U. of Alberta, Edmonton, AB Date: 25 Mar 89 17:53:48 +0000 > If someone really wants to blast PDP, there are _real_ > problems with it, like scaling of learning time, which make more > sense to focus on than the things F&P talk about. I think the reason Fodor and Pylyshyn do not concentrate on those issues is because they are criticizing the philosophy of connectionism as a general approach to studying the mind, rather than attacking specific problems with current systems. The points they do discuss are all intended to be general problems inherent in the philosophy of connectionism, which they do not anticipate being solved. There are a few aspects of their reasoning that puzzle me; I thought I'd respond by posting my own reactions to the article. I would be interested in hearing from anyone who perhaps understands their perspective better, particularly since I haven't read any connectionist responses to Fodor and Pylyshyn (does anybody know of any?). First of all, though, I think they do a reasonably good job of clearing up a few contentious points concerning which issues are directly relevant to the connectionist/symbolist debate and which are not. For instance, while parallelism is central to most connectionist systems, there is nothing contrary to the classical symbolist view in massive parallelism. The same goes for the idea of soft or fuzzy constraints. I have two major problems with the rest of their article. First, they seem very limited in the types of connectionist systems they are willing to discuss. Most of the examples they give are of systems were each node represents some concept or proposition. Hence, they only discuss connectionist systems that already have a lot in common with symbolic systems. They talk very little about distributed representations and sub-symbolic processes. This seems strange to me, since I would consider these things to be the central justification for the connectionist approach. Fodor and Pylyshyn seem to be artificially limiting the connectionist architecture to a narrow form of symbolism and then judging it on its performance as a logic. What they fail to realize is that it is these very assumptions they use in judging PDP that connectionists are calling into question in the first place. *Of course* PDP, at least in its current forms, fails as a general logical inference mechanism. What (many of) the new connectionists are saying is that these systems work better at *certain types of tasks* than the classical systems. They are meant to address problems with the symbolist approach. Yes, they fail miserably at many things symbol systems do well, but this does not mean we must choose one over the other. This brings me to the other point, which I think is the key problem with Fodor and Pylyshyn's approach. They do not seem to consider the possibility of using *both* approaches. Their main argument is that mental representations have a "combinatorial syntax and semantics" and "structure sensitivity of processes." The upshot of this is that to do the sorts of things humans are good at, a system must have a systematic way to generate mental representations that have a constituent structure. Connectionist systems lack this language-like ability. This is an argument for a "Language of Thought." Because of this emphasis on the language-like aspects of cognition, many of Fodor and Pylyshyn's arguments are about the inability of PDP nets to deal with language. They then generalize to the rest of cognition. While this is not entirely invalid, I think it really weakens their argument, as language is the one aspect of cognition that seems to be the most symbolic and the least connectionistic. However, I would still agree with much of what they say. It is true that thought must have these properties. Cognition must be more than the statistical modelling of the environment. But Fodor and Pylyshyn give short shrift to the idea that both types of architectures will be needed to handle all aspects of cognition. Why could we not have a connectionist system modelling the environment and creating distributed representations that are used by a more classical symbolic processor? (This is, of course, only one way of looking at hybrid systems.) While Fodor and Pylyshyn do spend a little time discussing this sort of thing, it seems to be more of an afterthought, rather than a central part of their argument. This seems strange, especially since this is where the field of AI as a whole seems to be going. In short, Fodor and Pylyshyn are extreme symbolists. They believe in the classical symbolist view in its most extreme form: physical symbol systems are a necessary and sufficient condition for cognition. Their article does a good job of arguing for the "necessary" part, but pays little attention to the more central "sufficient" part. Like the extreme connectionists, they seem convinced that we must choose one or the other. They show that a pure connectionist system could not work and thus conclude that pure symbolism is the answer. To give them credit, while I disagree with their conclusions, I think they do a good job of explaining why an intelligent system must display these properties and why current connectionist architectures are insufficient on their own to do the job. They build a good case against extreme connectionism, but fail to explain why this implies the other extreme. To summarize, my problems with Fodor and Pylyshyn are: i) they criticize connectionism as (a) symbolic logic, and (b) language, largely ignoring its other aspects as unimportant to cognition. ii) they ignore hybrid connectionist/symbolist approaches. Finally, I think Fodor and Pylyshyn simply have different intuitions than I do. They seem to feel that the statistical and distributed nature of intelligence is really not very crucial, if it is there at all. While I disagree, I can certainly respect that. But I was disappointed with their article, because I didn't think it really addressed the issue. I would love to see a critique of connectionism that considered the "sufficient" aspect of symbol systems as rigorously as they discussed the "necessary" aspects. Allan Randall Dept. Computing Science University of Alberta Edmonton, AB ------------------------------ Subject: Training From: "Roger Gonzalez" <spam@sun.soe.clarkson.edu> Date: Sun, 19 Mar 89 20:28:03 +0000 Has anyone come up with a good way to train a net without knowing a "target" in advance? For example, if my net is in an undesireable state S1, and it is totally clueless about how to get to state S2, I would like it to improve the connections that do get it to state S2, and impede any others. Now, since it has no data at all to go on, and the only info it gets after 1 pass is "nope, not in state 2 yet", what can I use as a target? All the learning methods I've studied so far are inadequate for this. Thanks in advance, - -Roger ++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++ ------------------------------ Subject: Re: Training From: hal@vicom.COM (Hal Hardenbergh) Organization: Vicom Systems Inc. San Jose, Cal. Date: Tue, 21 Mar 89 21:14:47 +0000 A colleague and I have tried several of the back-prop "speedup" methods. All that we have tried do speed up convergence, to some degree, as determined by the number of epochs (training iterations). However, none of them reliably provide a speed improvement as measured by wall clock time. The ones which do (sometimes) provide a slight improvement in wall clock time do not do so reliably. It's sort of like varying the convergence and momentum factors. Depending on the random initialization of the weights and biases, c1 and m1 will work better than c2 and m2, and vice versa. As long as we are simulating artificial neural nets in software (if simulating is the right word here), does anyone know of a back-prop speedup trick which reduces the wall-clock training time? Hal Hardenbergh [incl std dsclmr] hal@vicom.com Vicom Systems Inc 2520 Junction Ave San Jose CA (408) 432-8660 surely nobody here remembers the newsletter DTACK Grounded ? ------------------------------ Subject: Re: Training From: kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) Organization: Purdue University Engineering Computer Network Date: Wed, 22 Mar 89 04:07:05 +0000 This, I believe, could be tried using other inexact search methods with BP. REF: Conjugate-gradient methods with inexact searches , in Mathematics of Operations Research vol.3 Gradient evaluation is an expensive computation in BP. A gradient-reuse alogo was suggested whose idea is the following: Gradients are reused several times until the resulting weight updates no longer lead to a reduction in error. Furthe, usage of Batching makes the serach direction more accurate. Batching means : Considering the total error (to be minimized) as the sum of squared differnces between the desired output and the observed, summed over all patterns. SURYA (FIAT LUX) ------------------------------ Subject: Re: Training From: wlp@calmasd.Prime.COM (Walter L. Peterson, Jr.) Organization: Prime-Calma, San Diego R&D, Object and Data Management Group Date: Wed, 22 Mar 89 17:23:01 +0000 State S2 is, in and of itself, the target. That state, whatever it might be, has to be represented to the machine in some manner. The fact that S2 might be a very long text string or some very large binary number or some arbitraty bit pattern that you make up makes no difference, if you know where you are, and you know where you want to be, you can get there ( it might not be an easy trip, though :-) ). The point is that you must not confuse your interpretation of the "meaning" of the bit pattern with the bit pattern itself. As far as you statement that "...since it has no data at all to go on...". I submit that this is meaningless, not only for a neural net, but for any computer program. Also, your connections and their weights, by themselves, constitute 'data to go on'. Given that the system starts in some arbitrary state S1, which must be known or knowable to the system, and given that its goal is to get to some other known or knowable state S2 there are three ways to proceed. First, if there is some known transform that can be applied to state S1 to turn it into some other valid state Sn, then that transform must be continually applied to the current state, call it Sc, and then Sc must be tested against S2 to see if we have found S2. The existance of some transform implies that there is some set of parameters upon which the transform operates, and lacking any data on which to go, the system must perform a systematic exhaustive search of the parameter space. The second method is a variation on the first. Lacking a known transform, the best that you can do is apply random values to the things that constitute the state and wait for S2 to turn up. This is the situation that you would have in your example, where you say that "...it is totally clueless about how to get to S2...". Totally clueless implies that you don't even have an algorithm for transforming S1 into ANY other valid state, Sn. This, of course, assumes that you have no means of determining how "far away" the current state Sc is from S2. If you have some way of determining that distance, then what you've got is, essentially, the gradient descent of back-propagation which is the third way of going from state S1 to state S2. This "distance" information ( the error function in back-prop ) and the connections with their weights are, essentialy, all the data that you need.-- Walt Peterson. Prime - Calma San Diego R&D (Object and Data Management Group) "The opinions expressed here are my own and do not necessarily reflect those Prime, Calma nor anyone else. ...{ucbvax|decvax}!sdcsvax!calmasd!wlp ------------------------------ Subject: Re: Training From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Organization: Princeton University, NJ Date: Thu, 23 Mar 89 02:34:23 +0000 I'm using conjugate-gradient minimization in neural networks. Each iteration of conjugate gradient gains significantly more than "standard" backprop in error surface, but then again, each iteration requires an entire direction search & minimization which may take anywhere from 5 to 10 net evaluations (over whole training set or whatever). It's not always clear whether or not it saves "wall-clock" time. Sometimes, when the function is "easy", it sniffs out the minimum very quickly. It has the advantage that you're not dependent at all on a "learning parameter". But, of course, it's significantly more difficult to code than naive gradient descent, and can't possibly have any neurological justification. When I minimize the error surface w/ conjugate gradient, I usually take the sum over most or all elements of the training set. Conjugate gradient is very good at searching out local minima, and so I think some precautions must be noted: 1) Don't try to minimize on one example only, and then try minimizing on the next and so on. C.G. will very quickly find a minimum that "memorizes" the first example, and then just as easily forget it and memorize the second, etc. with very little generalization over many examples. 2) C.G. will still find local minima or regions where the error decreases very slowly, even training over all examples. Thus I don't train over the same examples for every iteration of C.G.; I train over some large random subset for a while and then switch off, either training over the whole set or another random subset, depending on how well I did. It has some advantages though: it's very dogged, and doesn't screw up royally very often. I often find that I can do very well with regular BP getting down towards a minimum, and then find that it screws up and starts to climb back up the other side, and because the step size is too large, can't find the narrow valley at the bottom. C.G. almost always will minimize your function, though it can get stuck in local minima more often. Please note that I'm doing a "hard" task where I need to map continous values to continous values with as high accuracy as possible. >Gradient evaluation is an expensive computation in BP. A gradient-reuse >alogo was suggested whose idea is the following: >Gradients are reused several times until the resulting weight updates no >longer lead to a reduction in error. In this case, you're just doing a very naive line-minimization in the gradient direction. You could do much better by using a good line-min routine which can interpolate to the presumed minimum much faster. (I use the BRENT from Numerical Recipes for my C.G.) I actually implimented this line-minimization alg., but was disappointed in its performance. Regular BP usually did better. > >Furthe, usage of Batching makes the serach direction more accurate. >Batching means : >Considering the total error (to be minimized) as the sum of squared differnces > between the desired output and the observed, summed over all patterns. Yes, but if you add on the gradient each time, later examples get the benefit of whatever global features that have been learnt in earlier examples. I usually find that standard back-prop (update after each weight) w/ momentum is faster. Indeed, because of its simplicity & speed, I've found it hard to beat in real wall-clock time. 90% of a back-prop program's time is spent in the forward iteration & gradient computation subroutines. It's very easy to double the complexity of the inner loops b/c back-prop w/ fixed steps is so simple, and thus really slow down the program alot. I use standard back-prop for a while before starting on C.G---when starting from a random position, C.G. is usually pretty clueless and needs to get closer in. Matt Kennel mbkennel@phoenix.princeton.edu ------------------------------ Subject: Re: Training From: rich@bunny.UUCP (Rich Sutton) Organization: GTE Laboratories, Waltham, MA Date: Thu, 23 Mar 89 16:17:59 +0000 The learning methods that you are looking for are called "reinforcement learning" methods. They are less well known than supervised learning methods, but hardly obscure. And of course they are more powerful in the sense you refer to -- they can learn what to do without a teacher that already knows the correct answers. I recommend the following papers. -Rich Sutton A. Barto, R. Sutton, & P. Brouwer, ``Associative search network: A reinforcement learning associative memory," Biological Cybernetics, 40, 1981, pp. 201--211. A. Barto & R. Sutton, ``Landmark learning: An illustration of associative search," Biological Cybernetics, 42, 1981, pp. 1--8. Barto, Sutton, and Anderson, ``Neuronlike elements that can solve difficult learning control problems,'' IEEE Trans. on Systems, Man, and Cybernetics, 13, 1983, pp. 835--846. Williams, R. J., ``Toward a theory of reinforcement-learning connectionist systems,'' Technical Report NU-CCS-88-3, College of Computer Science, Northeastern University, Boston, Massachusetts, 1988. My dissertation is also directly relevant and I invite you to write me for a copy at GTE Labs, Waltham MA 02254. ------------------------------ Subject: Re: Training From: kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) Organization: Purdue University Engineering Computer Network Date: Fri, 24 Mar 89 05:31:02 +0000 Besides gradient and conjugate gradient methods there are other one could try. There are methods known as Quasi-Newton methods that are known to perform much better in non-linear optimization. It is because they use higher order derivatives besides the first(as in gradient methods), and thus use more knowledge of the objective function contours. Second and higher order derivatives can be thought of as indicators of error (surfaces) arising from gradient application. This additional knowledge is used to achieve a much rapid convergence. The computational difficulties with quasi-Newtonian approaches is the evaluation of the Hessian. There are inexpensive updating methods as BFGS (Broyden-Fletcher -Goldfarb-Shannon) algorithm. (most NLP books should have this) Surya ------------------------------ Subject: Re: Training From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Organization: Princeton University, NJ Date: Sat, 25 Mar 89 02:19:57 +0000 I don't think you want to use Quasi-Newton methods such as BFGS for most neural-net problems. They require O(N^2) storage where N is the number of _weights_, and in each iteration require at least O(N^2) operations. If you're dealing with small N, this is usually insignificant next to the time to evaluate your functions, but for large N, it might be a problem. I use conjugate-gradient, which doesn't have this problem but probably requires more function evaluations than BFGS by a moderate, but not huge factor. Matt K. mbkennel@phoenix.princeton.edu ------------------------------ Subject: Re: Training From: "Roger Gonzelez" <spam@sun.soe.clarkson.edu> Date: Mon, 03 Apr 89 21:29:15 +0000 Let me restate my question on training in a slightly different way. Since I'm an undergraduate doing research in NN and I can get away with it, my "problem" is not one of much importance or practicality. Here it is: There is a small organism in my pseudo-universe that has a life that consists mainly of swimming up or down, eating things, and dodging annoying predators that nibble at it. The shallow waters are safe, but all the food (and predators) are down deeper. There are other stimuli as well, but these will suffice to give you the gist of what I'm playing with. Lets assume that our friend is in deep water (literally) and a predator is munching on him. (It's never fatal, but the "pain" and "fear" inputs are bothering it.) The correct solution to get out of this state is to swim up. However, I don't want to have to tell it this. The solution that I'm playing with right now (suggested to me via mail) is something I'm told is called "weakly supervised learning". I haven't implemented much of this solution yet, so I don't know how well it will work. The basic (and obvious) method seems to be to negatively affect any attempt that didn't get it out of the undesireable state. I can't believe I didn't think of this on my own. Any other suggestions? - Roger ++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++ ------------------------------ Subject: Re: Training From: "John B. Nagle" <shelby!glacier!jbn@DECWRL.DEC.COM> Organization: Stanford University Date: 04 Apr 89 04:29:04 +0000 This is a vivarium approach to AI, and a good one. See Ann Marion's classic "aquarium" program. Mike Travers recent MS thesis at MIT offers some insight on current thinking in this area. Worth considering for a system like yours, where you really want to cold-start with no built-in knowledge at all, is to add some analogue of "evolution" to the system. The simulator should cause the fish to die if they don't get enough food or get bitten too much. Fish that thrive should reproduce, with some tweaking of the parameters of the offspring to simulate mutation. Provided that the initial environment is not so hostile that all the fish die, which can be handled by arranging the simulator so that the predators are very few until the population increases above some level, the fish should become more effective over time. Once it's working, let it run over a weekend and see what happens. John Nagle ------------------------------ End of Neurons Digest *********************