clutx.clarkson.edu (Roger Gonzalez,,,) (03/20/89)
Has anyone come up with a good way to train a net without knowing a "target" in advance? For example, if my net is in an undesireable state S1, and it is totally clueless about how to get to state S2, I would like it to improve the connections that do get it to state S2, and impede any others. Now, since it has no data at all to go on, and the only info it gets after 1 pass is "nope, not in state 2 yet", what can I use as a target? All the learning methods I've studied so far are inadequate for this. Thanks in advance, -Roger ++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++ "Just like I've always said; there's nothing an agnostic can't do if he's not sure he believes in anything or not!" - Monty Python
bph@buengc.BU.EDU (Blair P. Houghton) (03/21/89)
In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes: >Has anyone come up with a good way to train a net >without knowing a "target" in advance? [...] >All the learning methods I've studied so far >are inadequate for this. Sounds like good ol' negative reinforcement to me. You could send it to a Zen Buddhist monastery, or an English public school... Have you tried just getting a degree in education and coding it? --Blair "...using C--, the 'Object-Lesson' C..." P.S. 8-D
hal@vicom.COM (Hal Hardenbergh) (03/22/89)
In article <2351@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes: >In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes: >>Has anyone come up with a good way to train a net >>without knowing a "target" in advance? >[...] >>All the learning methods I've studied so far >>are inadequate for this. > >Sounds like good ol' negative reinforcement to me. > >You could send it to a Zen Buddhist monastery, or an English public >school... > >Have you tried just getting a degree in education and coding it? > > --Blair > "...using C--, the 'Object-Lesson' C..." > >P.S. 8-D A colleague and I have tried several of the back-prop "speedup" methods. All that we have tried do speed up convergence, to some degree, as determined by the number of epochs (training iterations). However, none of them reliably provide a speed improvement as measured by wall clock time. The ones which do (sometimes) provide a slight improvement in wall clock time do not do so reliably. It's sort of like varying the convergence and momentum factors. Depending on the random initialization of the weights and biases, c1 and m1 will work better than c2 and m2, and vice versa. As long as we are simulating artificial neural nets in software (if simulating is the right word here), does anyone know of a back-prop speedup trick which reduces the wall-clock training time? Hal Hardenbergh [incl std dsclmr] hal@vicom.com Vicom Systems Inc 2520 Junction Ave San Jose CA (408) 432-8660 surely nobody here remembers the newsletter DTACK Grounded ?
efrethei@afit-ab.arpa (Erik J. Fretheim) (03/22/89)
In article <1577@vicom.COM> hal@vicom.COM (Hal Hardenbergh (236)) writes: >In article <2351@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes: >>In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes: >>>Has anyone come up with a good way to train a net >>>without knowing a "target" in advance? >>[...] >>>All the learning methods I've studied so far >>>are inadequate for this. >> >>Sounds like good ol' negative reinforcement to me. >> >>You could send it to a Zen Buddhist monastery, or an English public >>school... >> >>Have you tried just getting a degree in education and coding it? > > >As long as we are simulating artificial neural nets in software (if simulating "Emulating" I belive is the correct term, if not the term of choice. Of course, it seems rather natural that speeding up backprop is like all of the rest of pattern recognition, an art rather than a science. After all, if the results were predictable would it be any fun any more? .signature and .disclaimer to be invented someday.
kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) (03/22/89)
In article <1577@vicom.COM>, hal@vicom.COM (Hal Hardenbergh) writes: > > A colleague and I have tried several of the back-prop "speedup" methods. All > that we have tried do speed up convergence, to some degree, as determined by > the number of epochs (training iterations). However, none of them reliably > provide a speed improvement as measured by wall clock time. > > The ones which do (sometimes) provide a slight improvement in wall clock time > do not do so reliably. It's sort of like varying the convergence and momentum > factors. Depending on the random initialization of the weights and biases, > c1 and m1 will work better than c2 and m2, and vice versa. > > As long as we are simulating artificial neural nets in software (if simulating > is the right word here), does anyone know of a back-prop speedup trick which > reduces the wall-clock training time? > > Hal Hardenbergh [incl std dsclmr] hal@vicom.com This, I believe, could be tried using other inexact search methods with BP. REF: Conjugate-gradient methods with inexact searches , in Mathematics of Operations Research vol.3 Gradient evaluation is an expensive computation in BP. A gradient-reuse alogo was suggested whose idea is the following: Gradients are reused several times until the resulting weight updates no longer lead to a reduction in error. Furthe, usage of Batching makes the serach direction more accurate. Batching means : Considering the total error (to be minimized) as the sum of squared differnces between the desired output and the observed, summed over all patterns. SURYA (FIAT LUX)
wlp@calmasd.Prime.COM (Walter L. Peterson, Jr.) (03/23/89)
In article <2698@sun.soe.clarkson.edu>, spam@sun.soe!clutx.clarkson.edu (Roger Gonzalez,,,) writes: > Has anyone come up with a good way to train a net > without knowing a "target" in advance? For example, > if my net is in an undesireable state S1, and it is > totally clueless about how to get to state S2, I would > like it to improve the connections that do get it to > state S2, and impede any others. Now, since it has > no data at all to go on, and the only info it gets after > 1 pass is "nope, not in state 2 yet", what can I use as > a target? All the learning methods I've studied so far > are inadequate for this. > State S2 is, in and of itself, the target. That state, whatever it might be, has to be represented to the machine in some manner. The fact that S2 might be a very long text string or some very large binary number or some arbitraty bit pattern that you make up makes no difference, if you know where you are, and you know where you want to be, you can get there ( it might not be an easy trip, though :-) ). The point is that you must not confuse your interpretation of the "meaning" of the bit pattern with the bit pattern itself. As far as you statement that "...since it has no data at all to go on...". I submit that this is meaningless, not only for a neural net, but for any computer program. Also, your connections and their weights, by themselves, constitute 'data to go on'. Given that the system starts in some arbitrary state S1, which must be known or knowable to the system, and given that its goal is to get to some other known or knowable state S2 there are three ways to proceed. First, if there is some known transform that can be applied to state S1 to turn it into some other valid state Sn, then that transform must be continually applied to the current state, call it Sc, and then Sc must be tested against S2 to see if we have found S2. The existance of some transform implies that there is some set of parameters upon which the transform operates, and lacking any data on which to go, the system must perform a systematic exhaustive search of the parameter space. The second method is a variation on the first. Lacking a known transform, the best that you can do is apply random values to the things that constitute the state and wait for S2 to turn up. This is the situation that you would have in your example, where you say that "...it is totally clueless about how to get to S2...". Totally clueless implies that you don't even have an algorithm for transforming S1 into ANY other valid state, Sn. This, of course, assumes that you have no means of determining how "far away" the current state Sc is from S2. If you have some way of determining that distance, then what you've got is, essentially, the gradient descent of back-propagation which is the third way of going from state S1 to state S2. This "distance" information ( the error function in back-prop ) and the connections with their weights are, essentialy, all the data that you need.-- Walt Peterson. Prime - Calma San Diego R&D (Object and Data Management Group) "The opinions expressed here are my own and do not necessarily reflect those Prime, Calma nor anyone else. ...{ucbvax|decvax}!sdcsvax!calmasd!wlp
mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (03/23/89)
In article <769@cb.ecn.purdue.edu> kavuri@cb.edn.purdue writes: >In article <1577@vicom.COM>, hal@vicom.COM (Hal Hardenbergh) writes: >> >> A colleague and I have tried several of the back-prop "speedup" methods. All >> that we have tried do speed up convergence, to some degree, as determined by >> the number of epochs (training iterations). However, none of them reliably >> provide a speed improvement as measured by wall clock time. >> >> The ones which do (sometimes) provide a slight improvement in wall clock time >> do not do so reliably. It's sort of like varying the convergence and momentum >> factors. Depending on the random initialization of the weights and biases, >> c1 and m1 will work better than c2 and m2, and vice versa. >> >> As long as we are simulating artificial neural nets in software (if simulating >> is the right word here), does anyone know of a back-prop speedup trick which >> reduces the wall-clock training time? >> >> Hal Hardenbergh [incl std dsclmr] hal@vicom.com > > > This, I believe, could be tried using other inexact search methods >with BP. >REF: Conjugate-gradient methods with inexact searches , in Mathematics of >Operations Research vol.3 I'm using conjugate-gradient minimization in neural networks. Each iteration of conjugate gradient gains significantly more than "standard" backprop in error surface, but then again, each iteration requires an entire direction search & minimization which may take anywhere from 5 to 10 net evaluations (over whole training set or whatever). It's not always clear whether or not it saves "wall-clock" time. Sometimes, when the function is "easy", it sniffs out the minimum very quickly. It has the advantage that you're not dependent at all on a "learning parameter". But, of course, it's significantly more difficult to code than naive gradient descent, and can't possibly have any neurological justification. When I minimize the error surface w/ conjugate gradient, I usually take the sum over most or all elements of the training set. Conjugate gradient is very good at searching out local minima, and so I think some precautions must be noted: 1) Don't try to minimize on one example only, and then try minimizing on the next and so on. C.G. will very quickly find a minimum that "memorizes" the first example, and then just as easily forget it and memorize the second, etc. with very little generalization over many examples. 2) C.G. will still find local minima or regions where the error decreases very slowly, even training over all examples. Thus I don't train over the same examples for every iteration of C.G.; I train over some large random subset for a while and then switch off, either training over the whole set or another random subset, depending on how well I did. It has some advantages though: it's very dogged, and doesn't screw up royally very often. I often find that I can do very well with regular BP getting down towards a minimum, and then find that it screws up and starts to climb back up the other side, and because the step size is too large, can't find the narrow valley at the bottom. C.G. almost always will minimize your function, though it can get stuck in local minima more often. Please note that I'm doing a "hard" task where I need to map continous values to continous values with as high accuracy as possible. > >Gradient evaluation is an expensive computation in BP. A gradient-reuse >alogo was suggested whose idea is the following: >Gradients are reused several times until the resulting weight updates no >longer lead to a reduction in error. In this case, you're just doing a very naive line-minimization in the gradient direction. You could do much better by using a good line-min routine which can interpolate to the presumed minimum much faster. (I use the BRENT from Numerical Recipes for my C.G.) I actually implimented this line-minimization alg., but was disappointed in its performance. Regular BP usually did better. > >Furthe, usage of Batching makes the serach direction more accurate. >Batching means : >Considering the total error (to be minimized) as the sum of squared differnces > between the desired output and the observed, summed over all patterns. > Yes, but if you add on the gradient each time, later examples get the benefit of whatever global features that have been learnt in earlier examples. I usually find that standard back-prop (update after each weight) w/ momentum is faster. Indeed, because of its simplicity & speed, I've found it hard to beat in real wall-clock time. 90% of a back-prop program's time is spent in the forward iteration & gradient computation subroutines. It's very easy to double the complexity of the inner loops b/c back-prop w/ fixed steps is so simple, and thus really slow down the program alot. I use standard back-prop for a while before starting on C.G---when starting from a random position, C.G. is usually pretty clueless and needs to get closer in. > > SURYA > (FIAT LUX) Matt Kennel mbkennel@phoenix.princeton.edu
bph@buengc.BU.EDU (Blair P. Houghton) (03/23/89)
In article <997@afit-ab.arpa> efrethei@blackbird.afit.af.mil (Erik J. Fretheim) writes: >In article <1577@vicom.COM> hal@vicom.COM (Hal Hardenbergh (236)) writes: >>In article <2351@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes: >>> >>>You could send it to a Zen Buddhist monastery, or an English public >>>school... >> >>As long as we are simulating artificial neural nets in software (if simulating > >"Emulating" I belive is the correct term, if not the term of choice. "Modelling." --Blair "What is the sound of one Usener posting?"
rich@bunny.UUCP (Rich Sutton) (03/24/89)
In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes: > Has anyone come up with a good way to train a net > without knowing a "target" in advance? ... > All the learning methods I've studied so far > are inadequate for this. The learning methods that you are looking for are called "reinforcement learning" methods. They are less well known than supervised learning methods, but hardly obscure. And of course they are more powerful in the sense you refer to -- they can learn what to do without a teacher that already knows the correct answers. I recommend the following papers. -Rich Sutton A. Barto, R. Sutton, & P. Brouwer, ``Associative search network: A reinforcement learning associative memory," Biological Cybernetics, 40, 1981, pp. 201--211. A. Barto & R. Sutton, ``Landmark learning: An illustration of associative search," Biological Cybernetics, 42, 1981, pp. 1--8. Barto, Sutton, and Anderson, ``Neuronlike elements that can solve difficult learning control problems,'' IEEE Trans. on Systems, Man, and Cybernetics, 13, 1983, pp. 835--846. Williams, R. J., ``Toward a theory of reinforcement-learning connectionist systems,'' Technical Report NU-CCS-88-3, College of Computer Science, Northeastern University, Boston, Massachusetts, 1988. My dissertation is also directly relevant and I invite you to write me for a copy at GTE Labs, Waltham MA 02254.
kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) (03/24/89)
Besides gradient and conjugate gradient methods there are other one could try. There are methods known as Quasi-Newton methods that are known to perform much better in non-linear optimization. It is because they use higher order derivatives besides the first(as in gradient methods), and thus use more knowledge of the objective function contours. Second and higher order derivatives can be thought of as indicators of error (surfaces) arising from gradient application. This additional knowledge is used to achieve a much rapid convergence. The computational difficulties with quasi-Newtonian approaches is the evaluation of the Hessian. There are inexpensive updating methods as BFGS(Broyden-Fletcher -Goldfarb-Shannon) algorithm. (most NLP books should have this) Surya
mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (03/25/89)
In article <774@cb.ecn.purdue.edu> kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) writes: > > Besides gradient and conjugate gradient methods there are other one could try. > There are methods known as Quasi-Newton methods that are known > to perform much better in non-linear optimization. It is > because they use higher order derivatives besides the first(as in gradient methods), and thus use more knowledge of the > objective function contours. Second and higher order derivatives can be thought of as indicators of error (surfaces) arising from gradient application. This additional knowledge is used to achieve a much rapid convergence. > The computational difficulties with quasi-Newtonian approaches is the evaluation of the Hessian. > There are inexpensive updating methods as BFGS(Broyden-Fletcher -Goldfarb-Shannon) algorithm. > (most NLP books should have this) > Surya I don't think you want to use Quasi-Newton methods such as BFGS for most neural-net problems. They require O(N^2) storage where N is the number of _weights_, and in each iteration require at least O(N^2) operations. If you're dealing with small N, this is usually insignificant next to the time to evaluate your functions, but for large N, it might be a problem. I use conjugate-gradient, which doesn't have this problem but probably requires more function evaluations than BFGS by a moderate, but not huge factor. Matt K. mbkennel@phoenix.princeton.edu
hbhatnag@hawk.ulowell.edu (Himanshu Bhatnagar) (03/30/89)
In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes: >Has anyone come up with a good way to train a net >without knowing a "target" in advance? For example, >if my net is in an undesireable state S1, and it is >totally clueless about how to get to state S2, I would ^^^^^^^^ I depends, if you know what the state S2 is or what it could be (heuristicslly) then you could use the methods of Unsupervised Learning (Self Organsing Nets) or Competitive learning... I could explain in detail a bit later, am in a hurry now... >-Roger Himanshu....
clutx.clarkson.edu (Roger Gonzalez,,,) (04/04/89)
Let me restate my question on training in a slightly different way. Since I'm an undergraduate doing research in NN and I can get away with it, my "problem" is not one of much importance or practicality. Here it is: There is a small organism in my pseudo-universe that has a life that consists mainly of swimming up or down, eating things, and dodging annoying predators that nibble at it. The shallow waters are safe, but all the food (and predators) are down deeper. There are other stimuli as well, but these will suffice to give you the gist of what I'm playing with. Lets assume that our friend is in deep water (literally) and a predator is munching on him. (It's never fatal, but the "pain" and "fear" inputs are bothering it.) The correct solution to get out of this state is to swim up. However, I don't want to have to tell it this. The solution that I'm playing with right now (suggested to me via mail) is something I'm told is called "weakly supervised learning". I haven't implemented much of this solution yet, so I don't know how well it will work. The basic (and obvious) method seems to be to negatively affect any attempt that didn't get it out of the undesireable state. I can't believe I didn't think of this on my own. Any other suggestions? - Roger ++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++ "Just like I've always said; there's nothing an agnostic can't do if he's not sure he believes in anything or not!" - Monty Python
jbn@glacier.STANFORD.EDU (John B. Nagle) (04/04/89)
This is a vivarium approach to AI, and a good one. See Ann Marion's classic "aquarium" program. Mike Travers recent MS thesis at MIT offers some insight on current thinking in this area. Worth considering for a system like yours, where you really want to cold-start with no built-in knowledge at all, is to add some analogue of "evolution" to the system. The simulator should cause the fish to die if they don't get enough food or get bitten too much. Fish that thrive should reproduce, with some tweaking of the parameters of the offspring to simulate mutation. Provided that the initial environment is not so hostile that all the fish die, which can be handled by arranging the simulator so that the predators are very few until the population increases above some level, the fish should become more effective over time. Once it's working, let it run over a weekend and see what happens. John Nagle