[comp.ai.neural-nets] Training

clutx.clarkson.edu (Roger Gonzalez,,,) (03/20/89)

Has anyone come up with a good way to train a net
without knowing a "target" in advance?  For example,
if my net is in an undesireable state S1, and it is
totally clueless about how to get to state S2, I would
like it to improve the connections that do get it to
state S2, and impede any others.  Now, since it has
no data at all to go on, and the only info it gets after
1 pass is "nope, not in state 2 yet", what can I use as
a target?  All the learning methods I've studied so far
are inadequate for this.

Thanks in advance,
-Roger



++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++

   "Just like I've always said; there's nothing an agnostic can't do
    if he's not sure he believes in anything or not!" - Monty Python

bph@buengc.BU.EDU (Blair P. Houghton) (03/21/89)

In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes:
>Has anyone come up with a good way to train a net
>without knowing a "target" in advance?
[...]
>All the learning methods I've studied so far
>are inadequate for this.

Sounds like good ol' negative reinforcement to me.

You could send it to a Zen Buddhist monastery, or an English public
school...

Have you tried just getting a degree in education and coding it?

				--Blair
				  "...using C--, the 'Object-Lesson' C..."

P.S.  8-D

hal@vicom.COM (Hal Hardenbergh) (03/22/89)

In article <2351@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes:
>In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes:
>>Has anyone come up with a good way to train a net
>>without knowing a "target" in advance?
>[...]
>>All the learning methods I've studied so far
>>are inadequate for this.
>
>Sounds like good ol' negative reinforcement to me.
>
>You could send it to a Zen Buddhist monastery, or an English public
>school...
>
>Have you tried just getting a degree in education and coding it?
>
>				--Blair
>				  "...using C--, the 'Object-Lesson' C..."
>
>P.S.  8-D

A colleague and I have tried several of the back-prop "speedup" methods.  All
that we have tried do speed up convergence, to some degree, as determined by
the number of epochs (training iterations).  However, none of them reliably
provide a speed improvement as measured by wall clock time.

The ones which do (sometimes) provide a slight improvement in wall clock time
do not do so reliably.  It's sort of like varying the convergence and momentum
factors.  Depending on the random initialization of the weights and biases,
c1 and m1 will work better than c2 and m2, and vice versa.

As long as we are simulating artificial neural nets in software (if simulating
is the right word here), does anyone know of a back-prop speedup trick which
reduces the wall-clock training time?

Hal Hardenbergh  [incl std dsclmr]  hal@vicom.com
Vicom Systems Inc  2520 Junction Ave  San Jose  CA  (408) 432-8660
surely nobody here remembers the newsletter DTACK Grounded ?

efrethei@afit-ab.arpa (Erik J. Fretheim) (03/22/89)

In article <1577@vicom.COM> hal@vicom.COM (Hal Hardenbergh (236)) writes:
>In article <2351@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes:
>>In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes:
>>>Has anyone come up with a good way to train a net
>>>without knowing a "target" in advance?
>>[...]
>>>All the learning methods I've studied so far
>>>are inadequate for this.
>>
>>Sounds like good ol' negative reinforcement to me.
>>
>>You could send it to a Zen Buddhist monastery, or an English public
>>school...
>>
>>Have you tried just getting a degree in education and coding it?
>
>
>As long as we are simulating artificial neural nets in software (if simulating

"Emulating" I belive is the correct term, if not the term of choice.

Of course, it seems rather natural that speeding up backprop is like all of
the rest of pattern recognition, an art rather than a science.  After all, if
the results were predictable would it be any fun any more?

.signature and .disclaimer to be invented someday.

kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) (03/22/89)

In article <1577@vicom.COM>, hal@vicom.COM (Hal Hardenbergh) writes:
> 
> A colleague and I have tried several of the back-prop "speedup" methods.  All
> that we have tried do speed up convergence, to some degree, as determined by
> the number of epochs (training iterations).  However, none of them reliably
> provide a speed improvement as measured by wall clock time.
> 
> The ones which do (sometimes) provide a slight improvement in wall clock time
> do not do so reliably.  It's sort of like varying the convergence and momentum
> factors.  Depending on the random initialization of the weights and biases,
> c1 and m1 will work better than c2 and m2, and vice versa.
> 
> As long as we are simulating artificial neural nets in software (if simulating
> is the right word here), does anyone know of a back-prop speedup trick which
> reduces the wall-clock training time?
> 
> Hal Hardenbergh  [incl std dsclmr]  hal@vicom.com

  This, I believe, could be tried using other inexact search methods 
with BP. 
REF: Conjugate-gradient methods with inexact searches , in Mathematics of
Operations Research vol.3 

Gradient evaluation is an expensive computation in BP.  A gradient-reuse
alogo was suggested whose idea is the following:
Gradients are reused several times until the resulting weight updates no
longer lead to a reduction in error.     

Furthe, usage of Batching makes the serach direction more accurate.
Batching means :
Considering the total error (to be minimized) as the sum of squared differnces
 between the desired output and the observed, summed over all patterns.

                                                                SURYA
                                                       (FIAT LUX)

wlp@calmasd.Prime.COM (Walter L. Peterson, Jr.) (03/23/89)

In article <2698@sun.soe.clarkson.edu>, spam@sun.soe!clutx.clarkson.edu (Roger Gonzalez,,,) writes:
> Has anyone come up with a good way to train a net
> without knowing a "target" in advance?  For example,
> if my net is in an undesireable state S1, and it is
> totally clueless about how to get to state S2, I would
> like it to improve the connections that do get it to
> state S2, and impede any others.  Now, since it has
> no data at all to go on, and the only info it gets after
> 1 pass is "nope, not in state 2 yet", what can I use as
> a target?  All the learning methods I've studied so far
> are inadequate for this.
> 

State S2 is, in and of itself, the target.  That state, whatever it
might be, has to be represented to the machine in some manner.  The
fact that S2 might be a very long text string or some very large
binary number or some arbitraty bit pattern that you make up makes no
difference, if you know where you are, and you know where you want to
be, you can get there ( it might not be an easy trip, though :-) ).

The point is that you must not confuse your interpretation of the
"meaning" of the bit pattern with the bit pattern itself.

As far as you statement that "...since it has no data at all to go
on...". I submit that this is meaningless, not only for a neural net,
but for any computer program. Also, your connections and their
weights, by themselves, constitute 'data to go on'.

Given that the system starts in some arbitrary state S1, which
must be known or knowable to the system, and given that its
goal is to get to some other known or knowable state S2 there
are three ways to proceed.  First, if there is some known transform
that can be applied to state S1 to turn it into some
other valid state Sn, then that transform must be continually applied
to the current state, call it Sc, and then Sc must be tested against
S2 to see if we have found S2.  The existance of some transform
implies that there is some set of parameters upon which the transform
operates, and lacking any data on which to go, the system must perform
a systematic exhaustive search of the parameter space. The second
method is a variation on the first. Lacking a known transform, the
best that you can do is apply random values to the things that
constitute the state and wait for S2 to turn up. This is the situation
that you would have in your example, where you say that "...it is
totally clueless about how to get to S2...". Totally clueless implies
that you don't even have an algorithm for transforming S1 into ANY
other valid state, Sn.

This, of course, assumes that you have no means of determining how
"far away" the current state Sc is from S2.  If you have some way of
determining that distance, then what you've got is, essentially, the
gradient descent of back-propagation which is the third way of going
from state S1 to state S2. This "distance" information ( the error
function in back-prop ) and the connections with their weights are,
essentialy, all the data that you need.-- 
Walt Peterson.  Prime - Calma San Diego R&D (Object and Data Management Group)
"The opinions expressed here are my own and do not necessarily reflect those
Prime, Calma nor anyone else.
...{ucbvax|decvax}!sdcsvax!calmasd!wlp

mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (03/23/89)

In article <769@cb.ecn.purdue.edu> kavuri@cb.edn.purdue writes:
>In article <1577@vicom.COM>, hal@vicom.COM (Hal Hardenbergh) writes:
>> 
>> A colleague and I have tried several of the back-prop "speedup" methods.  All
>> that we have tried do speed up convergence, to some degree, as determined by
>> the number of epochs (training iterations).  However, none of them reliably
>> provide a speed improvement as measured by wall clock time.
>> 
>> The ones which do (sometimes) provide a slight improvement in wall clock time
>> do not do so reliably.  It's sort of like varying the convergence and momentum
>> factors.  Depending on the random initialization of the weights and biases,
>> c1 and m1 will work better than c2 and m2, and vice versa.
>> 
>> As long as we are simulating artificial neural nets in software (if simulating
>> is the right word here), does anyone know of a back-prop speedup trick which
>> reduces the wall-clock training time?
>> 
>> Hal Hardenbergh  [incl std dsclmr]  hal@vicom.com
>
>
>  This, I believe, could be tried using other inexact search methods 
>with BP. 
>REF: Conjugate-gradient methods with inexact searches , in Mathematics of
>Operations Research vol.3 

I'm using conjugate-gradient minimization in neural networks.  
Each iteration of conjugate gradient gains  significantly more than
"standard" backprop in error surface, but then again, each iteration
requires an entire direction search & minimization which may take
anywhere from 5 to 10 net evaluations (over whole training set or whatever).

It's not always clear whether or not it saves "wall-clock" time.  Sometimes,
when the function is "easy", it sniffs out the minimum very quickly.  It
has the advantage that you're not dependent at all on a "learning parameter".
But, of course, it's significantly more difficult to code than naive
gradient descent, and can't possibly have any neurological justification.

When I minimize the error surface w/ conjugate gradient, I usually take the
sum over most or all elements of the training set.  Conjugate gradient is very
good at searching out local minima, and so I think some precautions must
be noted:

	1)  Don't try to minimize on one example only, and then try
	    minimizing on the next and so on.  C.G. will very quickly
            find a minimum that "memorizes" the first example, and then
	    just as easily forget it and memorize the second, etc. with
	    very little generalization over many examples.

	2)  C.G. will still find local minima or regions where
	    the error decreases very slowly,  even training over all examples.
	    Thus I don't train over the same examples for every iteration
	    of C.G.; I train over some large random subset for a while
	    and then switch off, either training over the whole set
	    or another random subset, depending on how well I did.

It has some advantages though:  it's very dogged, and doesn't screw up
royally very often.  I often find that I can do very well with regular
BP getting down towards a minimum, and then find that it screws up and
starts to climb back up the other side, and because the step size is
too large, can't find the narrow valley at the bottom.  C.G. almost
always will minimize your function, though it can get stuck in local
minima more often.  Please note that I'm doing a "hard" task  where I need
to map continous values to continous values with as high accuracy as
possible.

>
>Gradient evaluation is an expensive computation in BP.  A gradient-reuse
>alogo was suggested whose idea is the following:
>Gradients are reused several times until the resulting weight updates no
>longer lead to a reduction in error.     

In this case, you're just doing a very naive line-minimization in the
gradient direction.  You could do much better by using a good line-min
routine which can interpolate to the presumed minimum much faster.
(I use the BRENT from Numerical Recipes for my C.G.)  I actually implimented
this line-minimization alg., but was disappointed in its performance.  Regular
BP usually did better.

>
>Furthe, usage of Batching makes the serach direction more accurate.
>Batching means :
>Considering the total error (to be minimized) as the sum of squared differnces
> between the desired output and the observed, summed over all patterns.
>
Yes, but if you add on the gradient each time, later examples get the
benefit of whatever global features that have been learnt in earlier
examples.  

I usually find that standard back-prop (update after each weight) w/
momentum is faster.  Indeed, because of its simplicity & speed, I've
found it hard to beat in real wall-clock time.  90% of a back-prop
program's time is spent in the forward iteration & gradient computation
subroutines.  It's very easy to double the complexity of the inner loops
b/c back-prop w/ fixed steps is so simple, and thus really slow down
the program alot.

I use standard back-prop for a while before starting on C.G---when starting
from a random position, C.G. is usually pretty clueless and needs to
get closer in.

>
>                                                                SURYA
>                                                       (FIAT LUX)

Matt Kennel
mbkennel@phoenix.princeton.edu

bph@buengc.BU.EDU (Blair P. Houghton) (03/23/89)

In article <997@afit-ab.arpa> efrethei@blackbird.afit.af.mil (Erik J. Fretheim) writes:
>In article <1577@vicom.COM> hal@vicom.COM (Hal Hardenbergh (236)) writes:
>>In article <2351@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes:
>>>
>>>You could send it to a Zen Buddhist monastery, or an English public
>>>school...
>>
>>As long as we are simulating artificial neural nets in software (if simulating
>
>"Emulating" I belive is the correct term, if not the term of choice.

"Modelling."

				--Blair
				  "What is the sound of one
				   Usener posting?"

rich@bunny.UUCP (Rich Sutton) (03/24/89)

In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes:

> Has anyone come up with a good way to train a net
> without knowing a "target" in advance?
...
> All the learning methods I've studied so far
> are inadequate for this.

The learning methods that you are looking for are called
"reinforcement learning" methods.  They are less well known than
supervised learning methods, but hardly obscure.  And of course they
are more powerful in the sense you refer to -- they can learn what to
do without a teacher that already knows the correct answers.  
I recommend the following papers.
					-Rich Sutton

A. Barto, R. Sutton, & P. Brouwer, ``Associative search network:
A reinforcement learning associative memory," Biological
Cybernetics, 40, 1981, pp. 201--211.

A. Barto & R. Sutton, ``Landmark learning: An illustration of
associative search," Biological Cybernetics, 42, 1981, 
pp. 1--8.

Barto, Sutton, and Anderson, ``Neuronlike elements that can solve
difficult learning control problems,'' IEEE Trans. on Systems, Man, and
Cybernetics, 13, 1983, pp. 835--846.

Williams, R. J., ``Toward a theory of reinforcement-learning
connectionist systems,'' Technical Report NU-CCS-88-3, College of
Computer Science, Northeastern University, Boston, Massachusetts, 1988.

My dissertation is also directly relevant and I invite you to write me
for a copy at GTE Labs, Waltham MA 02254.

kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) (03/24/89)

 Besides gradient and conjugate gradient methods there are other one could try.
 There are methods known as Quasi-Newton methods that are known
 to perform much better in non-linear optimization.  It is 
 because they use higher order derivatives besides the first(as  in gradient methods), and thus use more knowledge of the 
 objective function contours.  Second and higher order derivatives can be thought of as indicators of error (surfaces) arising from gradient application.  This additional knowledge is used to achieve a much rapid convergence.  
 The computational difficulties with quasi-Newtonian approaches is the evaluation of the Hessian.
 There are inexpensive updating methods as BFGS(Broyden-Fletcher -Goldfarb-Shannon) algorithm.
      (most NLP books should have this)
                                           Surya

mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (03/25/89)

In article <774@cb.ecn.purdue.edu> kavuri@cb.ecn.purdue.edu (Surya N Kavuri ) writes:
>
> Besides gradient and conjugate gradient methods there are other one could try.
> There are methods known as Quasi-Newton methods that are known
> to perform much better in non-linear optimization.  It is 
> because they use higher order derivatives besides the first(as  in gradient methods), and thus use more knowledge of the 
> objective function contours.  Second and higher order derivatives can be thought of as indicators of error (surfaces) arising from gradient application.  This additional knowledge is used to achieve a much rapid convergence.  
> The computational difficulties with quasi-Newtonian approaches is the evaluation of the Hessian.
> There are inexpensive updating methods as BFGS(Broyden-Fletcher -Goldfarb-Shannon) algorithm.
>      (most NLP books should have this)
>                                           Surya

I don't think you want to use Quasi-Newton methods such as BFGS for
most neural-net problems.  They require O(N^2) storage where N is the
number of _weights_, and in each iteration require at least
O(N^2) operations.  If you're dealing with small N, this is usually
insignificant next to the time to evaluate your functions, but for
large N, it might be a problem.  

I use conjugate-gradient, which doesn't have this problem but probably
requires more function evaluations than BFGS by a moderate, but not
huge factor.

Matt K.
mbkennel@phoenix.princeton.edu

hbhatnag@hawk.ulowell.edu (Himanshu Bhatnagar) (03/30/89)

In article <2698@sun.soe.clarkson.edu> spam@clutx.clarkson.edu writes:
>Has anyone come up with a good way to train a net
>without knowing a "target" in advance?  For example,
>if my net is in an undesireable state S1, and it is
>totally clueless about how to get to state S2, I would
				      ^^^^^^^^
I depends, if you know what the state S2 is or what it could be (heuristicslly)
then 
you could use the methods of Unsupervised Learning (Self Organsing Nets)
or Competitive learning... I could explain in detail a bit later, am in a 
hurry now...

>-Roger

Himanshu....

clutx.clarkson.edu (Roger Gonzalez,,,) (04/04/89)

Let me restate my question on training in a slightly different
way.  Since I'm an undergraduate doing research in NN and I can
get away with it, my "problem" is not one of much importance
or practicality.  Here it is:

There is a small organism in my pseudo-universe that has a life that
consists mainly of swimming up or down, eating things, and dodging
annoying predators that nibble at it.  The shallow waters are safe, but
all the food (and predators) are down deeper.  There are other stimuli
as well, but these will suffice to give you the gist of what I'm playing
with. 

Lets assume that our friend is in deep water (literally) and a predator
is munching on him. (It's never fatal, but the "pain" and "fear" inputs
are bothering it.)

The correct solution to get out of this state is to swim up.  However, I
don't want to have to tell it this.  The solution that I'm playing with
right now (suggested to me via mail) is something I'm told is called
"weakly supervised learning".  I haven't implemented much of this solution
yet, so I don't know how well it will work.  The basic (and obvious)
method seems to be to negatively affect any attempt that didn't get it
out of the undesireable state.  I can't believe I didn't think of this
on my own.

Any other suggestions?

   - Roger










++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++

   "Just like I've always said; there's nothing an agnostic can't do
    if he's not sure he believes in anything or not!" - Monty Python

jbn@glacier.STANFORD.EDU (John B. Nagle) (04/04/89)

      This is a vivarium approach to AI, and a good one.  See Ann Marion's
classic "aquarium" program.  Mike Travers recent MS thesis at MIT offers
some insight on current thinking in this area.

      Worth considering for a system like yours, where you really want to
cold-start with no built-in knowledge at all, is to add some analogue of
"evolution" to the system.  The simulator should cause the fish to die
if they don't get enough food or get bitten too much.  Fish that thrive
should reproduce, with some tweaking of the parameters of the offspring
to simulate mutation.  Provided that the initial environment is not so
hostile that all the fish die, which can be handled by arranging the
simulator so that the predators are very few until the population
increases above some level, the fish should become more effective over
time.  Once it's working, let it run over a weekend and see what happens.

					John Nagle