[comp.ai.neural-nets] Neuron Digest V5 #18

neuron-request@HPLABS.HP.COM ("Neuron-Digest Moderator Peter Marvit") (04/19/89)

Neuron Digest   Tuesday, 18 Apr 1989
                Volume 5 : Issue 18

Today's Topics:
                       Cure for a PDP large .net bug.
                     Questions About Hopfield Networks
                           S. Pinker / A. Prince
                         Re: S. Pinker / A. Prince
                         Re: S. Pinker / A. Prince
                         Pinker and Prince article
                         Re: S. Pinker / A. Prince
                         Re: S. Pinker / A. Prince
                                  Training
                                Re: Training
                                Re: Training
                                Re: Training
                                Re: Training
                                Re: Training
                                Re: Training
                                Re: Training
                                Re: Training
                                Re: Training


Send submissions, questions, address maintenance and requests for old issues to
"neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request"
ARPANET users can get old issues via ftp from hplpm.hpl.hp.com (15.255.16.205).

------------------------------------------------------------

Subject: Cure for a PDP large .net bug.
From:    jdg@antique.UUCP (John Gabbe)
Organization: AT&T Bell Labs, Holmdel, NJ
Date:    Fri, 24 Feb 89 09:19:34 -0500 

[[ Editor's Note:  John and I had quite a time of it trying to push his
message through the mail, hence the apparent lateness. -PM ]]

James McClelland accepted this fix for later versions of the PDP program.
For those of you who have earlier versions the following may be helpful.


                                        April 20, 1988
PDP Software Inquiries
c/o Texts Manager
MIT Press
55 Hayward Street
Cambridge, MA  02142

Dear Sir:

The PDP software file ../src/weights.c, contains errors in the segments
reproduced below.  The pertinent lines are marked with an E# or C in the
left margin, which serves as a key to the comments that follow the code.

                * * * * * * * * * * * * * * * * * * * * 

/*

       This file is part of the PDP software package.
                 
       Copyright 1987 by James L. McClelland and David E. Rumelhart.
       
       Please refer to licensing information in the file license.txt,
       which is in the same directory with this source file and is
       included here by reference.
*/


/* file: weights.c

        read in network descriptions, and set up constraints.
        
        First version implemented by Elliot Jaffe.
        
        Date of last revision: 8-12-87/JLM.
*/
line 21
 .
 .
 .
line 609
    if (nlinks) {
        constraints = (struct constraint   *) 
          emalloc ((unsigned int)(sizeof (struct constraint) * (nlinks + 1)));
        for (i = 0; i < nlinks; i++) {
            constraints[i].num = 0;
            constraints[i].max = MAXCONSTRAINTS;
            constraints[i].cvec = ((float **) 
                emalloc((unsigned int)(sizeof(float *) * MAXCONSTRAINTS)));
            constraints[i].ivec = ((float **) 
                emalloc((unsigned int)(sizeof(float *) * MAXCONSTRAINTS)));
E1      for (j = 0; j < nunits; j++) {
                constraints[i].cvec[j] = NULL;
                constraints[i].ivec[j] = NULL;
            }
        }
    }
line 626
 .
 .
 .
line 783
/* realloc positive_constraints, negative_constraints, and link constraints
   this is called whenever the allocated constraint lists run out of
   space for additional constraints  14-May-87 MAF / 15-May-87 JLM */

enlarge_constraints(con_index) int con_index; {
E2  if (con_index = ENLARGE_POS) {
C       maxpos += 100;
        positive_constraints = ((float **) erealloc 
          ((char *) positive_constraints,
C             (unsigned int) ((maxpos - 100) * sizeof(float *)),
              (unsigned int) (maxpos * sizeof(float *))));
    }
E3  else if (con_index = ENLARGE_NEG) {
C       maxneg += 100;
        negative_constraints = ((float **) erealloc 
          ((char *) negative_constraints,
C            (unsigned int) ((maxneg -100) * sizeof (float *)),
             (unsigned int) (maxneg * sizeof(float *))));
    }
    else {
C       constraints[con_index].max += 100;
        constraints[con_index].cvec = ((float **) erealloc 
          ((char *)constraints[con_index].cvec,
           (unsigned int)
C            ((constraints[con_index].max - 100) * sizeof(float *)),
           (unsigned int)
             (constraints[con_index].max * sizeof(float *))));
        constraints[con_index].ivec = ((float **) erealloc 
          ((char *)constraints[con_index].ivec,
           (unsigned int)
C            ((constraints[con_index].max - 100) * sizeof(float *)),
           (unsigned int)
             (constraints[con_index].max * sizeof(float *))));
E4 }
}

                * * * * * * * * * * * * * * * * * * * * 

E1.     At the line marked E1, MAXCONSTRAINTS items have just been emalloc'd
        but nunits items are NULL'ed. When nunits > MAXCONSTRAINTS, unallocated
        memory is overwritten destroying (on Suns and Vaxen) malloc's
        bookkeeping and leading to a segmentation error.  This line should
        be replaced by:

                for (j = 0; j < nunits && j < MAXCONSTRAINTS; j++) {

E2, E3. These should be compares, not assignments. Replace `=' with `=='.
        As the code is written, positive_constraints is always erealloc'd
        and nothing else is ever erealloc'd. 

E4.     If it is necessary to NULL the pointers at E1, then it should
        be necessary to NULL the additional space allocated here.
C.      For consistency, all the 100's should be replaced by MAXCONSTRAINTS.
        The following code inserted at E4 will then initialize the
        additional space:
                
                for (j = constraints[con_index].max - MAXCONSTRAINTS; 
                                j < constraints[con_index].max; 
                                j++) {  
                        constraints[con_index].cvec[j] = NULL;  
                        constraints[con_index].ivec[j] = NULL;  
                }       


These comments are provided as a courtesy; no claims are made or responsibility
assumed regarding their correctness.

Yours truly,

John D. Gabbe
AT&T Bell Laboratories

copy to netnews comp.ai.neural-nets

P.S.  In the official revision, McClelland introduced an additional constant
      to implement comment C, instead of overloading MAXCONSTRAINTS as is
      done above. - JDG 2/24/89.


------------------------------

Subject: Questions About Hopfield Networks
From:    mmm@cup.portal.com (Mark Robert Thorson)
Organization: The Portal System (TM)
Date:    Mon, 20 Mar 89 06:08:58 +0000 

I have a question or two about Hopfield networks used as associative memory.
Hopfield says that when multiple parallel networks are driven from the same
clue bus, and drive the same output bus, only the network containing the
memory corresponding to the clue responds (assuming the memory is stored in
one of the networks, and only one of the networks).  In other places,
Hopfield says these networks always drive themselves to a stable state.

My question is: is there some kind of "null" or "indecision" stable state?
When multiple memory planes are used, what state do they have when presented
with a memory which they do not hold?

------------------------------

Subject: S. Pinker / A. Prince
From:    "Roger Gonzales" <spam@sun.soe.clarkson.edu>
Date:    Wed, 22 Mar 89 21:59:26 +0000 

I just read a rather scathing article by Steven Pinker (MIT) and Alan Prince
(Brandeis) that tore PDP apart.  Has anyone seen any responses to this
article that defend Rumelhart and McClelland?  The article bummed me out,
and I'm not up enough on language processing to really debate what they
claim.

++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++

------------------------------

Subject: Re: S. Pinker / A. Prince
From:    kortge@Portia.Stanford.EDU (Chris Kortge)
Organization: Stanford University
Date:    Thu, 23 Mar 89 04:29:28 +0000 


The article doesn't tear PDP apart at all (I assume you mean the one in
_Cognition_, published also as a book, "Connections and Symbols").  What it
does do is tear apart one very simple model of a complex phenomenon
(learning of past tenses).  As far as I could tell, virtually all its
criticisms could be answered with a multilayer network model; that is, the
faults of the R&M model derive mostly from the fact that it's just a single
layer associative network, with hand-wired representations.  As I understand
it (having heard Dave Rumelhart's response), the R&M model was never
intended to be anything near a complete model of tense acquisition--rather,
it was mainly supposed to demonstrate that "rule-like" behavior can coexist
with "special case" behavior in the same set of connecting weights.

If you want to read an article which _attempts_ to tear PDP _in general_
apart, read the one by Fodor and Pylyshyn in the same book.  It didn't make
much sense to me, but then I guess I don't have the MIT perspective.  If
someone really wants to blast PDP, there are _real_ problems with it, like
scaling of learning time, which make more sense to focus on than the things
F&P talk about.

Chris Kortge
kortge@psych.stanford.edu

------------------------------

Subject: Re: S. Pinker / A. Prince
From:    "Roger Gonzalez,,," <sun.soe.clarkson.edu!spam@TCGOULD.TN.CORNELL.EDU>
Date:    23 Mar 89 21:46:49 +0000 


> As far as I could tell, virtually all its criticisms
> could be answered with a multilayer network model; that is, the faults
> of the R&M model derive mostly from the fact that it's just a single
> layer associative network, with hand-wired representations. 

Yeah, so I noticed after I waded through the last 50 pages.

My impressions changed after I was done reading the article.  Seems to me
that they were getting all picky about details that I don't think R&M
intended to be the god's truth about language processing...  Correct me if
I'm wrong, but weren't R&M just saying "Look, here's a simple little model
that does a pretty good job with past tenses...  and look, it even seems to
exhibit some of the behavior of children learning language.."

Some of P&P's accusations seemed pretty trivial anyway: "It can learn rules
found in no human language"

SOoooo?

(Or are they assuming the ol' language aquisition device?)


Anyway, I'm reading the "nastier" article you suggested right now.

++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++

------------------------------

Subject: Pinker and Prince article
From:    gblee@maui.cs.ucla.edu
Organization: UCLA Computer Science Department
Date:    Fri, 24 Mar 89 00:14:38 +0000 


The article is :
Pinker, S. and Prince, A. (1988). On language and connectionism: analysis
of a parallel distributed processing models of language inqusition. 
Cognition, Vol. 28, PP 73-193.

Actually this volume of cognition contains two other papers which attack PDP
model and it is published as a book.

the two other articles are:

Fodor, J. and Pylyshyn, A. (1988) Connectionism and cognitive architecture:
a critical analysis.

Lachter, J and Bever, T. G. (1988) The relation between linguistic structure
and associative theories of language learning -- A constructive critique of
some connectionist learning models.

Geunbae Lee
AI Lab, UCLA

------------------------------

Subject: Re: S. Pinker / A. Prince
From:    kavuri@cb.ecn.purdue.edu (Surya N Kavuri )
Organization: Purdue University Engineering Computer Network
Date:    Thu, 23 Mar 89 22:25:56 +0000 


 There is another paper with a similar claim.

  "Gradient descent fails to separate" is its title.
            By : M. Brady and R.Raghavan

The paper shows the failure of BP in the case of examples where there are no
local minima.  They assert (and they could be right as such claims have been
"romantic",as Minsky put it!) that least square solutons do not minimize the
# of misclassifications. They have examples where Perceptron does well while
the gradient descent with LSR fails. They conclude that the failure of GD
and LSE may be much more wide spread than presumed.

                                          SURYA
                                         (FIAT LUX)
                                                               

------------------------------

Subject: Re: S. Pinker / A. Prince
From:    Allan F Randall <ncc!alberta!randall@uunet.uu.net>
Organization: U. of Alberta, Edmonton, AB
Date:    25 Mar 89 17:53:48 +0000 


> If someone really wants to blast PDP, there are _real_
> problems with it, like scaling of learning time, which make more
> sense to focus on than the things F&P talk about.

I think the reason Fodor and Pylyshyn do not concentrate on those issues is
because they are criticizing the philosophy of connectionism as a general
approach to studying the mind, rather than attacking specific problems with
current systems. The points they do discuss are all intended to be general
problems inherent in the philosophy of connectionism, which they do not
anticipate being solved.

There are a few aspects of their reasoning that puzzle me; I thought I'd
respond by posting my own reactions to the article. I would be interested in
hearing from anyone who perhaps understands their perspective better,
particularly since I haven't read any connectionist responses to Fodor and
Pylyshyn (does anybody know of any?).

First of all, though, I think they do a reasonably good job of clearing up a
few contentious points concerning which issues are directly relevant to the
connectionist/symbolist debate and which are not. For instance, while
parallelism is central to most connectionist systems, there is nothing
contrary to the classical symbolist view in massive parallelism. The same
goes for the idea of soft or fuzzy constraints.

I have two major problems with the rest of their article. First, they seem
very limited in the types of connectionist systems they are willing to
discuss.  Most of the examples they give are of systems were each node
represents some concept or proposition. Hence, they only discuss
connectionist systems that already have a lot in common with symbolic
systems. They talk very little about distributed representations and
sub-symbolic processes. This seems strange to me, since I would consider
these things to be the central justification for the connectionist approach.
Fodor and Pylyshyn seem to be artificially limiting the connectionist
architecture to a narrow form of symbolism and then judging it on its
performance as a logic. What they fail to realize is that it is these very
assumptions they use in judging PDP that connectionists are calling into
question in the first place. *Of course* PDP, at least in its current forms,
fails as a general logical inference mechanism. What (many of) the new
connectionists are saying is that these systems work better at *certain
types of tasks* than the classical systems. They are meant to address
problems with the symbolist approach. Yes, they fail miserably at many
things symbol systems do well, but this does not mean we must choose one
over the other.

This brings me to the other point, which I think is the key problem with
Fodor and Pylyshyn's approach. They do not seem to consider the possibility
of using *both* approaches. Their main argument is that mental
representations have a "combinatorial syntax and semantics" and "structure
sensitivity of processes."  The upshot of this is that to do the sorts of
things humans are good at, a system must have a systematic way to generate
mental representations that have a constituent structure. Connectionist
systems lack this language-like ability.  This is an argument for a
"Language of Thought." Because of this emphasis on the language-like aspects
of cognition, many of Fodor and Pylyshyn's arguments are about the inability
of PDP nets to deal with language. They then generalize to the rest of
cognition. While this is not entirely invalid, I think it really weakens
their argument, as language is the one aspect of cognition that seems to be
the most symbolic and the least connectionistic.

However, I would still agree with much of what they say. It is true that
thought must have these properties. Cognition must be more than the
statistical modelling of the environment. But Fodor and Pylyshyn give short
shrift to the idea that both types of architectures will be needed to handle
all aspects of cognition. Why could we not have a connectionist system
modelling the environment and creating distributed representations that are
used by a more classical symbolic processor? (This is, of course, only one
way of looking at hybrid systems.) While Fodor and Pylyshyn do spend a
little time discussing this sort of thing, it seems to be more of an
afterthought, rather than a central part of their argument. This seems
strange, especially since this is where the field of AI as a whole seems to
be going.

In short, Fodor and Pylyshyn are extreme symbolists. They believe in the
classical symbolist view in its most extreme form: physical symbol systems
are a necessary and sufficient condition for cognition. Their article does a
good job of arguing for the "necessary" part, but pays little attention to
the more central "sufficient" part. Like the extreme connectionists, they
seem convinced that we must choose one or the other. They show that a pure
connectionist system could not work and thus conclude that pure symbolism is
the answer.

To give them credit, while I disagree with their conclusions, I think they
do a good job of explaining why an intelligent system must display these
properties and why current connectionist architectures are insufficient on
their own to do the job. They build a good case against extreme
connectionism, but fail to explain why this implies the other extreme.

To summarize, my problems with Fodor and Pylyshyn are:
  i) they criticize connectionism as (a) symbolic logic, and (b) language,
     largely ignoring its other aspects as unimportant to cognition.
  ii) they ignore hybrid connectionist/symbolist approaches.

Finally, I think Fodor and Pylyshyn simply have different intuitions than I
do.  They seem to feel that the statistical and distributed nature of
intelligence is really not very crucial, if it is there at all. While I
disagree, I can certainly respect that. But I was disappointed with their
article, because I didn't think it really addressed the issue. I would love
to see a critique of connectionism that considered the "sufficient" aspect
of symbol systems as rigorously as they discussed the "necessary" aspects.

Allan Randall
Dept. Computing Science
University of Alberta
Edmonton, AB

------------------------------

Subject: Training
From:    "Roger Gonzalez" <spam@sun.soe.clarkson.edu>
Date:    Sun, 19 Mar 89 20:28:03 +0000 

Has anyone come up with a good way to train a net without knowing a "target"
in advance?  For example, if my net is in an undesireable state S1, and it
is totally clueless about how to get to state S2, I would like it to improve
the connections that do get it to state S2, and impede any others.  Now,
since it has no data at all to go on, and the only info it gets after 1 pass
is "nope, not in state 2 yet", what can I use as a target?  All the learning
methods I've studied so far are inadequate for this.

Thanks in advance,
- -Roger

++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++

------------------------------

Subject: Re: Training
From:    hal@vicom.COM (Hal Hardenbergh)
Organization: Vicom Systems Inc. San Jose, Cal.
Date:    Tue, 21 Mar 89 21:14:47 +0000 

A colleague and I have tried several of the back-prop "speedup" methods.
All that we have tried do speed up convergence, to some degree, as
determined by the number of epochs (training iterations).  However, none of
them reliably provide a speed improvement as measured by wall clock time.

The ones which do (sometimes) provide a slight improvement in wall clock
time do not do so reliably.  It's sort of like varying the convergence and
momentum factors.  Depending on the random initialization of the weights and
biases, c1 and m1 will work better than c2 and m2, and vice versa.

As long as we are simulating artificial neural nets in software (if
simulating is the right word here), does anyone know of a back-prop speedup
trick which reduces the wall-clock training time?

Hal Hardenbergh  [incl std dsclmr]  hal@vicom.com
Vicom Systems Inc  2520 Junction Ave  San Jose  CA  (408) 432-8660
surely nobody here remembers the newsletter DTACK Grounded ?

------------------------------

Subject: Re: Training
From:    kavuri@cb.ecn.purdue.edu (Surya N Kavuri )
Organization: Purdue University Engineering Computer Network
Date:    Wed, 22 Mar 89 04:07:05 +0000 


  This, I believe, could be tried using other inexact search methods with
BP.  REF: Conjugate-gradient methods with inexact searches , in Mathematics
of Operations Research vol.3

Gradient evaluation is an expensive computation in BP.  A gradient-reuse
alogo was suggested whose idea is the following: Gradients are reused
several times until the resulting weight updates no longer lead to a
reduction in error.

Furthe, usage of Batching makes the serach direction more accurate.
Batching means : Considering the total error (to be minimized) as the sum of
squared differnces between the desired output and the observed, summed over
all patterns.


                                                                SURYA
                                                       (FIAT LUX)

------------------------------

Subject: Re: Training
From:    wlp@calmasd.Prime.COM (Walter L. Peterson, Jr.)
Organization: Prime-Calma, San Diego R&D, Object and Data Management Group
Date:    Wed, 22 Mar 89 17:23:01 +0000 


State S2 is, in and of itself, the target.  That state, whatever it might
be, has to be represented to the machine in some manner.  The fact that S2
might be a very long text string or some very large binary number or some
arbitraty bit pattern that you make up makes no difference, if you know
where you are, and you know where you want to be, you can get there ( it
might not be an easy trip, though :-) ).

The point is that you must not confuse your interpretation of the "meaning"
of the bit pattern with the bit pattern itself.

As far as you statement that "...since it has no data at all to go on...". I
submit that this is meaningless, not only for a neural net, but for any
computer program. Also, your connections and their weights, by themselves,
constitute 'data to go on'.

Given that the system starts in some arbitrary state S1, which must be known
or knowable to the system, and given that its goal is to get to some other
known or knowable state S2 there are three ways to proceed.  First, if there
is some known transform that can be applied to state S1 to turn it into some
other valid state Sn, then that transform must be continually applied to the
current state, call it Sc, and then Sc must be tested against S2 to see if
we have found S2.  The existance of some transform implies that there is
some set of parameters upon which the transform operates, and lacking any
data on which to go, the system must perform a systematic exhaustive search
of the parameter space. The second method is a variation on the first.
Lacking a known transform, the best that you can do is apply random values
to the things that constitute the state and wait for S2 to turn up. This is
the situation that you would have in your example, where you say that "...it
is totally clueless about how to get to S2...". Totally clueless implies
that you don't even have an algorithm for transforming S1 into ANY other
valid state, Sn.

This, of course, assumes that you have no means of determining how "far
away" the current state Sc is from S2.  If you have some way of determining
that distance, then what you've got is, essentially, the gradient descent of
back-propagation which is the third way of going from state S1 to state S2.
This "distance" information ( the error function in back-prop ) and the
connections with their weights are, essentialy, all the data that you
need.--

Walt Peterson.  Prime - Calma San Diego R&D (Object and Data Management Group)
"The opinions expressed here are my own and do not necessarily reflect those
Prime, Calma nor anyone else.
...{ucbvax|decvax}!sdcsvax!calmasd!wlp

------------------------------

Subject: Re: Training
From:    mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Organization: Princeton University, NJ
Date:    Thu, 23 Mar 89 02:34:23 +0000 

I'm using conjugate-gradient minimization in neural networks.  Each
iteration of conjugate gradient gains significantly more than "standard"
backprop in error surface, but then again, each iteration requires an entire
direction search & minimization which may take anywhere from 5 to 10 net
evaluations (over whole training set or whatever).

It's not always clear whether or not it saves "wall-clock" time.  Sometimes,
when the function is "easy", it sniffs out the minimum very quickly.  It has
the advantage that you're not dependent at all on a "learning parameter".
But, of course, it's significantly more difficult to code than naive
gradient descent, and can't possibly have any neurological justification.

When I minimize the error surface w/ conjugate gradient, I usually take the
sum over most or all elements of the training set.  Conjugate gradient is
very good at searching out local minima, and so I think some precautions
must be noted:

        1)  Don't try to minimize on one example only, and then try
            minimizing on the next and so on.  C.G. will very quickly
            find a minimum that "memorizes" the first example, and then
            just as easily forget it and memorize the second, etc. with
            very little generalization over many examples.

        2)  C.G. will still find local minima or regions where
            the error decreases very slowly,  even training over all examples.
            Thus I don't train over the same examples for every iteration
            of C.G.; I train over some large random subset for a while
            and then switch off, either training over the whole set
            or another random subset, depending on how well I did.

It has some advantages though: it's very dogged, and doesn't screw up
royally very often.  I often find that I can do very well with regular BP
getting down towards a minimum, and then find that it screws up and starts
to climb back up the other side, and because the step size is too large,
can't find the narrow valley at the bottom.  C.G. almost always will
minimize your function, though it can get stuck in local minima more often.
Please note that I'm doing a "hard" task where I need to map continous
values to continous values with as high accuracy as possible.
            
>Gradient evaluation is an expensive computation in BP.  A gradient-reuse
>alogo was suggested whose idea is the following:
>Gradients are reused several times until the resulting weight updates no
>longer lead to a reduction in error.     

In this case, you're just doing a very naive line-minimization in the
gradient direction.  You could do much better by using a good line-min
routine which can interpolate to the presumed minimum much faster.  (I use
the BRENT from Numerical Recipes for my C.G.)  I actually implimented this
line-minimization alg., but was disappointed in its performance.  Regular BP
usually did better.

>
>Furthe, usage of Batching makes the serach direction more accurate.
>Batching means :
>Considering the total error (to be minimized) as the sum of squared differnces
> between the desired output and the observed, summed over all patterns.

Yes, but if you add on the gradient each time, later examples get the
benefit of whatever global features that have been learnt in earlier
examples.

I usually find that standard back-prop (update after each weight) w/
momentum is faster.  Indeed, because of its simplicity & speed, I've found
it hard to beat in real wall-clock time.  90% of a back-prop program's time
is spent in the forward iteration & gradient computation subroutines.  It's
very easy to double the complexity of the inner loops b/c back-prop w/ fixed
steps is so simple, and thus really slow down the program alot.

I use standard back-prop for a while before starting on C.G---when starting
from a random position, C.G. is usually pretty clueless and needs to get
closer in.

Matt Kennel
mbkennel@phoenix.princeton.edu

------------------------------

Subject: Re: Training
From:    rich@bunny.UUCP (Rich Sutton)
Organization: GTE Laboratories, Waltham, MA
Date:    Thu, 23 Mar 89 16:17:59 +0000 

The learning methods that you are looking for are called "reinforcement
learning" methods.  They are less well known than supervised learning
methods, but hardly obscure.  And of course they are more powerful in the
sense you refer to -- they can learn what to do without a teacher that
already knows the correct answers.  I recommend the following papers.
                                        -Rich Sutton


A. Barto, R. Sutton, & P. Brouwer, ``Associative search network:
A reinforcement learning associative memory," Biological
Cybernetics, 40, 1981, pp. 201--211.

A. Barto & R. Sutton, ``Landmark learning: An illustration of
associative search," Biological Cybernetics, 42, 1981, 
pp. 1--8.

Barto, Sutton, and Anderson, ``Neuronlike elements that can solve
difficult learning control problems,'' IEEE Trans. on Systems, Man, and
Cybernetics, 13, 1983, pp. 835--846.

Williams, R. J., ``Toward a theory of reinforcement-learning
connectionist systems,'' Technical Report NU-CCS-88-3, College of
Computer Science, Northeastern University, Boston, Massachusetts, 1988.

My dissertation is also directly relevant and I invite you to write me
for a copy at GTE Labs, Waltham MA 02254.

------------------------------

Subject: Re: Training
From:    kavuri@cb.ecn.purdue.edu (Surya N Kavuri )
Organization: Purdue University Engineering Computer Network
Date:    Fri, 24 Mar 89 05:31:02 +0000 


 Besides gradient and conjugate gradient methods there are other one could
try.  There are methods known as Quasi-Newton methods that are known to
perform much better in non-linear optimization.  It is because they use
higher order derivatives besides the first(as in gradient methods), and thus
use more knowledge of the objective function contours.  Second and higher
order derivatives can be thought of as indicators of error (surfaces)
arising from gradient application.  This additional knowledge is used to
achieve a much rapid convergence.  The computational difficulties with
quasi-Newtonian approaches is the evaluation of the Hessian.  There are
inexpensive updating methods as BFGS (Broyden-Fletcher -Goldfarb-Shannon)
algorithm.  (most NLP books should have this)
                                           Surya

------------------------------

Subject: Re: Training
From:    mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Organization: Princeton University, NJ
Date:    Sat, 25 Mar 89 02:19:57 +0000 


I don't think you want to use Quasi-Newton methods such as BFGS for most
neural-net problems.  They require O(N^2) storage where N is the number of
_weights_, and in each iteration require at least O(N^2) operations.  If
you're dealing with small N, this is usually insignificant next to the time
to evaluate your functions, but for large N, it might be a problem.

I use conjugate-gradient, which doesn't have this problem but probably
requires more function evaluations than BFGS by a moderate, but not huge
factor.

Matt K.
mbkennel@phoenix.princeton.edu

------------------------------

Subject: Re: Training
From:    "Roger Gonzelez" <spam@sun.soe.clarkson.edu>
Date:    Mon, 03 Apr 89 21:29:15 +0000 

Let me restate my question on training in a slightly different way.  Since
I'm an undergraduate doing research in NN and I can get away with it, my
"problem" is not one of much importance or practicality.  Here it is:

There is a small organism in my pseudo-universe that has a life that
consists mainly of swimming up or down, eating things, and dodging annoying
predators that nibble at it.  The shallow waters are safe, but all the food
(and predators) are down deeper.  There are other stimuli as well, but these
will suffice to give you the gist of what I'm playing with.

Lets assume that our friend is in deep water (literally) and a predator is
munching on him. (It's never fatal, but the "pain" and "fear" inputs are
bothering it.)

The correct solution to get out of this state is to swim up.  However, I
don't want to have to tell it this.  The solution that I'm playing with
right now (suggested to me via mail) is something I'm told is called "weakly
supervised learning".  I haven't implemented much of this solution yet, so I
don't know how well it will work.  The basic (and obvious) method seems to
be to negatively affect any attempt that didn't get it out of the
undesireable state.  I can't believe I didn't think of this on my own.

Any other suggestions?

   - Roger

++ Roger Gonzalez ++ spam@clutx.clarkson.edu ++ Clarkson University ++

------------------------------

Subject: Re: Training
From:    "John B. Nagle" <shelby!glacier!jbn@DECWRL.DEC.COM>
Organization: Stanford University
Date:    04 Apr 89 04:29:04 +0000 


      This is a vivarium approach to AI, and a good one.  See Ann Marion's
classic "aquarium" program.  Mike Travers recent MS thesis at MIT offers
some insight on current thinking in this area.

      Worth considering for a system like yours, where you really want to
cold-start with no built-in knowledge at all, is to add some analogue of
"evolution" to the system.  The simulator should cause the fish to die if
they don't get enough food or get bitten too much.  Fish that thrive should
reproduce, with some tweaking of the parameters of the offspring to simulate
mutation.  Provided that the initial environment is not so hostile that all
the fish die, which can be handled by arranging the simulator so that the
predators are very few until the population increases above some level, the
fish should become more effective over time.  Once it's working, let it run
over a weekend and see what happens.

                                        John Nagle

------------------------------

End of Neurons Digest
*********************