[comp.ai.neural-nets] Neuron Digest V5 #29

neuron-request@HPLABS.HP.COM ("Neuron-Digest Moderator Peter Marvit") (07/08/89)

Neuron Digest	Friday,  7 Jul 1989
		Volume 5 : Issue 29

Today's Topics:
	rochester connectionist simulator available on uunet.uu.net
		     Two Problems with Back-propagation
		   How to simulate Foreign Exchange Rates
		 Re: How to simulate Foreign Exchange Rates
	       References on learning and memory in the brain
			 DARPA Neural Network Study
		    ART and non-stationary environments
			    Karmarkar algorithm
	     NEURAL NETWORKS TODAY (a new book on the subject)
			 Re: Neural-net marketplace
		       RE: climatological data wanted
	       Kohonen musical application of neural network
		   Back Propagation Algorithm question...
		  Back Propagation question... (follow up)
		Re: Back Propagation question... (follow up)
		Re: Back Propagation question... (follow up)
		Re: Back Propagation question... (follow up)
		Re: Back Propagation question... (follow up)
		 Re: Back Propagation Algorithm question...
		  Accelerated learning using Dahl's method
			      Info on DYSTAL?
			 3-Layer versus Multi-Layer
		       Re: 3-Layer versus Multi-Layer
		       Re: 3-Layer versus Multi-Layer
		       Re: 3-Layer versus Multi-Layer


Send submissions, questions, address maintenance and requests for old issues to
"neuron-request@hplabs.hp.com" or "{any backbone,uunet}!hplabs!neuron-request"
ARPANET users can get old issues via ftp from hplpm.hpl.hp.com (15.255.16.205).

------------------------------------------------------------

Subject: rochester connectionist simulator available on uunet.uu.net
From:    bukys@cs.rochester.edu (Liudvikas Bukys)
Organization: U of Rochester, CS Dept, Rochester, NY
Date:    Fri, 21 Apr 89 14:39:56 +0000 

A number of people have asked me whether the Rochester Connectionist
Simulator is available by uucp.  I am happy to announce that uunet.uu.net
has agreed to be a redistribution point of the simulator for their uucp
subscribers.  It is in the directory ~uucp/pub/rcs on uunet:

 -rw-r--r--  1 8        11           2829 Jan 19 10:07 README
 -rw-r--r--  1 8        11         527247 Jan 19 09:57 rcs_v4.1.doc.tar.Z
 -rw-r--r--  1 8        11           9586 Jul  8  1988 rcs_v4.1.note.01
 -rw-r--r--  1 8        11            589 Jul  7  1988 rcs_v4.1.patch.01
 -rw-r--r--  1 8        11           1455 Apr 19 19:18 rcs_v4.1.patch.02
 -rw-r--r--  1 8        11            545 Aug  8  1988 rcs_v4.1.patch.03
 -rw-r--r--  1 8        11         837215 May 19  1988 rcs_v4.1.tar.Z

These files are copies of what is available by FTP from the directory
pub/rcs on cs.rochester.edu (192.5.53.209).  We will still send you, via
U.S. Mail, a tape and manual for $150 or just a manual for $10.

If you are interested in obtaining the simulator via uucp, but you aren't a
uunet subscriber, I can't help you, because I don't know how to sign up.
Maybe a note to postmaster@uunet.uu.net would get you started.

Liudvikas Bukys
<simulator-request@cs.rochester.edu>

------------------------------

Subject: Two Problems with Back-propagation
From:    Rich Sutton <rich@gte.com>
Date:    Tue, 25 Apr 89 14:03:03 -0400 

A recent posting referred to my paper that analyzes steepest descent
procedures such as back-propagation.  That posting requested the full
citation to the paper:

   Sutton, R.S. ``Two problems with backpropagation and other steepest-descent
   learning procedures for networks'', Proceedings of the Eighth Annual
   Conference of the Cognitive Science Society, 1986, pp. 823-831.

The paper is not really ``an attack'' on gradient descent, but an analysis
of its strengths and weaknesses with an eye to improving it.  The analysis
suggests several directions in which to look for improvements, but pursues
none very far.  Subsequent work by Jacobs (Neural Networks, 1988, p.295) and
Scalettar and Zee (Complex Systems, 1987) did pursue some of the ideas, but
others remain unexplored.  Most of the discussion is still relevant today,
though I now have more doubts about simply adopting conventional
(non-steepest) descent algorithms for use in connectionist nets.

 -Rich Sutton


------------------------------

Subject: How to simulate Foreign Exchange Rates
From:    harish@mist.CS.ORST.EDU (Harish Pillay)
Organization: Oregon State University, E&CE, Corvallis, Oregon 97331
Date:    Wed, 26 Apr 89 05:32:51 +0000 

I am taking a grad course on NN and am planning on doing a project trying to
predict foreign exchange rates specifically the following:

   US$ vs British Pound vs Japanese Yen vs Singapore $ vs German Marks

I am using NeuralWorks and am thinking of using the backprop strategy.  So
far, all I've done is to gather the exchange rates reported in the WSJ from
March 17 to today.  I've normalized it to be within 0 and 1 but my problem
is in trying to train the network.  Has anyone out there done anything
similar to this?  If so, what desired output values did you use to train?  I
understand that it is naive to just take the rates themselves and try to get
a pattern or correlation.  Should I be looking at other values too?  What
kind of transfer function should I use?  I think one hidden layer may be
sufficient.

I would really appreciate any suggestions, and will post something once I
get this project done.

Thanks.

Harish Pillay                                Internet: harish@ece.orst.edu
Electrical and Computer Engineering          MaBell: 503-758-1389 (home)
Oregon State University                              503-754-2554 (office)
Corvallis, OR 97331

------------------------------

Subject: Re: How to simulate Foreign Exchange Rates
From:    andrew@berlioz (Andrew Palfreyman)
Organization: National Semiconductor, Santa Clara
Date:    Wed, 26 Apr 89 09:06:30 +0000 

One brute force method, to separate the chicken from the egg, might be to
use the changes instead of the absolute values (especially since you're
using localised data which doesn't span a boom or a crash).  Maybe then you
could use 3 inputs in parallel (3 currencies) and 2 outputs, and just ring
the changes (5c3 = 10 ways) until the input deltas produce correct output
deltas. An associative net might do this better.  Else, you could play with
recursive nets (Jordan, etc.), whereby you try and predict tomorrow's
5-vector, given today's.

Andrew Palfreyman 		USENET: ...{this biomass}!nsc!logic!andrew
National Semiconductor M/S D3969, 2900 Semiconductor Dr., PO Box 58090,
Santa Clara, CA 95052-8090 ; 408-721-4788 		there's many a slip
							'twixt cup and lip

[[ Editor's Note: This problem points out the more general one of
input/output representation; this is still a hot topic in AI circles and
even in many traditional computing fields.  The representation often
determines the architecture and ouytcome.  Strict and useful guidelines for
ANNs don't yet exist.  Modeling the changes, rather than values, seems like
a neat solution amenable to many problems, however. -PM ]]

------------------------------

Subject: References on learning and memory in the brain
From:    honavar@goat.cs.wisc.edu (Vasant Honavar)
Organization: U of Wisconsin CS Dept
Date:    Wed, 26 Apr 89 17:20:57 +0000 

I am looking for papers (good reviews in particular) on learning and memory
mechanisms in the brain from the perspectives of neuroscience and
psychology. Please e-mail me the lists of papers that you know of and I will
compile a bibliography and make it available to anyone that is interested.
Thanks.

Vasant Honavar
Computer Sciences Dept.
University of Wisconsin-Madison

honavar@cs.wisc.edu

[[ Editor's Note: Hmmm, there are more papers and books on this subject
than I can remember. However, one of the best recent (survey) books,
complete with a reasonable bibliography to get you started, is "Memory and
Brain" by Larry R. Squire (1987, Oxford University Press). -PM ]]

------------------------------

Subject: DARPA Neural Network Study
From:    djlinse@phoenix.Princeton.EDU (Dennis Linse)
Organization: Princeton Unversity, Princeton, NJ
Date:    Thu, 27 Apr 89 01:57:43 +0000 

I recently saw an advertisement for the complete report of the October 1987
to February 1988 DARPA report on U.S. national perspectives on neural
networks.

Has anyone seen/read this report?  Is it useful for a researcher, or is it
written more from the funding agency perspective?  Any information would be
useful.

And before I get inundated with requests, the publication information is:

$49.95 casebound/ over 600 pages shipping/handling, $5.00 for the first
copy, $1.50 for each additional copy shipped to a U.S. or Canada address.
$10 per copy to all other addresses.

AFCEA International Press
4400 Fair Lakes Court, Dept. S1
Fairfax VA 22033-3899
(703) 631-6190
(800) 336-4583 ext. 6190


Dennis   (djlinse@phoenix.princeton.edu)
 
Found at the top of a looonnng homework assignment:
   "Activity is the only road to knowledge"  G.B. Shaw

------------------------------

Subject: ART and non-stationary environments
From:    adverb@bucsb.UUCP (Josh Krieger)
Organization: Boston Univ Comp. Sci.
Date:    Thu, 27 Apr 89 20:40:50 +0000 

I think it's important to say one last thing about ART:

ART is primarily usefull in a statistically non-stationary environment
because its learned categories will not erode with the changing input.  If
your input environment is stationary, then there may be little reason to use
the complex machinery behind ART; your vanilla backprop net will work just
fine.

 -- Josh Krieger

------------------------------

Subject: Karmarkar algorithm
From:    andrew@berlioz (Lord Snooty @ The Giant Poisoned Electric Head)
Organization: National Semiconductor, Santa Clara
Date:    Sat, 29 Apr 89 00:49:04 +0000 

Does anybody have comparitive data on the Karmarkar algorithm in respect
of neural-net implementations? The algorithm is apparently quite efficient
at optimising constraints in large parameter spaces, an area where
comparitive data on the neural approach would be very interesting. In
particular, does anybody know of attempts to somehow encode this algorithm
in a neural/parallel fashion, or indeed if this is possible?
Finally, could anybody recommend a reference work on Karmarkar? (for
novices, please!) - thanks in advance.

Andrew Palfreyman 		USENET: ...{this biomass}!nsc!logic!andrew
National Semiconductor M/S D3969, 2900 Semiconductor Dr., PO Box 58090,
Santa Clara, CA 95052-8090 ; 408-721-4788 		there's many a slip
							'twixt cup and lip

------------------------------

Subject: NEURAL NETWORKS TODAY (a new book on the subject)
From:    mmm@cup.portal.com (Mark Robert Thorson)
Organization: The Portal System (TM)
Date:    Thu, 04 May 89 22:36:01 +0000 

[[ Editor's Note: Caution!  Advertisement here. Caveat Emptor, especially
via-a-vis the slight hyperbole below. However, possibly quite useful to the
small audience who needs it.  I also saw a service advertised at IJCNN which
would mail you quarterly updates on what patents were files in this field.
Sorry, I didn't pick up the literature. -PM ]]

I have just finished a book titled NEURAL NETWORKS TODAY, which is available
for $35 (plus $5 for postage and handling and 7% sales tax if you live in
California).

It's 370 pages, not counting title page, table of contents, and the
separators between chapters.  Velo-bound with soft vinyl covers.

Its contents include descriptions of 14 hardware implementations of neural
networks, by Leon Cooper, John Hopfield, David Tank, Dick Lyon, and others.
These implementations come from Nestor Associates, AT&T, Synaptics, and
others.  The source material is the U.S. patents currently in force in the
field.  About 250 pages are copies of these patents, about 100 pages are my
commentary on these patents.  (Patents are somewhat difficult to read; my
commentary makes everything clear.)

This book should be of primary interest to researchers doing patentable work
in the field of neural networks.  It's like getting a patent search for a
fraction of the usual price.

This book should also be of interest to people new to the field of neural
networks.  My commentary is at an introductory level, while the source
material is at a very detailed and technical level.  My commentary can help
someone acquire expert knowledge in a short period of time.

I will accept checks and corporate or university purchase orders.

Mark Thorson
12991 B Pierce Rd.
Saratoga, CA
95070

------------------------------

Subject: Re: Neural-net marketplace
From:    demers@beowulf.ucsd.edu (David E Demers)
Organization: EE/CS Dept. U.C. San Diego
Date:    Sun, 14 May 89 20:51:52 +0000 

In article <159@spectra.COM> eghbalni@spectra.COM (Hamid Eghbalnia) writes:
>	This is a purely a curiosity question.  How are the "NN-type"
>	companies doing?  I suppose the underlying question is:  Has
>	anybody been able to use the technology to develop applications
>	that has excited government or industry beyond just research?

I just read that SAIC just received a $100 million contract to provide
airports with bomb-sensing luggage-scanning devices based on neural nets.  I
believe the order was from the FAA.  I don't have the article handy, it
might have been in EETimes - definitely a trade publication - within the
past week or so.  SAIC, of course, is not primarily a NN company, but $100MM
is a big chunk of business no matter who you are.

Dave

------------------------------

Subject: RE: climatological data wanted
From:    Albert Boulanger <bbn.com!aboulang@BBN.COM>
Date:    20 May 89 19:54:50 +0000 

maurer@nova2.Stanford.EDU (Michael Maurer) writes:

      Does anybody out there know of electronic sources for climatological
  data compiled by the US weather service?  The National Environmental
  Satellite Data and Information Service publishes thick books full of daily
  weather records from weather stations around the country, but I would
  prefer the data in machine-readable format.

       I am doing a research project for an Adaptive Systems class and
  am interested in short-term weather prediction using a system that learns
  from past weather records.  Please e-mail any info you might have.

How do you plan to do short-term prediction from data that is long term (and
hence below the sampling resolution you want)?

You can get computer media of this from:

National Climate Center, Environmental Data Information Service
NOAA, Dept of Commerce
Federal Building
Asheville NC 28801 (704) 258-2850

I got the initial pointer from a friend and the complete address form
the book:

Information USA
Matthew Lesko
Viking Press 1983
ISBN 0-670-39823-3 (hardcover)
ISBN 0 14 046.564 2 (paperback)

For shorter term records (~several days), there are a couple of dozen
services that provide such information (WSI in MA being one).

Albert Boulanger
BBN Systems & Technologies Corp
aboulanger@bbn.com
	

------------------------------

Subject: Kohonen musical application of neural network
From:    viseli@uceng.UC.EDU (victor l iseli)
Organization: Univ. of Cincinnati, College of Engg.
Date:    Tue, 23 May 89 22:57:44 +0000 

[[ Editor's Note: This subject was discussed at some length in previous
issues of Neuron Digest.  At IJCNN, Kohonen gave a "paper" describing some
of his latest experements using a variation of an associative memory to
analyze pieces of music (by one composer or period) and remember sequences,
then generate "new" music by recursively recalling those sequences of notes
with some small variation. The paper can be found in the Proceedings of
IJCNN 89.  My judgement? As a musician, I found the music tedious and
lacking in substance; Mozart's musical dice do better! The music we heard
owed more to its production (echoing organ-like synthesizer tones in various
registers and timbres) than with the notes. As an engineer, good start
without much musical foundation. I certainly hope that the scheduled two
hour concert at Winter IJCNN will be more than an indulgence in a famous
man's toy. -PM ]]

I am looking for information regarding Kohonen's demonstration of a neural
network at the INN conference last September, 1988.  He apparently trained a
neural net to compose music or harmonize to a melody in the manner of famous
classical composers (??).

I am also interested in any other information/references regarding the
recent discussion on neural nets in music.

Please send email.

         || ||  Victor Iseli                 (viseli@uceng.uc.edu)
         ||     University of Cincinnati
\\    // || ||  Dept. Electrical Engr.
 \\  //  || ||  811K Rhodes Hall
  \\//   || ||  Cincinnati, OH 45219      

------------------------------

Subject: Back Propagation Algorithm question...
From:    camargo@cs.columbia.edu (Francisco Camargo)
Organization: Columbia University Department of Computer Science
Date:    Mon, 29 May 89 23:26:49 +0000 


Can anyone put some light in the following issue:

How should one compute the weight adjustments in BackProp ?  From reading
PDP, one gathers the impression that the DELTAS should be acumulated over
all INPUT PATTERNS and only then a STEP is taken towards the gradient.
Robins Monroe suggests a stochastic algorithm with proved convergency if one
takes one step at each pattern presentation, but dumps its effect by a
factor 1/k where "k" is the presentation number. Other people,(from codes
that I've seen flying around) seems to take a STEP a each presentation a
don't take into account any dumping factors. I've tried myself both
approaches and they all seem to work. After all, which is the correct way of
adjusting the weights ? Acumulate the errors over all patterns ? Or, work
towards the minimum as new patterns are presented. Which are the
implications ?

Any light is this issue is extremelly appreciated.

Francisco A. Camargo
CS Department - Columbia University
camargo@cs.columbia.edu


PS: A few weeks ago, I requested some pointers to Learning Algorithms in NN
and promissed a summary of the replies. It is comming. I have not forgoten
my responsibilities with this group. Even though I got more requests than
really new info, I'll have a summary posted shortly. And thanks for all who
contributed.

------------------------------

Subject: Back Propagation question... (follow up)
From:    camargo@cs.columbia.edu (Francisco Camargo)
Organization: Columbia University Department of Computer Science
Date:    Tue, 30 May 89 14:18:30 +0000 

Hi there,

I'm re-posting my previous message together with a reply that I received from
Tony Plate and my reply to him. I'd really appreciate comments on this issue.
Thanks to all.

| There are two standard methods of doing the updates, sometimes called
| "batch" and "online" learning.
|
| In "batch" learning, all the changes are accumulated for one pass through
| all the examples.  At the end of the pass (or "epoch") the update is made.
| Thus, each link requires an extra storage field in which to accumulate
| the changes.
|
| In "online" learning, the change is made after seeing each example.
|
| Some people claim online is better, others claim batch is better.
|
| "dumping" (you mean "weighting") each change by 1/k, where k is the number
| of the example (?) sounds really wierd, do you mean if you had four examples
| in your training set changes from the fourth would be worth only a quarter
| as much as changes from the second? surely you don't mean this!
|
| Some people use a momentum term, and some change the learning rate during
| learning.  Using momentum seems to be generally a good thing, and it's
| easy to do.  Automatically changing the learning rate is much harder.
|
| .....
| ..... Connectionist Learning Algorithms by Hinton....
| .....
|
| tony plate

Hi Tony,

Sorry for my previous message being so unspecific. What I meat is that the
dumping occurs after each "epoch." The idea is that the changes in the
weights tend to be of lesser and lesser importance. Actually, the way the
algorithm is stated, one should dump (I really mean dump) the step size by a
series of terms {a_k} where "sum({a_k}^2)<infinity", with no restriction in
the sum({a_k}). In any case, using {a_k}=1/k for k="epoch number" should be
enough.

My problem is that I can find any (theoretical) justification for the
"online" method other that "Robins Monroe algorithm" (I may have misspelled
his name, for which I apologize, but I don't have my references near by).
But then, the "dumping" factor is required for guaranteed convergence. I
tried the "online" method and it does seem to perform better. But, WHY does
it work ? How come it converges so well (despite of making {a_k}=1) ?

I am familiar with the use of "momentum" in the learning process, but I
really want to understand more the theoretical reasons for the "online"
method. Having started my studies with the "batch" mode, it seems a little
like black magic that the "online" method works.

I have the paper by Hinton, "Connectionist Learning Procedures",
CMU-CS-87-115.  Is this the paper you refered to ? Any other improvements to
this work?  I appreciate your time and effort.  Thanks,

/Kiko.
camargo@cs.columbia.edu

------------------------------

Subject: Re: Back Propagation question... (follow up)
From:    demers@beowulf.ucsd.edu (David E Demers)
Organization: EE/CS Dept. U.C. San Diego
Date:    Tue, 30 May 89 19:23:47 +0000 

[Tony replied]
| In "batch" learning, all the changes are accumulated for one pass through
| all the examples.  At the end of the pass (or "epoch") the update is made.
| Some people use a momentum term, and some change the learning rate during
| learning.  Using momentum seems to be generally a good thing, and it's
| easy to do.  Automatically changing the learning rate is much harder.

[No it's not...]
>------------------------------------------------------------------------------
[Francisco tries to explain what he means by "dumping", and the
"Robins Monroe" algorithm...]

[[Editor's Note: Most of the quotations deleted from above. -PM]]

Sorry to quote so much of the prior postings, but I thought it worth
it to retain context.

I am not sure that I fully understand Francisco's question.  But I'll answer
it anyway :-)

Essentially, what backpropogation is trying to do is to acheive a minimum
mean squared error by following the gradient of the error as a function of
the weights.  The "batch" method works well because you get a good picture
of the true gradient after seeing all of the input-output pairs.  However,
as long as corrections are made which go "downhill", then we will converge
(possibly to a local rather than global minimum).  Making weight changes
after presentation of each training example will not necessarily follow the
gradient, but with a small learning rate, in the aggregate we will still be
moving downhill (reducing MSE).

Dave

------------------------------

Subject: Re: Back Propagation question... (follow up)
From:    dhw@itivax.iti.org (David H. West)
Organization: Industrial Technology Institute
Date:    Tue, 30 May 89 20:09:55 +0000 

]My problem is that I can find any (theoretical) justification for the "online"
]method other that "Robins Monroe algorithm" 

]But, WHY does it work ? How come it
]converges so well (despite of making {a_k}=1) ?

It's related to an old statistical hack for calculating the change in the
mean of a set of observations when another is added.  That formula takes 2
or 3 lines of algebra to derive, on a bad day.

 -David       dhw@itivax.iti.org

------------------------------

Subject: Re: Back Propagation question... (follow up)
From:    mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Organization: Princeton University, NJ
Date:    Tue, 30 May 89 20:28:32 +0000 

>But, WHY does it work ? How come it
>converges so well (despite of making {a_k}=1) ?
>
>I am familiar with the use of "momentum" in the learning process, but I 
>really want to understand more the theoretical reasons for the "online"
>method. Having started my studies with the "batch" mode, it seems a little
>like black magic that the "online" method works.

I have an intuitive explanation, but it's not rigorous by any means, and it
could even be completely wrong, but here goes...

In most problems, there is some underlying regularity that _all_ examples
possess that you're trying to learn.  Thus, if you update the weights after
each example, you get the benefit of learning from the previous examples,
but if you only update after a whole run through the training set, it takes
much longer to learn this regularity.

In my experiments, I've found that "online" learning works much better at
the beginning, when the network is completely untrained, because presumably
it's learning the general features of the whole set quickly, but later on,
when trying to learn the fine distinctions among examples, "online" learning
does worse, because it tries to "memorize" each example in turn instead of
learning the whole mapping.  In this regime, you have to use batch learning.

For many problems though, you never need this level of accuracy (I needed
continuous-valued outputs accurate to <1%) and so "online" learning is good
enough, and often significantly faster, especially with momentum.  Momentum
smooths out the weight changes from a few recent examples.  (Actually, for
my stuff, I like conjugate gradient on the whole "batch" error surface.)

Matt Kennel
mbkennel@phoenix.princeton.edu (6 more days only!!! )
kennel@cognet.ucla.edu  (after that)

------------------------------

Subject: Re: Back Propagation question... (follow up)
From:    artzi@cpsvax.cps.msu.edu (Ytshak Artzi - CPS)
Organization: Michigan State University, Computer Science Department
Date:    Tue, 30 May 89 23:51:56 +0000 

   As a general comment, you must be careful in choosing the particular
instance of the problem you try to solve. If the initial state is close to
the correct solution than both methods will work. For any problem there
exists an instance for which the convergence is not guaranteed for either
method. Unfortunately, there is no good method available to detect such an
instance, given an arbitrary problem.

  Now consider the following equation:

       DELTA w   = n(t   - O  )i   = nd  i
            p ji      pj    pj  pi     pj pi

  This rule changes weights following presentation of I/O pair p.

  t   is target input for j-th component of output pattern p
   pj

  O   is the j-th element of the actual output pattern, resulted by
   pj
       input p

  i   is the i-th input element
   pi

  d   = t   - O
   pj    pj    pj

  DELTA w   is the change to be made from the i-th to j-th unit after
       p ij
            input p

  Hope it helps...

   Itzik.

------------------------------

Subject: Re: Back Propagation Algorithm question...
From:    heumann@hpmtlx.HP.COM ($John_Heumann@hpmtljh)
Organization: HP Manufacturing Test Division - Loveland, CO
Date:    Wed, 31 May 89 14:42:03 +0000 

A few comments.

1) Note that if backprop is modified by addition of a search (rather than
fixed step size) in the minimization direction, its simply a form of
gradient descent.  In light of this,

2) If you want to accumulate the gradient accurately over the entire search
space your forced to either a) accumulate over all samples before altering
any weights, or b) take a tiny (actually infinitesmal) step after each
sample is presented.  Doing a full step after each sample presentation
destroys the descent property of the algorithm.

3) If your're using backprop as originally presented (i.e. either fixed step
size or with a momentum term), I don't believe there is any general way to
establish that one method is superior to the other for all problems.  I've
seen search spaces on which one wonders rather aimlessly and the other
converges rapidly; the trouble is that which is good and which is bad
depends on the particular problem!  Its certainly true that doing something
which causes you to deviate from a true descent path (like adding a momentum
term or taking a step after each sample) can help you escape local minima in
selected cases and can lead to more rapid convergence on some problems.
Unfortunately, it can also lead to aimless wandering and poor performance on
others.

4) If you can find the full reference, I'd be interested in seeing the
Monroe paper, since I'm unaware of any backprop-like method with proven
convergence for non-convex search spaces.

5) Personally, my choice for optimizing NN's is to modify backprop to be a
true gradient descent method and the use either the Fletcher-Reeves or or
Pollak-Ribiere methods for accelerated convergence.  Doing so means you WILL
have trouble with local minima if they're present in your search space, but
avoids all the tweaky parameters in the original backpropagation algorithm.
(Since there's no one set of paramaters that appear applicable across a wide
range of problems, you can waste a huge amount of time trying to tweak the
learning rate or the size of the momentum term; to my mind this is simply
not practical for large problems).

Hope this is of some help.

ps: Note that Rummelhart et al are purposefully rather vague on whether the
weight adjustment is to be done after each sample presentation.  If you
carefully compare the chapter on backprop in PDP with that in their Nature
paper, you'll find that each paper uses a different tactic!

------------------------------

Subject: Accelerated learning using Dahl's method
From:    csstddm@cc.brunel.ac.uk (David Martland)
Organization: Brunel University, Uxbridge, UK
Date:    Tue, 06 Jun 89 16:34:28 +0000 

Has anyone out there tried to implement the accelerated learning method
described by Dahl in ICNN87 vol II, p523-530? It appears to work by
parabolic interpolation, but is not very clearly described.

Alternatively, does anyone have an email address for Dahl?

Thanks, dave martland

------------------------------

Subject: Info on DYSTAL?
From:    "Pierce T. Wetter" <wetter@CSVAX.CALTECH.EDU>
Organization: California Institute of Technology
Date:    14 Jun 89 05:34:24 +0000 

   In the new Scientific American, D.L. Alkon describes some work he has
done on biological learning and describes a program called DYSTAL which uses
this work to train artificial NN. Unfortunately, he doesn't describe the
algorithm. Does anyone have any info on DYSTAL or its training method so
that I can include it in my NN software?

Pierce
 
wetter@csvax.caltech.edu | wetter@tybalt.caltech.edu | pwetter@caltech.bitnet

------------------------------

Subject: 3-Layer versus Multi-Layer
From:    Jochen Ruhland <mcvax!unido!cosmo!jochenru%cosmo.UUCP@uunet.uu.net>
Organization: CosmoNet, D-3000 Hannover 1, FRG
Date:    20 Jun 89 01:19:09 +0000 

During a local meeting here in Germany I heard somebody talking about a
theorem that a three layer perceptron is capable to perform any given In/Out
function with an maximum number of hidden units in the network.

I forgot to ask where to look for the proof - so I'm asking here.  Response
may be in german or english.

Thanks in advance
       Jochen

------------------------------

Subject: Re: 3-Layer versus Multi-Layer
From:    merrill@bucasb.bu.edu (John Merrill)
Organization: Boston University Center for Adaptive Systems
Date:    Tue, 27 Jun 89 18:56:01 +0000 

One reference to such a result is

Funahashi, K. (1989). "On the Approximate Realization of Continuous Mappings
by Neural Networks", {\bf Neural Networks} (2) 183-192.

There are actually several different theorems which prove the same thing,
but Funahashi's is the first that I know of which does it from standard
sigmoid semi-linear nodes.

John Merrill			|	ARPA:	merrill@bucasb.bu.edu
Center for Adaptive Systems	|	
111 Cummington Street		|	
Boston, Mass. 02215		|	Phone:	(617) 353-5765

------------------------------

Subject: Re: 3-Layer versus Multi-Layer
From:    demers@beowulf.ucsd.edu (David E Demers)
Organization: EE/CS Dept. U.C. San Diego
Date:    Wed, 28 Jun 89 18:57:45 +0000 

For "perceptrons", there is no such proof, since multilayer linear units can
easily be collapsed into two-layers.  See, e.g., Minsky & Papert,
"Perceptrons" (1969).  If, however, units can take on non-linear
activations, then it can be shown that a three layer network can approximate
any Borel-measurable function to any desired degree of accuracy (exponential
in the number of units, however!).  Hal White et al have shown this, and
have also shown that the mapping is learnable.  This paper is going to
appear this year in the Journal of INNS, Neural Networks.

The source of this is frequently listed as the Kolmogorov superposition
theorem.  Robert Hecht-Nielsen has a paper in the 1987 Proceedings of the
First IEEE conference on Neural Networks about this theorem.  The theorem is
not constructive, however.  It shows that a function from R^m to R^n can be
represented by the superposition of {some number linear in m & n} bounded,
monotonic, non-linear functions of the m inputs.  However, there is no way
of determining these functions...

I am writing all of this from memory, all of my papers are elsewhere right
now...  but I know that others have similar results.

Dave
 

------------------------------

Subject: Re: 3-Layer versus Multi-Layer
From:    Matthew Kennel <mara!mickey.cognet.ucla.edu!kennel@LANAI.CS.UCLA.EDU>
Organization: none
Date:    03 Jul 89 18:06:42 +0000 

I recently saw a preprint by some EE professors at Princeton who made a
constructive proof using something called the "inverse Radon transform", or
something like that.

What I think the subject needs is work on characterizing the "complexity" of
continuous mappings, w.r.t. neural networks--- i.e.  how many hidden units
(free coefficients) are needed to reproduce some mapping with a certain
accuracy?

Obviously, this depends crucially on the functional basis and architecture
of the network---we might be able to thus evaluate various network types on
their power and efficiency in a practical way, and not just formal (i.e.
given infinite hidden neurons).

My undergrad thesis adviser, Eric Baum, has been working on this type of
problem, but for binary-valued networks, i.e. networks that classify the
input space into arbitrary categories.  The theory is quite
mathematical---as a "gut feeling" I suspect that for continous-valued
networks, only approximate results would be possible.

Matt Kennel
kennel@cognet.ucla.edu

------------------------------

End of Neurons Digest
*********************