[comp.ai.neural-nets] Back Propagation question...

camargo@cs.columbia.edu (Francisco Camargo) (05/30/89)

Hi there,

I'm re-posting my previous message together with a reply that I received from
Tony Plate and my reply to him. I'd really appreciate comments on this issue.
Thanks to all.
-----------------------------------------------------------------------------


|In article <224@cs.columbia.edu> you write:
||
||Can anyone put some light in the following issue:
||
||How should one compute the weight adjustments in BackProp ?
||From reading PDP, one gathers the impression that the DELTAS
||should be acumulated over all INPUT PATTERNS and only then
||a STEP is taken towards the gradient. Robins Monroe suggests
||a stochastic algorithm with proved convergency if one takes one
||step at each pattern presentation, but dumps its effect by a factor
||1/k where "k" is the presentation number. Other people,(from codes
||that I've seen flying around) seems to take a STEP a each presentation
||a don't take into account any dumping factors. I've tried myself both
||approaches and they all seem to work. After all, which is the correct way
||of adjusting the weights ? Acumulate the errors over all patterns ? Or, work
||towards the minimum as new patterns are presented.Which are the implications?
||
||Any light is this issue is extremelly appreciated.
||
-----------------------------------------------------------------------------
| There are two standard methods of doing the updates, sometimes called
| "batch" and "online" learning.
|
| In "batch" learning, all the changes are accumulated for one pass through
| all the examples.  At the end of the pass (or "epoch") the update is made.
| Thus, each link requires an extra storage field in which to accumulate
| the changes.
|
| In "online" learning, the change is made after seeing each example.
|
| Some people claim online is better, others claim batch is better.
|
| "dumping" (you mean "weighting") each change by 1/k, where k is the number
| of the example (?) sounds really wierd, do you mean if you had four examples
| in your training set changes from the fourth would be worth only a quarter
| as much as changes from the second? surely you don't mean this!
|
| Some people use a momentum term, and some change the learning rate during
| learning.  Using momentum seems to be generally a good thing, and it's
| easy to do.  Automatically changing the learning rate is much harder.
|
| .....
| ..... Connectionist Learning Algorithms by Hinton....
| .....
|
| tony plate
------------------------------------------------------------------------------
Hi Tony,

Sorry for my previous message being so unspecific. What I meat is that 
the dumping occurs after each "epoch." The idea is that the changes in
the weights tend to be of lesser and lesser importance. Actually, the way the 
algorithm is stated, one should dump (I really mean dump) the step size
by a series of terms {a_k} where "sum({a_k}^2)<infinity", with no restriction
in the sum({a_k}). In any case, using {a_k}=1/k for k="epoch number" should
be enough.

My problem is that I can find any (theoretical) justification for the "online"
method other that "Robins Monroe algorithm" (I may have misspelled his name, 
for which I apologize, but I don't have my references near by). But then, the
"dumping" factor is required for guaranteed convergence. I tried the "online"
method and it does seem to perform better. But, WHY does it work ? How come it
converges so well (despite of making {a_k}=1) ?

I am familiar with the use of "momentum" in the learning process, but I 
really want to understand more the theoretical reasons for the "online"
method. Having started my studies with the "batch" mode, it seems a little
like black magic that the "online" method works.

I have the paper by Hinton, "Connectionist Learning Procedures", CMU-CS-87-115.
Is this the paper you refered to ? Any other improvements to this work?
I appreciate your time and effort.
Thanks,


/Kiko.
camargo@cs.columbia.edu

demers@beowulf.ucsd.edu (David E Demers) (05/31/89)

In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes:
>I'm re-posting my previous message together with a reply that I received from
>Tony Plate and my reply to him. I'd really appreciate comments on this issue.
>-----------------------------------------------------------------------------
|In article <224@cs.columbia.edu> [Francisco Camargo] writes:
||How should one compute the weight adjustments in BackProp ?
||From reading PDP, one gathers the impression that the DELTAS
||should be acumulated over all INPUT PATTERNS and only then
||a STEP is taken towards the gradient. Robins Monroe suggests
||a stochastic algorithm with proved convergency if one takes one
||step at each pattern presentation, but dumps its effect by a factor
||1/k where "k" is the presentation number. Other people,(from codes
||that I've seen flying around) seems to take a STEP a each presentation
||a don't take into account any dumping factors. I've tried myself both
||approaches and they all seem to work. After all, which is the correct way
||of adjusting the weights ? Acumulate the errors over all patterns ? Or, work
||towards the minimum as new patterns are presented.Which are the implications?
-----------------------------------------------------------------------------
[Tony replies]
| There are two standard methods of doing the updates, sometimes called
| "batch" and "online" learning.
|
| In "batch" learning, all the changes are accumulated for one pass through
| all the examples.  At the end of the pass (or "epoch") the update is made.
| Some people use a momentum term, and some change the learning rate during
| learning.  Using momentum seems to be generally a good thing, and it's
| easy to do.  Automatically changing the learning rate is much harder.

[No it's not...]
>------------------------------------------------------------------------------
[Francisco tries to explain what he means by "dumping", and the
"Robins Monroe" algorithm...]
>"dumping" factor is required for guaranteed convergence. I tried the "online"
>method and it does seem to perform better. But, WHY does it work ? How come it
>converges so well (despite of making {a_k}=1) ?
>
>I am familiar with the use of "momentum" in the learning process, but I 
>really want to understand more the theoretical reasons for the "online"
>method. Having started my studies with the "batch" mode, it seems a little
>like black magic that the "online" method works.
>
>I have the paper by Hinton, "Connectionist Learning Procedures", CMU-CS-87-115.
>Is this the paper you refered to ? Any other improvements to this work?


Sorry to quote so much of the prior postings, but I thought it worth
it to retain context.

I am not sure that I fully understand Francisco's question.  But I'll
answer it anyway :-)  

Essentially, what backpropogation is trying to do is to acheive a minimum
mean squared error by following the gradient of the error as a function
of the weights.  The "batch" method works well because you get a good
picture of the true gradient after seeing all of the input-output
pairs.  However, as long as corrections are made which go "downhill",
then we will converge (possibly to a local rather than global minimum).
Making weight changes after presentation of each training example
will not necessarily follow the gradient, but with a small learning
rate, in the aggregate we will still be moving downhill (reducing
MSE).


Dave

dhw@itivax.iti.org (David H. West) (05/31/89)

In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes:
]Hi there,
]| "dumping" (you mean "weighting") each change by 1/k, where k is the number
]| of the example (?) sounds really wierd, do you mean if you had four examples
]| in your training set changes from the fourth would be worth only a quarter
]| as much as changes from the second? surely you don't mean this!

]| tony plate

]My problem is that I can find any (theoretical) justification for the "online"
]method other that "Robins Monroe algorithm" (I may have misspelled his name, 
]for which I apologize, but I don't have my references near by). But then, the
]"dumping" factor is required for guaranteed convergence. I tried the "online"
]method and it does seem to perform better. But, WHY does it work ? How come it
]converges so well (despite of making {a_k}=1) ?

]/Kiko.
]camargo@cs.columbia.edu

It's related to an old statistical hack for calculating the change
in the mean of a set of observations when another is added.  That 
formula takes 2 or 3 lines of algebra to derive, on a bad day.

-David       dhw@itivax.iti.org

mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) (05/31/89)

In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes:
>[stuff deleted]
>
>My problem is that I can find any (theoretical) justification for the "online"
>method other that "Robins Monroe algorithm" (I may have misspelled his name, 
>for which I apologize, but I don't have my references near by). But then, the
>"dumping" factor is required for guaranteed convergence. I tried the "online"
>method and it does seem to perform better. But, WHY does it work ? How come it
>converges so well (despite of making {a_k}=1) ?

>
>I am familiar with the use of "momentum" in the learning process, but I 
>really want to understand more the theoretical reasons for the "online"
>method. Having started my studies with the "batch" mode, it seems a little
>like black magic that the "online" method works.

I have an intuitive explanation, but it's not rigorous by any means, and
it could even be completely wrong, but here goes...

In most problems, there is some underlying regularity that _all_ examples
possess that you're trying to learn.  Thus, if you update the weights
after each example, you get the benefit of learning from the previous
examples, but if you only update after a whole run through the training
set, it takes much longer to learn this regularity.

In my experiments, I've found that "online" learning works much better
at the beginning, when the network is completely untrained, because
presumably it's learning the general features of the whole set quickly,
but later on, when trying to learn the fine distinctions among examples,
"online" learning does worse, because it tries to "memorize" each example
in turn instead of learning the whole mapping.  In this regime, you
have to use batch learning.

For many problems though, you never need this level of accuracy (I needed
continuous-valued outputs accurate to <1%) and so "online" learning
is good enough, and often significantly faster, especially with
momentum.  Momentum smooths out the weight changes from a few
recent examples.  (Actually, for my stuff, I like conjugate gradient
on the whole "batch" error surface.)

>/Kiko.
>camargo@cs.columbia.edu

Matt Kennel
mbkennel@phoenix.princeton.edu (6 more days only!!! )
kennel@cognet.ucla.edu  (after that)

artzi@cpsvax.cps.msu.edu (Ytshak Artzi - CPS) (05/31/89)

In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes:
>My problem is that I can find any (theoretical) justification for the "online"
>method other that "Robins Monroe algorithm" (I may have misspelled his name, 
>for which I apologize, but I don't have my references near by). But then, the
>"dumping" factor is required for guaranteed convergence. I tried the "online"
>method and it does seem to perform better. But, WHY does it work ? How come it
>converges so well (despite of making {a_k}=1) ?
>
   As a general comment, you must be careful in choosing the particular
instance of the problem you try to solve. If the initial state is close
to the correct solution than both methods will work. For any problem
there exists an instance for which the convergence is not guaranteed for
either method. Unfortunately, there is no good method available to
detect such an instance, given an arbitrary problem.

  Now consider the following equation:

       DELTA w   = n(t   - O  )i   = nd  i
            p ji      pj    pj  pi     pj pi

  This rule changes weights following presentation of I/O pair p.

  t   is target input for j-th component of output pattern p
   pj

  O   is the j-th element of the actual output pattern, resulted by
   pj
       input p

  i   is the i-th input element
   pi

  d   = t   - O
   pj    pj    pj

  DELTA w   is the change to be made from the i-th to j-th unit after
       p ij
            input p

  Hope it helps...

   Itzik.