[comp.ai.neural-nets] quastion

cheng@ral.rpi.edu (Wei-Ying Cheng) (03/20/91)

I have a question which may be interesting:

Suppose we have a sample set S which contains a huge number of 
samples. It is impossible to sum all the samples in the set S.
If we choose a sample s=(x,y) from the sample set S randomly
according to a prior distribution. We hope NN to learn this 
sample. Suppose  the output of NN is y' if the input of NN is
x, then we have a norm d =|y - y'|. It is obvious that d is
a random variable. The problem is how to develop a learning 
rule such that d can be minimized in probablity sence. e.g
E(d) is minimized, or P(min(d)) converge to 1, etc. Is there
any reference talking about this problem? Since this problem is
concerned with the generalization problem of NN, I think it
is very interesting.  I greatly appreciate any help.

nicwi@isy.liu.se (Niclas Wiberg) (03/21/91)

cheng@ral.rpi.edu (Wei-Ying Cheng) writes:

>I have a question which may be interesting:

>Suppose we have a sample set S which contains a huge number of 
>samples. It is impossible to sum all the samples in the set S.
>If we choose a sample s=(x,y) from the sample set S randomly
>according to a prior distribution. We hope NN to learn this 
>sample. Suppose  the output of NN is y' if the input of NN is
>x, then we have a norm d =|y - y'|. It is obvious that d is
>a random variable. The problem is how to develop a learning 
>rule such that d can be minimized in probablity sence. e.g
>E(d) is minimized, or P(min(d)) converge to 1, etc. Is there
>any reference talking about this problem? Since this problem is
>concerned with the generalization problem of NN, I think it
>is very interesting.  I greatly appreciate any help.

What you talk about seems very much like "epoch" vs. "sample"
updating in the backpropagation algorithm. Most people seem to use the
sample version, but the mathematical formulations I have seen have only
applied to the epoch version. Since I spend some time with this myself
recently, I might as well share my thoughts.

The norm d is a function of the chosen sample s=(x,y) and the weight
state W of the network:

    d(W,s) = |y - f(W,x)|

where f(W,x) is the output of the network, which depends on the
weights W and the input x.


USUAL FORMULATION OF BACKPROPAGATION

The usual formulation of backpropagation and similar training
algorithms defines a global error function as the sum of d(W,s) over
all samples s:

    e(W) = SUM(s) d(W,s)

The gradient e'(W) of e(W) is then used to update W in order to
decrease the error. We can motivate this by studying the taylor
expansion of the error function around W:

    e(W+dW) = e(W) + dW e'(W) + O(dW^2)

(the "O" term is "very small" if dW is "small"). If we use
dW = - C e'(W) to update W, and choose C as a sufficiently small
positive number, we can omit the last term, and get

    e(W - C e'(W)) = e(W) - C e'(W) e'(W)

Because the e'(W)e'(W) is negative, the error is always decreasing.
To compute the gradient of e(W), we must sum the gradients of each
d(W,s) over all s:

    e'(W) = SUM(s) d'(W,s)

So we have to evaluate f(W,x) for all s before W is updated. This is
known as "epoch training". Epoch training has some significant
drawbacks, especially if the training set is large.


STOCHASTIC FORMULATION

If s insted is regarded as a random variable, d(W,s) also becomes a
random variable (as you noted in your question). We can then redefine
our global error function as the expectation of d(W,s):

    e(W) = E[ d(W,s) ]

The gradient of the new e(W) can be rewritten as the expectation of
the gradient of d(W,s):

    e'(W) = E[ d'(W,s) ]

So, if we are using d'(W,s) to update W, we will "by expectation" walk
in the correct direction. More formally, we can set dW = - C d'(W,s),
and evaluate e(W) in the new point (similar taylor expansion as above):

    e(W - C d'(W,s)) = e(W) - C d'(W,s) e'(W)

And, by taking expectation:

    E[ e(W - C d'(W,s)) ] = e(W) - C e'(W) e'(W)

We see that the expectation of e(W) after the updating is lower than
the original e(W). This holds as long as dW is sufficiently small.
Thus, we can update W after each sample presentation.

    +    +    +    +

Hope this helps. I am studying backpropagation with random noise added
to the outputs of the hidden layer. Therefore, I need to formulate the
backpropagation rule on a stochastic basis.
-- 
----------------------------------------------------------------------
Niclas Wiberg                         nicwi@isy.liu.se
Dept. of EE    Linkoping University   Sweden