cheng@ral.rpi.edu (Wei-Ying Cheng) (03/20/91)
I have a question which may be interesting: Suppose we have a sample set S which contains a huge number of samples. It is impossible to sum all the samples in the set S. If we choose a sample s=(x,y) from the sample set S randomly according to a prior distribution. We hope NN to learn this sample. Suppose the output of NN is y' if the input of NN is x, then we have a norm d =|y - y'|. It is obvious that d is a random variable. The problem is how to develop a learning rule such that d can be minimized in probablity sence. e.g E(d) is minimized, or P(min(d)) converge to 1, etc. Is there any reference talking about this problem? Since this problem is concerned with the generalization problem of NN, I think it is very interesting. I greatly appreciate any help.
nicwi@isy.liu.se (Niclas Wiberg) (03/21/91)
cheng@ral.rpi.edu (Wei-Ying Cheng) writes: >I have a question which may be interesting: >Suppose we have a sample set S which contains a huge number of >samples. It is impossible to sum all the samples in the set S. >If we choose a sample s=(x,y) from the sample set S randomly >according to a prior distribution. We hope NN to learn this >sample. Suppose the output of NN is y' if the input of NN is >x, then we have a norm d =|y - y'|. It is obvious that d is >a random variable. The problem is how to develop a learning >rule such that d can be minimized in probablity sence. e.g >E(d) is minimized, or P(min(d)) converge to 1, etc. Is there >any reference talking about this problem? Since this problem is >concerned with the generalization problem of NN, I think it >is very interesting. I greatly appreciate any help. What you talk about seems very much like "epoch" vs. "sample" updating in the backpropagation algorithm. Most people seem to use the sample version, but the mathematical formulations I have seen have only applied to the epoch version. Since I spend some time with this myself recently, I might as well share my thoughts. The norm d is a function of the chosen sample s=(x,y) and the weight state W of the network: d(W,s) = |y - f(W,x)| where f(W,x) is the output of the network, which depends on the weights W and the input x. USUAL FORMULATION OF BACKPROPAGATION The usual formulation of backpropagation and similar training algorithms defines a global error function as the sum of d(W,s) over all samples s: e(W) = SUM(s) d(W,s) The gradient e'(W) of e(W) is then used to update W in order to decrease the error. We can motivate this by studying the taylor expansion of the error function around W: e(W+dW) = e(W) + dW e'(W) + O(dW^2) (the "O" term is "very small" if dW is "small"). If we use dW = - C e'(W) to update W, and choose C as a sufficiently small positive number, we can omit the last term, and get e(W - C e'(W)) = e(W) - C e'(W) e'(W) Because the e'(W)e'(W) is negative, the error is always decreasing. To compute the gradient of e(W), we must sum the gradients of each d(W,s) over all s: e'(W) = SUM(s) d'(W,s) So we have to evaluate f(W,x) for all s before W is updated. This is known as "epoch training". Epoch training has some significant drawbacks, especially if the training set is large. STOCHASTIC FORMULATION If s insted is regarded as a random variable, d(W,s) also becomes a random variable (as you noted in your question). We can then redefine our global error function as the expectation of d(W,s): e(W) = E[ d(W,s) ] The gradient of the new e(W) can be rewritten as the expectation of the gradient of d(W,s): e'(W) = E[ d'(W,s) ] So, if we are using d'(W,s) to update W, we will "by expectation" walk in the correct direction. More formally, we can set dW = - C d'(W,s), and evaluate e(W) in the new point (similar taylor expansion as above): e(W - C d'(W,s)) = e(W) - C d'(W,s) e'(W) And, by taking expectation: E[ e(W - C d'(W,s)) ] = e(W) - C e'(W) e'(W) We see that the expectation of e(W) after the updating is lower than the original e(W). This holds as long as dW is sufficiently small. Thus, we can update W after each sample presentation. + + + + Hope this helps. I am studying backpropagation with random noise added to the outputs of the hidden layer. Therefore, I need to formulate the backpropagation rule on a stochastic basis. -- ---------------------------------------------------------------------- Niclas Wiberg nicwi@isy.liu.se Dept. of EE Linkoping University Sweden