olky@vax5.CIT.CORNELL.EDU (05/18/89)
Hi, I am working on various applications of NNs for learning control systems, adaptive learning etc. Now, I have a little problem... I would like to train a feedforward net with input-output patterns that have a wider dynamic range. For example the outputs of the net will vary , say, between -4.0 ,4.0 or for some cases you don't even know the range, because NN will be a part of a dynamic system with a certain degree of freedom. So what I did? I modified the activation function for the output units and used f(x)=x, also made the necessary changes in the error derivation where you need the derivative of f(x). Anyway I got very bad results, huge numbers in the order of billions. I used sigmoids at the output layer with arange 0 to 4.0, it didn't work... Then I tried a symmetric sigmoid with a range -2,2 , again nothing. Then I used this limited sigmoid for each layer, no results again. In some papers people claim ,that they trained a net with a linear output activation function. How ? I even bought PDP volume III to check their bp.c software. They used only the common 0-1 sigmoid. So I was disappointed, and a little bit mad that I spent 30 bucks for it,(since I couldn't get reimbursed for this by my advisor..). At any rate , CAN smbdy help me with this situation? If anybody has worked on problems with wider dynamic ranges, suggestions are greatly appreciated... Kemal Ciliz olky@vax5.cit.cornell.edu mkciliz@cmx.npac.syr.edu
mkciliz@cmx.npac.syr.edu (M. Kemal Ciliz) (05/24/89)
Hi again, Thanks to all who sent messages to me explaining possible cures to my problem... The problem was with the large constant(learning rate) I used in the gradient descent algorithm. I was using sth. like 0.3-0.5, which basicly caused the weights and consequently outputs to blow up. I used very low learning rates like 0.001-0.005 and increased the number of nodes in the two hidden layers (10,8), and got better results but now convergence is extremely slow. In any case I think it's better than having outputs that blow up. But I am still curious if sth. else can be done to speed up the convergence. Somebody suggested the use of BAMs but I didn't quite understand the procedure. Regards, Kemal Ciliz mkciliz@cmx.npac.syr.edu olky@vax5.cit.cornell.edu
james@blake.acs.washington.edu (James Taylor) (05/26/89)
In article <18635@vax5.CIT.CORNELL.EDU> writes: [...] > I would like to train a feedforward net with input-output patterns > that have a wider dynamic range. For example the outputs of the net > will vary , say, between -4.0 ,4.0 or for some cases you don't even > know the range, because NN will be a part of a dynamic system with > a certain degree of freedom. > > So what I did? I modified the activation function for the output > units and used f(x)=x, also made the necessary changes in the > error derivation where you need the derivative of f(x). Anyway > I got very bad results, huge numbers in the order of billions. > > Kemal Ciliz > olky@vax5.cit.cornell.edu > mkciliz@cmx.npac.syr.edu [...] I have done similiar things - scaling problem patterns for <0,1>, or <-n,n>. I ran into that sort of unstable training behavior in the magnitude of the weights when I screwed up the feedback for the network. The experiments I ran worked OK as long as I guarranteed that the feedback really did go to zero as the node activation went to +-n ie. if (ignoring subscripts, everything is at node i, approxomately following the notation of the PDP books) sig(y) = {2n/(1-exp[-y])} - n = n*(1+exp[-y])/(1-exp[-y) dsig(y)/dy = n*(-2exp[-y])/{(1-exp[-y])^2} = n*(1+sig(y))*(1-sig(y)) = n*(1+x)*(1-x) = sig'(y) Which implies that Delta_W = delta *alpha*x = -sig'(y)* SUM[of delta*W next layer] *alpha*x Now if n > 1 with the above definition of the sigmiod when | sig(y) | > 1 sig'(y) changes sign and the feedback Delta_W becomes a **postive** feedback. Bad. The simplest solution is to make n<=1 and scale the output of the network. Put a fixed gain at the output, scale your target output to [-1,+1] during training, then rescale during testing if you really want to see the larger magnitude outputs. I played a little with alternate sigmoid definitions but in every case I came out with a similiar problem, with a much more complex training algorithm. If you come up with an alternate solution I'd be very interested. James Taylor james@uw-isdl.ee.washington.edu
gary@desi.ucsd.edu (Gary Cottrell) (05/29/89)
You have some bugs in your math. You have: >sig(y) = {2n/(1-exp[-y])} - n > = n*(1+exp[-y])/(1-exp[-y) >James Taylor > >james@uw-isdl.ee.washington.edu The sigmoid is sq(x) = 1/1+exp(-x), not 1/1-exp(-x). I will use k instead of n, since for any simulation, it is a constant. We want the derivative of 2k*sq(x) - k. Let's call sq(x) OLD, and 2k*sq(x)-k NEW, and derivative DER. So we want DER(NEW)=DER(2k*OLD-k) since NEW = 2k*OLD-k = 2k*DER(OLD) = 2k*(1-OLD)*OLD, since DER(OLD) = (1-OLD)*OLD But if in your simulation, your squash is the new one, you need the derivative in terms of the new squashing function. OLD = (NEW + k)/(2k), so we have DER(NEW)= 2k*(1-OLD)*OLD = 2k*[1 - (NEW + k)/2k]*(NEW + k)/2k = [1 - (NEW + k)/2k]*(NEW + k) = [(2k - (NEW + k))/2k]*(NEW + k) = [(k-NEW)/2k]*(k + NEW) = (k^2 - NEW^2)/2k, if you prefer. gary cottrell 619-534-6640 Computer Science and Engineering C-014 UCSD, La Jolla, Ca. 92093 gary@cs.ucsd.edu (ARPA) {ucbvax,decvax,akgua,dcdwest}!sdcsvax!gary (USENET) gcottrell@ucsd.edu (BITNET)