[comp.ai.neural-nets] Does back-propagation work with a wider dynamic range...?

olky@vax5.CIT.CORNELL.EDU (05/18/89)

	Hi,

	I am working on various applications of NNs for learning control
	systems, adaptive learning etc. Now, I have a little problem...
	
	I would like to train a feedforward net with input-output patterns
	that have a wider dynamic range. For example the outputs of the net
	will vary , say, between -4.0 ,4.0 or for some cases you don't even
	know the range, because NN will be a part of a dynamic system with
	a certain degree of freedom. 

	So what I did? I modified the activation function for the output
	units and used f(x)=x, also made the necessary changes in the 
	error derivation where you need the derivative of f(x). Anyway
	I got very bad results, huge numbers in the order of billions.

	I used sigmoids at the output layer with arange 0 to 4.0, it didn't
	work...
	Then I tried a symmetric sigmoid with a range -2,2 , again nothing.
	Then I used this limited sigmoid for each layer, no results  again.

	In some papers people claim ,that they trained a net with a linear
	output activation function. How ? 
	I even bought PDP volume III to check their bp.c software. They used
	only the common 0-1 sigmoid. So I was disappointed, and a little 
	bit mad that I spent 30 bucks for it,(since I couldn't get
	reimbursed for this by my advisor..).

	At any rate , CAN smbdy help me with this situation? If anybody
	has worked on problems with wider dynamic ranges, suggestions
	are greatly appreciated...


							Kemal Ciliz
						olky@vax5.cit.cornell.edu
						mkciliz@cmx.npac.syr.edu

mkciliz@cmx.npac.syr.edu (M. Kemal Ciliz) (05/24/89)

        Hi again,

        Thanks to all who sent messages to me explaining possible
        cures to my problem...

        The problem was with the large constant(learning rate) I used
        in the gradient descent algorithm. I was using sth. like 0.3-0.5,
        which basicly caused the weights and consequently outputs to blow up.

        I used very low learning rates like 0.001-0.005 and increased the
        number of nodes in the two hidden layers (10,8), and got better
        results but now convergence is extremely slow. In any case I think
        it's better than having outputs that blow up.

        But I am still curious if sth. else can be done to speed up the
        convergence. Somebody suggested the use of BAMs but I didn't
        quite understand the procedure.

        Regards,

        Kemal Ciliz
        mkciliz@cmx.npac.syr.edu
        olky@vax5.cit.cornell.edu

james@blake.acs.washington.edu (James Taylor) (05/26/89)

In article <18635@vax5.CIT.CORNELL.EDU> writes:
[...]
>	I would like to train a feedforward net with input-output patterns
>	that have a wider dynamic range. For example the outputs of the net
>	will vary , say, between -4.0 ,4.0 or for some cases you don't even
>	know the range, because NN will be a part of a dynamic system with
>	a certain degree of freedom. 
>
>	So what I did? I modified the activation function for the output
>	units and used f(x)=x, also made the necessary changes in the 
>	error derivation where you need the derivative of f(x). Anyway
>	I got very bad results, huge numbers in the order of billions.
>
>							Kemal Ciliz
>						olky@vax5.cit.cornell.edu
>						mkciliz@cmx.npac.syr.edu
[...]

I have done similiar things - scaling problem patterns for <0,1>, or
<-n,n>.  I ran into that sort of unstable training behavior in the
magnitude of the weights when I screwed up the feedback for the network.
The experiments I ran worked OK as long as I guarranteed that the
feedback really did go to zero as the node activation went to +-n ie.

if (ignoring subscripts, everything is at node i, approxomately
    following the notation of the PDP books)

sig(y) = {2n/(1-exp[-y])} - n
       = n*(1+exp[-y])/(1-exp[-y)

dsig(y)/dy = n*(-2exp[-y])/{(1-exp[-y])^2}
	   = n*(1+sig(y))*(1-sig(y))
	   = n*(1+x)*(1-x)
           = sig'(y)

Which implies that 

Delta_W = delta *alpha*x
        = -sig'(y)* SUM[of delta*W next layer] *alpha*x

Now if n > 1  with the above definition of the sigmiod

when
	| sig(y) | > 1

sig'(y) changes sign and the feedback Delta_W becomes a **postive**
feedback.  Bad.

The simplest solution is to make n<=1 and scale the output of the 
network.  Put a fixed gain at the output, scale your target output
to [-1,+1] during training, then rescale during testing if you really
want to see the larger magnitude outputs.

I played a little with alternate sigmoid definitions but in every
case I came out with a similiar problem, with a much more complex
training algorithm.

If you come up with an alternate solution I'd be very interested.

James Taylor

james@uw-isdl.ee.washington.edu

gary@desi.ucsd.edu (Gary Cottrell) (05/29/89)

You have some bugs in your math.
You have:

>sig(y) = {2n/(1-exp[-y])} - n
>       = n*(1+exp[-y])/(1-exp[-y)
>James Taylor
>
>james@uw-isdl.ee.washington.edu

The sigmoid is sq(x) = 1/1+exp(-x), not 1/1-exp(-x).

I will use k instead of n, since for any simulation, it is
a constant.

We want the derivative of  2k*sq(x) - k.

Let's call sq(x) OLD, and 2k*sq(x)-k NEW, and derivative DER.
So we want
	
DER(NEW)=DER(2k*OLD-k)		since NEW = 2k*OLD-k
	= 2k*DER(OLD)
	= 2k*(1-OLD)*OLD,	since DER(OLD) = (1-OLD)*OLD

But if in your simulation, your squash is the new one,
you need the derivative in terms of the new squashing function.

OLD = (NEW + k)/(2k),
so we have
DER(NEW)= 2k*(1-OLD)*OLD
	= 2k*[1 - (NEW + k)/2k]*(NEW + k)/2k
	=    [1 - (NEW + k)/2k]*(NEW + k)
	=    [(2k - (NEW + k))/2k]*(NEW + k)
	=    [(k-NEW)/2k]*(k + NEW)
	=    (k^2 - NEW^2)/2k, if you prefer.


gary cottrell	619-534-6640
Computer Science and Engineering C-014
UCSD, 
La Jolla, Ca. 92093
gary@cs.ucsd.edu (ARPA)
{ucbvax,decvax,akgua,dcdwest}!sdcsvax!gary (USENET)
gcottrell@ucsd.edu (BITNET)