[comp.ai.neural-nets] SUMMARY: Need activation levels in backprop networks be betw 0 and 1?

bettingr@pegasus.cs.Buffalo.EDU (Keith E. Bettinger) (06/29/91)
I got many responses to my recent question about activation levels in
backpropagation networks, thanks!  I really learned a lot, enough that
I thought that I would share it with the net for other novice backprop
investigators.  

-------------------------------------------------------------------------

THE ORIGINAL QUESTION: (reproduced in its entirety for completeness)
======================
>In backpropagation networks, is it *inherent* in the equations involved
>that the activation range be on the unit interval [0,1] or the
>integer unit interval [-1,1]?
>
>BACKGROUND:
>==========
>I've been trying to relate a set of real-valued inputs to a
>set of real-valued outputs using a 3-layer backpropagation neural
>network, without any success.  The network seems to max out
>immediately, with the hidden nodes going directly to either minimum
>or maximum activation levels, and no appreciable learning taking
>place thereafter.
>   I *was* able to get a running network, though, if I scaled each
>input and output down to a [0,1] range.  But this procedure has its
>own problems, not the least of which is the necessity of knowing the
>entire range of inputs and outputs before beginning.
>
>LONG QUESTION:
>=============
>Is a [0,1] range (or a [-1,1] range, which also worked) necessary for
>backprop nets?  If so, can the equations be modified to allow a
>wider, hopefully unlimited, activation range?  If not, are there any
>special techniques needed to get such a network going?

------------------------------------------------------------------------------

SUMMARY OF RESPONSES:
=====================
I think that the answer to my Short Question turned out to be
effectively Yes, although the strict answer is in all likelihood No.
The key, as far as I see it, turned out to be in the dynamic range of
whatever activation function is used (e.g., the dynamic range of
logistic is in the sharply curving section between -5 & +5).  Using
real-valued inputs causes the activation functions for the hidden
units to exceed their effective dynamic range.  The solution for this
seems to be to use activations with wider dynamic ranges, but I wonder
what effect doing such would have on learning.  My guess is that
either you have a small dynamic range, or slow learning.  Restricting
your inputs and outputs to [0,1] allows you to have a small enough
dynamic range to permit quick learning.

I will anonymously offer below some of the suggestions I got; anyone
of my respondents who wants to come forward and claim their ideas has
my blessing. :-)

NO, no inherent restrictions:
-----------------------------
   People who told me that no inherent restrictions existed also
tended to suggest that I use a linear activation function for the
output.  I had already tried that before posting, but it didn't help,
due to the dynamic range problem I mentioned before.
   The apparent restriction in activation range lies only in the
definition of the activation function.  It is easy to redefine the
activation function to lie on some other range.  Some suggestions I
got for alternative activation functions included:
        o tanh (range [-1,1])
        o sin (from -pi/2 to pi/2, range [-1,1])
        o the rescaled logistic function
             ((Max-Min)/1+exp(-net)) + Min,  for a range [Min,Max].


YES, activation levels need to be in [0,1] (or [-1,1]):
------------------------------------------------------
   Many people agreed that, yes, Keith, you had to do what you did to
get a real-valued network to work.  They offered some suggestions on
how to better do the scaling, which included:
 
        o Scale outputs between [0.1,0.9] or [0.2,0.8] to prevent
          the network from setting the weights unnecessarily high.
          O/W, the network would do this in order to try to reach
          the unattainable values 0 and 1.
        o Try scaling with a logarithm to cast a wide range into a
          small numeric space.
         
   In my original posting, I complained that scaling required me to
know the entire range of inputs and outputs before running my network.
One person offered that such foreknowledge was necessary anyway, that
backprop networks are most effective with interpolation, but not
extrapolation.  I meant to ask for more details about that idea.
Also, one person suggested that the bounds set by the logistic and
other activation functions might be necessary to learning complex
associations, in that real neurons couldn't handle an unlimited
response range.

ISSUE OF DYNAMIC RANGE CONSTRAINTS:
-----------------------------------
   Two people transcended my (probably) clumsy explanation of my
problem to offer what I think is the answer to my querying:  if the
inputs to a node are too high, the activation function will be
saturated, and not much learning can take place.
   In the case of the standard logistic function, this loss of dynamic
range occurs around +-5.  Outside of the interval [-5,5],
logistic(X)~=logistic(X+epsilon), so that even relatively large moves
along the curve result in no change in the function's output.  In
order for learning to take place, the values coming into the logistic
function must stay on the more dynamic, semilinear area of the curve.
   One suggestion I got in how to help my real-valued network stay on
that section was to initialize the weight matrix to have magnitudes
1/<avg. magnitude to input vector>.  I didn't have any success with
this heuristic; the weights didn't want to stay in this range once
learning was initiated.
   The obvious solution to the saturation problem is to use a function
with a wider dynamic range (not a taller activation range, as I had
asked for).  But I wonder what effect that might have on learning; my
guess it that it would slow it considerably.  But perhaps that is the
cost of learning on a range wider than [0,1].

LIST OF RESPONDENTS:   Thanks to all of you. :-)
====================
larry@spike.rprc.washington.edu (Larry Shupe)
cherwig@eng.clemson.edu (christoph bruno herwig)
adams@probitas.cs.utas.edu.au (Tony Adams)
paulonis@kodak.com
Denis Anthony <esrmm@cu.warwick.ac.uk>
Geoffrey Talvola <GTALVO%AUVM.BITNET@ubvm.cc.buffalo.edu>
dbailey%icsia4.Berkeley.EDU@berkeley.edu (David R. Bailey)
Yu Shen <shen@iro.umontreal.ca>
munro@learning.siemens.com (Paul Munro)
kooijman@duteca.et.tudelft.nl (Richard ...)
demers@cs.ucsd.edu (David DeMers)

FINAL REMARKS:
=============
   This problem seemed like a fundamental one that any newcomer to
using backpropagation networks would come across, so I thought this
summary might be useful to a lot of people.  Pardon me if I got too
verbose. :-)
   Thanks again for all your feedback.

-------------------------------------------------------------------------
Keith E. Bettinger                  "All of us get lost in the darkness
SUNY at Buffalo Computer Science     Dreamers learn to steer by the stars
                                     All of us do time in the gutter
                                     Dreamers turn to look at the cars."
INTERNET: bettingr@cs.buffalo.edu                           - Neil Peart
UUCP: ..{bbncca,decvax,rocksvax,watmath}!sunybcs!bettingr
-------------------------------------------------------------------------