bettingr@pegasus.cs.Buffalo.EDU (Keith E. Bettinger) (06/29/91)
I got many responses to my recent question about activation levels in backpropagation networks, thanks! I really learned a lot, enough that I thought that I would share it with the net for other novice backprop investigators. ------------------------------------------------------------------------- THE ORIGINAL QUESTION: (reproduced in its entirety for completeness) ====================== >In backpropagation networks, is it *inherent* in the equations involved >that the activation range be on the unit interval [0,1] or the >integer unit interval [-1,1]? > >BACKGROUND: >========== >I've been trying to relate a set of real-valued inputs to a >set of real-valued outputs using a 3-layer backpropagation neural >network, without any success. The network seems to max out >immediately, with the hidden nodes going directly to either minimum >or maximum activation levels, and no appreciable learning taking >place thereafter. > I *was* able to get a running network, though, if I scaled each >input and output down to a [0,1] range. But this procedure has its >own problems, not the least of which is the necessity of knowing the >entire range of inputs and outputs before beginning. > >LONG QUESTION: >============= >Is a [0,1] range (or a [-1,1] range, which also worked) necessary for >backprop nets? If so, can the equations be modified to allow a >wider, hopefully unlimited, activation range? If not, are there any >special techniques needed to get such a network going? ------------------------------------------------------------------------------ SUMMARY OF RESPONSES: ===================== I think that the answer to my Short Question turned out to be effectively Yes, although the strict answer is in all likelihood No. The key, as far as I see it, turned out to be in the dynamic range of whatever activation function is used (e.g., the dynamic range of logistic is in the sharply curving section between -5 & +5). Using real-valued inputs causes the activation functions for the hidden units to exceed their effective dynamic range. The solution for this seems to be to use activations with wider dynamic ranges, but I wonder what effect doing such would have on learning. My guess is that either you have a small dynamic range, or slow learning. Restricting your inputs and outputs to [0,1] allows you to have a small enough dynamic range to permit quick learning. I will anonymously offer below some of the suggestions I got; anyone of my respondents who wants to come forward and claim their ideas has my blessing. :-) NO, no inherent restrictions: ----------------------------- People who told me that no inherent restrictions existed also tended to suggest that I use a linear activation function for the output. I had already tried that before posting, but it didn't help, due to the dynamic range problem I mentioned before. The apparent restriction in activation range lies only in the definition of the activation function. It is easy to redefine the activation function to lie on some other range. Some suggestions I got for alternative activation functions included: o tanh (range [-1,1]) o sin (from -pi/2 to pi/2, range [-1,1]) o the rescaled logistic function ((Max-Min)/1+exp(-net)) + Min, for a range [Min,Max]. YES, activation levels need to be in [0,1] (or [-1,1]): ------------------------------------------------------ Many people agreed that, yes, Keith, you had to do what you did to get a real-valued network to work. They offered some suggestions on how to better do the scaling, which included: o Scale outputs between [0.1,0.9] or [0.2,0.8] to prevent the network from setting the weights unnecessarily high. O/W, the network would do this in order to try to reach the unattainable values 0 and 1. o Try scaling with a logarithm to cast a wide range into a small numeric space. In my original posting, I complained that scaling required me to know the entire range of inputs and outputs before running my network. One person offered that such foreknowledge was necessary anyway, that backprop networks are most effective with interpolation, but not extrapolation. I meant to ask for more details about that idea. Also, one person suggested that the bounds set by the logistic and other activation functions might be necessary to learning complex associations, in that real neurons couldn't handle an unlimited response range. ISSUE OF DYNAMIC RANGE CONSTRAINTS: ----------------------------------- Two people transcended my (probably) clumsy explanation of my problem to offer what I think is the answer to my querying: if the inputs to a node are too high, the activation function will be saturated, and not much learning can take place. In the case of the standard logistic function, this loss of dynamic range occurs around +-5. Outside of the interval [-5,5], logistic(X)~=logistic(X+epsilon), so that even relatively large moves along the curve result in no change in the function's output. In order for learning to take place, the values coming into the logistic function must stay on the more dynamic, semilinear area of the curve. One suggestion I got in how to help my real-valued network stay on that section was to initialize the weight matrix to have magnitudes 1/<avg. magnitude to input vector>. I didn't have any success with this heuristic; the weights didn't want to stay in this range once learning was initiated. The obvious solution to the saturation problem is to use a function with a wider dynamic range (not a taller activation range, as I had asked for). But I wonder what effect that might have on learning; my guess it that it would slow it considerably. But perhaps that is the cost of learning on a range wider than [0,1]. LIST OF RESPONDENTS: Thanks to all of you. :-) ==================== larry@spike.rprc.washington.edu (Larry Shupe) cherwig@eng.clemson.edu (christoph bruno herwig) adams@probitas.cs.utas.edu.au (Tony Adams) paulonis@kodak.com Denis Anthony <esrmm@cu.warwick.ac.uk> Geoffrey Talvola <GTALVO%AUVM.BITNET@ubvm.cc.buffalo.edu> dbailey%icsia4.Berkeley.EDU@berkeley.edu (David R. Bailey) Yu Shen <shen@iro.umontreal.ca> munro@learning.siemens.com (Paul Munro) kooijman@duteca.et.tudelft.nl (Richard ...) demers@cs.ucsd.edu (David DeMers) FINAL REMARKS: ============= This problem seemed like a fundamental one that any newcomer to using backpropagation networks would come across, so I thought this summary might be useful to a lot of people. Pardon me if I got too verbose. :-) Thanks again for all your feedback. ------------------------------------------------------------------------- Keith E. Bettinger "All of us get lost in the darkness SUNY at Buffalo Computer Science Dreamers learn to steer by the stars All of us do time in the gutter Dreamers turn to look at the cars." INTERNET: bettingr@cs.buffalo.edu - Neil Peart UUCP: ..{bbncca,decvax,rocksvax,watmath}!sunybcs!bettingr -------------------------------------------------------------------------