pjhamvs@cs.vu.nl (Summeren van Peter) (09/20/90)
Greetings to the reader, this file contains : 1. a literature referencing for troff 2. a file for printing on a Pp4p an explanation of the method from Powell. The implementation was ftp'ed and the ftp adress was in this group before. But I put it here some adresses: for a hardcopy of docs: vincew@cse.ogc.edu for having done the ftp: opt-dist@cse.ogc.edu for the bugs:opt-bugs@cse.ogc.edu. 3. The implementation of opt was done by Etienne Barnard and Ronald Cole, (OGC Tech. Report No.CSE 89-014) You have to use your own troff commands for referencing. Ask the local troff specialists about it (it's not that difficult). I use a macro that places the references on the same page, but you may have other habits. After the next line starts a file for referencing literature in troff. ========================================================================= %E William H.Press %T Numerical Recipes in C %I Cambridge University Press %D 1989 %A M.J.D. Powell %T Restart Procedures for the conjugate gradient method %J Mathematical Programming %P 12(1977) 241-254 %A E.M.L. Beale %T A derivation of conjugate gradients %E F.A.Lootsma %B Numerical Methods for nonlinear optimization %I Academic Press %C London %D 1972 %P pp. 39-43 %A J.Stoer A.Bulirsch %T Introduction to numerical analysis %I Springer Verlag %C New York %D 1980 After the next line the troff program starts ========================================================================== .EQ delim $$ .EN .nr PS 12 .nr VS 15 .ps 12 .vs 15 .ft R .NH Introduction .LP A neural network can be built in many ways. The terms used are neural node, neural node function, sites, links. They are shown below in the figure. .PS scale = 2 .ps .ps 10 ellipse at 3.737,8.012 wid 1.000 ht 1.000 ellipse at 1.988,6.513 wid 1.000 ht 1.000 ellipse at 1.238,9.012 wid 1.000 ht 1.000 dashwid = 0.050i line dashed from 1.988,9.137 to 2.987,9.137 line dashed from 2.487,9.762 to 2.487,8.512 line from 1.800,6.513 to 1.800,6.513 line dashed from 1.050,8.387 to 1.238,8.200 line from 1.149,8.253 to 1.238,8.200 to 1.184,8.288 line from 3.887,9.412 to 3.862,9.512 to 3.837,9.412 line from 3.862,9.512 to 3.862,8.512 line from 0.657,8.536 to 0.675,8.637 to 0.612,8.556 line from 0.675,8.637 to 0.425,8.075 line from 0.600,8.567 to 0.675,8.637 to 0.575,8.611 line from 0.675,8.637 to 0.113,8.325 line from 0.675,8.637 to 0.675,8.637 line from 3.362,7.638 to 3.237,7.450 to 3.425,7.325 to 3.550,7.575 line from 2.362,6.200 to 2.550,6.075 to 2.612,6.200 to 2.425,6.325 line from 1.550,6.200 to 1.425,6.138 to 1.488,6.013 to 1.675,6.138 line from 1.488,6.513 to 1.300,6.513 to 1.300,6.388 to 1.488,6.388 line from 0.988,8.575 to 0.925,8.450 to 1.050,8.387 to 1.113,8.575 line from 0.800,8.825 to 0.613,8.700 to 0.738,8.575 to 0.863,8.700 spline from 1.988,8.762 to 2.237,8.825 to 2.362,8.950 to 2.487,9.137 to 2.675,9.450 to 2.925,9.575 to 3.175,9.575 line from 2.652,6.125 to 2.550,6.138 to 2.635,6.079 line from 2.550,6.138 to 2.584,6.122\ to 2.617,6.107\ to 2.650,6.093\ to 2.682,6.079\ to 2.713,6.065\ to 2.745,6.052\ to 2.775,6.038\ to 2.805,6.026\ to 2.835,6.013\ to 2.864,6.001\ to 2.892,5.990\ to 2.921,5.978\ to 2.948,5.967\ to 2.975,5.957\ to 3.002,5.947\ to 3.028,5.937\ to 3.054,5.927\ to 3.080,5.918\ to 3.129,5.900\ to 3.177,5.884\ to 3.224,5.869\ to 3.269,5.856\ to 3.313,5.843\ to 3.355,5.832\ to 3.396,5.822\ to 3.435,5.814\ to 3.474,5.807\ to 3.511,5.800\ to 3.548,5.796\ to 3.583,5.792\ to 3.617,5.789\ to 3.651,5.788\ to 3.684,5.788\ to 3.716,5.789\ to 3.747,5.791\ to 3.778,5.794\ to 3.808,5.798\ to 3.838,5.803\ to 3.867,5.809\ to 3.896,5.817\ to 3.925,5.825\ line from 3.925,5.825 to 3.955,5.835\ to 3.985,5.848\ to 4.014,5.862\ to 4.043,5.879\ to 4.073,5.897\ to 4.101,5.916\ to 4.130,5.938\ to 4.158,5.960\ to 4.186,5.984\ to 4.214,6.010\ to 4.241,6.036\ to 4.268,6.063\ to 4.294,6.091\ to 4.320,6.120\ to 4.346,6.150\ to 4.371,6.180\ to 4.395,6.210\ to 4.419,6.241\ to 4.442,6.272\ to 4.465,6.303\ to 4.486,6.334\ to 4.508,6.364\ to 4.528,6.395\ to 4.548,6.425\ to 4.567,6.454\ to 4.585,6.483\ to 4.602,6.511\ to 4.619,6.539\ to 4.634,6.565\ to 4.649,6.591\ to 4.675,6.638\ line from 4.675,6.638 to 4.697,6.678\ to 4.722,6.720\ to 4.748,6.764\ to 4.775,6.811\ to 4.804,6.860\ to 4.818,6.885\ to 4.833,6.910\ to 4.848,6.936\ to 4.863,6.962\ to 4.879,6.989\ to 4.894,7.016\ to 4.910,7.044\ to 4.925,7.071\ to 4.941,7.100\ to 4.956,7.128\ to 4.972,7.157\ to 4.987,7.186\ to 5.002,7.216\ to 5.017,7.245\ to 5.032,7.275\ to 5.047,7.306\ to 5.062,7.336\ to 5.076,7.367\ to 5.090,7.398\ to 5.104,7.429\ to 5.117,7.461\ to 5.130,7.492\ to 5.143,7.524\ to 5.155,7.556\ to 5.166,7.588\ to 5.178,7.620\ to 5.188,7.652\ to 5.198,7.684\ to 5.208,7.717\ to 5.217,7.749\ to 5.225,7.782\ to 5.233,7.814\ to 5.240,7.847\ to 5.246,7.880\ to 5.252,7.912\ to 5.256,7.945\ to 5.260,7.977\ to 5.263,8.010\ to 5.266,8.042\ to 5.267,8.075\ to 5.267,8.107\ to 5.267,8.139\ to 5.265,8.171\ to 5.263,8.203\ to 5.259,8.235\ to 5.254,8.267\ to 5.249,8.298\ to 5.242,8.329\ to 5.234,8.360\ to 5.224,8.391\ to 5.214,8.422\ to 5.202,8.452\ to 5.189,8.483\ to 5.175,8.512\ line from 5.175,8.512 to 5.159,8.543\ to 5.141,8.572\ to 5.122,8.600\ to 5.101,8.628\ to 5.079,8.653\ to 5.056,8.678\ to 5.006,8.724\ to 4.979,8.746\ to 4.951,8.766\ to 4.923,8.785\ to 4.894,8.802\ to 4.864,8.819\ to 4.833,8.834\ to 4.801,8.848\ to 4.770,8.861\ to 4.737,8.872\ to 4.704,8.882\ to 4.671,8.891\ to 4.637,8.899\ to 4.604,8.905\ to 4.570,8.911\ to 4.536,8.914\ to 4.502,8.917\ to 4.468,8.918\ to 4.434,8.918\ to 4.400,8.916\ to 4.367,8.913\ to 4.334,8.909\ to 4.301,8.903\ to 4.269,8.896\ to 4.237,8.887\ line from 4.237,8.887 to 4.190,8.869\ to 4.146,8.843\ to 4.107,8.808\ to 4.069,8.762\ to 4.051,8.735\ to 4.033,8.705\ to 4.015,8.672\ to 3.998,8.635\ to 3.980,8.594\ to 3.962,8.550\ to 3.944,8.502\ to 3.934,8.477\ to 3.925,8.450\ line from 0.973,8.285 to 0.988,8.387 to 0.926,8.304 line from 0.988,8.387 to 0.974,8.361\ to 0.961,8.335\ to 0.938,8.285\ to 0.917,8.238\ to 0.899,8.195\ to 0.883,8.154\ to 0.870,8.116\ to 0.859,8.079\ to 0.851,8.046\ to 0.845,8.013\ to 0.841,7.983\ to 0.839,7.954\ to 0.839,7.927\ to 0.842,7.900\ to 0.847,7.875\ to 0.863,7.825\ line from 0.863,7.825 to 0.891,7.778\ to 0.932,7.737\ to 0.957,7.719\ to 0.983,7.703\ to 1.011,7.687\ to 1.039,7.672\ to 1.068,7.659\ to 1.097,7.645\ to 1.125,7.633\ to 1.152,7.621\ to 1.200,7.598\ to 1.238,7.575\ line from 1.238,7.575 to 1.276,7.546\ to 1.324,7.512\ to 1.350,7.492\ to 1.377,7.472\ to 1.404,7.451\ to 1.432,7.430\ to 1.460,7.408\ to 1.487,7.386\ to 1.512,7.364\ to 1.537,7.343\ to 1.580,7.301\ to 1.613,7.263\ line from 1.613,7.263 to 1.639,7.219\ to 1.653,7.190\ to 1.667,7.157\ to 1.682,7.117\ to 1.699,7.070\ to 1.708,7.043\ to 1.717,7.014\ to 1.727,6.983\ to 1.738,6.950\ .ps .ps 12 .ft .ft R "A NETWORK" at 2.175,5.296 ljust "activity" at 2.175,10.046 ljust "input" at 2.800,8.921 ljust "node" at 1.738,6.546 ljust "link" at 1.300,7.608 ljust "site" at 1.175,8.108 ljust .ps .ft .PE .nr PS 12 .nr VS 15 .ps 12 .vs 15 .ft R A kind of gradient descent algorithm is normally used in the backpropagation method to calculate the weights of the network. The activation of a neuron is a nondecreasing and differentiable function of the total input. If one looks at the outputlayer it is very easy to make the connection between what happens there and any (conjugate gradient or gradient descent) method. .br The output can be thought of as a vector with dimension N the number of output units: $ act sub i sup p $ denotes the response of an output unit for the output vector \fIp\fR. In the learning phase it is known what the output has to be: $ s sub i sup p $. To make the total error over all the patterns as least as possible .br .sp .ce E = $ SIGMA $ $ E sup p $ = $ SIGMA sub 1 sup N $ $ (act sub i sup p $ - $ s sub i sup p$ $ ) sup 2 $ .br is minimised. The input itself is .br .sp .ce $ input sub i sup p $ = $ SIGMA sub j $ $w sub i sub j$ $ act sub j sup p$ + $ bias sub j $. The gradient of E is now: .br .ce $ {partial E sup p } over {partial w sub {i j} } = {partial E sup p} over {partial input sub i sup p } {partial input sup p} over {partial w sub {i j}} $. .br But ${partial input sub i sup p} over {partial w sub {i j}} $ is known to be $ act sub j sup p$. .br We now have to work out the term: .br .ce ${partial E sup p} over {partial input sub i sup p } $. .br And this is equal to : .br .sp .ce $ {partial E sup p} over {partial act sub i sup p } {partial act sub i sup p} over { partial input sub {i j}} $. .br The last term is the differential of the node function \fIF(x)\fR. The first term differs for an output unit and a unit somewhere in the network. In the case of an output unit one can differentiate $ E sup p $. In the case of an hidden unit .br .sp .ce $ {partial E sup p} over {partial act sub i sup p} ~=~ SIGMA sub {h = 1} sup N { partial E sup p} over {partial input sub h sup p} { partial input sub h sup p} over { partial act sub i sup p}$ .br The last term equals $w sub {h i} $ and so one gets: .br .sp .ce $ {partial E sup p} over {partial act sub i sup p} ~=~ SIGMA sub {h = 1} sup N { partial E sup p} over {partial input sub h sup p} w sub {h i} $. .br Now we can collect the results of all these calculations for hidden units: .br .sp .ce ${partial input sub i sup p} over {partial w sub {i j}} ~=~ italic F roman \(qs(input sub i sup p) {act sub j sup p}{ SIGMA sub {h = 1} sup N }{ partial E sup p} over {partial input sub h sup p} w sub {h i} $ \fRWorking back from the output layer to the input layer one can calculate all the necessary gradients. Therefore methods like , gradient descent and conjugate gradient can be implemented. The whole procedure is sketched in the figure. .PS scale = 1.6 .ps .ps 10 ellipse at 0.988,5.763 wid 0.500 ht 0.500 ellipse at 0.487,7.263 wid 0.500 ht 0.500 ellipse at 1.238,8.762 wid 0.500 ht 0.500 dashwid = 0.050i line dashed from 1.113,5.575 to 2.175,4.638 line from 2.083,4.685 to 2.175,4.638 to 2.117,4.722 line from 6.487,2.8 to 6.487,4.450 to 2.175,4.450 to 2.175,2.8 to 6.487,2.8 line from 3.550,3.888 to 3.550,3.513 to 4.925,3.513 line from 4.825,3.488 to 4.925,3.513 to 4.825,3.538 line from 4.8,3.8 to 4.8,4.4 to 2.237,4.4 to 2.237,3.8 to 4.8,3.8 line from 2.862,6.763 to 2.862,5.013 line from 2.837,5.112 to 2.862,5.013 to 2.887,5.112 line from 6.487,5.513 to 6.487,8.7 to 2.237,8.7 to 2.237,5.513 to 6.487,5.513 line from 2.987,6.763 to 2.987,6.013 to 2.987,5.763 to 4.987,5.763 line from 4.888,5.737 to 4.987,5.763 to 4.888,5.788 line from 4.175,6.763 to 4.175,6.263 to 4.987,6.263 line from 4.888,6.237 to 4.987,6.263 to 4.888,6.288 line from 2.987,8.075 to 2.987,7.513 line from 2.962,7.612 to 2.987,7.513 to 3.012,7.612 line from 5.0,6.8 to 5.0,7.513 to 2.362,7.513 to 2.362,6.8 to 5.0,6.8 line from 4.112,8.075 to 4.112,7.575 to 4.925,7.575 line from 4.825,7.550 to 4.925,7.575 to 4.825,7.600 line from 4.987,8.075 to 4.987,8.637 to 2.300,8.637 to 2.300,8.075 to 4.987,8.075 line from 0.925,6.013 to 0.613,7.075 line from 0.665,6.986 to 0.613,7.075 to 0.617,6.972 line from 1.238,9.012 to 1.238,9.512 line from 1.262,9.412 to 1.238,9.512 to 1.213,9.412 line from 1.488,8.325 to 1.988,9.262 to 2.487,9.262 line from 2.387,9.237 to 2.487,9.262 to 2.387,9.287 line from 1.425,8.493 to 1.363,8.575 to 1.379,8.473 line from 1.363,8.575 to 1.550,8.137 line from 0.863,6.013 to 0.863,6.013 line from 0.863,6.013 to 0.863,6.013 line from 0.863,6.013 to 0.863,6.013 line from 1.095,8.473 to 1.113,8.575 to 1.049,8.494 line from 1.113,8.575 to 0.613,7.450 line from 1.090,8.317 to 0.988,8.325 to 1.074,8.270 spline from 0.988,8.325 to 1.175,8.262 to 1.300,8.262 to 1.488,8.325 line from 1.401,8.270 to 1.488,8.325 to 1.385,8.317 .ps .ps 10 .ft .ft R "$SIGMA sub {h = 1} sup N { partial E sup p} over {partial input sub h sup p} w sub {h i} $ " at 4.987,3.579 ljust "FOR HIDDEN UNITS" at 2.237,4.579 ljust "$SIGMA sub {h = 1} sup N { partial E sup p} over {partial input sub h sup p}{ partial input sub h sup p} over { partial act sub i sup p}$ " at 2.425,4.25 ljust "$ {partial E sup p } over {partial w sub {i j} }$ " at 2.237,9.1 ljust " $ act sub j sup p$" at 5.112,7.517 ljust "$ ~~~~{partial E sup p} over {partial act sub i sup p } ~~~~~~~~~~~~ {partial act sub i sup p}over { partial input sub {i j}} $ " at 2.487,7.3 ljust "$ 2({act sub i sup p} - {s sub i sup p})$" at 5.050,5.829 ljust " \fnodefunction\(qs(x)\fR " at 5.050,6.329 ljust "$~~~~{partial E sup p} over {partial input sub i sup p }~~~~~~~~~~~~ {partial input sup p} over { w sub {i j}} $" at 2.425,8.5 ljust .ps .ps 12 "K" at 0.97,5.796 ljust "J" at 0.47,7.296 ljust "I" at 1.22,8.796 ljust .ps 10 "$~~~~ input sub i sup p $ = $ SIGMA sub j $ $w sub i sub j$ $ act sub j sup p$ + $ bias sub j $ " at 2.487,9.296 ljust "E = $ SIGMA $ $ E sup p $ = $ SIGMA sub 1 sup N $ $ (act sub i sup p $ - $ s sub i sup p$ $ ) sup 2 $ " at 0.988,9.608 ljust .ps .nr PS 12 .nr VS 15 .ps 12 .vs 15 .ft .PE \fRWe will now turn to conjugate gradient. .NH The conjugate gradient algorithm .LP We want to know the least value of a function F(\fBx\fR) where \fBx\fR is a vector of dimension \fIn\fR. In the conjugate gradient method there is no need to calculate second derivatives. We denote the $k sup th $ iteration of the vector \fBx\fR as $ bold x sub k $ \fRand the gradient vector as: .br .sp .ce \fBg(x)\fR = $grad F$(\fBx\fR) .sp If $ bold x sub 1 $ \fRis a given starting point in the space of variables then we can calculate the gradient $ bold g sub 1 $. \fRThe steepest descent direction is of course in the opposite direction: - $ bold g sub 1 $. \fRThis direction is taken as the first search direction $ bold d sub 1 $\fR. For \fIk\fR > 1 we apply the following formula: .sp .ce $ bold d sub k $ \fR = -$ bold g sub k $ \fR + $ beta sub k $ $ bold d sub {k-1}$. .sp .ce \fRwhere $ beta sub k $ = ${norm( {bold g sub k} , {bold g sub k})} over {norm( {bold g sub {k-1}}, {bold g sub {k-1}} )} $. .br .sp \fRAs norm can be taken the Euclidean norm. Now we have to find the minimum of F(\fBx\fR) in the search direction. This is certainly not an easy task. Implementations for this can be found in .[ [ Recipes .]]. In the course of the algorithm one gets in this way the several searchdirections $ bold d sub k $ \fR and the points .sp .ce $bold x sub {k+1} $ = $bold x sub k $ + $ lambda sub k $ $ bold d sub k $\fR .sp where F($ bold x sub {k+1}$) is minimal. It would be very advantageous if the search directions lead in a very short time to the desired solution. This indeed can be done by the above method in the case of a quadratic function. Let : .sp .ce F(\fBx\fR) = $ 1 over 2 $ $ bold x sup T $ A\fBx\fR + $ bold {b sup T} x $\fR + c .sp where T denotes transponation. Then it can be proven that the above calculated search directions have the property: .sp .ce $ bold d sub i sup T$ \fRA $ bold d sub j$ = 0 and i != j. .sp We say that $ bold d sub i $ \fRand $ bold d sub j $ \fR are conjugate. As the sequence of $bold x sub k$ leads always to the lowest point after k+1 iterations $bold x sub {k+1} $ must be the lowest value. This last point can be expressed as a sum of the previous $bold d sub k $ of which there are at most N . Therefore the algorithm leads in at most N iterations to the least value. It is the essence of Powell\'s idea to make use of this property. One could consider every function as built up from quadratic partial functions. Then in every subregion one could use a kind of conjugate gradient for N steps. After these N steps a restart is made. The point now is to guard the convergence during a traverse in a non quadratic region. Here lie a number of tricks which we will give later. .br To understand what happens in the very many non-quadratic cases a geometrical representation of the principle of the conjugate gradient method is given below. At a starting point on the left side of the ellips the first search direction is chosen to be the opposite of the gradient vector. In a lineseach the point is found where the function F is minimal. The box in the corner elaborates what happens in this case: the gradient increases, so the weight of the old search direction is increased according to $ beta $ and a new search direction can be established. This new search direction leads to a third point etc. .sp .PS scale = 2 .ps .ps 10 ellipse at 4.987,4.650 wid 0.250 ht 0.250 ellipse at 4.987,5.013 wid 0.525 ht 1.025 ellipse at 4.987,4.513 wid 0.025 ht 0.025 ellipse at 4.987,6.513 wid 1.650 ht 4.025 ellipse at 4.987,5.825 wid 0.775 ht 2.650 line from 4.300,7.213 to 4.300,7.312 to 4.175,7.312 to 4.175,7.213 to 4.300,7.213 line from 4.225,7.263 to 7.225,6.950 line from 7.123,6.935 to 7.225,6.950 to 7.128,6.985 line from 4.987,6.013 to 4.987,6.013 line from 5.175,7.000 to 5.175,7.312 to 4.800,7.312 to 4.800,7.000 to 5.175,7.000 line from 5.112,7.312 to 7.225,8.338 line from 7.146,8.271 to 7.225,8.338 to 7.124,8.316 .ps .ps 20 line from 8.175,9.137 to 8.662,9.075 line from 8.458,9.051 to 8.662,9.075 to 8.470,9.150 line from 8.175,9.137 to 8.113,8.438 line from 8.080,8.641 to 8.113,8.438 to 8.180,8.632 dashwid = 0.050i line dashed from 8.175,9.137 to 8.550,8.438 line from 8.411,8.590 to 8.550,8.438 to 8.500,8.637 line dashed from 9.738,8.338 to 9.738,9.775 to 7.225,9.775 to 7.225,8.338 to 9.738,8.338 line dashed from 4.987,7.213 to 6.237,5.013 line from 6.095,5.162 to 6.237,5.013 to 6.182,5.211 line from 9.863,4.388 to 9.863,9.825 to 3.862,9.825 to 3.862,4.388 to 9.863,4.388 .ps .ps 12 .ft .ft R "new search direction" at 6.362,5.108 ljust "grad" at 7.362,8.546 ljust "Sold" at 8.675,9.171 ljust "Snew" at 8.675,8.546 ljust .ps .ft .PE .sp .nr PS 12 .nr VS 15 .ps 12 .vs 15 .ft R Yet in the figure all the disadvantages of the conjugate gradient method can be seen. Most functions are not quadratic at all in a large domain. The effect of this can be seen in the next figure. .br .sp .PS scale = 2 .ps .ps 10 ellipse at 2.900,6.250 wid 1.350 ht 1.350 ellipse at 2.900,6.250 wid 0.200 ht 0.200 ellipse at 2.900,6.250 wid 0.300 ht 0.300 ellipse at 2.900,6.250 wid 0.525 ht 0.525 ellipse at 2.900,6.250 wid 0.425 ht 0.425 ellipse at 2.900,6.250 wid 0.625 ht 0.625 .ps .ps 20 ellipse at 2.900,6.300 wid 3.100 ht 3.100 .ps .ps 10 line from 4.002,5.261 to 3.913,5.312 to 3.967,5.225 line from 3.913,5.312 to 4.362,4.875 dashwid = 0.050i line dashed from 6.013,9.775 to 7.463,9.000 line from 7.363,9.025 to 7.463,9.000 to 7.386,9.069 line from 6.013,9.775 to 5.963,9.675 line from 5.985,9.776 to 5.963,9.675 to 6.030,9.753 line from 6.013,9.775 to 7.513,9.050 line from 7.412,9.071 to 7.513,9.050 to 7.433,9.116 line from 4.825,10.088 to 4.825,10.088 line from 7.825,8.637 to 7.825,10.300 to 4.825,10.300 to 4.825,8.637 to 7.825,8.637 .ps .ps 20 dashwid = 0.037i line dotted from 3.987,7.912 to 4.875,8.637 line from 4.752,8.472 to 4.875,8.637 to 4.688,8.550 line from 4.037,7.338 to 4.037,7.912 to 3.525,7.912 to 3.525,7.338 to 4.037,7.338 dashwid = 0.050i line dashed from 1.450,9.050 to 6.537,5.938 line from 6.341,5.999 to 6.537,5.938 to 6.393,6.085 .ps .ps 10 line from 3.750,7.562 to 3.785,7.533\ to 3.814,7.508\ to 3.861,7.466\ to 3.892,7.436\ to 3.913,7.412\ line from 3.913,7.412 to 3.957,7.363\ to 3.982,7.332\ to 4.009,7.300\ to 4.035,7.267\ to 4.060,7.237\ to 4.100,7.188\ line from 4.100,7.188 to 4.113,7.164\ to 4.128,7.136\ to 4.146,7.104\ to 4.165,7.071\ to 4.183,7.037\ to 4.200,7.005\ to 4.215,6.975\ to 4.225,6.950\ line from 4.225,6.950 to 4.239,6.921\ to 4.255,6.885\ to 4.271,6.844\ to 4.287,6.800\ to 4.302,6.754\ to 4.316,6.711\ to 4.328,6.671\ to 4.338,6.638\ line from 4.338,6.638 to 4.341,6.612\ to 4.346,6.579\ to 4.350,6.543\ to 4.354,6.503\ to 4.358,6.463\ to 4.361,6.425\ to 4.362,6.391\ to 4.362,6.362\ line from 4.362,6.362 to 4.362,6.322\ to 4.358,6.273\ to 4.356,6.246\ to 4.354,6.218\ to 4.350,6.189\ to 4.347,6.160\ to 4.343,6.131\ to 4.339,6.103\ to 4.335,6.075\ to 4.331,6.049\ to 4.322,6.001\ to 4.312,5.963\ line from 4.312,5.963 to 4.300,5.922\ to 4.283,5.875\ to 4.273,5.849\ to 4.263,5.822\ to 4.251,5.795\ to 4.240,5.768\ to 4.228,5.742\ to 4.216,5.715\ to 4.192,5.666\ to 4.170,5.622\ to 4.150,5.588\ line from 4.150,5.588 to 4.118,5.545\ to 4.077,5.496\ to 4.054,5.470\ to 4.030,5.443\ to 4.005,5.416\ to 3.979,5.389\ to 3.953,5.363\ to 3.927,5.337\ to 3.901,5.312\ to 3.876,5.288\ to 3.828,5.246\ to 3.788,5.213\ line from 3.788,5.213 to 3.759,5.189\ to 3.723,5.163\ to 3.682,5.134\ to 3.637,5.105\ to 3.592,5.077\ to 3.548,5.051\ to 3.508,5.029\ to 3.475,5.013\ line from 3.475,5.013 to 3.432,4.999\ to 3.407,4.991\ to 3.381,4.984\ to 3.352,4.977\ to 3.323,4.970\ to 3.293,4.963\ to 3.262,4.956\ to 3.232,4.950\ to 3.201,4.945\ to 3.172,4.939\ to 3.144,4.935\ to 3.117,4.931\ to 3.092,4.928\ to 3.050,4.925\ line from 3.050,4.925 to 3.003,4.922\ to 2.975,4.922\ to 2.946,4.922\ to 2.915,4.922\ to 2.883,4.923\ to 2.850,4.925\ to 2.817,4.927\ to 2.784,4.930\ to 2.752,4.933\ to 2.720,4.937\ to 2.689,4.941\ to 2.660,4.946\ to 2.634,4.951\ to 2.587,4.963\ line from 2.587,4.963 to 2.548,4.978\ to 2.501,5.001\ to 2.477,5.014\ to 2.451,5.028\ to 2.425,5.043\ to 2.399,5.059\ to 2.373,5.074\ to 2.348,5.090\ to 2.301,5.121\ to 2.259,5.150\ to 2.225,5.175\ line from 2.225,5.175 to 2.178,5.225\ to 2.151,5.256\ to 2.121,5.294\ to 2.086,5.338\ to 2.067,5.363\ to 2.047,5.391\ to 2.025,5.421\ to 2.002,5.453\ to 1.977,5.488\ to 1.950,5.525\ line from 2.140,5.233 to 2.225,5.175 to 2.177,5.266 .ps .ps 20 line from 2.688,10.300 to 2.691,10.258\ to 2.692,10.227\ to 2.688,10.188\ line from 2.688,10.188 to 2.690,10.139\ to 2.688,10.110\ to 2.688,10.088\ line from 2.688,10.088 to 2.701,10.063\ to 2.719,10.036\ to 2.750,9.988\ line from 2.750,9.988 to 2.748,9.962\ to 2.744,9.932\ to 2.743,9.901\ to 2.750,9.875\ line from 2.750,9.875 to 2.773,9.855\ to 2.800,9.825\ line from 2.800,9.825 to 2.794,9.798\ to 2.777,9.772\ to 2.750,9.725\ line from 2.750,9.725 to 2.745,9.701\ to 2.744,9.672\ to 2.750,9.625\ line from 2.750,9.625 to 2.748,9.599\ to 2.748,9.569\ to 2.750,9.538\ to 2.750,9.512\ line from 2.750,9.512 to 2.719,9.464\ to 2.701,9.436\ to 2.688,9.412\ line from 2.688,9.412 to 2.688,9.363\ line from 2.688,9.363 to 2.681,9.338\ to 2.666,9.310\ to 2.638,9.262\ line from 2.638,9.262 to 2.639,9.231\ to 2.638,9.200\ line from 2.638,9.200 to 2.638,9.150\ line from 2.638,9.150 to 2.640,9.127\ to 2.638,9.100\ line from 2.638,9.100 to 2.640,9.078\ to 2.640,9.050\ to 2.638,9.000\ line from 2.638,9.000 to 2.642,8.973\ to 2.638,8.950\ line from 2.638,8.950 to 2.615,8.921\ to 2.587,8.900\ line from 2.587,8.900 to 2.587,8.868\ to 2.587,8.838\ line from 2.587,8.838 to 2.593,8.791\ to 2.593,8.762\ to 2.587,8.738\ line from 2.587,8.738 to 2.563,8.713\ to 2.538,8.688\ line from 2.538,8.688 to 2.537,8.661\ to 2.538,8.637\ line from 2.538,8.637 to 2.513,8.607\ to 2.487,8.588\ line from 2.487,8.588 to 2.460,8.557\ to 2.438,8.525\ line from 2.438,8.525 to 2.407,8.506\ to 2.388,8.475\ line from 2.388,8.475 to 2.388,8.425\ line from 2.388,8.425 to 2.357,8.399\ to 2.325,8.375\ line from 2.325,8.375 to 2.316,8.353\ to 2.305,8.324\ to 2.275,8.275\ line from 2.275,8.275 to 2.227,8.243\ to 2.198,8.232\ to 2.175,8.225\ line from 2.175,8.225 to 2.140,8.211\ to 2.095,8.196\ to 2.050,8.181\ to 2.013,8.162\ line from 2.013,8.162 to 1.991,8.141\ to 1.962,8.113\ line from 1.962,8.113 to 1.924,8.089\ to 1.883,8.065\ to 1.837,8.042\ to 1.788,8.018\ to 1.762,8.006\ to 1.735,7.994\ to 1.706,7.981\ to 1.676,7.969\ to 1.645,7.956\ to 1.613,7.943\ to 1.579,7.930\ to 1.543,7.916\ to 1.506,7.903\ to 1.467,7.889\ to 1.427,7.874\ to 1.385,7.860\ to 1.341,7.845\ to 1.295,7.829\ to 1.248,7.814\ to 1.198,7.797\ to 1.173,7.789\ to 1.147,7.781\ to 1.120,7.772\ to 1.093,7.764\ to 1.065,7.755\ to 1.037,7.746\ to 1.009,7.737\ to 0.979,7.728\ to 0.950,7.719\ to 0.919,7.709\ to 0.888,7.700\ to 0.857,7.690\ to 0.825,7.680\ to 0.792,7.670\ to 0.759,7.660\ to 0.725,7.650\ line from 2.438,10.238 to 2.455,10.201\ to 2.468,10.174\ to 2.487,10.137\ line from 2.487,10.137 to 2.497,10.115\ to 2.510,10.087\ to 2.538,10.037\ line from 2.538,10.037 to 2.587,9.988\ line from 2.587,9.988 to 2.602,9.963\ to 2.618,9.933\ to 2.631,9.902\ to 2.638,9.875\ line from 2.638,9.875 to 2.642,9.827\ to 2.640,9.798\ to 2.638,9.775\ line from 2.638,9.775 to 2.640,9.750\ to 2.638,9.725\ line from 2.638,9.725 to 2.640,9.690\ to 2.642,9.646\ to 2.641,9.600\ to 2.638,9.562\ line from 2.638,9.562 to 2.630,9.520\ to 2.623,9.493\ to 2.615,9.464\ to 2.606,9.435\ to 2.598,9.408\ to 2.587,9.363\ line from 2.587,9.363 to 2.587,9.339\ to 2.588,9.310\ to 2.587,9.262\ line from 2.587,9.262 to 2.589,9.231\ to 2.587,9.200\ line from 2.587,9.200 to 2.590,9.153\ to 2.590,9.124\ to 2.587,9.100\ line from 2.587,9.100 to 2.579,9.066\ to 2.563,9.024\ to 2.547,8.983\ to 2.538,8.950\ line from 2.538,8.950 to 2.536,8.925\ to 2.537,8.894\ to 2.539,8.863\ to 2.538,8.838\ line from 2.538,8.838 to 2.529,8.805\ to 2.516,8.763\ to 2.501,8.720\ to 2.487,8.688\ line from 2.487,8.688 to 2.466,8.651\ to 2.438,8.606\ to 2.409,8.561\ to 2.388,8.525\ line from 2.388,8.525 to 2.355,8.506\ to 2.325,8.475\ line from 2.325,8.475 to 2.325,8.425\ line from 2.325,8.425 to 2.278,8.396\ to 2.248,8.386\ to 2.225,8.375\ line from 2.225,8.375 to 2.200,8.350\ to 2.175,8.325\ line from 2.175,8.325 to 2.131,8.296\ to 2.104,8.281\ to 2.074,8.266\ to 2.044,8.251\ to 2.014,8.239\ to 1.987,8.230\ to 1.962,8.225\ line from 1.962,8.225 to 1.938,8.219\ to 1.907,8.216\ to 1.873,8.215\ to 1.836,8.216\ to 1.800,8.218\ to 1.766,8.220\ to 1.736,8.223\ to 1.712,8.225\ line from 1.712,8.225 to 1.685,8.223\ to 1.653,8.220\ to 1.617,8.218\ to 1.579,8.216\ to 1.542,8.215\ to 1.507,8.216\ to 1.476,8.219\ to 1.450,8.225\ line from 1.450,8.225 to 1.427,8.231\ to 1.400,8.240\ to 1.371,8.254\ to 1.341,8.269\ to 1.312,8.284\ to 1.284,8.300\ to 1.238,8.325\ line from 1.238,8.325 to 1.188,8.348\ to 1.159,8.362\ to 1.137,8.375\ line from 1.137,8.375 to 1.113,8.393\ to 1.085,8.415\ to 1.054,8.440\ to 1.023,8.467\ to 0.992,8.497\ to 0.965,8.527\ to 0.942,8.558\ to 0.925,8.588\ line from 0.925,8.588 to 0.922,8.620\ to 0.927,8.662\ to 0.930,8.703\ to 0.925,8.738\ line from 0.925,8.738 to 0.915,8.763\ to 0.899,8.790\ to 0.879,8.819\ to 0.856,8.848\ to 0.832,8.877\ to 0.810,8.904\ to 0.775,8.950\ line from 0.775,8.950 to 0.749,8.980\ to 0.718,9.020\ to 0.690,9.063\ to 0.675,9.100\ line from 0.675,9.100 to 0.669,9.124\ to 0.669,9.153\ to 0.675,9.200\ line from 0.675,9.200 to 0.664,9.250\ to 0.656,9.279\ to 0.648,9.309\ to 0.639,9.339\ to 0.630,9.368\ to 0.613,9.412\ line from 0.613,9.412 to 0.600,9.449\ to 0.585,9.476\ to 0.562,9.512\ line from 2.175,9.875 to 2.219,9.849\ to 2.257,9.827\ to 2.289,9.806\ to 2.316,9.787\ to 2.357,9.755\ to 2.388,9.725\ line from 2.388,9.725 to 2.402,9.691\ to 2.416,9.647\ to 2.428,9.602\ to 2.438,9.562\ line from 2.438,9.562 to 2.436,9.515\ to 2.436,9.486\ to 2.438,9.463\ line from 2.438,9.463 to 2.435,9.438\ to 2.438,9.412\ line from 2.438,9.412 to 2.435,9.390\ to 2.435,9.361\ to 2.438,9.312\ line from 2.438,9.312 to 2.435,9.284\ to 2.438,9.262\ line from 2.438,9.262 to 2.435,9.237\ to 2.435,9.206\ to 2.435,9.175\ to 2.438,9.150\ line from 2.438,9.150 to 2.437,9.102\ to 2.439,9.073\ to 2.438,9.050\ line from 2.438,9.050 to 2.416,9.012\ to 2.384,8.970\ to 2.350,8.931\ to 2.325,8.900\ line from 2.325,8.900 to 2.315,8.863\ to 2.303,8.817\ to 2.289,8.772\ to 2.275,8.738\ line from 2.275,8.738 to 2.256,8.702\ to 2.229,8.659\ to 2.200,8.618\ to 2.175,8.588\ line from 2.175,8.588 to 2.153,8.560\ to 2.125,8.529\ to 2.097,8.499\ to 2.075,8.475\ line from 2.075,8.475 to 2.051,8.464\ to 2.021,8.449\ to 1.990,8.435\ to 1.962,8.425\ line from 1.962,8.425 to 1.919,8.420\ to 1.891,8.420\ to 1.862,8.420\ to 1.834,8.420\ to 1.806,8.421\ to 1.762,8.425\ line from 1.762,8.425 to 1.725,8.436\ to 1.681,8.452\ to 1.637,8.468\ to 1.600,8.475\ line from 1.600,8.475 to 1.568,8.478\ to 1.526,8.477\ to 1.484,8.474\ to 1.450,8.475\ line from 1.450,8.475 to 1.408,8.490\ to 1.360,8.511\ to 1.335,8.524\ to 1.309,8.537\ to 1.283,8.552\ to 1.257,8.567\ to 1.232,8.582\ to 1.207,8.598\ to 1.160,8.629\ to 1.119,8.660\ to 1.087,8.688\ line from 1.087,8.688 to 1.058,8.719\ to 1.027,8.761\ to 0.998,8.804\ to 0.975,8.838\ line from 0.975,8.838 to 0.965,8.874\ to 0.952,8.919\ to 0.938,8.964\ to 0.925,9.000\ line from 0.925,9.000 to 0.915,9.033\ to 0.901,9.075\ to 0.886,9.117\ to 0.875,9.150\ line from 0.875,9.150 to 0.869,9.182\ to 0.864,9.220\ to 0.859,9.263\ to 0.853,9.308\ to 0.847,9.353\ to 0.840,9.395\ to 0.833,9.433\ to 0.825,9.463\ line from 0.825,9.463 to 0.804,9.510\ to 0.789,9.538\ to 0.772,9.567\ to 0.756,9.597\ to 0.742,9.625\ to 0.725,9.675\ line from 0.725,9.675 to 0.725,9.725\ .ft .ps 10 .ft R "a very slow approximation" at 4.362,4.893 ljust "gradient" at 5.388,9.418 ljust "Sold" at 7.050,9.418 ljust .ps .ft .PE .nr PS 12 .nr VS 15 .ps 12 .vs 15 The hills on the left side are very steep, so the search direction is a large vector. Then when one arrives at the landscape of the quadratic function, the gradient becomes much smaller, but big enough to do harm. The method of the conjugate gradient requires to conserve a portion of the old search direction, and if the landscape is chosen well , this portion is very big in comparison with the local gradient. Then the quadratic function is travelled through in a very slow spiral. This effect can be observed in neural networks. In the start of the algorithm everything works well. A considerable diminishing of the error takes place. But when the error approaches zero, it takes a lot of iterations to get a factor two less - so it is very time consuming. Powell .[ [ Powell .]] points at the difference between two expressions: .sp .ce $ beta sub k $ = ${norm( {bold g sub k} , {bold g sub k})} over {norm( {bold g sub {k-1}}, {bold g sub {k-1}})}~~ $ $ beta sub k $ = ${norm( {bold g sub k} , {bold g sub k} - {bold g sub {k-1})}} over {norm( {bold g sub {k-1}}, {bold g sub {k-1}} )} $ .sp \fR If the function is quadratic there is no difference between the two expressions. But in the case that the searchdirection is formed by a part of the function which is not quadratic at all, then it makes a lot of difference. To understand this again a geometric representation can help. .sp .PS scale = 2 .ps .ps 20 line from 1.175,7.200 to 3.175,7.950 line from 3.005,7.833 to 3.175,7.950 to 2.970,7.927 line from 2.300,8.637 to 3.237,7.950 line from 3.047,8.028 to 3.237,7.950 to 3.106,8.109 line from 1.175,7.200 to 2.300,8.637 line from 2.216,8.449 to 2.300,8.637 to 2.137,8.511 .ps .ps 12 .ft .ft R "$ space 0 bold d sub k $" at 2.237,7.421 ljust "$ space 0 {beta sub k} {bold d sub {k-1}} $" at 2.737,8.796 ljust "$ space 0 bold -g sub k $" at 0.925,7.983 ljust .ps .ft .PE .nr PS 12 .nr VS 15 .ps 12 .vs 15 .ft R .sp \fRThe angle between $ bold -g sub k $ \fRand $ bold d sub k $\fR is called $ theta$ \fRand equals: .sp .ce norm($ bold d sub k $\fR) = sec $theta sub k$ norm($ bold g sub k $\fR) .sp and if we replace k by k+1 and eliminate norm($ bold d sub k $\fR) then: .sp .ce $ tan theta sub {k+1}$ = sec$theta sub k$ ${{norm({bold g sub {k+1}} - { bold g sub k)}} over {norm({bold g sub k})}} $ .sp \fRand therefore: .sp .sp .ce $ tan theta sub {k+1}$ > tan$theta sub k$ ${{norm({bold g sub {k+1}} - { bold g sub k)}} over {norm({bold g sub k})}} $ .sp \fRNow we are able to give some idea how to proceed in the case of the very slow spiraling: if $theta sub k$ is close to ${1 over 2} pi$ than if the iteration takes a small step, the change in $ bold g sub {k+1} - bold g sub k $ is also very small. Therefore the ratio between the norms is close to one. This all results in a $ theta sub k $ to be also near one: and there is the slow approximation of the spiral. .br The other formula however can save the day: .sp .ce $ tan theta sub {k+1}$ < sec$theta sub k$ ${{norm({bold g sub {k+1}} - { bold g sub k)}} over {norm({bold g sub k})}} $ .sp The problem is however that it does not work all the time. Powell gives an example of this. Although it is a matter of experience that restarting every N iterations does help, this is a meager result for all the trouble. This restarting has often the disadvantage that the new search direction along the gradient is not better than the old one - and this is understandable. So one looks for a better direction than the gradient and also better than the conjugate gradient. Beale .[ [ Beale .]] gives arguments to take a multiple of the restarting direction $ bold d sub t$ \fRand to add this to our conjugate gradient. If for the moment we take $ bold d sub t$ an arbitrary direction and the function F quadratic is it then possible to make the $ bold d sub t $, $ bold d sub {t+1} $,$ bold d sub {t+2} $,$ bold d sub {t+3} $,.... mutually conjugate? We have: .sp .ce $ bold d sub k $ \fR = -$ bold g sub k $ \fR + $ beta sub k $ $ bold d sub {k-1}$ + $ gamma sub k $ $bold d sub t$. .sp \fRThe factors $beta$ and $gamma$ \fR look like the old $beta$, but now the restart direction is involved in the $gamma$. .sp .ce $ beta sub k $ = ${norm( {bold g sub k} , {bold g sub k} - {bold g sub {k-1})}} over {norm( {bold d sub {k-1}}, {bold g sub k} - {bold g sub {k-1}} )} $ .sp .ce $ gamma sub k $ = ${norm( {bold g sub k} , {bold g sub {t+1}} - {bold g sub t})} over {norm( {bold d sub t}, {bold g sub {t+1}} - {bold g sub t} )} $ .sp But the augmentation in $bold d sub k$ \fR can easily lead uphill instead of downhill. One way not to get in this situation is to demand that: .sp .ce norm($ bold d sub k $, $bold g sub k$) < 0 \fRfor k > t. .sp \fRAnd this is just what the algorithm of Powell does! It compares in the $beta$ of the conjugate gradient method and demands that: ${norm( {bold g sub {k-1}} , {bold g sub k})} ~~~<=~~~ 0.2 {norm( {bold g sub k}, {bold g sub k})} $ .sp Furthermore the searchdirection has to be sufficiently downhill: .sp .ce -1.2 ${norm( {bold g sub k} , {bold g sub k})} ~~~<=~~~ norm( bold d sub k $, $bold g sub k$) $~~~<=~~~ -0.8 {norm( {bold g sub k}, {bold g sub k})} $ .sp \fRIn the above the method of linesearch is not mentioned. But also there it is wise to set some norms: .sp .ce ${norm( {bold g sub k} , {bold g sub {k+1}})} ~~~<~~~ 0.1 norm( bold d sub k $, $bold g sub k$) .sp \fRto insure that one really has a decrease in the directional derivate.To require that the angle $theta$ between $bold g sub {k+1} $ and $bold d sub k$ is nearly ${1 over 2}pi $ makes also sense according to Powell. .sp To compare the increase in efficiency we take the well known problem of the xor. We ask for a total error of 0.000001. The implementation for the normal backpropagation method had 2 inputs, 1 hidden layer with 2 hidden units and 1 output. The unit function was: .br .sp .ce f(x) = $ 1 over { 1 + e sup {-x + bias}}$. .sp .sp In the implementation which used the method of Powell the unit function had a bias zero. Therefore a hidden layer of three units was needed. The results are given below. .sp .sp .TS center allbox; cs cs cc ll. error 0.000001 number of iterations Powell backprop. 46 > 320.000 .TE .sp .nr PS 12 .nr VS 15 .ps 12 .vs 15 .ft R Of course the comparison is not honest: it only shows that the normally used method in backpropagation can be very slow. It makes however the difference between succes and failure in the case of the this research. There are some 154 inputs of binary value, one layer of 9 hidden units, and 16 classes. A network with the normal backpropagation method could learn some 50 different cases, but it took about 24 hours to learn 100 cases and the whole set of learn cases could not be mastered at all.