geirt@idt.unit.no (Geir Torheim) (10/30/90)
Anybody who got any idea on how to train a back-prop net whith layers divided into feature maps ? The units in a feature map share weights. I have read an article by Le Cun about this, but he does not explain how he updates the shared weights. I guess the best thing to do is to find the delta w for each unit in the map and then use the average of these delta w as the delta w for the shared weight. If someone has other ideas, please send me an email. I am also interested in guidelines for how big the receptive fields should be, and how the maps should be interconnected. Is it a good idea to let a unit in a map connect to several different maps in the previous layer, like Cun do ? How much should the receptive fields overlap ? I am going to use the maps in character recognition. If there is any interest, I'll send a summary to this newsgroup. - Geir -- geirt@idt.unit.no or TORHEIM@NORUNIT (BITNET/EARN)
geirt@idt.unit.no (Geir Torheim) (11/11/90)
Hi, Because of trouble with our news-server, it looks 'world' didn't get this article. Therefore I'm sending it once more. Sorry if you have seen it before. Here we go : A short summary of the responses I got: Tony Plate <tap@neuron.ai.toronto.edu> writes >> Anybody who got any idea on how to train a back-prop net whith >> layers divided into feature maps ? The units in a feature map share >> weights. >> I have read an article by Le Cun >> about this, but he does not explain how he updates the shared weights. >> I guess the best thing to do is to find the delta w for each >> unit in the map and then use the average of these delta w as >> the delta w for the shared weight. If someone has other ideas, >> please send me an email. > It is better to regard each shared weight as a function of > an underlying parameter. (That function is just the > identity function.) Then you compute the partial derivative > of E with respect to that parameter, and use that derivative > to work out an update to the weights. > > E.g., consider the following part of a feed-forward network: > > O x1 O x2 xi is total input > ^ ^ > | w1 | w2 wi is weight > | | > O y1 O y2 yi is total output > > > In the standard, non-shared network, > > dE dE > -- = -- y > dw dx i > i i > > In the network where w1 and w2 are "shared weights", call > the underlying parameter "p" (so that w1=p and w2=p), and: > > dE dE dw dE dw > 1 2 > -- = -- --- + --- --- ( by the chain rule ) > dp dw dp dw dp > 1 2 > dw > dE dE i > = --- + --- ( --- = 1 , since w = p) > dw dw dp i > 1 2 > > dE dE > = -- y + -- y > dx 1 dx 2 > 1 2 > > So, put simply, the derivative for the shared weight is just > the sum of the derivatives for the weights being shared. > > If you are using steepest descent with a fixed step size, > just multiply dE/dp by your stepsize to get your "delta". > > [ Note that the above is correct only in the case where > x1 does not influence x2, or vis-versa. ] > > > I am also interested in guidelines for > > how big the receptive fields should be, and how the maps should be > > interconnected. Is it a good idea to let a unit in a map connect to several > > different maps in the previous layer, like Cun do ? How much should the > > receptive fields overlap ? > > This is where the magic spells come in handy. > > [ By the way, I believe the surname of "Yann Le Cun" is "Le > Cun", not "Cun". ] > -- > ---------------- Tony Plate ---------------------- tap@ai.utoronto.ca ----- > Department of Computer Science, University of Toronto, > 10 Kings College Road, Toronto, > Ontario, CANADA M5S 1A4 > ---------------------------------------------------------------------------- > I have trouble seeing that > dE dE > -- = -- y > dw dx i > i i The idea of using an underlying varible is, however, excellent. From McClelland/Rumelhart: (note: d means partial derivative) Delta is difference between target value and output value. See PDP vol. I, p 323 ) dE dE dy -- = -- --i = -Delta y dw dy dw i i i i i With Tony's underlying variable and two weights w1 and w2, we get: dE dE dw dE dw dE dE -- = -- --1 + -- --2 = -- + -- = dp dw dp dw dp dw dw 1 2 1 2 = -( Delta y + Delta y ) 1 1 2 2 since w = w = pA 1 2 This means that the weight change should be *propotional* to the sum of the Deltas. (see PDP I, p 322). It seems like a good idea to use a smaller etha (learning rate, step size) than 'normally', since you add up all the delta w's. Or perhaps simpler, use the average of the deltas, and a more 'normal' step size. (which is just what my intiution tells me to do :-) -------------------------------------------------------------- Nick Porcino <nporcino@uvcw.UVic.CA> writes : > A problem with feature maps is that they are typically > winner-take-all and provide a binary input to the back propagation > network. > > If you are using winner-take-all nets, ignore the bp, and just > use a lookup table with prestored answers in it! > > I solved the problem by associating fuzzy membership functions > with the feature map vectors, thus providing something for bp > to work with (paper submitted to Neural Computation). > > -nick Looks like you are thinking of another type of feature map than I do, Nick. I was thinking of making a percetron-like net (feedforward) with several maps i each layer. No winner-take-all. (Would be great if you could send abstract of your paper to this newsgroup.) -------------------------------------------------------------- Yaakov Metzger <coby@shum.huji.ac.il> writes > I am interested in the same thing. We tried here averaging the weights after > each cycle and it seems to work fine. Seems to me that it can be theoretically > justified since the error function is the sum of all the contributions. > > We are in a very primary stage so I dont have any definite results about > field sizes etc. > > Coby > -------------------------------------------------------------- Tony Robinson <ajr@eng.cam.ac.uk> writes [stuff deleted] > This is just to confirm that you have the right idea. If you want another > example of weight sharing then I suggest you read the TDNN papers, ref below. > Weight sharing is a neet idea, good luck. > > Tony. > > @article$WaibelHanazawaHintonShikanoLang89, > author= "A. Waibel and T. Hanazawa and G. Hinton and > K. Shikano and K. J. Lang", > title= "Phoneme Recognition Using Time-Delay Neural Networks", > month= mar, > year= 1989, > journal= assp, > volume= 37, > number= 3, > pages= "328-339" > The article I referred to in my question to this newsgroup was the following : "Generalization and Network Design Strategies." Yann Le Cun >From the book "Connectionism in perspective" R. Pfeifer, Z. Schreter, F. Fogelman-Souile, L. Steels (eds) North-Holland, 1989, pp 143 - 155 Another interesting article is : "Optical Character Recognition and Neural-net Chips" Y. L. Cun, L.D. Jackel and others Proceedings from the INNC Paris - 90, pp 651 - 655 Any comments ? -Geir -- email: geirt@idt.unit.no TORHEIM@NORUNIT (BITNET/EARN)