[comp.ai.neural-nets] Feature maps and weight sharing

geirt@idt.unit.no (Geir Torheim) (10/30/90)

Anybody who got any idea on how to train a back-prop net whith
layers divided into feature maps ? The units in a feature map share
weights.

I have read an article by Le Cun
about this, but he does not explain how he updates the shared weights.
I guess the best thing to do is to find the delta w for each
unit in the map and then use the average of these delta w as
the delta w for the shared weight.  If someone has other ideas,
please send me an email.

I am also interested in guidelines for
how big the receptive fields should be, and how the maps should be
interconnected. Is it a good idea to let a unit in a map connect to several 
different maps in the previous layer, like Cun do ? How much should the
receptive fields overlap ? 


I am going to use the maps in character recognition.

If there is any interest, I'll send a summary to this newsgroup.

  - Geir

-- 
  geirt@idt.unit.no       or       TORHEIM@NORUNIT  (BITNET/EARN)

geirt@idt.unit.no (Geir Torheim) (11/11/90)

Hi,

Because of trouble with our news-server, it looks 'world' didn't
get this article. Therefore I'm sending it once more. Sorry if
you have seen it before.

Here we go :


A short summary of the responses I got:

Tony Plate <tap@neuron.ai.toronto.edu> writes

>> Anybody who got any idea on how to train a back-prop net whith
>> layers divided into feature maps ? The units in a feature map share
>> weights.

>> I have read an article by Le Cun
>> about this, but he does not explain how he updates the shared weights.
>> I guess the best thing to do is to find the delta w for each
>> unit in the map and then use the average of these delta w as
>> the delta w for the shared weight.  If someone has other ideas,
>> please send me an email.


> It is better to regard each shared weight as a function of
> an underlying parameter.  (That function is just the
> identity function.)  Then you compute the partial derivative
> of E with respect to that parameter, and use that derivative
> to work out an update to the weights.
> 
> E.g., consider the following part of a feed-forward network:
> 
>           O x1      O x2            xi is total input
>           ^         ^
> 	  | w1 	    | w2            wi is weight
> 	  |   	    |
>           O y1	    O y2            yi is total output
> 
> 
> In the standard, non-shared network,
> 
>      dE     dE
>      --  =  -- y
>      dw     dx  i
>        i      i 
> 
> In the network where w1 and w2 are "shared weights", call
> the underlying parameter "p" (so that w1=p and w2=p), and:
> 
>      dE     dE  dw      dE  dw
>                   1           2
>      --  =  --  ---  +  --- ---       ( by the chain rule )
>      dp     dw  dp      dw   dp
>               1           2
>                                          dw
>             dE      dE                     i
>          =  ---  +  ---                ( ---  = 1 ,    since w  = p)
>             dw      dw                   dp                   i
>               1       2
> 
>             dE        dE
>          =  -- y   +  -- y
>             dx  1     dx  2
>               1         2
> 
> So, put simply, the derivative for the shared weight is just
> the sum of the derivatives for the weights being shared.
> 
> If you are using steepest descent with a fixed step size,
> just multiply dE/dp by your stepsize to get your "delta".
> 
> [ Note that the above is correct only in the case where
> x1 does not influence x2, or vis-versa. ]
> 
> > I am also interested in guidelines for
> > how big the receptive fields should be, and how the maps should be
> > interconnected. Is it a good idea to let a unit in a map connect to several
> > different maps in the previous layer, like Cun do ? How much should the
> > receptive fields overlap ?
> 
> This is where the magic spells come in handy.
> 
> [ By the way, I believe the surname of "Yann Le Cun" is "Le
> Cun", not "Cun". ]
> -- 
> ---------------- Tony Plate ----------------------  tap@ai.utoronto.ca -----
> Department of Computer Science, University of Toronto, 
> 10 Kings College Road, Toronto, 
> Ontario, CANADA M5S 1A4
> ----------------------------------------------------------------------------
> 

I have trouble seeing that 

>      dE     dE
>      --  =  -- y
>      dw     dx  i
>        i      i 

The idea of using an underlying varible is, however, excellent.

From McClelland/Rumelhart:
(note: d means partial derivative)
       Delta is difference between target value and
             output value. See PDP vol. I, p 323 )

     dE      dE dy
     --   =  -- --i  = -Delta y 
     dw      dy dw           i i
       i       i  i


With Tony's underlying variable and two weights w1 and w2,
we get:

    dE      dE  dw     dE dw      dE    dE
    --   =  --  --1  + -- --2  =  --  + --  =
    dp      dw  dp     dw dp      dw    dw  
              1          2          1     2


         =   -( Delta y  +  Delta  y )
                     1 1         2  2

since w   = w  = pA
       1     2

This means that the weight change should be *propotional* to
the sum of the Deltas. (see PDP I, p 322).

It seems like a good idea to use a smaller
etha (learning rate, step size) than 'normally', since you add
up all the delta w's. Or perhaps simpler, use the average of
the deltas, and a more 'normal' step size.
(which is just what my intiution tells me to do :-)

--------------------------------------------------------------

Nick Porcino <nporcino@uvcw.UVic.CA>  writes :
 
> A problem with feature maps is that they are typically
>  winner-take-all and provide a binary input to the back propagation 
>   network.
> 
> If you are using winner-take-all nets, ignore the bp, and just
>  use a lookup table with prestored answers in it!
> 
> I solved the problem by associating fuzzy membership functions
>  with the feature map vectors, thus providing something for bp
>   to work with (paper submitted to Neural Computation).
> 
> -nick

Looks like you are thinking of another type of feature map than I do,
Nick. I was thinking of making a percetron-like net (feedforward) with
several maps i each layer. No winner-take-all.

(Would be great if you could send abstract of your paper to this
newsgroup.)

--------------------------------------------------------------

Yaakov Metzger <coby@shum.huji.ac.il> writes 

> I am interested in the same thing. We tried here averaging the weights after
> each cycle and it seems to work fine. Seems to me that it can be theoretically
> justified since the error function is the sum of all the contributions.
> 
> We are in a very primary stage so I dont have any definite results about
> field sizes etc.
> 
> Coby
> 

--------------------------------------------------------------

Tony Robinson <ajr@eng.cam.ac.uk> writes

[stuff deleted]

> This is just to confirm that you have the right idea.  If you want another
> example of weight sharing then I suggest you read the TDNN papers, ref below.
> Weight sharing is a neet idea, good luck.
>
> Tony.
>
> @article$WaibelHanazawaHintonShikanoLang89,
> 	author=		"A. Waibel and T. Hanazawa and G. Hinton and
> 			K. Shikano and K. J. Lang",
> 	title=		"Phoneme Recognition Using Time-Delay Neural Networks",
> 	month=		mar,
> 	year=		1989,
> 	journal=	assp,
> 	volume=		37,
> 	number=		3,
> 	pages=		"328-339"
>

The article I referred to in my question to this newsgroup was the following :

"Generalization and Network Design Strategies."
Yann Le Cun
>From the book "Connectionism in perspective"
R. Pfeifer, Z. Schreter, F. Fogelman-Souile, L. Steels (eds)
North-Holland, 1989, pp 143 - 155

Another interesting article is :

"Optical Character Recognition and Neural-net Chips"
Y. L. Cun, L.D. Jackel and others
Proceedings from the INNC  Paris - 90, pp 651 - 655


Any comments ?


   -Geir

-- 
email:      geirt@idt.unit.no      TORHEIM@NORUNIT (BITNET/EARN)