[comp.ai.neural-nets] updating neural nets

al@gtx.com (Alan Filipski) (12/08/89)

Suppose a net has been trained (e.g., using backprop) on several
thousand samples and a few more samples come along.  One could retrain
the net using all (several thousand + few) samples, or one could
somehow modify the weights using only the few new samples.The risk
of the latter approach is that the gain in learning about the
few new samples may be outweighed by degraded performance on the old
samples.  The advantage, of course, is that all the old samples do not
have to be retained.

To look at a simple analog, suppose we have an estimate of the mean of
1000 numbers, and then 3 more numbers come in.  it is easy to compute
the mean of all 1003 numbers withous seeing the first 1000 again--
their mean, along with the count of how many there were, provides a
sufficient statistic for updating the mean.

Does anyone know of any work on updating neural nets to account for new
samples without completely retraining them?


  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ( Alan Filipski, GTX Corp, 8836 N. 23rd Avenue, Phoenix, Arizona 85021, USA )
 ( {decvax,hplabs,uunet!amdahl,nsc}!sun!sunburn!gtx!al         (602)870-1696 )
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

danforth@riacs.edu (Douglas G. Danforth) (12/09/89)

]>From: al@gtx.com (Alan Filipski)
]Newsgroups: comp.ai.neural-nets
]Subject: updating neural nets
]Message-ID: <1153@gtx.com>
]Date: 7 Dec 89 20:34:13 GMT
]Reply-To: al@gtx.UUCP (Alan Filipski)
]Organization: GTX Corporation, Phoenix
]Lines: 23
]
]Suppose a net has been trained (e.g., using backprop) on several
]thousand samples and a few more samples come along.  One could retrain
]the net using all (several thousand + few) samples, or one could
]somehow modify the weights using only the few new samples.The risk
]of the latter approach is that the gain in learning about the
]few new samples may be outweighed by degraded performance on the old
]samples.  The advantage, of course, is that all the old samples do not
]have to be retained.
]
]To look at a simple analog, suppose we have an estimate of the mean of
]1000 numbers, and then 3 more numbers come in.  it is easy to compute
]the mean of all 1003 numbers withous seeing the first 1000 again--
]their mean, along with the count of how many there were, provides a
]sufficient statistic for updating the mean.
]
]Does anyone know of any work on updating neural nets to account for new
]samples without completely retraining them?
]
]
]  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
] ( Alan Filipski, GTX Corp, 8836 N. 23rd Avenue, Phoenix, Arizona 85021, USA )
] ( {decvax,hplabs,uunet!amdahl,nsc}!sun!sunburn!gtx!al         (602)870-1696 )
]  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
]

     Consider a 3 layer network where the weights between layers 1 & 2
(input to hidden) are frozen (this is the case for certain forms of associative
memory) then all of the subsequent learning takes place in the adjustment of 
the hidden to output weights.  If the task at hand is one of CLASSIFICATION,
then either a 1-out-of-n or orthogonal encoding of the class labels can be used.
For any input pattern a subset of the hidden nodes will be activated (assuming
a step function or almost step function for the activation rule).  The code for
the class label written into each of these activated hidden nodes (increment or
decrement the weights leading to the activated hidden node commensurate with
the label encoding scheme) will then simply be a COUNT of the number of times
that class label occured with the activation of that hidden node.

     With the above view we see that hidden nodes can be interpreted as LOCAL
estimators of the conditional probability density for each class where we are
conditionalizing on some region of the input space.  The pooled and summed
weights from the activated nodes for a given input pattern then give the
best estimate of which class is dominant in that region of the input space.
This is very close to a form of Bayes estimation.

     Note that with a fixed increment or decrement scheme for the weights
during training that the order of presentation does not matter so that 
only frequency counts of class labels are being retained.  The problem of
destroying the state of the system by further training is avoided by the
above approach and is governed by the same conditions that govern statistical
sampling (large samples give more robust estimators).

     Systems that use approaches similar to this can be found in the work of
James Albus (CMAC), Pentti Kanerva (Sparse Distributed Memory) and is
related to some aspects of the Cerebellum as pointed out by David Marr.

     Douglas Danforth
     danforth@riacs.edu


-- 
Douglas G. Danforth
danforth@riacs.edu
NASA Ames Research Center
Moffett Field, CA 94035

chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88) (12/09/89)

From a previous message:

Does anyone know of any work on updating neural nets to account for new
samples without completely retraining them?


  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ( Alan Filipski, GTX Corp, 8836 N. 23rd Avenue, Phoenix, Arizona 85021, USA )
 ( {decvax,hplabs,uunet!amdahl,nsc}!sun!sunburn!gtx!al         (602)870-1696 )
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

______________________________________________

Yes.  This was one of the things Geoff Hinton was trying to do with fast and
slow weights.  I think he has a paper "Using Fast and Slow Weights to De-blur
Old Memories" that appeared in the Proceedings of the Cog Sci Society '87.

Is there a better reference for this paper?

Ron Chrisley
chrisley.pa@xerox.com