[comp.ai.neural-nets] TD Model of Conditioning -- Paper Announcement

rich@GTE.COM (Rich Sutton) (05/20/89)

Andy Barto and I have just completed a major new paper relating
temporal-difference learning, as used, for example, in our
pole-balancing learning controller, to classical conditioning in
animals.  The paper will appear in the forthcoming book ``Learning and
Computational Neuroscience,'' edited by J.W. Moore and M. Gabriel, MIT
Press.  A preprint can be obtained by emailing to
rich%gte.com@relay.cs.net with your physical-mail address.  The paper
has no abstract, but begins as follows:


	   TIME-DERIVATIVE MODELS OF PAVLOVIAN REINFORCEMENT

			   Richard S. Sutton
		     GTE Laboratories Incorporated

			    Andrew G. Barto
		      University of Massachusetts

This chapter presents a model of classical conditioning called the
temporal-difference (TD) model.  The TD model was originally developed
as a neuron-like unit for use in adaptive networks (Sutton & Barto,
1987; Sutton, 1984; Barto, Sutton & Anderson, 1983).  In this paper,
however, we analyze it from the point of view of animal learning theory.
Our intended audience is both animal learning researchers interested in
computational theories of behavior and machine learning researchers
interested in how their learning algorithms relate to, and may be
constrained by, animal learning studies.

We focus on what we see as the primary theoretical contribution to
animal learning theory of the TD and related models: the hypothesis that
reinforcement in classical conditioning is the time derivative of a
composite association combining innate (US) and acquired (CS)
associations.  We call models based on some variant of this hypothesis
``time-derivative models'', examples of which are the models by Klopf
(1988), Sutton & Barto (1981a), Moore et al (1986), Hawkins & Kandel
(1984), Gelperin, Hopfield & Tank (1985), Tesauro (1987), and Kosko
(1986); we examine several of these models in relation to the TD model.
We also briefly explore relationships with animal learning theories of
reinforcement, including Mowrer's drive-induction theory (Mowrer, 1960)
and the Rescorla-Wagner model (Rescorla & Wagner, 1972).

In this paper, we systematically analyze the inter-stimulus interval
(ISI) dependency of time-derivative models, using realistic stimulus
durations and both forward and backward CS--US intervals.  The models'
behaviors are compared with the empirical data for rabbit eyeblink
(nictitating membrane) conditioning.  We find that our earlier
time-derivative model (Sutton & Barto, 1981a) has significant problems
reproducing features of these data, and we briefly explore partial
solutions in subsequent time-derivative models proposed by Moore et al.
(1986), Klopf (1988), and Gelperin et al. (1985).

The TD model was designed to eliminate these problems by relying on a
slightly more complex time-derivative theory of reinforcement.  In this
paper, we motivate and explain this theory from the point of view of
animal learning theory, and show that the TD model solves the ISI
problems and other problems with simpler time-derivative models.
Finally, we demonstrate the TD model's behavior in a range of
conditioning paradigms including conditioned inhibition, primacy effects
(Egger & Miller, 1962), facilitation of remote associations, and
second-order conditioning.