[comp.protocols.appletalk] PAP -1101 errors

cck@cunixc.columbia.EDU (Charlie C. Kim) (03/18/88)
Here's my analysis of how to handle the -1101 (no release received)
error.  The implicit release and extended timeouts will both be
present in the next release of papif/pap.

Comments are welcome (except you should be warned that I feel very
strongly about returning the noRel(-1101) error when the response
cache timeout occurs).

Charlie C. Kim
User Services
Columbia University

The effects of the ATP No Release Errors on PAP: revisited in detail

Author: Charlie C. Kim, User Services, Columbia University, March 1988

We assume in the following that reader understands the basic operation
of the AppleTalk Transaction Protocol and the Printer Access Protocol.

Abstract
--------

An ATP exactly once (XO) transaction defines the concept of a Release
packet that causes a Responding node to delete a response cache entry
kept to retransmit packets.  This cache entry may also be deleted by a
timeout occurring (by specification) 30 seconds after the last request
for transmission or retransmission.

The problem occurs when the Release packet is either lost or delayed
beyond the response cache timeout.  In the absence of other
information, the connection must be torn down when this event occurs
because the responder cannot know if all response packets have been
correctly received by the requester and to prevent a deadlock.

These problems were showing up with regularity in using the Printer
Access Protocol to communicate from a Unix host running CAP/KIP to a
LaserWriter.

Getting around the problems
---------------------------

The workaround to the problem of delayed packets, posted in bug fix 14
for CAP 4.0 with patches 5,6, and 7, was to extend the ATP response
cache timeout.  This is not a solution because it is still possible to
delete the response cache before the transaction has completed;
however, the probability of this occurrence has been reduced
drastically--especially knowing that a primary culprit was the time
"compression" that can occur when running protocol with a printing
LaserWriter and that prevents it from responding in a "reasonable"
amount of time ("reasonable" as judged by the the unix host based upon
the protocol specified timeouts).

In conducting the ATP "responder" transactions for PAP, it must be
pointed out that you must return an error and close down the
connection if no "release" has been received and an response cache
timeout on a particular transaction occurs.  If you do not, you cannot
be sure that the remote side has received all the data (for that
transaction).  Furthermore, you open up the possibility of a deadlock.

The deadlock may occur if the requester (e.g. LaserWriter doing a
PAPRead), is sending the ATP request with parameters: infinite retry,
exactly-once mode and, for some reason, is unable to ask for
retransmissions of missed packets inside the Response Cache Timeout
interval, so that the cache entry is removed by the responder.  The
requester (PAPRead/LaserWriter) will continue requesting the data, but
will never get it because the responder has already sequenced to the
next transaction: if it hasn't, then it isn't really timing out the
response cache.

A lost Release packet is a slightly different story.  It is actually
possible to detect this situation when ATP is being run by PAP.  PAP
is a sequenced protocol with single ATP transactions outstanding (It
is sufficient to note that neither side is allowed to have more than
one PAPRead outstanding per connection).  A PAPWrite on a host sending
data to a LaserWriter works as follows.  The LaserWriter must have
issued a PAPRead that sends a PAP "SendData" ATP Request with a
particular pap sequence number <n> whereupon the host (responder)
sends a response with the data, placing it into the cache, etc.

If the release from the LaserWriter to the host gets lost, it is
possible for the host to resynchronize.  Simply, note that if it
receives a SendData request with pap sequence <n+1> (within the
response cache timeout), then it can, with impunity, cancel the last
response and continue with protocol because all the data in the
response must have been received by the LaserWriter (requester).  One
knows this because the LaserWriter cannot legitimately issue a new
PAPRead (send a new SendData request) until all the responses to the
previous PAPRead/"SendData request" have been received by the
LaserWriter.  In other words, the new SendData is a implicit release
on the previous transaction.  There are indications that this may have
been done by Apple in their implementation of PAP (c.f. Release a
RspCB, Page VII-13, Inside AppleTalk, June 1986, Apple Comp.).


OTHER IDEAS
-----------

Dan Lanciani of Harvard suggests the following.  By not deleting the
response cache entry on a release timeout (must delete it by hand in
this case) or never letting it timeout (setting timeout to infinity(),
the problem of delayed responses disappears.  Basically, one assumes
that the responses will eventually be requested and that massive or
continuing disruptions in network communications or shutdown of the
remote will be signaled by the Tickling mechanism.  One must be
careful with this idea because if a release is lost and the response
cache is never timed out, then the next papwrite will not issued by
the host: in other words this can result in a deadlock.

However, with the use of the "lost" release handling outlined above,
this idea becomes workable.

One can envision pathological cases in which the above would result in
a "hung" connection.  For example, it could be that Tickle request can
make their way through a gateway, but not other packets.  Further
thought on the pathological conditions that may hold is required to
determine whether this may or may not be a safe course of action.  At
the present, extending the Release Timeout to several minutes is
considered the most prudent course of action.

- END -