cck@cunixc.columbia.EDU (Charlie C. Kim) (03/18/88)
Here's my analysis of how to handle the -1101 (no release received) error. The implicit release and extended timeouts will both be present in the next release of papif/pap. Comments are welcome (except you should be warned that I feel very strongly about returning the noRel(-1101) error when the response cache timeout occurs). Charlie C. Kim User Services Columbia University The effects of the ATP No Release Errors on PAP: revisited in detail Author: Charlie C. Kim, User Services, Columbia University, March 1988 We assume in the following that reader understands the basic operation of the AppleTalk Transaction Protocol and the Printer Access Protocol. Abstract -------- An ATP exactly once (XO) transaction defines the concept of a Release packet that causes a Responding node to delete a response cache entry kept to retransmit packets. This cache entry may also be deleted by a timeout occurring (by specification) 30 seconds after the last request for transmission or retransmission. The problem occurs when the Release packet is either lost or delayed beyond the response cache timeout. In the absence of other information, the connection must be torn down when this event occurs because the responder cannot know if all response packets have been correctly received by the requester and to prevent a deadlock. These problems were showing up with regularity in using the Printer Access Protocol to communicate from a Unix host running CAP/KIP to a LaserWriter. Getting around the problems --------------------------- The workaround to the problem of delayed packets, posted in bug fix 14 for CAP 4.0 with patches 5,6, and 7, was to extend the ATP response cache timeout. This is not a solution because it is still possible to delete the response cache before the transaction has completed; however, the probability of this occurrence has been reduced drastically--especially knowing that a primary culprit was the time "compression" that can occur when running protocol with a printing LaserWriter and that prevents it from responding in a "reasonable" amount of time ("reasonable" as judged by the the unix host based upon the protocol specified timeouts). In conducting the ATP "responder" transactions for PAP, it must be pointed out that you must return an error and close down the connection if no "release" has been received and an response cache timeout on a particular transaction occurs. If you do not, you cannot be sure that the remote side has received all the data (for that transaction). Furthermore, you open up the possibility of a deadlock. The deadlock may occur if the requester (e.g. LaserWriter doing a PAPRead), is sending the ATP request with parameters: infinite retry, exactly-once mode and, for some reason, is unable to ask for retransmissions of missed packets inside the Response Cache Timeout interval, so that the cache entry is removed by the responder. The requester (PAPRead/LaserWriter) will continue requesting the data, but will never get it because the responder has already sequenced to the next transaction: if it hasn't, then it isn't really timing out the response cache. A lost Release packet is a slightly different story. It is actually possible to detect this situation when ATP is being run by PAP. PAP is a sequenced protocol with single ATP transactions outstanding (It is sufficient to note that neither side is allowed to have more than one PAPRead outstanding per connection). A PAPWrite on a host sending data to a LaserWriter works as follows. The LaserWriter must have issued a PAPRead that sends a PAP "SendData" ATP Request with a particular pap sequence number <n> whereupon the host (responder) sends a response with the data, placing it into the cache, etc. If the release from the LaserWriter to the host gets lost, it is possible for the host to resynchronize. Simply, note that if it receives a SendData request with pap sequence <n+1> (within the response cache timeout), then it can, with impunity, cancel the last response and continue with protocol because all the data in the response must have been received by the LaserWriter (requester). One knows this because the LaserWriter cannot legitimately issue a new PAPRead (send a new SendData request) until all the responses to the previous PAPRead/"SendData request" have been received by the LaserWriter. In other words, the new SendData is a implicit release on the previous transaction. There are indications that this may have been done by Apple in their implementation of PAP (c.f. Release a RspCB, Page VII-13, Inside AppleTalk, June 1986, Apple Comp.). OTHER IDEAS ----------- Dan Lanciani of Harvard suggests the following. By not deleting the response cache entry on a release timeout (must delete it by hand in this case) or never letting it timeout (setting timeout to infinity(), the problem of delayed responses disappears. Basically, one assumes that the responses will eventually be requested and that massive or continuing disruptions in network communications or shutdown of the remote will be signaled by the Tickling mechanism. One must be careful with this idea because if a release is lost and the response cache is never timed out, then the next papwrite will not issued by the host: in other words this can result in a deadlock. However, with the use of the "lost" release handling outlined above, this idea becomes workable. One can envision pathological cases in which the above would result in a "hung" connection. For example, it could be that Tickle request can make their way through a gateway, but not other packets. Further thought on the pathological conditions that may hold is required to determine whether this may or may not be a safe course of action. At the present, extending the Release Timeout to several minutes is considered the most prudent course of action. - END -