[comp.protocols.tcp-ip] TCP and long pauses

jc@minya.UUCP (John Chambers) (03/13/91)
While investigating a curious problem on some DECstations and AIX
systems connected via some modems and a SLIP link, I ran across
a curious feature (?) of TCP that perhaps someone here can explain.

The problem was that, although the link seemed up and alive, and
all UDP-absed applications (ping, tftp, NFS, among others) seemed
to work just fine, all TCP-based applications were subject to long
pauses.  There seemed to be a 4K size somehow involved, and roughly
a 2-minute timeout.

One scenario is illustrative:  We put line monitors on both serial
ports, and did an rlogin across the link.  So far, so good.  We 
cd'ed to a big directory and typed "ls -l".  The listing started
to appear in the window, ran for a few pages, and halted in its
tracks.  Looking at the monitor, we could see that about 4K more
data had been delivered than had appeared on the screen.  After
about 2 minutes, the last 4K was retransmitted, and everything
ran fine for a while, until it stopped again.

In reading through several documents on the subject (including Comer's
famous book), it eventually struck me that perhaps this behavior is
implicit in TCP.  All the texts describe a sliding window protocol,
in which the ACKs include the number of bytes correctly received.
Suppose that the first packet in the window is garbled, and all the
rest get though OK.  The receiver can't send any ACKs, because it
hasn't correctly received the first packet in the window.  So it sits,
hoping that the sender will remember to send that first packet.  After
a while, the sender times out the ACK, and retransmits the first packet,
which is dutifully ACKed by the receiver.  But meanwhile, the sender
sees that the second packet has also timed, out, and retransmits it,
etc.  Given the speeds of processors and networks, it is pretty much
guaranteed that the entire window will be queued for retransmission
before an ACK is received.  It's unlikely that many implementations 
have a method for deleting ACKed packets from the send queue, so they 
all get sent.

Comer and others also describe somewhat vaguely a dynamic scheme for
adjusting the timeout interval.  It seems that on a slow link such as
a 9600-baud modem, a 2-minute timeout is a fairly reasonable value for
this scheme to produce.  This implies that such long pauses are part
of TCP's normal behaviour, and there's probably not much I can do to
correct the problem, which pretty much makes most TCP-based tools
unusable across a SLIP link.  (It's amusing that NFS+cp works, but
ftp doesn't.)

Any comments?  I'd sure like to hear that I'm wrong.  Even better,
I'd like to hear about tools that might exist (or which I might be
able to implement with the right information) to correct the problem.

For instance, I'm not at all shy about opening /dev/kmem, lseeking to
some place that nlist() has told me about, and writing a new value in
the appropriate bytes.  If there's a way to dig out the kernel's values
that control the timeout, I can certainly zap them.  But so far my
perusing TFM and /usr/include/*/*.h haven't turned up any hints as
to where the timeout value might be hidden.  I suspect that it isn't
actually stored in one place, but is an implicit value that is some
function of several variables.  Not having access to the source is
somewhat limiting at this point...

(Responding via email might be helpful, as I'm about 200 articles
behind in this newsgroup.  ;-)

-- 
All opinions Copyright (c) 1991 by John Chambers.  Inquire for licensing at:
Home: 1-617-484-6393 
Work: 1-508-486-5475
Uucp: ...!{bu.edu,harvard.edu,ima.com,eddie.mit.edu,ora.com}!minya!jc