[comp.protocols.tcp-ip] Why do TCP connections hang?

wisner@hayes.fai.alaska.edu (Bill Wisner) (11/30/89)

I sent a message to this august assemblage recently complaining that FTP
on our entire campus has an unpleasant habit of hanging in the middle of
transfers. It now turns out that the problem is much more general. Any
TCP connection may hang under certain unknown circumstances.

Background: we have two subnets. The main subnet is populated by a VMS
machine (with TWG WINS TCP) and many PCs and Macs running NCSA Telnet.
It is connected with a Proteon router to the world at large. The other
subnet has a another VAX (with TWG) and a slew of Sun workstaions; it
is connected by a second Proteon router to the main subnet. TCP links
from any campus machine to any machine off campus are prone to failure.
(The same is true of connections from off-campus to on-.) 

I've used SunOS's etherfind utility to try to find the problem. I have
hex dumps of an rlogin connection and an FTP transfer, both of which
hung. (A certain message in my mailbox on a remote machine will cause my
connection to hang every time if I merely read it. It contains no 
obvious clues -- it's just another normal looking message.) I could find
nothing in the hex dumps. No packets were truncated; one packet arrived
and the next one didn't.

The hung connections seem to be data-specific. If I try transferring the
same file twice, both FTP connections are likely to hang at the same
point in the file. As an exercise, I took a file and split it into several
small chunks, then tried transferring it. The connection still hung at the
same point in the fragmented file.

I have come to believe that perhaps the problem resides in the Proteon
box that connects us to the Internet-at-large. It is, after all, the one
common factor shared by all campus machines. But I am at a loss to figure
out just what the problem might be.

Any ideas or suggestions would be greatly appreciated.

<wisner@hayes.fai.alaska.edu>

meggers@orion.oac.uci.edu (mark eggers) (12/01/89)

Bill,
     One thing that may be biting you is a particular bit pattern. If you 
have marginal connections to the rest of the world, or if your 
CSU/DSUs are a bit flakey, they can mangle certain bit patterns. We 
had this happen with a connection on CERFnet from here at UCI to 
SWRL (Cal State University net). PacBell and Brian Roode (another 
member of our network team) found the flakey link between Seal 
Beach and Long Beach. I think that they found it by doing extensive 
BERT (bit error rate tests) tests. 
     Since you seem to be on the track of this problem, you might try 
to artificially generate the bit pattern in a file (a quick and dirty C 
program), and then do a binary FTP to another system. If the FTP 
hangs, then you have some cause to suspect that bit pattern. You 
might want the circuit provider to then run a BERT test using the 
suspected pattern and watch for errors.
     At that point, you can then start replacing things to bring the 
error rate down.

Of course, this is just a guess (with 2 hours of sleep at that ;-) ).

Good luck - Mark Eggers, Network Communications Analyst
            University of California, Irvine

email:      meggers@uci.edu

henry@utzoo.uucp (Henry Spencer) (12/03/89)

In article <8911301020.AA14662@hayes.fai.alaska.edu> wisner@hayes.fai.alaska.edu (Bill Wisner) writes:
>The hung connections seem to be data-specific. If I try transferring the
>same file twice, both FTP connections are likely to hang at the same
>point in the file. As an exercise, I took a file and split it into several
>small chunks, then tried transferring it. The connection still hung at the
>same point in the fragmented file.

You should probably pursue this further to pin down the exact data pattern
that causes the problem.  This might yield some insight.
-- 
Mars can wait:  we've barely   |     Henry Spencer at U of Toronto Zoology
started exploring the Moon.    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu