[comp.protocols.tcp-ip] ftp hang while trying to talk to cu20b.columbia.edu

david@ms.uky.edu (David Herron -- Resident E-mail Hack) (02/17/88)

I'm experiencing some strange hangs while attempting to ftp files
from cu20b.  Specifically, the pattern is that I make the connection,
log in as anonymous and david@ms.uky.edu as the password.  Then I
start transferring some data (either a directory listing or an
actual get command) and after some short time (approximately 15K
worth of data has been transmitted) the transfer hangs and stops
doing things.

Running tcpdump to watch things and telling it to restrict itself
to the ftp port, sure enough the packets simply stop going by
after a time.

The local machines I have tried this from are an 11/750 with deuna
running mtxinu 4.3+nfs (we're one MR behind the current version),
uXavII's with DEQNA's and the same OS, Sun 3/something running 3.4,
and uVaxStation2000's with whatever the ether hardware is and
the next-to-current version of Ultrix.  Our sequent didn't
know how to reach that network so I wasn't le to try from
that machine.

Using ping, I get an response times like 250 ms at best, 2500 ms
at worst, and 600-700 avg and with packet lossage at somewhere
between 5% to 30% depending on the tides, or some other somewhat
unrelated/unknown effect.  I'm doing these transfers at night
most recently at 3AM on a saturday morning.

Our net is an ethernet attached to a proteon p4200 ip router which
is the gateway to suranet/nsfnet.  From there we go through some
gateway in wash dc (maryland?) to get to net 10.

One thing we've noticed recently on our net is something which may
be rwho related broadcast storms.  I haven't traced this down yet,
but will be doing some tracing on this in the next couple of days.
One thing I notice from ping is that occasionally there are
bursts of packets being lost and surrounding the lost packets
are long ping times.

um, some hard facts gathered with ping.  I watched for about 5 or 6
minutes just now.  Every minute like clockwork we'd experience some
packet loss, and generally for the next 20-30 seconds the ping
time would be much worse than the rest of the time.  But there was
enough variance in the details of what happened, as well as variation
in other parts of the minute, that it's not completely clear what
is happening.  I did see one extreme example in that time where
we lost about 40 packets and didn't have ANY pings come back for
about 30-40 seconds.

This does seem to be extreme lossage ... however, isn't tcp supposed
to be able to handle loss of packets and do re-transmissions?  Why
is tcp getting wedged?  Or possibly is it something in the way that ftp
works?  (I must admit that I know very little about how ftp operates
other than it runs using two tcp channels, one for data transfers
and the other for commands).

Any ideas?
-- 
<---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
<---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET
<----
<---- It takes more than a good memory to have good memories.

sy.Ken@CU20B.COLUMBIA.EDU (Ken Rossman) (02/18/88)

  I'm experiencing some strange hangs while attempting to ftp files from
  cu20b.  Specifically, the pattern is that I make the connection, log
  in as anonymous and david@ms.uky.edu as the password.  Then I start
  transferring some data (either a directory listing or an actual get
  command) and after some short time (approximately 15K worth of data
  has been transmitted) the transfer hangs and stops doing things...

You are by far not the only site having this problem with FTP to CU20B.
We're getting lots of complaints about it.  Near as I can tell, what we're
doing is running out of IP free space, after which, all of our daemons get
very confused and stop working properly.  No one is really working on this
problem here because CU20B's lifetime is only a few more months, but if
anyone happens to have collected up some pertinent IP free space manager
monitor patches, and they don't look too complex to put in, I guess I'll
take a crack at them.

At least this is what I *think* is going on here...  /Ken
-------

pavlov@hscfvax.harvard.edu (G.Pavlov) (02/18/88)

In article <8377@g.ms.uky.edu>, david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes:
> I'm experiencing some strange hangs while attempting to ftp files
> from cu20b.  Specifically, the pattern is that I make the connection,
> log in as anonymous and david@ms.uky.edu as the password.  Then I
> start transferring some data (either a directory listing or an
> actual get command) and after some short time (approximately 15K
> worth of data has been transmitted) the transfer hangs and stops
> doing things.

  According to the Kermit folks at Columbia, there are known problems at the
  cu20b end.  I experienced the same synptoms for several months (trying to
  get thru on a variety of machines) and finally gave up.  Used KERMSRV instead;
   
   greg pavlov, fstrf, amherst, ny

WWB.TYM@OFFICE-1.ARPA (Bill Barns) (02/19/88)

Readers not interested in the guts of TCP implementation might as well skip 
this message.

I've had to muck about with Tenex TCP which is "related" to TOPS-20 TCP, and 
has much worse constraints with buffer space due to being part of a single 
section monitor.  Some of what I've done to try to cope with free storage 
problems may be relevant to your monitor, but only you can tell for sure.  I 
think there must be a jillion locally-hacked subflavors of this TCP code, and 
who knows how much resemblance remains between yours and mine.  I can say that 
I do have a copy of "DEC's source" as of about 2.5 years ago and it seems to 
have the same problems which I'm about to describe, so maybe you have them too.

Refer to the TCP packetizer near PKTZ10 (in source file TCPPZ or TCPTCP).  The 
call on TCPIPK will nonskip-return if you are indeed out of space.  Code in a 
literal tries to queue you to retry but as I understand the code, there's a 
problem.  Your TCB is not queued anywhere at this instant, but TSFP or TSEP is 
very probably on (else why are you here?)  So ENCPKT and/or DLAYPZ will 
effectively no-op and you're out of the packetizer without being queued 
anywhere.  Any future Force or Encourage will meet the same fate because of 
those same bits.  You're trapped in the Twilight Zone.  Cure: SETZRO 
<TSFP,TSEP>,(TCB) as the first thing in the literal that calls ENCPKT after 
TCPIPK failure.  If this scenario happened, it would be likely to yield the 
symptoms described by David Herron; but so might other things.

I made several changes to TCPPRC routine, a little too long to list here.  
Basically they are: not to run a free-storage scavenge more than once a second,
so as not to hog the CPU; and don't run TVTOPR on any pass that did a scavenge,
in hopes of making fewer and bigger Telnet packets.

It's better to avoid running out of space in the first place, even if that 
takes something drastic.  With an 1822 interface it's absolutely crucial not to
let the input interrupt level run out of buffers, so as to avoid RFNM-related 
deadlocks.  Solution: Never give Internet the "last" input buffer.  Put it back
on the input buffer list after processing the 1822 leader.  I suspect this 
isn't your problem though, since your addresses are class B/C, thus probably 
not 1822.

It would help to have some idea of what most of your space is being used for 
when you run out.  Absent specific data, I'd suspect huge retransmit queues 
caused by big windows and slow gateways between you and the FTPers.  You can 
brute-force cope with this somewhat either by clamping received windows, or by 
finagling the packetizer to refuse to packetize for any connection that has 
more than n packets on the RX queue, or where the first packet on the RX queue 
has actually been retransmitted (a quick test for congestion).  This will slow 
things down, but that's what you need to do when you're short of space.  You 
can condition this code on INTFSP being less than some threshold and shove it 
into the PKTZ10 area too.

If you have a lot of TVT (Telnet) tinygram traffic, you might want to add code 
in this same area to ask TCPIPK for only the size of buffer you need, rather 
than a max size buffer, when space is below the threshold.  Also in the OPSCAN 
routine (TTTVDV or TTANDV source file?) around OPSCA1+10 or so, just after the 
JUMPE T3,OPSCA7 you might add JN TSEP,(TCB),OPSCA7 which will prevent this 
routine from undoing any delay previously imposed by some other routine.  
Further down in this same routine you should also have a change published by 
Westfield and Crispin about 2-3 years ago which includes a test on whether the 
RX queue is empty.  This change is mainly performance-oriented but will save 
free storage too in some situations.

These cover the highlights of things I've done that seem relevant.  You can 
talk bits with me further if you're interested, of course.  I wanted to post 
this much in case it stirs up any comments from TOPS-20 hackers out there.  
Maybe someone out there has already done these changes in a form that will 
slide directly into CU20B monitor.  -b