david@ms.uky.edu (David Herron -- Resident E-mail Hack) (02/17/88)
I'm experiencing some strange hangs while attempting to ftp files from cu20b. Specifically, the pattern is that I make the connection, log in as anonymous and david@ms.uky.edu as the password. Then I start transferring some data (either a directory listing or an actual get command) and after some short time (approximately 15K worth of data has been transmitted) the transfer hangs and stops doing things. Running tcpdump to watch things and telling it to restrict itself to the ftp port, sure enough the packets simply stop going by after a time. The local machines I have tried this from are an 11/750 with deuna running mtxinu 4.3+nfs (we're one MR behind the current version), uXavII's with DEQNA's and the same OS, Sun 3/something running 3.4, and uVaxStation2000's with whatever the ether hardware is and the next-to-current version of Ultrix. Our sequent didn't know how to reach that network so I wasn't le to try from that machine. Using ping, I get an response times like 250 ms at best, 2500 ms at worst, and 600-700 avg and with packet lossage at somewhere between 5% to 30% depending on the tides, or some other somewhat unrelated/unknown effect. I'm doing these transfers at night most recently at 3AM on a saturday morning. Our net is an ethernet attached to a proteon p4200 ip router which is the gateway to suranet/nsfnet. From there we go through some gateway in wash dc (maryland?) to get to net 10. One thing we've noticed recently on our net is something which may be rwho related broadcast storms. I haven't traced this down yet, but will be doing some tracing on this in the next couple of days. One thing I notice from ping is that occasionally there are bursts of packets being lost and surrounding the lost packets are long ping times. um, some hard facts gathered with ping. I watched for about 5 or 6 minutes just now. Every minute like clockwork we'd experience some packet loss, and generally for the next 20-30 seconds the ping time would be much worse than the rest of the time. But there was enough variance in the details of what happened, as well as variation in other parts of the minute, that it's not completely clear what is happening. I did see one extreme example in that time where we lost about 40 packets and didn't have ANY pings come back for about 30-40 seconds. This does seem to be extreme lossage ... however, isn't tcp supposed to be able to handle loss of packets and do re-transmissions? Why is tcp getting wedged? Or possibly is it something in the way that ftp works? (I must admit that I know very little about how ftp operates other than it runs using two tcp channels, one for data transfers and the other for commands). Any ideas? -- <---- David Herron -- The E-Mail guy <david@ms.uky.edu> <---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET <---- <---- It takes more than a good memory to have good memories.
sy.Ken@CU20B.COLUMBIA.EDU (Ken Rossman) (02/18/88)
I'm experiencing some strange hangs while attempting to ftp files from cu20b. Specifically, the pattern is that I make the connection, log in as anonymous and david@ms.uky.edu as the password. Then I start transferring some data (either a directory listing or an actual get command) and after some short time (approximately 15K worth of data has been transmitted) the transfer hangs and stops doing things... You are by far not the only site having this problem with FTP to CU20B. We're getting lots of complaints about it. Near as I can tell, what we're doing is running out of IP free space, after which, all of our daemons get very confused and stop working properly. No one is really working on this problem here because CU20B's lifetime is only a few more months, but if anyone happens to have collected up some pertinent IP free space manager monitor patches, and they don't look too complex to put in, I guess I'll take a crack at them. At least this is what I *think* is going on here... /Ken -------
pavlov@hscfvax.harvard.edu (G.Pavlov) (02/18/88)
In article <8377@g.ms.uky.edu>, david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes: > I'm experiencing some strange hangs while attempting to ftp files > from cu20b. Specifically, the pattern is that I make the connection, > log in as anonymous and david@ms.uky.edu as the password. Then I > start transferring some data (either a directory listing or an > actual get command) and after some short time (approximately 15K > worth of data has been transmitted) the transfer hangs and stops > doing things. According to the Kermit folks at Columbia, there are known problems at the cu20b end. I experienced the same synptoms for several months (trying to get thru on a variety of machines) and finally gave up. Used KERMSRV instead; greg pavlov, fstrf, amherst, ny
WWB.TYM@OFFICE-1.ARPA (Bill Barns) (02/19/88)
Readers not interested in the guts of TCP implementation might as well skip this message. I've had to muck about with Tenex TCP which is "related" to TOPS-20 TCP, and has much worse constraints with buffer space due to being part of a single section monitor. Some of what I've done to try to cope with free storage problems may be relevant to your monitor, but only you can tell for sure. I think there must be a jillion locally-hacked subflavors of this TCP code, and who knows how much resemblance remains between yours and mine. I can say that I do have a copy of "DEC's source" as of about 2.5 years ago and it seems to have the same problems which I'm about to describe, so maybe you have them too. Refer to the TCP packetizer near PKTZ10 (in source file TCPPZ or TCPTCP). The call on TCPIPK will nonskip-return if you are indeed out of space. Code in a literal tries to queue you to retry but as I understand the code, there's a problem. Your TCB is not queued anywhere at this instant, but TSFP or TSEP is very probably on (else why are you here?) So ENCPKT and/or DLAYPZ will effectively no-op and you're out of the packetizer without being queued anywhere. Any future Force or Encourage will meet the same fate because of those same bits. You're trapped in the Twilight Zone. Cure: SETZRO <TSFP,TSEP>,(TCB) as the first thing in the literal that calls ENCPKT after TCPIPK failure. If this scenario happened, it would be likely to yield the symptoms described by David Herron; but so might other things. I made several changes to TCPPRC routine, a little too long to list here. Basically they are: not to run a free-storage scavenge more than once a second, so as not to hog the CPU; and don't run TVTOPR on any pass that did a scavenge, in hopes of making fewer and bigger Telnet packets. It's better to avoid running out of space in the first place, even if that takes something drastic. With an 1822 interface it's absolutely crucial not to let the input interrupt level run out of buffers, so as to avoid RFNM-related deadlocks. Solution: Never give Internet the "last" input buffer. Put it back on the input buffer list after processing the 1822 leader. I suspect this isn't your problem though, since your addresses are class B/C, thus probably not 1822. It would help to have some idea of what most of your space is being used for when you run out. Absent specific data, I'd suspect huge retransmit queues caused by big windows and slow gateways between you and the FTPers. You can brute-force cope with this somewhat either by clamping received windows, or by finagling the packetizer to refuse to packetize for any connection that has more than n packets on the RX queue, or where the first packet on the RX queue has actually been retransmitted (a quick test for congestion). This will slow things down, but that's what you need to do when you're short of space. You can condition this code on INTFSP being less than some threshold and shove it into the PKTZ10 area too. If you have a lot of TVT (Telnet) tinygram traffic, you might want to add code in this same area to ask TCPIPK for only the size of buffer you need, rather than a max size buffer, when space is below the threshold. Also in the OPSCAN routine (TTTVDV or TTANDV source file?) around OPSCA1+10 or so, just after the JUMPE T3,OPSCA7 you might add JN TSEP,(TCB),OPSCA7 which will prevent this routine from undoing any delay previously imposed by some other routine. Further down in this same routine you should also have a change published by Westfield and Crispin about 2-3 years ago which includes a test on whether the RX queue is empty. This change is mainly performance-oriented but will save free storage too in some situations. These cover the highlights of things I've done that seem relevant. You can talk bits with me further if you're interested, of course. I wanted to post this much in case it stirs up any comments from TOPS-20 hackers out there. Maybe someone out there has already done these changes in a form that will slide directly into CU20B monitor. -b