jmr1@Ra.MsState.Edu (Mike Rackley) (05/15/91)
For months we have been chasing the following problem. We had it with our original csc2 processor and are having it with our recent csc3 upgrade. We are running 8.2(2) software, but we saw the same problem with previous versions of the software. The problem is that output coming back to dial up terminals through the terminal server will hang for anywhere from 30 seconds to 2 minutes at a time. Once output resumes, it will run fine for varying lengths of time and then hang again. On bad days, a user might see such hangs every few minutes. The problem occurs regardless of which host the terminal user is telnetted to. Users have complained of it when logged in to Sun SPARCserver 490's, Vax 780's and UNISYS 1100's. The most annoying manifestation of the problem is while the host is echoing keyboard input back to the terminal. The user can continue typing, and the host will continue accepting the input, but the user won't see his input echoed back to the terminal until the hang condition clears. Last week, cisco customer support called and said that another user at another site had complained of an identical problem. He finally tracked it down to the fact that the terminal server ethernet port was connected to a transceiver that had SQE enabled. When SQE was disabled, the hanging problem went away. We checked our transceiver and SQE was disabled. For good measure, we switched to another transceiver with SQE disabled and the hangs stayed with us. Has anyone out there seen a similar problem? And more importantly, does anybody have a clue as to what might be going on and how to fix it? Mike Rackley, Mississippi State University Internet: jmr1@CC.MsState.Edu Bitnet: JMR1@MsState Phone: (601)325-7028 FAX: (601)325-8921
MAP@lcs.mit.edu (Michael A. Patton) (05/15/91)
From: jmr1@Ra.MsState.Edu (Mike Rackley) Date: Tue, 14 May 91 15:28:44 CDT For months we have been chasing the following problem. The problem is that output coming back to dial up terminals through the terminal server will hang for anywhere from 30 seconds to 2 minutes at a time. We have seen exactly this same problem and are also trying to track it down. We are presently running 8.1(25) on that system, but saw it 5 months ago with 7.1(7). I'm not sure but it seems to be correlated with heavy load on the terminal side, not the Ethernet side. It has become a problem now during two consecutive final exam periods. We thought we'd finally tracked it down last time, but I guess it was just end-of-finals. It came back this month to haunt us again. We have 24 active dialups running at a fixed 19.2KB rate between the TS and the modem using hardware flow-control to compensate for speed mismatch in the calling modem. Some of the lines are being used for upload/download, but most are regular terminal sessions. __ /| /| /| \ Michael A. Patton, Network Manager / | / | /_|__/ Laboratory for Computer Science / |/ |/ |atton Massachusetts Institute of Technology Disclaimer: The opinions expressed above are a figment of the phosphor on your screen and do not represent the views of MIT, LCS, or MAP. :-)
chen@cunixf.cc.columbia.edu (Bill Chen) (05/15/91)
We've had and are still having the problem. We run csc/2s with 8.1 (14) software. I don't think we've ever had it hang for up to 2 minutes, but definitely 30 seconds. Bill Chen Columbia University
lim@slc6.INS.CWRU.Edu (Hock Koon Lim) (05/15/91)
We have the same problem too. We are running Version 7.1(9) with csc-2 processor. There are 80 modem connections on the terminal server. -- Hock-Koon Lim, Information Network services Case Western Reserve University; Cleveland, Ohio, USA 44106 (216) 368-2982 lim@ins.cwru.edu
deschape@UDAVXB.OCA.UDAYTON.EDU (05/15/91)
We've experienced similar delays, but on the order of 10-15 seconds rather than the longer delays you mention. I've always chalked it up to high traffic on the net, or delay from the hosts, but now that you've asked I'd also be interested in hearing from others with similar experiences, and possible solutions. +-------------------------------------------------------------------------+ | Barb Deschapelles, University of Dayton, | | Office for Computing Activities | | Bitnet: DESCHAPE@DAYTON Internet: deschape@udavxb.oca.udayton.edu | | VOICE: 513 - 229-4040 FAX: 513 - 229-4000 | +-------------------------------------------------------------------------+
josevela@mtecv2.mty.itesm.mx (Jose Angel Vela Avila) (05/15/91)
We have the same problem !!! But sometimes terminals never come back !!!! until Terminal Server Reboot .... Bye. Jose A. Vela A. josevela@mtecv2.mty.itesm.mx
ric@optima.UUCP (Ric Anderson) (05/16/91)
From article <35090@boulder.Colorado.EDU>, by chen@cunixf.cc.columbia.edu (Bill Chen): > > > We've had and are still having the problem. We run csc/2s with > 8.1 (14) software. I don't think we've ever had it hang for up > to 2 minutes, but definitely 30 seconds. > > Bill Chen I've seen two flavors of hangs, completely unrelated. Flavor 1 involved rlogin connections to Sun systems at SunOS 4.1 and above (4.0.3 did not show the problem). The terminal session would hang for 5 seconds, and then continue. The workaround (which cisco phone support supplied) was to set ip mtu 1064 on the servers. This completely eliminated the problem. The 5 second time was a deadman timer on the Suns. It is possible thet the 30 second hang is due to the timer being longer on whatever host you have. Flavor 2 involves all terminals on a server (including the console). Basically, a user connected to a system is unaffected, but the server stops processing connect and disconnects, so if you log out, you terminal goes dead. A "send *" does appear on all terminals (including the dead ones), and you can get in to do a "send *" via telnet to one of the vty's. This appears to have been a bad processor card, as it has not reappeared since we swapped out the cpu. All of this was under 8.1(14). Ric (ric@cs.arizona.edu <Ric Anderson>)
William "Chops" Westfield <BILLW@mathom.cisco.com> (05/16/91)
We've had and are still having the problem. We run csc/2s with 8.1 (14) software. I don't think we've ever had it hang for up to 2 minutes, but definitely 30 seconds. This is probably a different bug. 8.1(14) had a bug in the SWS avoidance code that would cause pauses on large outputs from certain TCP implementations (notably SUNOS 4.1, but not SUNOS 4.0). This was fixed in 8.1(25) and remains fixed in the current release. There is also a workaround - lower the ethernet MTU to 1064 or so. (What happened was tha SUNOS started actually sending packets of 1460 bytes, instead of 1024 bytes (of data). The cisco's TCP window is 2144, and the SWS avoidance bug creeps in when two full packets don't fit in the window...) The MAP/Richley problem is almost certainly different - apparently the pauses are box wide rather than per connection... Bill Westfield cisco Systems. -------
gumby@pokey.cray.com (Scott Rick) (05/16/91)
Mike, We here at Cray have also seen the problem your describing. We spent many weeks with the H.P. analyzer, and found nothing conclusive. We were told from cisco (we were more than likely the other customer cisco was refering to) that the problem was from the brodcasts on our backbone network. We have now moved all the terminal servers (5) to their own net and the net is connected to an AGS+. We are still seeing the problem. There is no rhyme or reason to the pauses, that we can see. If you figure this one out I would like to hear from you. --------------------------------------------------------------------------- D. Scott Rick ATT: (612) 683-3111 Cray Research Incorporated E-Mail: gumby@cray.com 655E Lone Oak Drive. Eagan, MN 55121
evan@is.rice.edu (Evan R. Wetstone) (05/16/91)
We have experienced delays as well. They only appear between our terminal server and our Sun 4/490. It does not appear to happen between the terminal server and any of our slower machines (3/280's, 4/65's, 4/280's). I can pretty much reproduce the problem at will. Analysis with a Sniffer shows that when the terminal server advertises a relatively small TCP window (800-900 bytes) *AND* the 4/490 actually fills the window completely, there is about a 5 second delay. Looks kind of like this: Terminal server acks with window size of 870, Sun responds immediately with 7 bytes Terminal server acks with window size of 863, Sun responds 5 seconds later with 863 bytes...... Maybe it is a SunOS bug that appears when it has to fragment a send buffer to a smaller size than it wants to use? -- Evan Wetstone Network Support Rice University
dd@ariel.unm.edu (Don Doerner) (05/16/91)
Folks- Several people in this mail alias are having problems with terminal servers having degraded response, for example: <bill chen> We've had and are still having <bill chen> the problem. We run csc/2s with <bill chen> 8.1 (14) software. I don't think <bill chen> we've ever had it hang for up <bill chen> to 2 minutes, but definitely 30 seconds. <mike rackley> For months we have been chasing the <mike rackley> following problem. We had it with <mike rackley> our original csc2 processor and are <mike rackley> having it with our recent csc3 <mike rackley> upgrade. We are running 8.2(2) <mike rackley> software, but we saw the same problem <mike rackley> with previous versions of the software. <mike rackley> The problem is that output coming back <mike rackley> to dial up terminals through the <mike rackley> terminal server will hang for anywhere <mike rackley> from 30 seconds to 2 minutes at a <mike rackley> time. Once output resumes, it will <mike rackley> run fine for varying lengths of time <mike rackley> and then hang again. On bad days, a <mike rackley> user might see such hangs every few ... If none of your terminal servers are used extensively across wide area networks, you might want to try the configuration command "no service nagle". The nagle algorithm is an algorithm for minimizing the overhead, and improving general throughput on a TCP/IP network with slow links - appropriate if you are using a wide area net, but not so appropriate if you are using a local area net. We had some problems like this with our terminal servers, and this turned out to be the source... Hope this helps!
MAP@lcs.mit.edu (Michael A. Patton) (05/17/91)
From: evan@is.rice.edu (Evan R. Wetstone) Date: Wed, 15 May 91 10:05:56 CDT We have experienced delays as well. ... between our TS and our Sun 4/490. ... Terminal server acks with window size of 870, Sun responds immediately with 7 bytes Terminal server acks with window size of 863, Sun responds 5 seconds later with 863 bytes...... This isn't what we're seeing. We see it on connections to at least half a dozen different systems and the symptomology is very different from what you describe. The pauses are substantially longer than 5 seconds (I timed one at a minute and a half) and are most noticed during interactive echoing as delays in the echo time. In our case the symptoms seem to have gone away again. This coincided pretty closely with end of classes, so I'm getting more convinced that it's a total box load issue. I wonder if BillW could comment on how the system degrades when the total output rate approaches the box's limit. Could this affect other lines without as much output? Might it cause lost interrupts on the Ethernet interface? What affect would THAT have? Would it just drop an incoming packet or two (which would cause short delays awaiting retransmit timers) because you didn't pick them up from the interface in time? Might it cause packet output to stop (this was one symptom we thought we noticed last time but didn't have sufficient data to prove) because you missed an interrupt that said you could send now. __ /| /| /| \ Michael A. Patton, Network Manager / | / | /_|__/ Laboratory for Computer Science / |/ |/ |atton Massachusetts Institute of Technology
remaker@icarus.amd.com (Phillip Remaker) (05/17/91)
ric@optima.UUCP (Ric Anderson) writes: >Flavor 1 involved rlogin connections to Sun systems at SunOS 4.1 >and above (4.0.3 did not show the problem). The terminal session >would hang for 5 seconds, and then continue. >The workaround (which cisco phone support supplied) was to set > ip mtu 1064 >on the servers. This completely eliminated the problem. This hanging problem was fixed in 8.2. As I upgrade to 8.2 throughout the company, I am upgrading the MTU's back to 1500. Also, I have had flowcontrol-related hanging problems like that that I am still working on, especially on dial-up connections. -- Phillip A. Remaker A.M.D. M/S 167 P.O. Box 3453 Sunnyvale, CA 94088-3000 TCP/IP internetworking from hell. DoD #185 remaker@amd.com 408-749-2552 Things to do today: 1) Get a clue. 2) Get a job. 3) Get a life.
MAP@lcs.mit.edu (Michael A. Patton) (05/17/91)
Date: Thu 16 May 91 14:10:18-PDT From: William Chops Westfield <BILLW@mathom.cisco.com> Very doubtful. The ethernet drivers (what kind do you have?) ... We presently have an MCI in the TS. The previous problem occured first on a (really old) 3Com card which was upgraded to the MCI to see if that fixed it. [I'm still not CERTAIN it's the same problem, but the symptoms are very close.] The ethernet driver could "hang", but it includes code to detect such things, and will reset the interface when it happens. No one who has this problem has had an inappropriate number of interface resets... Right. We have 1 interface reset in four weeks covering the time when the problem occurred. If the console responds during these periods, I'd like to get the output from show process/interface/tcp DURRING the hang. That might help a lot. The problems only last for a couple of minutes at most and it takes me at least 5 minutes to get from my office to the console even if I do get a timely report.
BILLW@mathom.cisco.com (WilliamChops Westfield) (05/17/91)
We have experienced delays as well. ... between our TS and our Sun 4/490. ... Terminal server acks with window size of 870, Sun responds immediately with 7 bytes Terminal server acks with window size of 863, Sun responds 5 seconds later with 863 bytes...... This is the SWS avoidance algorithm bug I mentioned eariler - it tends to cause short pauses, 5-10 seconds or so. This isn't what we're seeing. We see it on connections to at least half a dozen different systems and the symptomology is very different from what you describe. The pauses are substantially longer than 5 seconds (I timed one at a minute and a half) and are most noticed during interactive echoing as delays in the echo time. And this is apparently something entirely differen; something we don't understand at all yet. In our case the symptoms seem to have gone away again. This coincided pretty closely with end of classes, so I'm getting more convinced that it's a total box load issue. I wonder if BillW could comment on how the system degrades when the total output rate approaches the box's limit. In all the performance tests that I've run, the system has degraded "gracefully" when run past its limit. That is, all the lines slow down a little bit, rather than a partiular line being starved. Input is still processed and sent on to the host, and so on. Could this affect other lines without as much output? Yes, but not catastrophically. Essentially, the scheduler is a round-robin sort of thing, so that all other users get some cpu time before the current user gets to run again. The IP input process is the only one that runs at a higher priority - if you send the box a fast stream of IP packets, say filling up the window with 1 byte TCP packets), it is conceivable that the box would be unresponsive durring that time. Might it cause lost interrupts on the Ethernet interface? What affect would THAT have? Very doubtful. The ethernet drivers (what kind do you have?) do not assume that there is only one packet per interrupt, or anything so foolish... Would it just drop an incoming packet or two (which would cause short delays awaiting retransmit timers) because you didn't pick them up from the interface in time? Might it cause packet output to stop (this was one symptom we thought we noticed last time but didn't have sufficient data to prove) because you missed an interrupt that said you could send now. The ethernet driver could "hang", but it includes code to detect such things, and will reset the interface when it happens. No one who has this problem has had an inappropriate number of interface resets... If the console responds during these periods, I'd like to get the output from show process/interface/tcp DURRING the hang. That might help a lot. Bill Westfield cisco Systems. -------
BILLW@mathom.cisco.com (WilliamChops Westfield) (05/17/91)
We have seen this type of behavior occasionally on other TCP/IP implementations. We have never completely resolved the problem however, the best guess is that it is related to the way a particular TCP/IP implementation handles calculation of expected transmission delays and how agressive it is in performing retries when a response is not received in the expected amount of time. We have found 3COM telnet terminal server boxes to be relatively agressive in their retransmissions and we don't normally experience this phenomenon when telneting to their boxes. However, we have seen these types of delays occasionally when telneting to some host based telnet implementations. Hmmph. The cisco terminal server's TCP uses you basic Karn/Jacobson "network-friendly" exponential backoff algorithm, but this should not result in significant delays unless something else is causing the packets to be lost... It is a very hard problem to track down. It's very frustrating - all the tools are there - the cisco practically gives you a sniffer in every box, but to debug problems with durations of <2 minutes, you pretty much have to be sitting at the console waiting for it to fail. Most people have better ways to spend their time. BillW -------
chris@gargoyle.uchicago.edu (Chris Johnston) (05/20/91)
In article <35086@boulder.Colorado.EDU> you write: >The problem is that output coming back to dial up terminals through the >terminal server will hang for anywhere from 30 seconds to 2 minutes at a >time. > >Mike Rackley, Mississippi State University >Internet: jmr1@CC.MsState.Edu Bitnet: JMR1@MsState >Phone: (601)325-7028 FAX: (601)325-8921 Hi Mike, I'm the guy who had the SQE problem. However, it was not the tranceiver on the terminal server. I turned off SQE on the tranceiver attached to our multiport hub. | | | | | | | -----HUB----- | <- disable SQE on "out" side of hub A former colleague of mine who still works at U of Chicago disables SQE on all his tranceivers. Disabling SQE cleared up the following symptoms on a very small six node ethernet... Character echoing on terminals was falling 8 to 20 characters behind. NFS performance between some hosts was poor. Remote tape throughput between some hosts was poor (about 30 seconds per tape block). We isolated the problem to one tranceiver by partitioning our network. Can anyone explain why disabling SQE cleared up my problems. SQE is sometimes called Heartbeat or Jam. cj chris@gargoyle.uchicago.edu 312-786-4889 I work for a company named AM Investors