[comp.dcom.sys.cisco] Terminal server hangs

jmr1@Ra.MsState.Edu (Mike Rackley) (05/15/91)

For months we have been chasing the following problem.  We had it with
our original csc2 processor and are having it with our recent csc3
upgrade.  We are running 8.2(2) software, but we saw the same problem
with previous versions of the software.

The problem is that output coming back to dial up terminals through the 
terminal server will hang for anywhere from 30 seconds to 2 minutes at a 
time.  Once output resumes, it will run fine for varying lengths of time
and then hang again.  On bad days, a user might see such hangs every few
minutes.  The problem occurs regardless of which host the terminal user
is telnetted to.  Users have complained of it when logged in to Sun
SPARCserver 490's, Vax 780's and UNISYS 1100's.  The most annoying
manifestation of the problem is while the host is echoing keyboard
input back to the terminal.  The user can continue typing, and the host
will continue accepting the input, but the user won't see his input
echoed back to the terminal until the hang condition clears.

Last week, cisco customer support called and said that another user at
another site had complained of an identical problem.  He finally tracked
it down to the fact that the terminal server ethernet port was connected
to a transceiver that had SQE enabled.  When SQE was disabled, the
hanging problem went away.  We checked our transceiver and SQE was
disabled.  For good measure, we switched to another transceiver with SQE
disabled and the hangs stayed with us.

Has anyone out there seen a similar problem?  And more importantly, does
anybody have a clue as to what might be going on and how to fix it?


Mike Rackley, Mississippi State University
Internet: jmr1@CC.MsState.Edu    Bitnet: JMR1@MsState
Phone:    (601)325-7028          FAX:    (601)325-8921

MAP@lcs.mit.edu (Michael A. Patton) (05/15/91)

   From: jmr1@Ra.MsState.Edu (Mike Rackley)
   Date: Tue, 14 May 91 15:28:44 CDT

   For months we have been chasing the following problem.  The problem
   is that output coming back to dial up terminals through the
   terminal server will hang for anywhere from 30 seconds to 2 minutes
   at a time.

We have seen exactly this same problem and are also trying to track it
down.  We are presently running 8.1(25) on that system, but saw it 5
months ago with 7.1(7).  I'm not sure but it seems to be correlated
with heavy load on the terminal side, not the Ethernet side.  It has
become a problem now during two consecutive final exam periods.  We
thought we'd finally tracked it down last time, but I guess it was
just end-of-finals.  It came back this month to haunt us again.

We have 24 active dialups running at a fixed 19.2KB rate between the
TS and the modem using hardware flow-control to compensate for speed
mismatch in the calling modem.  Some of the lines are being used for
upload/download, but most are regular terminal sessions.

            __
  /|  /|  /|  \         Michael A. Patton, Network Manager
 / | / | /_|__/         Laboratory for Computer Science
/  |/  |/  |atton       Massachusetts Institute of Technology

Disclaimer: The opinions expressed above are a figment of the phosphor
on your screen and do not represent the views of MIT, LCS, or MAP. :-)

chen@cunixf.cc.columbia.edu (Bill Chen) (05/15/91)

We've had and are still having the problem. We run csc/2s with
8.1 (14) software. I don't think we've ever had it hang for up
to 2 minutes, but definitely 30 seconds.

Bill Chen
Columbia University

lim@slc6.INS.CWRU.Edu (Hock Koon Lim) (05/15/91)

  We have the same problem too.  We are running Version 7.1(9) with csc-2 
processor.  There are 80 modem connections on the terminal server.


 

-- 
Hock-Koon Lim, Information Network services
Case Western Reserve University; Cleveland, Ohio, USA  44106   
(216) 368-2982        lim@ins.cwru.edu

deschape@UDAVXB.OCA.UDAYTON.EDU (05/15/91)

We've experienced similar delays, but on the order of 10-15 seconds 
rather than the longer delays you mention.  I've always chalked it
up to high traffic on the net, or delay from the hosts, but now
that you've asked I'd also be interested in hearing from others
with similar experiences, and possible solutions.
+-------------------------------------------------------------------------+
|  Barb Deschapelles, University of Dayton,                               |
|  Office for Computing Activities                                        |
|  Bitnet: DESCHAPE@DAYTON    Internet: deschape@udavxb.oca.udayton.edu   |
|  VOICE: 513 - 229-4040      FAX:  513 - 229-4000                        |
+-------------------------------------------------------------------------+

josevela@mtecv2.mty.itesm.mx (Jose Angel Vela Avila) (05/15/91)

 We have the same problem !!!
 But sometimes terminals never come back !!!! until Terminal Server Reboot ....

 Bye.


Jose A. Vela A.
josevela@mtecv2.mty.itesm.mx

ric@optima.UUCP (Ric Anderson) (05/16/91)

From article <35090@boulder.Colorado.EDU>, by chen@cunixf.cc.columbia.edu (Bill Chen):
> 
> 
> We've had and are still having the problem. We run csc/2s with
> 8.1 (14) software. I don't think we've ever had it hang for up
> to 2 minutes, but definitely 30 seconds.
> 
> Bill Chen

I've seen two flavors of hangs, completely unrelated.

Flavor 1 involved rlogin connections to Sun systems at SunOS 4.1
and above (4.0.3 did not show the problem).  The terminal session
would hang for 5 seconds, and then continue.

The workaround (which cisco phone support supplied) was to set 
	ip mtu 1064
on the servers.  This completely eliminated the problem.

The 5 second time was a deadman timer on the Suns.  It is possible
thet the 30 second hang is due to the timer being longer on
whatever host you have.

Flavor 2 involves all terminals on a server (including the
console).  Basically, a user connected to a system is unaffected,
but the server stops processing connect and disconnects, so
if you log out, you terminal goes dead.  A "send *" does appear
on all terminals (including the dead ones), and you can
get in to do a "send *" via telnet to one of the vty's.

This appears to have been a bad processor card, as it
has not reappeared since we swapped out the cpu.

All of this was under 8.1(14).
Ric (ric@cs.arizona.edu <Ric Anderson>)

William "Chops" Westfield <BILLW@mathom.cisco.com> (05/16/91)

    We've had and are still having the problem. We run csc/2s with
    8.1 (14) software. I don't think we've ever had it hang for up
    to 2 minutes, but definitely 30 seconds.

This is probably a different bug.  8.1(14) had a bug in the SWS
avoidance code that would cause pauses on large outputs from certain
TCP implementations (notably SUNOS 4.1, but not SUNOS 4.0).  This was
fixed in 8.1(25) and remains fixed in the current release.  There is
also a workaround - lower the ethernet MTU to 1064 or so.  (What
happened was tha SUNOS started actually sending packets of 1460 bytes,
instead of 1024 bytes (of data).  The cisco's TCP window is 2144, and
the SWS avoidance bug creeps in when two full packets don't fit in the
window...)

The MAP/Richley problem is almost certainly different - apparently the
pauses are box wide rather than per connection...

Bill Westfield
cisco Systems.
-------

gumby@pokey.cray.com (Scott Rick) (05/16/91)

Mike,

We here at Cray have also seen the problem your describing.  We spent
many weeks with the H.P. analyzer, and found nothing conclusive.  We
were told from cisco (we were more than likely the other customer cisco
was refering to) that the problem was from the brodcasts on our backbone
network.  We have now moved all the terminal servers (5) to their own net
and the net is connected to an AGS+. 
We are still seeing the problem.  There is no rhyme or reason to the pauses,
that we can see.
If you figure this one out I would like to hear from you.
---------------------------------------------------------------------------
D. Scott Rick                                  ATT: (612) 683-3111
Cray Research Incorporated                  E-Mail: gumby@cray.com
655E Lone Oak Drive.
Eagan, MN 55121

evan@is.rice.edu (Evan R. Wetstone) (05/16/91)

We have experienced delays as well.  They only appear between our terminal
server and our Sun 4/490.  It does not appear to happen between the terminal
server and any of our slower machines (3/280's, 4/65's, 4/280's).  I can
pretty much reproduce the problem at will.


Analysis with a Sniffer shows that when the terminal server advertises a
relatively small TCP window (800-900 bytes) *AND* the 4/490 actually fills
the window completely, there is about a 5 second delay.  Looks kind of like
this:

Terminal server acks with window size of 870,
Sun responds immediately with 7 bytes
Terminal server acks with window size of 863,
Sun responds 5 seconds later with 863 bytes......


Maybe it is a SunOS bug that appears when it has to fragment a send buffer
to a smaller size than it wants to use?

--
Evan Wetstone
Network Support
Rice University

dd@ariel.unm.edu (Don Doerner) (05/16/91)

Folks-

Several people in this mail alias are having problems with terminal
servers having degraded response, for example:

<bill chen> We've had and are still having
<bill chen> the problem. We run csc/2s with
<bill chen> 8.1 (14) software. I don't think
<bill chen> we've ever had it hang for up
<bill chen> to 2 minutes, but definitely 30 seconds.

<mike rackley> For months we have been chasing the
<mike rackley> following problem.  We had it with
<mike rackley> our original csc2 processor and are
<mike rackley> having it with our recent csc3
<mike rackley> upgrade.  We are running 8.2(2)
<mike rackley> software, but we saw the same problem
<mike rackley> with previous versions of the software.

<mike rackley> The problem is that output coming back
<mike rackley> to dial up terminals through the 
<mike rackley> terminal server will hang for anywhere
<mike rackley> from 30 seconds to 2 minutes at a 
<mike rackley> time.  Once output resumes, it will
<mike rackley> run fine for varying lengths of time
<mike rackley> and then hang again.  On bad days, a
<mike rackley> user might see such hangs every few ...

If none of your terminal servers are used extensively across wide area
networks, you might want to try the configuration command "no service
nagle".  The nagle algorithm is an algorithm for minimizing the
overhead, and improving general throughput on a TCP/IP network with
slow links - appropriate if you are using a wide area net, but not so
appropriate if you are using a local area net.  We had some problems
like this with our terminal servers, and this turned out to be the
source...

Hope this helps!

MAP@lcs.mit.edu (Michael A. Patton) (05/17/91)

   From: evan@is.rice.edu (Evan R. Wetstone)
   Date: Wed, 15 May 91 10:05:56 CDT

   We have experienced delays as well.  ... between our TS and our Sun 4/490.
	...
   Terminal server acks with window size of 870,
   Sun responds immediately with 7 bytes
   Terminal server acks with window size of 863,
   Sun responds 5 seconds later with 863 bytes......

This isn't what we're seeing.  We see it on connections to at least
half a dozen different systems and the symptomology is very different
from what you describe.  The pauses are substantially longer than 5
seconds (I timed one at a minute and a half) and are most noticed
during interactive echoing as delays in the echo time.

In our case the symptoms seem to have gone away again.  This coincided
pretty closely with end of classes, so I'm getting more convinced that
it's a total box load issue.  I wonder if BillW could comment on how
the system degrades when the total output rate approaches the box's
limit.  Could this affect other lines without as much output?  Might
it cause lost interrupts on the Ethernet interface?  What affect would
THAT have?  Would it just drop an incoming packet or two (which would
cause short delays awaiting retransmit timers) because you didn't pick
them up from the interface in time?  Might it cause packet output to
stop (this was one symptom we thought we noticed last time but didn't
have sufficient data to prove) because you missed an interrupt that
said you could send now.

            __
  /|  /|  /|  \         Michael A. Patton, Network Manager
 / | / | /_|__/         Laboratory for Computer Science
/  |/  |/  |atton       Massachusetts Institute of Technology

remaker@icarus.amd.com (Phillip Remaker) (05/17/91)

ric@optima.UUCP (Ric Anderson) writes:


>Flavor 1 involved rlogin connections to Sun systems at SunOS 4.1
>and above (4.0.3 did not show the problem).  The terminal session
>would hang for 5 seconds, and then continue.

>The workaround (which cisco phone support supplied) was to set 
>	ip mtu 1064
>on the servers.  This completely eliminated the problem.

This hanging problem was fixed in 8.2.  As I upgrade to 8.2 throughout
the company, I am upgrading the MTU's back to 1500.

Also, I have had flowcontrol-related hanging problems like that that 
I am still working on, especially on dial-up connections.



--
Phillip A. Remaker A.M.D. M/S 167 P.O. Box 3453 Sunnyvale, CA 94088-3000
TCP/IP internetworking from hell. DoD #185 remaker@amd.com  408-749-2552   
   Things to do today:  1) Get a clue.  2) Get a job.  3) Get a life.

MAP@lcs.mit.edu (Michael A. Patton) (05/17/91)

   Date: Thu 16 May 91 14:10:18-PDT
   From: William Chops Westfield <BILLW@mathom.cisco.com>

   Very doubtful.  The ethernet drivers (what kind do you have?) ...

We presently have an MCI in the TS.  The previous problem occured
first on a (really old) 3Com card which was upgraded to the MCI to see
if that fixed it.  [I'm still not CERTAIN it's the same problem, but
the symptoms are very close.]

   The ethernet driver could "hang", but it includes code to detect such
   things, and will reset the interface when it happens.  No one who has
   this problem has had an inappropriate number of interface resets...

Right.  We have 1 interface reset in four weeks covering the time when
the problem occurred.

   If the console responds during these periods, I'd like to get the output
   from show process/interface/tcp DURRING the hang.  That might help a lot.

The problems only last for a couple of minutes at most and it takes me
at least 5 minutes to get from my office to the console even if I do
get a timely report.

BILLW@mathom.cisco.com (WilliamChops Westfield) (05/17/91)

   We have experienced delays as well.  ... between our TS and our Sun 4/490.
	...
   Terminal server acks with window size of 870,
   Sun responds immediately with 7 bytes
   Terminal server acks with window size of 863,
   Sun responds 5 seconds later with 863 bytes......

This is the SWS avoidance algorithm bug I mentioned eariler - it tends to
cause short pauses, 5-10 seconds or so.


    This isn't what we're seeing.  We see it on connections to at least
    half a dozen different systems and the symptomology is very different
    from what you describe.  The pauses are substantially longer than 5
    seconds (I timed one at a minute and a half) and are most noticed
    during interactive echoing as delays in the echo time.

And this is apparently something entirely differen; something we don't
understand at all yet.


    In our case the symptoms seem to have gone away again.  This coincided
    pretty closely with end of classes, so I'm getting more convinced that
    it's a total box load issue.  I wonder if BillW could comment on how
    the system degrades when the total output rate approaches the box's limit.

In all the performance tests that I've run, the system has degraded
"gracefully" when run past its limit.  That is, all the lines slow down
a little bit, rather than a partiular line being starved.  Input is still
processed and sent on to the host, and so on.


    Could this affect other lines without as much output?

Yes, but not catastrophically.  Essentially, the scheduler is a round-robin
sort of thing, so that all other users get some cpu time before the current
user gets to run again.  The IP input process is the only one that runs
at a higher priority - if you send the box a fast stream of IP packets, say
filling up the window with 1 byte TCP packets), it is conceivable that the
box would be unresponsive durring that time.


    Might it cause lost interrupts on the Ethernet interface?  What affect
    would THAT have?

Very doubtful.  The ethernet drivers (what kind do you have?) do not assume
that there is only one packet per interrupt, or anything so foolish...


    Would it just drop an incoming packet or two (which would cause short
    delays awaiting retransmit timers) because you didn't pick them up
    from the interface in time?  Might it cause packet output to stop
    (this was one symptom we thought we noticed last time but didn't have
    sufficient data to prove) because you missed an interrupt that said
    you could send now.

The ethernet driver could "hang", but it includes code to detect such
things, and will reset the interface when it happens.  No one who has
this problem has had an inappropriate number of interface resets...

If the console responds during these periods, I'd like to get the output
from show process/interface/tcp DURRING the hang.  That might help a lot.

Bill Westfield
cisco Systems.
-------

BILLW@mathom.cisco.com (WilliamChops Westfield) (05/17/91)

    We have seen this type of behavior occasionally on other TCP/IP
    implementations.  We have never completely resolved the problem
    however, the best guess is that it is related to the way a particular
    TCP/IP implementation handles calculation of expected transmission
    delays and how agressive it is in performing retries when a 
    response is not received in the expected amount of time.

    We have found 3COM telnet terminal server boxes to be relatively
    agressive in their retransmissions and we don't normally experience
    this phenomenon when telneting to their boxes.  However, we have
    seen these types of delays occasionally when telneting to some
    host based telnet implementations.

Hmmph.  The cisco terminal server's TCP uses you basic Karn/Jacobson
"network-friendly" exponential backoff algorithm, but this should not
result in significant delays unless something else is causing the
packets to be lost...


    It is a very hard problem to track down.

It's very frustrating - all the tools are there - the cisco practically
gives you a sniffer in every box, but to debug problems with durations
of <2 minutes, you pretty much have to be sitting at the console waiting
for it to fail.  Most people have better ways to spend their time.

BillW
-------

chris@gargoyle.uchicago.edu (Chris Johnston) (05/20/91)

In article <35086@boulder.Colorado.EDU> you write:
>The problem is that output coming back to dial up terminals through the 
>terminal server will hang for anywhere from 30 seconds to 2 minutes at a 
>time.  
>
>Mike Rackley, Mississippi State University
>Internet: jmr1@CC.MsState.Edu    Bitnet: JMR1@MsState
>Phone:    (601)325-7028          FAX:    (601)325-8921


Hi Mike,

  I'm the guy who had the SQE problem.  However, it was not the
tranceiver on the terminal server.  I turned off SQE on the tranceiver
attached to our multiport hub.

    | | | | | | |
    -----HUB----- 
             | <- disable SQE on "out" side of hub

  A former colleague of mine who still works at U of Chicago disables
SQE on all his tranceivers.

Disabling SQE cleared up the following symptoms on a very small six
node ethernet...
  Character echoing on terminals was falling 8 to 20 characters behind.
  NFS performance between some hosts was poor.
  Remote tape throughput between some hosts was poor (about 30 seconds
per tape block).

  We isolated the problem to one tranceiver by partitioning our
network.

  Can anyone explain why disabling SQE cleared up my problems.

  SQE is sometimes called Heartbeat or Jam.

cj


chris@gargoyle.uchicago.edu 
312-786-4889
I work for a company named AM Investors