[comp.sys.encore] telnet/rlogin "bouncing"

dcourte@eve.wright.edu (Dale Courte) (02/07/90)

Have any of you experienced the following when attempting to telnet to
your Multimax?:

   Trying...
   Connected to xxxxxx
   Escape character is '^]'.

   Umax 4.2 (xxxxxx)

   Login: Connection closed by foreign host.

The "connection closed" comes up immediately after the word login
appears. It is impossible to log in.

If you rlogin, the login is accomplished (provided the machine you are
rloging in from is in your .rhosts), the message of the day is printed,
the system prompt appears, then BANG, you get "connection closed".

Only a "call" from an annex server will get you logged in at this point.
Most of our users on campus don't use an annex.

Here's what I have found through countless encounters with this very
annoying problem:

   1) It is the next available psuedoterminal which seems to be the
      problem. For instance, if the problem exists and I do a 'w' and
      see:
          
            3:54pm  up 1 day, 18:50,  13 users,  load average: 1.25 1.93 2.67
          User     tty       login@  idle   JCPU   PCPU  what
          cse4a031 ttyp0    11:11am         2:03   1:41  ksh
          cse4a028 ttyp1     3:49pm            7      2  ksh
          cse1001  ttyp3     3:54pm            2         ksh
          cse6028  ttyp4     3:51pm           11      2  ksh

      the problem will persist until either the user on ttyp0 or ttyp1
      logs off. Then another user can log in. Then the problem
      returns. I assumed from this behavior that the problem was
      therefore with the psuedoterminal ttyp2 (or whatever - it is
      not always the same one).

   2) If I peruse the output of 'ps -aux' when the problem occurs, I can
      always find a process "hanging around" assigned to that
      psuedoterminal. Sometimes multiple processes, other times just
      the telnetd or rlogind belonging to root.

   3) If the processes assigned to the psuedoterminal are killed (which,
      of course, is not always advisable in an environment where
      faculty research jobs are often running in the background) the
      problem goes away.

   4) The people at Encore (at least the _old_ Encore) never seemed to
      believe my story. I reported it three times, then just decided to
      sit back and wait for the "imminent" arrival of Umax 4.3. That was
      at least 6 months ago. Meanwhile, I just spend a lot of time
      waiting for the phone to ring and for someone to say "There's a
      psuedoterminal hosed up again. Can you fix it?"

In any of you have seen this problem, please let me know if:

   1) You have found a solution to the problem?

   2) You would be willing to corroborate my story if I take it to the
      _new_ Encore?

   3) If you are a beta site for Umax 4.3, has the problem been fixed in
      4.3?

Speaking of the _new_ Encore, I would also like to know what experiences
any of you have had with the new organization. I am intersted because I
lost a sales rep, was never informed of the fact until I posted a minor
flame on the net, was then contacted by a very apologetic person who
said he was "temporarily" my rep. I then heard nothing more for quite a
long time. Next, the person who does our accounting here asked me for
our rep's name. I gave him this "temporary" rep's name. He called, only to
find that we had yet another rep and had once again not been informed.

Let me make it clear that I am _not_ trying to start a flame-fest here.
I merely want to know if my experiences are unique. I am not unhappy
with our Multimax. There are several problems I have contacted (_old_)
Encore about and have gotten prompt service. The Multimax hardware has
been _very_ reliable, and hardware service has always been prompt. As
mentioned above, I have had almost no contact with the new Encore and
that bothers me, as does the one nagging software problem described
above.

Let me know if you have any input which might help me.

Thank You.

-Dale Courte
 Wright State University
 CSNET: dcourte@eve.wright.edu    BITNET: dcourte@wsu

francis@chook.ua.oz (Francis Vaughan) (02/09/90)

From article <1052@thor.wright.EDU>, by dcourte@eve.wright.edu (Dale Courte):
> Have any of you experienced the following when attempting to telnet to
> your Multimax?:
>
>    Trying...
>    Connected to xxxxxx
>    Escape character is '^]'.
>
>    Umax 4.2 (xxxxxx)
>
>    Login: Connection closed by foreign host.
>
> The "connection closed" comes up immediately after the word login
> appears. It is impossible to log in.

... plus explanation

> -Dale Courte

We have experienced the same problem from the word go, two years ago on our
Multimax running UMAX. We also reported the problem. Since most of our
systems hackers use Suns we found it nessesary to have a terminal attached
to an annex, ready all the time soley to enable us to clear the problem via
the "call" protocol. The system mangers would always have the line
echo `tty`
in their .logins so they could immeadiately identify the blocking ptty.
Then log in via "call" on the annex line and shoot the process attached to
the ptty. We never killed a real job, it was always just one process still
attached to the ptty (other then getty) owned by an ordinary user. It gets
to be quite fun when the ptty that blocks up is ptty0. Occasionaly people
could still login, that would happen when two people logged in together and
the lucky one got allocated to the next ptty after the blocking one.

The logout happens when the starting shell fires up and attaches itself to
the ptty. For some reason the first thing it reads is an EOF and thats it.

We have had a lot of trouble with processes hanging around, or not
correctly dieing. A lot of the new students (or worse those that had been
using VMS before) would type control-Z to stop compilations and other
things (Remember ^Z is EOF on VMS). They would then just hit break on the
annex line and think they were logged out. A kill command on the annex (or
a timeout) would send SIGHUP to all the processes. Instead of quietly dieing
we found some (in particular programs written in Pascal) would go nuts.
They would go into an infinite loop and start to allocate lots of memory.
Eventually we were forced to write a deamon to kill these off. The best
explanation we could thing of was that the signal handler under UMAX was
broken. We know it is one part that Encore had rewritten.
A lot of this has calmed down now, but this may be because we have better
behaved students, rather than Encore having fixed the problems.

Dept of Computer Science                        Francis Vaughan
Adelaide University                             francis@cs.ua.oz.au
South Australia

aej@wpi.wpi.edu (Allan E Johannesen) (02/09/90)

I've complained bitterly and repeatedly to Encore about this bug.
I've sent many dumps with specific pointers to ptys being broken in
the dumps (I never crashed a system due to the problem, but indicated
whether there were hung ptys in the system if it did crash).

My "solution" was a cheap hack which grabbed a pty in the same order
as all the pty-handlers (telnetd, script, etc.) and just detach and
hang onto the thing.  The theory was that the "broken" state of the
pty would still be detectable in the dump, but it would be out of the
way so we could operate our system.  I guess it was pointless, since
the bug has yet to be fixed.

Someone at Encore must believe it was a bug.  At one point, they gave
me a telnetd, script, etc. which grabbed ptys at random, the theory
being that you wouldn't be solidly locked out if a pty died.  You'd
have some statistical chance to get through on your second try.  I
preferred suffering with the old way so that I could continue to know
when it happen, identify the situation in dumps, etc.

The unfortunate news is that it is STILL in 4.3.  Two of the major
reasons I wanted to Beta test 4.3 was to get away from the pty bug (I
see another is vainly hoping for relief from the bug by 4.3, so I no
longer feel foolish in thinking my suffering would be over) and so
that I could run my 2,000 student system with quotas that wouldn't
crash the system (the quota code was entirely replaced in 4.3, but
still crashes the system).

dcourte@thor.wright.edu (Dale Courte,040P Lib. Annex,873-4030,) (02/12/90)

From article <763@sirius.ucs.adelaide.edu.au>, by francis@chook.ua.oz (Francis Vaughan):
> We have had a lot of trouble with processes hanging around, or not
> correctly dieing. A lot of the new students (or worse those that had been
> using VMS before) would type control-Z to stop compilations and other
> things (Remember ^Z is EOF on VMS). They would then just hit break on the
> annex line and think they were logged out. A kill command on the annex (or
> a timeout) would send SIGHUP to all the processes. Instead of quietly dieing
> we found some (in particular programs written in Pascal) would go nuts.
> They would go into an infinite loop and start to allocate lots of memory.
> Eventually we were forced to write a deamon to kill these off. The best
> explanation we could thing of was that the signal handler under UMAX was
> broken. We know it is one part that Encore had rewritten.

I have seen a lot of this also. We had a particular problem with Franz
Lisp. As above, stopped jobs were not killed when the user logged off,
they were re-started in some sort of hard loop. My load average would
get up close to 10, I'd look and find five or six spinning copies of
Lisp.

Through a large amount of experimentation, I found that this did not
happen when using the Korn shell. Since ksh is a nice shell anyway,
including job control and command line recall, I decided we would just
use ksh as our default login shell. We converted over, and the problem
with Franz Lisp disappeared. However, a similar problem developed with,
believe it or not, mail! Every day when I logged in I had to kill one or
two mail processes which had racked up hundreds of minutes of CPU time
in a hard loop. Bizarre. What could mail be doing?

My conclusion was also that the signal handler was broken, and when
making a service call to Encore, the person I talked to seemed to agree.
Thay had been able to duplicate the Franz Lisp problem.

The workaround I have in place now is to place a ksh ulimit command in
/etc/profile (which is executed when ksh users log in), limiting the cpu
time for a process to 5 minutes. Users who need to run long jobs can
reset this limit. So the mail processes spin for five minutes, then die.
Users who override this limit tend to be more sophisticated Unix people
and don't leave jobs hanging around, so this solution has worked quite
well. The C shell has similar cpu limiting commands, but no single file
equivalent to /etc/profile that I know of.

-Dale Courte, University Computing Services' Unix Systems Administrator
 email: dcourte (dcourte@eve.wright.edu)
 phone: 873-4030
 office: 040P Lib. Annex

dcourte@thor.wright.edu (Dale Courte,040P Lib. Annex,873-4030,) (02/12/90)

From article <8020@wpi.wpi.edu>, by aej@wpi.wpi.edu (Allan E Johannesen):
> 
> The unfortunate news is that it is STILL in 4.3.  Two of the major
> reasons I wanted to Beta test 4.3 was to get away from the pty bug (I
> see another is vainly hoping for relief from the bug by 4.3, so I no
> longer feel foolish in thinking my suffering would be over) and so
> that I could run my 2,000 student system with quotas that wouldn't
> crash the system (the quota code was entirely replaced in 4.3, but
> still crashes the system).

Is everyone still experiencing this in 4.3? I had heard otherwise, and I
don't want to get my hopes up only to be disappointed.

Also, can I have more detail about the quota system bug? I have been in
the process of planning for the implemenation of quotas, as my disks are
filling up, and much of the space I fear is not really being used. If it
will crash my system, I'll have to wait. That is unacceptable.

bowen@cs.Buffalo.EDU (Devon Bowen) (02/14/90)

In article <1069@thor.wright.EDU>, dcourte@thor.wright.edu (Dale
Courte,040P Lib. Annex,873-4030,) writes:
> Is everyone still experiencing this in 4.3? I had heard otherwise, and I
> don't want to get my hopes up only to be disappointed.

We haven't had this problem for quite some time. That doesn't prove it's
not there, but for a University with freshmen who don't understand the
concept of backgrouding it's pretty strong evidence.

Devon

fuat@cunixf.cc.columbia.edu (Fuat C. Baran) (02/14/90)

In article <1052@thor.wright.EDU> dcourte@eve.wright.edu (Dale Courte) writes:
>In any of you have seen this problem, please let me know if:
>
>   1) You have found a solution to the problem?
>
>   2) You would be willing to corroborate my story if I take it to the
>      _new_ Encore?
>
>   3) If you are a beta site for Umax 4.3, has the problem been fixed in
>      4.3?

We are a UMAX 4.3 Beta site, and we haven't seen it (yet), though
apparently some other site has.  We do see it under SunOS 4.0.1
though...  We do an "rsh host /usr/local/bin/sps law" from another
host, find the offending process, and kill it.

						--Fuat


Internet: fuat@columbia.edu          U.S. MAIL: Columbia University
  BITNET: fuat@cunixc                           Center for Computing Activities
    UUCP: ...!rutgers!columbia!cunixc!fuat      712 Watson Labs, 612 W115th St.
   Phone: (212) 854-5128  Fax: (212) 662-6442   New York, NY 10025

bowen@cs.Buffalo.EDU (Devon Bowen) (02/14/90)

In article <763@sirius.ucs.adelaide.edu.au>, francis@chook.ua.oz
(Francis Vaughan) writes:
> systems hackers use Suns we found it nessesary to have a terminal attached
> to an annex, ready all the time soley to enable us to clear the problem via
> the "call" protocol.

A cheaper fix is to run a backgrounded rlogin just before a regular rlogin.
The backgrounded one grabs the bad ptty long enough for the other rlogin to
grab a valid one. I did this from home a lot.

Devon