dcourte@eve.wright.edu (Dale Courte) (02/07/90)
Have any of you experienced the following when attempting to telnet to your Multimax?: Trying... Connected to xxxxxx Escape character is '^]'. Umax 4.2 (xxxxxx) Login: Connection closed by foreign host. The "connection closed" comes up immediately after the word login appears. It is impossible to log in. If you rlogin, the login is accomplished (provided the machine you are rloging in from is in your .rhosts), the message of the day is printed, the system prompt appears, then BANG, you get "connection closed". Only a "call" from an annex server will get you logged in at this point. Most of our users on campus don't use an annex. Here's what I have found through countless encounters with this very annoying problem: 1) It is the next available psuedoterminal which seems to be the problem. For instance, if the problem exists and I do a 'w' and see: 3:54pm up 1 day, 18:50, 13 users, load average: 1.25 1.93 2.67 User tty login@ idle JCPU PCPU what cse4a031 ttyp0 11:11am 2:03 1:41 ksh cse4a028 ttyp1 3:49pm 7 2 ksh cse1001 ttyp3 3:54pm 2 ksh cse6028 ttyp4 3:51pm 11 2 ksh the problem will persist until either the user on ttyp0 or ttyp1 logs off. Then another user can log in. Then the problem returns. I assumed from this behavior that the problem was therefore with the psuedoterminal ttyp2 (or whatever - it is not always the same one). 2) If I peruse the output of 'ps -aux' when the problem occurs, I can always find a process "hanging around" assigned to that psuedoterminal. Sometimes multiple processes, other times just the telnetd or rlogind belonging to root. 3) If the processes assigned to the psuedoterminal are killed (which, of course, is not always advisable in an environment where faculty research jobs are often running in the background) the problem goes away. 4) The people at Encore (at least the _old_ Encore) never seemed to believe my story. I reported it three times, then just decided to sit back and wait for the "imminent" arrival of Umax 4.3. That was at least 6 months ago. Meanwhile, I just spend a lot of time waiting for the phone to ring and for someone to say "There's a psuedoterminal hosed up again. Can you fix it?" In any of you have seen this problem, please let me know if: 1) You have found a solution to the problem? 2) You would be willing to corroborate my story if I take it to the _new_ Encore? 3) If you are a beta site for Umax 4.3, has the problem been fixed in 4.3? Speaking of the _new_ Encore, I would also like to know what experiences any of you have had with the new organization. I am intersted because I lost a sales rep, was never informed of the fact until I posted a minor flame on the net, was then contacted by a very apologetic person who said he was "temporarily" my rep. I then heard nothing more for quite a long time. Next, the person who does our accounting here asked me for our rep's name. I gave him this "temporary" rep's name. He called, only to find that we had yet another rep and had once again not been informed. Let me make it clear that I am _not_ trying to start a flame-fest here. I merely want to know if my experiences are unique. I am not unhappy with our Multimax. There are several problems I have contacted (_old_) Encore about and have gotten prompt service. The Multimax hardware has been _very_ reliable, and hardware service has always been prompt. As mentioned above, I have had almost no contact with the new Encore and that bothers me, as does the one nagging software problem described above. Let me know if you have any input which might help me. Thank You. -Dale Courte Wright State University CSNET: dcourte@eve.wright.edu BITNET: dcourte@wsu
francis@chook.ua.oz (Francis Vaughan) (02/09/90)
From article <1052@thor.wright.EDU>, by dcourte@eve.wright.edu (Dale Courte): > Have any of you experienced the following when attempting to telnet to > your Multimax?: > > Trying... > Connected to xxxxxx > Escape character is '^]'. > > Umax 4.2 (xxxxxx) > > Login: Connection closed by foreign host. > > The "connection closed" comes up immediately after the word login > appears. It is impossible to log in. ... plus explanation > -Dale Courte We have experienced the same problem from the word go, two years ago on our Multimax running UMAX. We also reported the problem. Since most of our systems hackers use Suns we found it nessesary to have a terminal attached to an annex, ready all the time soley to enable us to clear the problem via the "call" protocol. The system mangers would always have the line echo `tty` in their .logins so they could immeadiately identify the blocking ptty. Then log in via "call" on the annex line and shoot the process attached to the ptty. We never killed a real job, it was always just one process still attached to the ptty (other then getty) owned by an ordinary user. It gets to be quite fun when the ptty that blocks up is ptty0. Occasionaly people could still login, that would happen when two people logged in together and the lucky one got allocated to the next ptty after the blocking one. The logout happens when the starting shell fires up and attaches itself to the ptty. For some reason the first thing it reads is an EOF and thats it. We have had a lot of trouble with processes hanging around, or not correctly dieing. A lot of the new students (or worse those that had been using VMS before) would type control-Z to stop compilations and other things (Remember ^Z is EOF on VMS). They would then just hit break on the annex line and think they were logged out. A kill command on the annex (or a timeout) would send SIGHUP to all the processes. Instead of quietly dieing we found some (in particular programs written in Pascal) would go nuts. They would go into an infinite loop and start to allocate lots of memory. Eventually we were forced to write a deamon to kill these off. The best explanation we could thing of was that the signal handler under UMAX was broken. We know it is one part that Encore had rewritten. A lot of this has calmed down now, but this may be because we have better behaved students, rather than Encore having fixed the problems. Dept of Computer Science Francis Vaughan Adelaide University francis@cs.ua.oz.au South Australia
aej@wpi.wpi.edu (Allan E Johannesen) (02/09/90)
I've complained bitterly and repeatedly to Encore about this bug. I've sent many dumps with specific pointers to ptys being broken in the dumps (I never crashed a system due to the problem, but indicated whether there were hung ptys in the system if it did crash). My "solution" was a cheap hack which grabbed a pty in the same order as all the pty-handlers (telnetd, script, etc.) and just detach and hang onto the thing. The theory was that the "broken" state of the pty would still be detectable in the dump, but it would be out of the way so we could operate our system. I guess it was pointless, since the bug has yet to be fixed. Someone at Encore must believe it was a bug. At one point, they gave me a telnetd, script, etc. which grabbed ptys at random, the theory being that you wouldn't be solidly locked out if a pty died. You'd have some statistical chance to get through on your second try. I preferred suffering with the old way so that I could continue to know when it happen, identify the situation in dumps, etc. The unfortunate news is that it is STILL in 4.3. Two of the major reasons I wanted to Beta test 4.3 was to get away from the pty bug (I see another is vainly hoping for relief from the bug by 4.3, so I no longer feel foolish in thinking my suffering would be over) and so that I could run my 2,000 student system with quotas that wouldn't crash the system (the quota code was entirely replaced in 4.3, but still crashes the system).
dcourte@thor.wright.edu (Dale Courte,040P Lib. Annex,873-4030,) (02/12/90)
From article <763@sirius.ucs.adelaide.edu.au>, by francis@chook.ua.oz (Francis Vaughan): > We have had a lot of trouble with processes hanging around, or not > correctly dieing. A lot of the new students (or worse those that had been > using VMS before) would type control-Z to stop compilations and other > things (Remember ^Z is EOF on VMS). They would then just hit break on the > annex line and think they were logged out. A kill command on the annex (or > a timeout) would send SIGHUP to all the processes. Instead of quietly dieing > we found some (in particular programs written in Pascal) would go nuts. > They would go into an infinite loop and start to allocate lots of memory. > Eventually we were forced to write a deamon to kill these off. The best > explanation we could thing of was that the signal handler under UMAX was > broken. We know it is one part that Encore had rewritten. I have seen a lot of this also. We had a particular problem with Franz Lisp. As above, stopped jobs were not killed when the user logged off, they were re-started in some sort of hard loop. My load average would get up close to 10, I'd look and find five or six spinning copies of Lisp. Through a large amount of experimentation, I found that this did not happen when using the Korn shell. Since ksh is a nice shell anyway, including job control and command line recall, I decided we would just use ksh as our default login shell. We converted over, and the problem with Franz Lisp disappeared. However, a similar problem developed with, believe it or not, mail! Every day when I logged in I had to kill one or two mail processes which had racked up hundreds of minutes of CPU time in a hard loop. Bizarre. What could mail be doing? My conclusion was also that the signal handler was broken, and when making a service call to Encore, the person I talked to seemed to agree. Thay had been able to duplicate the Franz Lisp problem. The workaround I have in place now is to place a ksh ulimit command in /etc/profile (which is executed when ksh users log in), limiting the cpu time for a process to 5 minutes. Users who need to run long jobs can reset this limit. So the mail processes spin for five minutes, then die. Users who override this limit tend to be more sophisticated Unix people and don't leave jobs hanging around, so this solution has worked quite well. The C shell has similar cpu limiting commands, but no single file equivalent to /etc/profile that I know of. -Dale Courte, University Computing Services' Unix Systems Administrator email: dcourte (dcourte@eve.wright.edu) phone: 873-4030 office: 040P Lib. Annex
dcourte@thor.wright.edu (Dale Courte,040P Lib. Annex,873-4030,) (02/12/90)
From article <8020@wpi.wpi.edu>, by aej@wpi.wpi.edu (Allan E Johannesen): > > The unfortunate news is that it is STILL in 4.3. Two of the major > reasons I wanted to Beta test 4.3 was to get away from the pty bug (I > see another is vainly hoping for relief from the bug by 4.3, so I no > longer feel foolish in thinking my suffering would be over) and so > that I could run my 2,000 student system with quotas that wouldn't > crash the system (the quota code was entirely replaced in 4.3, but > still crashes the system). Is everyone still experiencing this in 4.3? I had heard otherwise, and I don't want to get my hopes up only to be disappointed. Also, can I have more detail about the quota system bug? I have been in the process of planning for the implemenation of quotas, as my disks are filling up, and much of the space I fear is not really being used. If it will crash my system, I'll have to wait. That is unacceptable.
bowen@cs.Buffalo.EDU (Devon Bowen) (02/14/90)
In article <1069@thor.wright.EDU>, dcourte@thor.wright.edu (Dale Courte,040P Lib. Annex,873-4030,) writes: > Is everyone still experiencing this in 4.3? I had heard otherwise, and I > don't want to get my hopes up only to be disappointed. We haven't had this problem for quite some time. That doesn't prove it's not there, but for a University with freshmen who don't understand the concept of backgrouding it's pretty strong evidence. Devon
fuat@cunixf.cc.columbia.edu (Fuat C. Baran) (02/14/90)
In article <1052@thor.wright.EDU> dcourte@eve.wright.edu (Dale Courte) writes: >In any of you have seen this problem, please let me know if: > > 1) You have found a solution to the problem? > > 2) You would be willing to corroborate my story if I take it to the > _new_ Encore? > > 3) If you are a beta site for Umax 4.3, has the problem been fixed in > 4.3? We are a UMAX 4.3 Beta site, and we haven't seen it (yet), though apparently some other site has. We do see it under SunOS 4.0.1 though... We do an "rsh host /usr/local/bin/sps law" from another host, find the offending process, and kill it. --Fuat Internet: fuat@columbia.edu U.S. MAIL: Columbia University BITNET: fuat@cunixc Center for Computing Activities UUCP: ...!rutgers!columbia!cunixc!fuat 712 Watson Labs, 612 W115th St. Phone: (212) 854-5128 Fax: (212) 662-6442 New York, NY 10025
bowen@cs.Buffalo.EDU (Devon Bowen) (02/14/90)
In article <763@sirius.ucs.adelaide.edu.au>, francis@chook.ua.oz (Francis Vaughan) writes: > systems hackers use Suns we found it nessesary to have a terminal attached > to an annex, ready all the time soley to enable us to clear the problem via > the "call" protocol. A cheaper fix is to run a backgrounded rlogin just before a regular rlogin. The backgrounded one grabs the bad ptty long enough for the other rlogin to grab a valid one. I did this from home a lot. Devon