roy@phri.UUCP (10/22/86)
Recently, a strange problem has cropped up with rlogin. We've got a Vax-11/750 running 4.2BSD and a bunch of Sun-3's running Sun 3.0 Unix. One of our Sun-3/50's (adenine) can't rlogin to the vax. It used to work fine, but then for no apparant reason it stopped working. Rebooting adenine seemed to have fixed the problem, but then it came back. The symptoms are: If I try to rlogin from adenine to the vax, I get "connection timed out". I can rlogin from the vax to adenine, however. All the other suns can rlogin to the vax without problems and I can also rlogin from adenine to some other sun and from there I can rlogin to the vax. Telnet from adenine directly to the vax works fine. I'm stumped. My guess is that the is a pty that's hung up and for some reason the triple (localhost=adenine, remotehost=vax, protocol=rlogin) always finds that same pty. Why that should happen I have no idea, nor do I know how to go about testing this theory or fixing the problem. Any ideas? -- Roy Smith, {allegra,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
ables@mcc-pp.UUCP (King Ables) (10/23/86)
> Recently, a strange problem has cropped up with rlogin. We've got > a Vax-11/750 running 4.2BSD and a bunch of Sun-3's running Sun 3.0 Unix. > One of our Sun-3/50's (adenine) can't rlogin to the vax. It used to work > fine, but then for no apparant reason it stopped working. Rebooting > adenine seemed to have fixed the problem, but then it came back. The > symptoms are: I've seen the same problem here. The only clue I've had the time to find is that it has something to do with a process left around from an old login. If I rlogin to my Vax from a Sun3 and something happens (the window dies, or you quit suntools w/o logging out, perhaps, I don't know because I haven't been able to reproduce it) then you get a job hanging around on the Vax that says it's coming from the Sun, but the sun no longer has it. Once you kill the process on the vax, then you can rlogin from the sun again. It's almost as if something is saying "no, you already have one of those, you can't have two" which is crazy since you can have multiple rlogins from a sun under normal circumstances. We haven't had time to try to figure it out and since we have a work-around now, it hasn't been a big inconvenience (it only happens rarely, anyway). However, I'd sure love to know why it does. -King ARPA: ables@mcc.com UUCP: {gatech,ihnp4,nbires,seismo,ucbvax}!ut-sally!im4u!milano!mcc-pp!ables ------- UNPARALLELED SERVICE means they can only do one thing at a time.
mac@esl.UUCP (10/29/86)
In article <1904@mcc-pp.UUCP> ables@mcc-pp.UUCP (King Ables) writes: >> Recently, a strange problem has cropped up with rlogin. We've got >> a Vax-11/750 running 4.2BSD and a bunch of Sun-3's running Sun 3.0 Unix. >> One of our Sun-3/50's (adenine) can't rlogin to the vax. It used to work >> fine, but then for no apparant reason it stopped working. Rebooting >> adenine seemed to have fixed the problem, but then it came back. The >> symptoms are: > >I've seen the same problem here. The only clue I've had the time >to find is that it has something to do with a process left around >from an old login. If I rlogin to my Vax from a Sun3 and something >happens (the window dies, or you quit suntools w/o logging out, perhaps, >I don't know because I haven't been able to reproduce it) then you >get a job hanging around on the Vax that says it's coming from the >Sun, but the sun no longer has it. Once you kill the process on the >vax, then you can rlogin from the sun again. It's almost as if something >is saying "no, you already have one of those, you can't have two" >which is crazy since you can have multiple rlogins from a sun under >normal circumstances. > >We haven't had time to try to figure it out and since we have a >work-around now, it hasn't been a big inconvenience (it only >happens rarely, anyway). However, I'd sure love to know why it does. > >-King >ARPA: ables@mcc.com >UUCP: {gatech,ihnp4,nbires,seismo,ucbvax}!ut-sally!im4u!milano!mcc-pp!ables >------- >UNPARALLELED SERVICE means they can only do one thing at a time. There are a number of strange things going on in configurations of Sun, suntools, and rlogin. You sun users have seen the occasional lockup of ptys follow sun through Sun 100U's running Sun 1.0 through Sun 3/180's running 3.1FCS. (I mean you do a w and see strange descriptions of users on some of the ptys: User tty login@ idle JCPU PCPU what amp p3board 2:13pm 1:43 1:32 16 -csh amp pty2 4:38pm238:48 123:92 123:21 - 7%4( %^ mac ttyp3 4:35pm 6:44 4:35 w A reboot cleans these up. Until a reboot, users attempting to vi get wierd window sizes (indeed, anything accessing window sizes over ptys gets strange data -- this includes people logged in over ethernet.) Any one know a better fix than L1 A ( or shutdown -r 5 Resetting Sun :-)? An aside: we had a real weird one the other day. One sun ( a Sun 3/180 ) wouldn't boot - the vmunix file was trashed -- hard disk errors on the xy0a partion. We booted it from tape, as it couldn't run /genvmunix ( a good idea, save a copy of the generic /vmunix as something like /genvmunix on each of your machines, so you could boot off that if vmunix got 86'ed) as the file system xy0a wouldn't pass fsck. Then I backed up the other xy0 partions, for safe keeping, and ran the SMD stand alone fix command on just the blocks of the xy0a partion. I then copied xy0a back in from backup tape, and booted the machine. Everything worked fine, EXCEPT rlogin from another machine to this machine. I got an error "getxfile: swap problems in getting file" or some such. I thought maybe rlogind was bad, so I copied another one to /usr, and ran that one instead of /etc/rlogind. Same problem. So I ran fix on the swap partion, and it found errors there also. Great, I thought. That's why there was the getxfile error message. Same problem. Did an nm on /vmunix, and getxfile was in the kernal. Hmm, probably wasn't rlogind causing the problem. Bit the bullet, shutdown the machine, and ran fix on the whole disk. Went home. Came in the next day, re loaded unix on it, then restored /usr. Things are all fine now. Perhaps partial fixes are not recommended. It looked so promising in the manual, though. Just reformat a piece, not the whole thing. Anyway. Cheers -- ------------------------------------+-----------------------------------------+ | Michael Mc Namara | MM MM MM OO SSS AAA II CCCC | | ESL Incorporated | M M M O O S A I C C | | ARPA: mac%esl.UUCP@ames.ARPA | M M M O O SSS AAAA I C | | mac%esl.UUCP@shasta.ARPA | M M M O O S A A I C C | | mac%esl.UUCP@lll-lcc.ARPA | MM M M OO SSS AAA A III CCCC | ------------------------------------+-----------------------------------------+ | Note: esl used to be called tflop; the path mac%tflop will still work awhile| ------------------------------------+-----------------------------------------+
richl@penguin.uss.tek.com (Rick Lindsley) (10/29/86)
In article <332@esl.UUCP> mac@esl.UUCP (Mike McNamara) writes: > User tty login@ idle JCPU PCPU what > amp p3board 2:13pm 1:43 1:32 16 -csh > amp pty2 4:38pm238:48 123:92 123:21 - 7%4( %^ > mac ttyp3 4:35pm 6:44 4:35 w > > A reboot cleans these up. Until a reboot, users attempting to vi get wierd > window sizes (indeed, anything accessing window sizes over ptys gets strange > data -- this includes people logged in over ethernet.) > > Any one know a better fix than L1 A ( or shutdown -r 5 Resetting Sun :-)? Yes, I think so. I spent the better part of a day tracking this down. I'm running 3.0 on a 3/50. The same bug exists under 3.0 on a 3/160 too. 3.0 has kernel structures for window sizes. Of course. These are settable via an ioctl. Termcap has been altered to look at these sizes. If they are non-zero, they apparently REPLACE the li: and co: entries in the termcap entry. Normally, this makes perfect sense. Unless ... unless you aren't running in a resizeable window. Something takes care to set these to 0 when you start. However, this DOESN'T happen if there is already a process running on the pty. Judging from this, I'd say the kernel, upon final close in the tty driver, is in charge of resetting these to 0. I can think of many situations where this makes perfect sense, but nevertheless it is the cause of the bug. The 'w' output you gave (and the amazing amount of cpu time) indicates there is probably another process still running on that terminal. Did you do a "ps atp1"? That will show all the processes running on ttyp1, and might verify that this is the case. Repeat it by making a window. Run the program below, tst, to show the window size. Do "sleep 600 &" and close the window. Now rlogin from another host. Check to see that you have the right tty. Run tst again, and you'll see the "window size" hasn't changed. Unless you are running on a terminal which happens to be the same size as your window was, your termcap won't work. Tset does NOT reset the window size, either. (I can see arguments in favor of this behavior also.) It might be reasonable for stty to report (and set) these sizes. Perhaps rlogind should do this. Doesn't matter; the fix is Sun's to ponder, not mine. I couldn't find a Sun program which allowed me to change the window size, so I wrote tst2. If you run tst2, below, you'll find your troubles go away, if they are from this problem. No complaints about style please; I admit they are quick and dirty! Embellish at will. tst: ----- #include <sys/ioctl.h> main() { struct ttysize size; if (ioctl(0,TIOCGSIZE,&size) < 0) perror("ioctl"); else printf("Size is %d, %d\n", size.ts_lines, size.ts_cols); } ----- tst2: ----- #include <sys/ioctl.h> main() { struct ttysize size; size.ts_lines = 0; size.ts_cols = 0; if (ioctl(0,TIOCSSIZE,&size) < 0) perror("ioctl"); } ----- Rick Lindsley
chris@umcp-cs.UUCP (Chris Torek) (11/02/86)
In article <2460@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >... If I try to rlogin from adenine to the vax, I get "connection timed >out". I can rlogin from the vax to adenine, however. All the other suns >can rlogin to the vax without problems and I can also rlogin from adenine >to some other sun and from there I can rlogin to the vax. Telnet from >adenine directly to the vax works fine. > >... My guess is that the is a pty that's hung up .... No: it is a TCP port. If you run netstat on the machines, one of them will have a connection stuck in FIN_WAIT_2. Install 4.3BSD. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu