[net.unix-wizards] Problems with rlogin

roy@phri.UUCP (10/22/86)

	Recently, a strange problem has cropped up with rlogin.  We've got
a Vax-11/750 running 4.2BSD and a bunch of Sun-3's running Sun 3.0 Unix.
One of our Sun-3/50's (adenine) can't rlogin to the vax.  It used to work
fine, but then for no apparant reason it stopped working.  Rebooting
adenine seemed to have fixed the problem, but then it came back.  The
symptoms are:

	If I try to rlogin from adenine to the vax, I get "connection timed
out".  I can rlogin from the vax to adenine, however.  All the other suns
can rlogin to the vax without problems and I can also rlogin from adenine
to some other sun and from there I can rlogin to the vax.  Telnet from
adenine directly to the vax works fine.

	I'm stumped.  My guess is that the is a pty that's hung up and for
some reason the triple (localhost=adenine, remotehost=vax, protocol=rlogin)
always finds that same pty.  Why that should happen I have no idea, nor do
I know how to go about testing this theory or fixing the problem.  Any
ideas?
-- 
Roy Smith, {allegra,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

ables@mcc-pp.UUCP (King Ables) (10/23/86)

> 	Recently, a strange problem has cropped up with rlogin.  We've got
> a Vax-11/750 running 4.2BSD and a bunch of Sun-3's running Sun 3.0 Unix.
> One of our Sun-3/50's (adenine) can't rlogin to the vax.  It used to work
> fine, but then for no apparant reason it stopped working.  Rebooting
> adenine seemed to have fixed the problem, but then it came back.  The
> symptoms are:

I've seen the same problem here.  The only clue I've had the time
to find is that it has something to do with a process left around
from an old login.  If I rlogin to my Vax from a Sun3 and something
happens (the window dies, or you quit suntools w/o logging out, perhaps,
I don't know because I haven't been able to reproduce it) then you
get a job hanging around on the Vax that says it's coming from the
Sun, but the sun no longer has it.  Once you kill the process on the
vax, then you can rlogin from the sun again.  It's almost as if something
is saying "no, you already have one of those, you can't have two"
which is crazy since you can have multiple rlogins from a sun under
normal circumstances.

We haven't had time to try to figure it out and since we have a
work-around now, it hasn't been a big inconvenience (it only
happens rarely, anyway).  However, I'd sure love to know why it does.

-King
ARPA: ables@mcc.com
UUCP: {gatech,ihnp4,nbires,seismo,ucbvax}!ut-sally!im4u!milano!mcc-pp!ables
-------
UNPARALLELED SERVICE means they can only do one thing at a time.

mac@esl.UUCP (10/29/86)

In article <1904@mcc-pp.UUCP> ables@mcc-pp.UUCP (King Ables) writes:
>> 	Recently, a strange problem has cropped up with rlogin.  We've got
>> a Vax-11/750 running 4.2BSD and a bunch of Sun-3's running Sun 3.0 Unix.
>> One of our Sun-3/50's (adenine) can't rlogin to the vax.  It used to work
>> fine, but then for no apparant reason it stopped working.  Rebooting
>> adenine seemed to have fixed the problem, but then it came back.  The
>> symptoms are:
>
>I've seen the same problem here.  The only clue I've had the time
>to find is that it has something to do with a process left around
>from an old login.  If I rlogin to my Vax from a Sun3 and something
>happens (the window dies, or you quit suntools w/o logging out, perhaps,
>I don't know because I haven't been able to reproduce it) then you
>get a job hanging around on the Vax that says it's coming from the
>Sun, but the sun no longer has it.  Once you kill the process on the
>vax, then you can rlogin from the sun again.  It's almost as if something
>is saying "no, you already have one of those, you can't have two"
>which is crazy since you can have multiple rlogins from a sun under
>normal circumstances.
>
>We haven't had time to try to figure it out and since we have a
>work-around now, it hasn't been a big inconvenience (it only
>happens rarely, anyway).  However, I'd sure love to know why it does.
>
>-King
>ARPA: ables@mcc.com
>UUCP: {gatech,ihnp4,nbires,seismo,ucbvax}!ut-sally!im4u!milano!mcc-pp!ables
>-------
>UNPARALLELED SERVICE means they can only do one thing at a time.

There are a number of strange things going on in configurations of Sun,
suntools, and rlogin.  You sun users have seen the occasional lockup of ptys
follow sun through Sun 100U's running Sun 1.0 through Sun 3/180's running
3.1FCS.  (I mean you do a w and see strange descriptions of users on some of
the ptys:

User     tty       login@  idle   JCPU   PCPU  what
amp      p3board   2:13pm  1:43   1:32     16  -csh 
amp      pty2      4:38pm238:48 123:92 123:21  - 7%4( %^
mac      ttyp3     4:35pm         6:44   4:35  w

A reboot cleans these up.  Until a reboot, users attempting to vi get wierd
window sizes (indeed, anything accessing window sizes over ptys gets strange
data -- this includes people logged in over ethernet.)

Any one know a better fix than L1 A ( or shutdown -r 5 Resetting Sun :-)?

An aside: we had a real weird one the other day.  One sun ( a Sun 3/180 )
wouldn't boot - the vmunix file was trashed -- hard disk errors on the xy0a
partion.  We booted it from tape, as it couldn't run /genvmunix 
( a good idea, save a copy of the generic /vmunix as something like
/genvmunix on each of your machines, so you could boot off that if vmunix got
86'ed)
as the file system xy0a wouldn't pass fsck.  Then I backed up the other xy0
partions, for safe keeping, and ran the SMD stand alone fix command on 
just the blocks of the xy0a partion.  I then copied xy0a back in from backup
tape, and booted the machine.  Everything worked fine, EXCEPT rlogin from
another machine to this machine.  I got an error "getxfile: swap problems in
getting file" or some such.  I thought maybe rlogind was bad, so I copied
another one to /usr, and ran that one instead of /etc/rlogind.  Same problem.

So I ran fix on the swap partion, and it found errors there also. Great, I
thought. That's why there was the getxfile error message. 

Same problem.  Did an nm on /vmunix, and getxfile was in the kernal. Hmm,
probably wasn't rlogind causing the problem.  

Bit the bullet, shutdown the machine, and ran fix on the whole disk. Went
home. Came in the next day, re loaded unix on it, then restored /usr.
Things are all fine now.  Perhaps partial fixes are not recommended.
It looked so promising in the manual, though.  Just reformat a piece, not the
whole thing.

Anyway. Cheers

-- 
------------------------------------+-----------------------------------------+
| Michael Mc Namara                 |  MM MM MM    OO   SSS   AAA   II  CCCC  |
| ESL Incorporated                  |   M   M  M  O  O S         A   I C    C |
| ARPA: mac%esl.UUCP@ames.ARPA      |   M   M  M  O  O  SSS   AAAA   I C      |
|       mac%esl.UUCP@shasta.ARPA    |   M   M  M  O  O     S A   A   I C    C |
|       mac%esl.UUCP@lll-lcc.ARPA   |  MM   M  M   OO   SSS   AAA A III CCCC  |
------------------------------------+-----------------------------------------+
| Note: esl used to be called tflop; the path mac%tflop will still work awhile|
------------------------------------+-----------------------------------------+

richl@penguin.uss.tek.com (Rick Lindsley) (10/29/86)

In article <332@esl.UUCP> mac@esl.UUCP (Mike McNamara) writes:

> User     tty       login@  idle   JCPU   PCPU  what
> amp      p3board   2:13pm  1:43   1:32     16  -csh 
> amp      pty2      4:38pm238:48 123:92 123:21  - 7%4( %^
> mac      ttyp3     4:35pm         6:44   4:35  w
> 
> A reboot cleans these up.  Until a reboot, users attempting to vi get wierd
> window sizes (indeed, anything accessing window sizes over ptys gets strange
> data -- this includes people logged in over ethernet.)
> 
> Any one know a better fix than L1 A ( or shutdown -r 5 Resetting Sun :-)?

Yes, I think so. I spent the better part of a day tracking this down.
I'm running 3.0 on a 3/50. The same bug exists under 3.0 on a 3/160
too.

3.0 has kernel structures for window sizes. Of course. These are
settable via an ioctl. Termcap has been altered to look at these sizes.
If they are non-zero, they apparently REPLACE the li: and co: entries
in the termcap entry. Normally, this makes perfect sense. Unless ...
unless you aren't running in a resizeable window.

Something takes care to set these to 0 when you start.  However, this
DOESN'T happen if there is already a process running on the pty.
Judging from this, I'd say the kernel, upon final close in the tty
driver, is in charge of resetting these to 0.  I can think of many
situations where this makes perfect sense, but nevertheless it is the
cause of the bug.

The 'w' output you gave (and the amazing amount of cpu time) indicates
there is probably another process still running on that terminal. Did
you do a "ps atp1"? That will show all the processes running on ttyp1,
and might verify that this is the case.

Repeat it by making a window. Run the program below, tst, to show the
window size. Do "sleep 600 &" and close the window. Now rlogin from
another host. Check to see that you have the right tty. Run tst again,
and you'll see the "window size" hasn't changed. Unless you are running
on a terminal which happens to be the same size as your window was,
your termcap won't work. Tset does NOT reset the window size, either.
(I can see arguments in favor of this behavior also.) It might be
reasonable for stty to report (and set) these sizes. Perhaps rlogind
should do this. Doesn't matter; the fix is Sun's to ponder, not mine. I
couldn't find a Sun program which allowed me to change the window size,
so I wrote tst2. If you run tst2, below, you'll find your troubles go
away, if they are from this problem. No complaints about style please;
I admit they are quick and dirty! Embellish at will.

tst:
-----
#include <sys/ioctl.h>

main()

{
    struct ttysize size;

    if (ioctl(0,TIOCGSIZE,&size) < 0)
	perror("ioctl");
    else
	printf("Size is %d, %d\n", size.ts_lines, size.ts_cols);
}
-----

tst2:
-----
#include <sys/ioctl.h>

main()

{
    struct ttysize size;

    size.ts_lines = 0;
    size.ts_cols = 0;
    if (ioctl(0,TIOCSSIZE,&size) < 0)
	perror("ioctl");
}
-----

Rick Lindsley

chris@umcp-cs.UUCP (Chris Torek) (11/02/86)

In article <2460@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>... If I try to rlogin from adenine to the vax, I get "connection timed
>out".  I can rlogin from the vax to adenine, however.  All the other suns
>can rlogin to the vax without problems and I can also rlogin from adenine
>to some other sun and from there I can rlogin to the vax.  Telnet from
>adenine directly to the vax works fine.
>
>... My guess is that the is a pty that's hung up ....

No: it is a TCP port.  If you run netstat on the machines, one of them
will have a connection stuck in FIN_WAIT_2.  Install 4.3BSD.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu