[comp.sys.isis] V2.1 Problem Regarding group membership changes

alext@ccwf.cc.utexas.edu (Alex Tomlinson) (02/28/91)

We have a problem in ISISV2.1 regarding killing the rank zero member
of a process group.  We set up a toy program to notify each member
of the process group when the group membership changes (PG_MONITOR).
Everything works as expected for the following cases:

	- any member of the group leaves via pg_leave
	- any member except the rank zero is terminated via a UNIX "kill"

When the rank zero member of the group is killed, ISIS seems not to notice.
We base this on the following:

	- no members are notified
	- .../run_isis/cmd  reports that the process is still part of the group
	- the process doesn't exist as far as UNIX's "ps" command is concerned

Is this a bug in ISISV2.1?
Please email or post any ideas.

Thanks,
Alex Tomlinson
Greg Hoagland

ken@gvax.cs.cornell.edu (Ken Birman) (02/28/91)

In article <ALEXT.91Feb27191225@doc.cc.utexas.edu> alext@ccwf.cc.utexas.edu (Alex Tomlinson) writes:
>
>We have a problem in ISISV2.1 regarding killing the rank zero member
>of a process group....

We don't observe this problem when we try simple experiments, such
as running "grid" and killing the lowest ranked group member.  The
most likely explanation is that your UNIX system is not reporting
an error condition to the ISIS protocols server when a TCP or UNIX
Domain (depending on the flavor of machine) connection breaks.

The point is that when a process is running under ISIS, we know it
is alive because it has a connection open, to bin/protos on itS
host machine.  We detect that it has failed when we see this connection
"break".  In UNIX, this normally results in a select reporting one or
both of "data ready" and "exception" -- ISIS checks for both conditions.
But, I have recently noticed a problem, especially under SUN OS,
whereby UNIX doesn't report this condition in some situations.
For example, if an ISIS system is running for a long time but inactive,
so that bin/isis is swapped out, and then you kill bin/isis, it seems
that protos won't notice its death -- although it detects this
promptly when bin/isis is not swapped out.

Your bug report doesn't give us much to work with, because you didn't
give the machine type and OS revision level you are on, didn't send
any sort of dump output (see the manual) and didn't send your program,
which could be buggy (no offense intended).  For this reason, we advise
people to send bug reports to us using email (isis-bugs@cs.cornell.edu)
and to delay posting until they know the cause of the problem and also
the fix or work-around.

So, lets take this one offline and, if there is an ISIS or UNIX problem
here, we can post something later.  Email the info requested above to
me and I'll have a look (use "cmd snap" to get the log files)

-- Ken