alext@ccwf.cc.utexas.edu (Alex Tomlinson) (02/28/91)
We have a problem in ISISV2.1 regarding killing the rank zero member of a process group. We set up a toy program to notify each member of the process group when the group membership changes (PG_MONITOR). Everything works as expected for the following cases: - any member of the group leaves via pg_leave - any member except the rank zero is terminated via a UNIX "kill" When the rank zero member of the group is killed, ISIS seems not to notice. We base this on the following: - no members are notified - .../run_isis/cmd reports that the process is still part of the group - the process doesn't exist as far as UNIX's "ps" command is concerned Is this a bug in ISISV2.1? Please email or post any ideas. Thanks, Alex Tomlinson Greg Hoagland
ken@gvax.cs.cornell.edu (Ken Birman) (02/28/91)
In article <ALEXT.91Feb27191225@doc.cc.utexas.edu> alext@ccwf.cc.utexas.edu (Alex Tomlinson) writes: > >We have a problem in ISISV2.1 regarding killing the rank zero member >of a process group.... We don't observe this problem when we try simple experiments, such as running "grid" and killing the lowest ranked group member. The most likely explanation is that your UNIX system is not reporting an error condition to the ISIS protocols server when a TCP or UNIX Domain (depending on the flavor of machine) connection breaks. The point is that when a process is running under ISIS, we know it is alive because it has a connection open, to bin/protos on itS host machine. We detect that it has failed when we see this connection "break". In UNIX, this normally results in a select reporting one or both of "data ready" and "exception" -- ISIS checks for both conditions. But, I have recently noticed a problem, especially under SUN OS, whereby UNIX doesn't report this condition in some situations. For example, if an ISIS system is running for a long time but inactive, so that bin/isis is swapped out, and then you kill bin/isis, it seems that protos won't notice its death -- although it detects this promptly when bin/isis is not swapped out. Your bug report doesn't give us much to work with, because you didn't give the machine type and OS revision level you are on, didn't send any sort of dump output (see the manual) and didn't send your program, which could be buggy (no offense intended). For this reason, we advise people to send bug reports to us using email (isis-bugs@cs.cornell.edu) and to delay posting until they know the cause of the problem and also the fix or work-around. So, lets take this one offline and, if there is an ISIS or UNIX problem here, we can post something later. Email the info requested above to me and I'll have a look (use "cmd snap" to get the log files) -- Ken