[comp.sys.isis] SUN OS4.1c bug causes irritating problems in ISIS

ken@cs.cornell.edu (Ken Birman) (06/11/91)

A number of ISIS users are upgrading to SUN OS 4.1c and then running
into the following problem:

ISIS is up for a while and things seem fine, and even pretty idle,
and then the system starts to "not notice" the failure/exit of
application programs.  Pretty quickly, it becomes impossible to
connect to ISIS, or the system runs out of something called "clist"
space and panic's (client-list for a group), or things get hung
because of the unnoticed failure.

Reason for the problem:  Under SUN OS 4.1c there seems to be a serious
OS bug.  Pipes don't "break" correctly when a process terminates if either
end of the pipe was swapped out when the termination occurs.  bin/protos
notices failures by receiving SIGPIPE and so this leaves the system
thinking the program that terminated is still around and is simply idle.

Strangely, one this happens you can still use cmd to connect to ISIS
and can even ask for a snapshot, which involves writing a message into
the non-broken pipe.  SUN OS simply refuses to signal an exception in this
case and the write "completes" without error indications of any kind.

Bug fixes/work-arounds:

1) A theoretical solution.  My idea is that you recompile with UNIX_DOM
   disabled (not defined in isis.h, pr.h). You will get a TCP channel;
   this involves a different close-down mechanism and perhaps the problem
   will go away.  Haven't tried this myself.  It would slow things down,
   but not by a lot.  The default is to use a UNIX domain connection when
   possible.

2) Use isis_probe.  In principle, ISIS will notice that you are dead and
   kick your application off even if the channel doesn't break.

3) Hack in the following.  "kill" reports when the process in question
   is not around, so 
	a) run protos as root
	b) change it to do a "kill USR1" every, say, 10 seconds to each
           local process.  Fails `cause process is dead => call client_crashed
        c) change cl_isis.c to catch and ignore USR1 signals.
   I think this would work, but it is certainly a little grungy.

4) Complain loudly to SUN.  Maybe they have a patch by now.


-- 
Kenneth P. Birman                              E-mail:  ken@cs.cornell.edu
4105 Upson Hall, Dept. of Computer Science     TEL:     607 255-9199 (office)
Cornell University Ithaca, NY 14853 (USA)      FAX:     607 255-4428