ken@cs.cornell.edu (Ken Birman) (06/11/91)
A number of ISIS users are upgrading to SUN OS 4.1c and then running into the following problem: ISIS is up for a while and things seem fine, and even pretty idle, and then the system starts to "not notice" the failure/exit of application programs. Pretty quickly, it becomes impossible to connect to ISIS, or the system runs out of something called "clist" space and panic's (client-list for a group), or things get hung because of the unnoticed failure. Reason for the problem: Under SUN OS 4.1c there seems to be a serious OS bug. Pipes don't "break" correctly when a process terminates if either end of the pipe was swapped out when the termination occurs. bin/protos notices failures by receiving SIGPIPE and so this leaves the system thinking the program that terminated is still around and is simply idle. Strangely, one this happens you can still use cmd to connect to ISIS and can even ask for a snapshot, which involves writing a message into the non-broken pipe. SUN OS simply refuses to signal an exception in this case and the write "completes" without error indications of any kind. Bug fixes/work-arounds: 1) A theoretical solution. My idea is that you recompile with UNIX_DOM disabled (not defined in isis.h, pr.h). You will get a TCP channel; this involves a different close-down mechanism and perhaps the problem will go away. Haven't tried this myself. It would slow things down, but not by a lot. The default is to use a UNIX domain connection when possible. 2) Use isis_probe. In principle, ISIS will notice that you are dead and kick your application off even if the channel doesn't break. 3) Hack in the following. "kill" reports when the process in question is not around, so a) run protos as root b) change it to do a "kill USR1" every, say, 10 seconds to each local process. Fails `cause process is dead => call client_crashed c) change cl_isis.c to catch and ignore USR1 signals. I think this would work, but it is certainly a little grungy. 4) Complain loudly to SUN. Maybe they have a patch by now. -- Kenneth P. Birman E-mail: ken@cs.cornell.edu 4105 Upson Hall, Dept. of Computer Science TEL: 607 255-9199 (office) Cornell University Ithaca, NY 14853 (USA) FAX: 607 255-4428