[comp.sys.isis] failure detection?

schwartz@shire.cs.psu.edu (Scott E. Schwartz) (03/02/90)

Hi all,
	Playing with Isis I've noticed something which surprised me.
If I run several instances of the grid demo I can kill and restart
individual processes with no problem.  But if I send SIGTSTP (i.e,
control-Z) to one of them, they all hang, seemingly forever, or until I
continue the stopped job.  I expected Isis to eventually decide that
the stopped participant had failed and continue without it.  Worse, if
I wait too long before allowing the stopped process to continue, the
system never seems to recover at all.  Have I misunderstood something
crucial?  This is on a sun4/260 under 4.0.3.


--
Scott Schwartz		schwartz@cs.psu.edu
"the same idea is applied today in the use of slide rules." -- Don Knuth 

ken@gvax.cs.cornell.edu (Ken Birman) (03/03/90)

In article <Epmj=c1@cs.psu.edu> schwartz@shire.cs.psu.edu (Scott E. Schwartz) writes:
>
>Hi all,
>	Playing with Isis I've noticed something which surprised me.
>If I run several instances of the grid demo I can kill and restart
>individual processes with no problem.  But if I send SIGTSTP (i.e,
>control-Z) to one of them, they all hang, seemingly forever, or until I
>continue the stopped job.  I expected Isis to eventually decide that
>the stopped participant had failed and continue without it.  Worse, if
>I wait too long before allowing the stopped process to continue, the
>system never seems to recover at all.  Have I misunderstood something
>crucial?  This is on a sun4/260 under 4.0.3.

I guess the answer is that you haven't exactly misunderstood anything, but
rather were missing some crutial information about just what grid actually
does.

The basic grid algorithm is as follows.  For each grid instance we create
one ISIS task that picks a cell to update and multicasts an appropriate
message.  When this arrives, one of the recipients will do an additional
update, but NOT necessarily the one that did the first update.  In this manner,
a single computational thread of the grid system can wander around the
group of grid processes, first multicasting in one, then in another, etc.
In particular, grid doesn't have anything like a "for loop" in which one
task in one process would issue lots and lots of multicasts.

One implication of this is that grid isn't quite as fast as it could be,
since we do a task create for each multicast in this model.  Another
is that if you cause one of the grid processes to hang, eventually all
the active multicast tasks expect this process to do the next multicast
and hence grid stops doing updates.  A third is that if you use the
start/stop button under V1.3.1, grid may be slower after you restart
it than before: under some situations it "loses" some of the update
tasks when this is done (we are fixing this particular problem in V2.0).

Now, as explained in the ISIS manual, ISIS doesn't try to detect software
crashes in application processes; without knowing what your code is doing,
it has no way to figure out if you are healthy or not.  For all ISIS knows,
grid was actually healthy, or perhaps you might be running dbx on it
to debug some problem in a sequenced way (this works).  So, from ISIS's
perspective, as process is up until it actually exits.

ISIS does know enough about "protos" to notice if protos is down, so
killing protos (using SIGSTOP if you like) will trigger the failure
detector, after a delay of about 3*f seconds, where 'f' is the delay
factor protos was given at startup (f=60 by default).  If you use a
SIGSTOP, wait until the failure is detected, and then do a SIGCONT
protos will notice that "I am dead" and will commit suicide.

Hope this clears things up.

Ken