schwartz@shire.cs.psu.edu (Scott E. Schwartz) (03/02/90)
Hi all, Playing with Isis I've noticed something which surprised me. If I run several instances of the grid demo I can kill and restart individual processes with no problem. But if I send SIGTSTP (i.e, control-Z) to one of them, they all hang, seemingly forever, or until I continue the stopped job. I expected Isis to eventually decide that the stopped participant had failed and continue without it. Worse, if I wait too long before allowing the stopped process to continue, the system never seems to recover at all. Have I misunderstood something crucial? This is on a sun4/260 under 4.0.3. -- Scott Schwartz schwartz@cs.psu.edu "the same idea is applied today in the use of slide rules." -- Don Knuth
ken@gvax.cs.cornell.edu (Ken Birman) (03/03/90)
In article <Epmj=c1@cs.psu.edu> schwartz@shire.cs.psu.edu (Scott E. Schwartz) writes: > >Hi all, > Playing with Isis I've noticed something which surprised me. >If I run several instances of the grid demo I can kill and restart >individual processes with no problem. But if I send SIGTSTP (i.e, >control-Z) to one of them, they all hang, seemingly forever, or until I >continue the stopped job. I expected Isis to eventually decide that >the stopped participant had failed and continue without it. Worse, if >I wait too long before allowing the stopped process to continue, the >system never seems to recover at all. Have I misunderstood something >crucial? This is on a sun4/260 under 4.0.3. I guess the answer is that you haven't exactly misunderstood anything, but rather were missing some crutial information about just what grid actually does. The basic grid algorithm is as follows. For each grid instance we create one ISIS task that picks a cell to update and multicasts an appropriate message. When this arrives, one of the recipients will do an additional update, but NOT necessarily the one that did the first update. In this manner, a single computational thread of the grid system can wander around the group of grid processes, first multicasting in one, then in another, etc. In particular, grid doesn't have anything like a "for loop" in which one task in one process would issue lots and lots of multicasts. One implication of this is that grid isn't quite as fast as it could be, since we do a task create for each multicast in this model. Another is that if you cause one of the grid processes to hang, eventually all the active multicast tasks expect this process to do the next multicast and hence grid stops doing updates. A third is that if you use the start/stop button under V1.3.1, grid may be slower after you restart it than before: under some situations it "loses" some of the update tasks when this is done (we are fixing this particular problem in V2.0). Now, as explained in the ISIS manual, ISIS doesn't try to detect software crashes in application processes; without knowing what your code is doing, it has no way to figure out if you are healthy or not. For all ISIS knows, grid was actually healthy, or perhaps you might be running dbx on it to debug some problem in a sequenced way (this works). So, from ISIS's perspective, as process is up until it actually exits. ISIS does know enough about "protos" to notice if protos is down, so killing protos (using SIGSTOP if you like) will trigger the failure detector, after a delay of about 3*f seconds, where 'f' is the delay factor protos was given at startup (f=60 by default). If you use a SIGSTOP, wait until the failure is detected, and then do a SIGCONT protos will notice that "I am dead" and will commit suicide. Hope this clears things up. Ken