ken@gvax.cs.cornell.edu (Ken Birman) (08/10/89)
-- INFINITE LOOP -- In the "frozen" release of ISIS V1.2, there is a bug that causes bin/protos to go into an infinite loop: SYMPTOM: During startup of a site, say site 32, another site crashes. Protos at site 32 suddenly goes into an infinite loop and becomes totally unresponsive. CAUSE: In pr_fdect.c, looked at the current view before it was actually committed. FIX: Edit protos/pr_fdect.c. Modify from line 886 as follows: 886: if (recov_coordinator) NEW-> { 887: for (s_id = failed_sites; *s_id; s_id++) 888: if (*s_id == recov_coordinator) 889: panic ("site restart.... "); NEW-> return; NEW-> } ... etc ... REASONING: The code in question is used by a "non-coordinator" to inform the failure-detector coordinator that a site-failure has been noticed. However, when run by a site that is recovering, there is a period when a failure might be noticed but the current view is not yet committed, and hence defined. In this case, the recovering site should not attempt to inform the coordinator of the failure it thinks it has noticed. This fix will be included in ISIS release 2.0 next fall. -- MINOR ANNOYANCE -- While you are editing pr_fdect.c, you might as well make a second change. This one allows the protocol to deal with a failure/recovery sequence that used to make it throw up its hands and panic. (Haven't noticed this? Well, it isn't easy to provoke; you need 6 or 7 sites and must simultaneously bring some up and others down...) SYMPTOM: During startup of a site, say site 32, another site crashes. The coordinator site suddenly panics ("5 attempts to install a new view have failed"). Then all the other sites panic too, with messages about being in a minority partition or about the "coordinator for my recovery is not responding". becomes totally unresponsive. CAUSE: In pr_fdect.c, looked at the current view before it was actually committed. FIX: Edit protos/pr_fdect.c. Modify line 855 as follows: 854: for (ss_id = proposed_slist; *ss_id; ss_id++) 855: if (*s_id == *ss_id) 856: { 857: if (!qu_find (pending_failures, *s_id)) OLD: for (ss_id = proposed_slist; *ss_id; ss_id++) NEW-> if (SITE_NO(*s_id) == SITE_NO(*ss_id)) OLD: { OLD: if (!qu_find (pending_failures, *s_id)) REASONING: The code in question is used by a "coordinator" to react when a failure is detected while running the protocol to install a new view. Unfortunately, when a site is recovering, the proposed site-id list shows it with a different incarnation number than the caller of fd_seemdead() may have used, hence the id wasn't found and the protocol hangs.