ken@gvax.cs.cornell.edu (Ken Birman) (08/10/89)
-- INFINITE LOOP --
In the "frozen" release of ISIS V1.2, there is a bug that causes
bin/protos to go into an infinite loop:
SYMPTOM: During startup of a site, say site 32, another site crashes.
Protos at site 32 suddenly goes into an infinite loop and
becomes totally unresponsive.
CAUSE: In pr_fdect.c, looked at the current view before it was
actually committed.
FIX: Edit protos/pr_fdect.c. Modify from line 886 as follows:
886: if (recov_coordinator)
NEW-> {
887: for (s_id = failed_sites; *s_id; s_id++)
888: if (*s_id == recov_coordinator)
889: panic ("site restart.... ");
NEW-> return;
NEW-> }
... etc ...
REASONING: The code in question is used by a "non-coordinator" to inform
the failure-detector coordinator that a site-failure has been
noticed. However, when run by a site that is recovering,
there is a period when a failure might be noticed but the
current view is not yet committed, and hence defined. In
this case, the recovering site should not attempt to inform
the coordinator of the failure it thinks it has noticed.
This fix will be included in ISIS release 2.0 next fall.
-- MINOR ANNOYANCE --
While you are editing pr_fdect.c, you might as well make a second
change. This one allows the protocol to deal with a failure/recovery
sequence that used to make it throw up its hands and panic. (Haven't
noticed this? Well, it isn't easy to provoke; you need 6 or 7 sites
and must simultaneously bring some up and others down...)
SYMPTOM: During startup of a site, say site 32, another site crashes.
The coordinator site suddenly panics ("5 attempts to install
a new view have failed"). Then all the other sites panic too,
with messages about being in a minority partition or about
the "coordinator for my recovery is not responding".
becomes totally unresponsive.
CAUSE: In pr_fdect.c, looked at the current view before it was
actually committed.
FIX: Edit protos/pr_fdect.c. Modify line 855 as follows:
854: for (ss_id = proposed_slist; *ss_id; ss_id++)
855: if (*s_id == *ss_id)
856: {
857: if (!qu_find (pending_failures, *s_id))
OLD: for (ss_id = proposed_slist; *ss_id; ss_id++)
NEW-> if (SITE_NO(*s_id) == SITE_NO(*ss_id))
OLD: {
OLD: if (!qu_find (pending_failures, *s_id))
REASONING: The code in question is used by a "coordinator" to react when
a failure is detected while running the protocol to install
a new view. Unfortunately, when a site is recovering, the
proposed site-id list shows it with a different incarnation
number than the caller of fd_seemdead() may have used, hence
the id wasn't found and the protocol hangs.