[comp.sys.isis] Two bugs in the ISIS failure detector

ken@gvax.cs.cornell.edu (Ken Birman) (08/10/89)

-- INFINITE LOOP --

In the "frozen" release of ISIS V1.2, there is a bug that causes
bin/protos to go into an infinite loop:

SYMPTOM:  During startup of a site, say site 32, another site crashes.
          Protos at site 32 suddenly goes into an infinite loop and
          becomes totally unresponsive.

CAUSE:    In pr_fdect.c, looked at the current view before it was
          actually committed.  

FIX:      Edit protos/pr_fdect.c.  Modify from line 886 as follows:

886:    if (recov_coordinator)   
NEW->   { 
887:        for (s_id = failed_sites; *s_id; s_id++)
888:            if (*s_id == recov_coordinator)
889:                panic ("site restart.... ");
NEW->       return; 
NEW->   }
            ... etc ...

REASONING: The code in question is used by a "non-coordinator" to inform
           the failure-detector coordinator that a site-failure has been
           noticed.  However, when run by a site that is recovering,
           there is a period when a failure might be noticed but the 
           current view is not yet committed, and hence defined.  In 
           this case, the recovering site should not attempt to inform
           the coordinator of the failure it thinks it has noticed.

This fix will be included in ISIS release 2.0 next fall.

-- MINOR ANNOYANCE --

While you are editing pr_fdect.c, you might as well make a second
change.  This one allows the protocol to deal with a failure/recovery
sequence that used to make it throw up its hands and panic.  (Haven't
noticed this?  Well, it isn't easy to provoke; you need 6 or 7 sites
and must simultaneously bring some up and others down...)

SYMPTOM:  During startup of a site, say site 32, another site crashes.
          The coordinator site suddenly panics ("5 attempts to install
          a new view have failed").  Then all the other sites panic too,
          with messages about being in a minority partition or about
          the "coordinator for my recovery is not responding".
          becomes totally unresponsive.

CAUSE:    In pr_fdect.c, looked at the current view before it was
          actually committed.  

FIX:      Edit protos/pr_fdect.c.  Modify line 855 as follows:

854:      for (ss_id = proposed_slist; *ss_id; ss_id++)
855:          if (*s_id == *ss_id)
856:          {
857:              if (!qu_find (pending_failures, *s_id))

OLD:      for (ss_id = proposed_slist; *ss_id; ss_id++)
NEW->         if (SITE_NO(*s_id) == SITE_NO(*ss_id))
OLD:          {
OLD:              if (!qu_find (pending_failures, *s_id))

REASONING: The code in question is used by a "coordinator" to react when
           a failure is detected while running the protocol to install
           a new view.  Unfortunately, when a site is recovering, the
           proposed site-id list shows it with a different incarnation
           number than the caller of fd_seemdead() may have used, hence
           the id wasn't found and the protocol hangs.