ken@gvax.cs.cornell.edu (Ken Birman) (06/17/89)
This is in response to someone who asked for more details on the "symptoms" of the vsync bug we just fixed. The bug was in pr_addr.c; a line that read if(used_view)... should have read if(!used_view)... The effect was that ISIS wasn't doing read/write locking on process group addresses. Somehow I deleted the '!' operator the other day when doing some code rearrangement. The result was that messages were sometimes delivered in the wrong view. The symptoms of this were several: 1) The grid programs sometimes showed different states when stopped. This is really awful when doing demos... 2) pg_join sometimes hung 3) cbcast sometimes went into an infinite loop filling a log with messages about "look at 3" 4) it was possible to get a panic "shr_gunlock: wasn't locked" 5) gbcast sometimes hung With the bug fixed, ISIS is back to its old self. We pounded on the system quite a bit and nothing upsetting happened. In a sense, this is encouraging; the point is that ISIS itself depends rather strongly on virtual synchrony, and things that break virtual synchrony show up in extreme ways rather quickly. This makes it reasonably likely that we will notice and be able to fix ISIS bugs when they occur... Ken