[comp.sys.isis] A "property" of the grid program

ken@gvax.cs.cornell.edu (Ken Birman) (09/28/90)
>From: aprakash@dost.eecs.umich.edu (Atul Prakash)
>Subject: Join problem in grid demo.


>I am trying out the grid demo with V2.1 general release with BYPASS
>enabled. I notice that if one member is already running, then any
>attempts to make another member join do not seem to succeed until
>the running member is stopped. Apparently, isis_accept_inputs()
>is not working as it should in allowing other processes to join the
>group. Is this a bug?

>I was running both grid processes on the same machine (SUN4). I believe that
>the problem may have to do with the BYPASS mechanism -- I remember that
>joins used to succeed in the beta release compiled with BYPASS disabled.
>Thanks.

This is actually not a bug.  First, note that if you press the stop
button to let the join finish, the joining process gets right in.

It turns out that the OLD behavior actually was incorrect, and that
it could definitely allow non virtually synchronous executions.  If you
go back to V2.1 beta and play with this, you can get the grid out of
sync, and the reason was closely tied to the trick that let the join
squeeze through.

With a single member running this way, grid is in a form of infinite loop.
On reception of each multicast, it immediately issues a new one.  The
effect is that ISIS always ends up with a message on the delivery queue
or a task on the runqueue in this application.  It used to put up with
this for a while and then, after a hundred or so rounds of this, let the
join squeeze in.  But, this algorithm was technically incorrect.  When
grid is behaving this way, the "correct" thing to do is to let the 
execution schedule continue in FIFO order.  For all ISIS knows, the order
might matter.  Calling isis_accept_events() doesn't help, since ISIS
knows about the pending join -- it just isn't allowed to "schedule" it.

It is easy to change the demo program to detect when a join is in
progress:

1) change all calls to update() to call "start_update()"

2) code start update like this:

start_update()
  {
	if(n_memb == 1 && (isis_state&ISIS_WJOIN))
	    isis_timeout(1000, update, 0);
	else
	    update();
  }

With this change, grid should work fine.  But, the key point is that this
changes the algorithm to break the "infinite loop" it explicitly contains
as currently coded.  Hitting the stop button has the same effect.

Maybe grid should refuse to run with just a single member in cbcast/bypass
mode?
 
Normal services should never encounter this problem, which is specifically
due to the fact that grid in bypass cbcast mode has this infinite loop
built into it.  The issue is avoided when there are multiple members
running because of the way that grid moves the source of updates around from
member to member.  

Ken