[comp.sys.isis] Large group structures and other ideas

ken@gvax.cs.cornell.edu (Ken Birman) (06/15/90)

I received some interesting questions about structuring large
applications under ISIS and want to summarize them and the apparent
answers.

Basically, the issue concerns how best to structure 
  1) Applications with a small group of processes that multicast to a
     huge number of processes.  I.e. a tickerplant in a stock brokerage.
  2) Applications with a huge number of "clients" that make periodic
     requests and need a rapid reply, fault-tolerantly.
  3) Applications with a service that spreads over huge numbers of sites
The problem here is that all of these need to somehow be shoe-horned into
the bypass mechanism for good performance.

Roughly, here's what we recommend and are planning to do about it:
   1) In this case, we will probably implement some sort of a "diffusion
      group tool".  We'll need a way to be told which processes are passive
      and which are active; a strong possibility is to extend the pg_client
      interface for this purpose.  The tool will try to use ethernet
      multicast and other tricks to give high performance.  Expect to see
      something with this structure towards the end of this year.  The
      UDP bypass scheme won't work well in this case because it sends one
      packet per destination and this could get slow.
   2) In this case, we will probably go with large numbers of small groups.
      For example (again using pg_client), we might create a small group
      with the two servers and one client for each client who wants to use
      the service.  This results in an acyclic communication graph and
      hence all the clients can efficiently multicast to the servers, which
      can operate redundantly or use a coord-cohort scheme.  Again, expect
      to see support for this in place sometime late this summer or in fall.
   3) This is the case covered by Robert Cooper's hierarchical group tool.
      This should be out in the same time frame.  
We are writing a paper on scaling in ISIS that should cover these sorts
of issues in detail.

On a related problem, I was asked about an application that needs to
react very quickly to failures, rolling over to a backup mode in less
time than ISIS normally takes to notice failures.

For such cases, I would maintain a replicated system configuration data
structure indicating the "best server" for a given type of request, as
well as other system configuration data.  On detecting an apparent problem,
I recommend that you update this replicated structure and use the update
to trigger application-level rollover.  Later, tagging alone, ISIS will
notice the failure (if it was a failure) and update things to catch up.

Finally, I received a suggestion that we include a bcast timeout option:
	bcast_l("T2....", ....)
i.e. timeout after 2 seconds.  Thanks for this idea!  We'll add it soon.

Ken