[comp.sys.isis] Fast causal multicast

ken@gvax.cs.cornell.edu (Ken Birman) (11/14/90)

I received the following inquiry and thought it might interest
the readers of this group.  comp.sys.isis is unmoderated and this
topic is debatable...  Feel free to post counter-arguments. 

> From scott@amethyst.omg.ORG Tue Nov 13 15:16:30 1990
> To: ken@gvax.cs.cornell.edu
> Subject: Fast Causal Multicast - multiple overlapping process groups

> Ken:

> Don't know if you'll remember me (Scott Meyer).  I attended Fingerlakes '89.
> At the time your description of bypass CBCAST piqued my interest.  The
> life of an engineer being what it is, I haven't had the chance to read
> your "Fast Causal Multicast" paper until now.  I do have some comments to
> which I'd be interested in hearing your response.

> Early on, you state: "Particularly valuable to us has been the ability
> to support multiple, possibly overlapping process groups..."  Could
> you provide examples (or pointers to examples) that substantiate this?

> In particular, I am skeptical of the need to enforce causal ordering
> across overlapping process groups as not all applications may need it.
> It seems to me that making this assumption engenders a disproportionate
> amount of complexity - runtime graph analysis, while not making it
> does not prevent applications that need such functionality from
> obtaining it.  To wit: suppose the membership of the groups g1 and g2
> overlaps.  Suppose futher that there is a set of messages, M, that
> we wish to deliver to both g1 and g2 with a causal ordering.  We can
> achieve this by creating a new group g3 = g1 U g2 and delivering
> all messages in M to g3.  This method does require distributed system
> developers to have a clear understanding of causal relationships that
> your bypass CBCAST protocol might save them the work of making explicit.
> The question in my mind is: "Is providing that extra service worthwhile
> given the extra (substantial) complexity that implementing it requires?
> If yes, why?"  Obviously, my own opinion is that it is not worthwhile
> to implement inter-group causal message ordering, and by extension that
> distributed system designers will have to have a good understanding of
> message causality (and a variety of other issues) in order to successfully
> implement a distributed system.

> Thanks in advance for your response.  Regards,

> Scott Meyer
> Netwise Inc.
> 2477 55th St.
> Boulder, CO  80301
> tel 303/442-8280

Scott,

We actually agree with your first point, but we also see a convincing
counter example.  As for the business about the need to maintain and
reason about the communication graph, I think you misunderstood how that
mechanism works (we reason about the graph, but we don't need to maintain
it anywhere).  

Issue 1: causality over group boundaries.

Let me grant that if two groups are not supposed to "interfere" then enforcing
causality could create an undesired effect.

But, say that I build an application in the style that the LINDA people
advocate.

I have a computational thread that writes some parameters on a blackboard
(some stuff that you might not need to see, but if you do, it costs a lot
to get it wrong).  Then, it puts a bunch of work requests into a bag
of work requests.

Server programs pull out work requests and start computing.  They can check
the blackboard if desired.

Now, I would argue that:
 1) You need to do the black board update asynchronously (if it is replicated)
    because otherwise the delay associated with waiting for the update to get
    done might be noticable and could even be a big problem (imagine a 1 BIP
    workstation in 1998 running over a comparatively slow link to an equally
    fast machine down the hall...  )  So, the bboard update is asynchronous,
    and the task bag update now creates a race condition.

    Specifically, will the server see the bboard update when it looks?
    With a causal guarantee, it will.

 2) This is not an uncommon paradigm.  In fact, with OO programming becoming
    common, this may actually be the usual style of distributed computing in
    the future.  I have lots of other examples of this sort (you register an
    activity and then someone who sees it goes off and thinks, then sends in
    an "answer" that races the original registration to some third party).

 3) Fault-tolerance can create more extreme problems, where "holes in the past"
    become a real difficulty.  Here, the issue is as much due to replication as
    asynchrony: I may have seen the state of some replica which then crashed,
    as did the original sender.  Now I am talking to some other replica in a
    state that "observed" a "past" update... will that replica ever see the
    update?  Will it see it before it hears from me?

The RPC community has run into this too: the Mercury project at MIT raised
many similar questions...

So do database systems.  Say that I do an update, then a commit.  The commit
releases a lock and you acquire it and do a read.  Do you see my update?
Well, if we run synchronously, sure you do.  But, if we run asynchronously
(for a big speedup) the exact same race arises.  And, again, it is easy to
believe that the process group used for locking is separate from the one used
for the replicated data item itself.  So, we definitely see a cross-group
causality issue here, too.

Robert and I are writing a paper on precisely this -- a paper on redoing
ISIS that asks what the appropriate rules for process groups ought to be.
Our paper argues for causality classes: causality would be enforced within
multiple groups in a single class, but might be violated over class boundaries.
The idea is that most groups, by default, would be in some single class.
But, if you build a system with a realtime subcomponent, or some other
independent mechanism in it, the group(s) associated with that mechanism
would be explicitly placed in a different class and hence kept separate from
the ones in the base class.

There are a number of additional issues that we are also looking at hard at, too: 
  1) Why have the system manage groups at all?  Why not put this at the
     application level?
  2) Given groups, why not do multicast using parallel RPC?  (Or, perhaps,
     transactions, if you want atomicity)?
  3) what do groups usually "look like" and how big are they?
  4) How much synchronization is needed between group membership and multicast
  5) Causality: see above
  6) Do we need the other ISIS protocols (ABCAST, etc, etc, etc, etc)
  7) Is there a simple, clean architecture for all this?

We'll make copies of this available when we finish the paper.  

Issue 2: reasoning about the communication graph

Next, regarding your question about the complexity of the graph needed in
the multicast paper (now being revised for ACM/TOCS).  We had a nice 
idea on this issue.  We realized (and subsequently learned of a brief
paper on this by Prof. Singhal -- U. Minn.) that if A -> B and A carries
a field VTg then if VTg hasn't changed, B can omit VTg.  This makes it
possible to just carry all VT's for all groups around on all messages, but to
actually transmit much smaller amounts of information.  So, we are revising
the paper to move the combined VT/LT stuff into a "special issues" section,
and to focus on the VT scheme with this optimization as our main protocol.

In particular, this change essentially eliminates the need to do any sort of
runtime estimation of the CG graph.  Instead, you just piggyback VT's
all around the system, but usually don't turn out to send very many of them.
This protocol is also extremely simple, and I wonder if it doesn't push the
balance back in the other direction?

I should also note that in the combined VT/LT scheme, we don't maintain the
CG graph explicitly.

The current TR version of the paper includes schemes for deducing some simple 
information about the graph based on information directly accessible by
looking at groups to which you belong, and being told about the groups to
which other members belong.  We never actually go out and build the graph
or explicitly test it for cycles.  Thus, even if you do use the VT/LT scheme,
which remains interesting because it guarantees that you don't need to
piggyback certain timestamps out, you certainly really don't need any
information that wouldn't already be available locally -- and in particular,
nobody actually maintains a structure called the CG graph.  The idea is
more like proving that a system can't deadlock by reasoning about a wait-for
graph: even without building the graph, you may be able to say something
about how it would look if you did go ahead and build it.

ken@gvax.cs.cornell.edu (Ken Birman) (11/14/90)

>From scott@amethyst.omg.ORG Tue Nov 13 23:10:48 1990
>Subject: Re: fast causal broadcast

You write:

>  But, say that I build an application in the style that the LINDA people
>  advocate.
>
>  I have a computational thread that writes some parameters on a
>  blackboard (some stuff that you might not need to see, but if you do,
>  it costs a lot to get it wrong).  Then, it puts a bunch of work
>  requests into a bag of work requests.
>
>  Server programs pull out work requests and start computing.  They can
>  check the blackboard if desired.
>
>  Now, I would argue that:
>
>   1) [... ]  So, the bboard update is asynchronous, and the task bag
>      update now creates a race condition.
>
>      Specifically, will the server see the bboard update when it looks?
>      With a causal guarantee, it will.

It is exactly my contention that because there is this causal
relationship between the blackboard update and the work-request-bag
update the application developer must EXPLICITLY handle this
relationship by creating a process group containing the PG members from
both the blackboard server and the work-request-bag server.  The
original two PGs still exist; they may be used by clients that are not
concerned with inter-PG causality.  This scheme assumes that PG members
simply discard messages that they don't understand.
 
You might argue that the increased group-size would incurr a substantial
performance penalty.  Under a pessimistic protocol it probably
would.  However with an optimistic protocol better suited to the
reliability characteristics of the pervasive CSMA/CD ethernet or
Token Ring physical layers there would be little performance penalty.
PG members would only have to adjust their VT/LTs.
 
I see this problem as a general concurrency problem, rather than
a distributed system problem.  A concurrent programer faced with such
a problem would use some sort of MUTEX mechanism to ensure that
the causal relationship between the two processes is respected.
 
To pursue this chain of reasoning somewhat further I see the
field of distributed computing as an amalgam of two fields:
concurrent computing, and communication protocols.  Many
approaches seem to do well in one regard but not in the other.
For example the Hermes system (R. E. Strom et al, IBM) handles
concurrency nicely but sidesteps the communication issues.
Isis seems to be just the opposite (somewhat unavoidably as
it is implemented in C).  My theoretical inclination is to start with a
good concurrent language, Actors, as proposed by Agha and Hewitt,
is my current favorite, and install useful communication semantics
under it.  My practical inclination is of course to do something
in the  C/Unix environment so people will use it.
 
>   2) This is not an uncommon paradigm.
 
Agreed.  However, I dispute the assumption I see lurking behind this
statement, that the benefits of concurrency (or any other stye of
programming) may be obtained "for free" - ie.  that through clever
systems programming on the part of a few, J. Random Hacker may write
generally useful concurrent programs in blissful ignorance of the
fundamentals of concurrent programming.  New tools demand new skills.
As an engineer, my concern is to develop a set of tools that will allow
other engineers, skilled at concurrent programming, to develop
distributed applications.  Developing this set of tools may require the
abandonment or revision of some common paradigms just as functional and
OO programming have.  Further Kuhnian rantings elided...
 
>   3) Fault-tolerance [... ]
 
I have to think about this one.  It's too complex for me to bash
it into my model on the fly.
 
>  Our paper argues for causality classes: causality would be enforced
>  within multiple groups in a single class, but might be violated over
>  class boundaries.  The idea is that most groups, by default, would
>  be in some single class.
 
Here is the crux of our discussion.  I am proposing to regulate
causality using an DAG of homogeneous nodes (each node being a PG) and
you, if I understand your argument correctly, are proposing to relax
the acyclic condition on the PG graph and avoid unneeded causality
enforcement by adding a second DAG (or flat list - are causality
classes heirarchical in your model?) encoding the exact causality
relationship desired.  It seems clear to me that your PG model is
theoretically more powerful, as it allows the group membership graph to
be arbitrary while guaranteeing causally correct message delivery.  A
more precise wording of my original question is: What practical
advantage does your more powerful theoretical model allow?  What
systems can you build that I can't?
 
>  There are a number of additional issues that we look hard at, too:
>    1) Why have the system manage groups at all?  Why not put this at the
>       application level?
 
Something must know about the physical topology of the network, which
must be assumed to be heterogenous and of arbitrary complexity.  Forcing
applications to manage their own groups means that each application
will have to contain some encoding (albeit in a "nice" form) of the
physical topology.  Not a pleasant prospect for the network administrator.
In the absence of some standard mechanism for making the network
topology available to applications I would expect a distributed
system development tool to provide some sort of distributed facility where
network topology could be encode.
 
 
>    2) Given groups, why not do multicast using parallel RPC?  (Or, perhaps,
>       transactions, if you want atomicity)?
 
I have looked at this (big surprise).  I assume, by the way, that you
mean "broadcast RPC", augmented by some sort of multiplexing over
different communications media.  I surmise that RPC protocols (even
over broadcast media) do not provide exactly the functionality we want
and hence will tend to be less efficient, likely prohibitively so.
Some RPC-like functionality, transfer syntaxes in particular, is
necessary.  I do believe that a stub generator (much like an RPC
compiler) is an excellent interface to an underlying PG abstraction.
The application developer might specify the set of messages as
augmented C function declarations much the way the current Netwise RPC
TOOL does.  Rather than generating client and server stubs, the PG stub
generator would generate a broadcast stub which PG members would call
to broadcast messages to other PG members, and a receive stub which
would be called by whatever was running the protocol when it needed to
deliver messages.  PG members would run as an event loop (obviating the
need for threads.)
 
Speaking of threads.  How does Isis handle non-kernel threads?  Using
Sun's lwp library for example, a blocking system call will block all
other threads in the process unless you are willing to pay the 500%
performance penalty (yes, I benchmarked it) for using the non-blocking
io library (Assuming you can.  Liblwp uses the interval timer and some
signals).  Yes but threads are more efficient...  Cough.
 
>    3) How much synchronization is needed between group membership and
>       multicast
>    4) Causality: see above
>    5) Do we need the other ISIS protocols (ABCAST, etc, etc, etc, etc)
 
For the moment I think so.  It would be nice to have a tool that would
deduce the correct protocol to use from some sort of specification but
I haven't seen even a good guess as to what that specification would
look like.  I currently am playing with four classes of multicast
protocols:
 
   1. Unreliable Broadcast: essentially UDP, no PG assumptions
   necessary
 
   2. Atomic Broadcast: (your FBCAST?) provides atomicity "eventually"
   (best effort) but makes no guarantees about message ordering.
   Requires PGs.
 
   3. Causal Broadcast: (your CBCAST) provides atomicity, and causal message
   ordering (withing PGs only).
 
   4. Totally Ordered Broadcast: (your ABCAST) provices atomicity and
   global (intra-PG) message ordering.
 
>    6) Is there a simple, clean architecture for all this?
 
Of course there is.  The obvious next step in this direction is to
move from procedure call interfaces to compiled specification file
interfaces.
 
Thanks again for the discussion.  It's been very stimulating.
 
--Scott

ken@gvax.cs.cornell.edu (Ken Birman) (11/14/90)

In article <48346@cornell.UUCP> scott@amethyst.omg.ORG writes:

> It is [...] my contention that [...]
> the application developer must EXPLICITLY handle this relationship [...]

I see this as being at odds with the OO paradigm, which says that you
generally don't know how things you call were implemented.  How can I
be explicit about preserving properties for a subsystem whose implementation
is a black box to me?

> I see this problem as a general concurrency problem, rather than
> a distributed system problem.  A concurrent programer faced with such
> a problem would use some sort of MUTEX mechanism to ensure that
> the causal relationship between the two processes is respected.

Here, you are basically saying that this whole issue is because I believe
in asynchronous communication and you believe in a more synchronized 
approach.  Well, this is exactly what I would have expected: if the system
doesn't enforce causality in an automatic way, you just can take advantage
of asynchronous computation...

The Strom/Yemeni work and the recent follow-on is for a fairly restrictive
model: it can't support concurrent threads in one address space (well, it
can with a lot of limits on scheduling, but you lose a lot of flexibility
as a programmer in order to let them log information whenever a non-
deterministic event occurs).  So, yes, the model they use is nice, but no,
I don't see this as solving the same problem...

> [..].  However, I dispute the assumption [..] that the benefits of
> concurrency (or any other stye of programming) may be obtained "for
> free" - ie.  that [...] J. Random Hacker may write generally useful
> concurrent programs [...]

Well, ISIS users do it all the time.  

> [... re: causality classes] ... Here is the crux of our discussion...

What we have in mind is pretty simple compared to your suggestion.  Basically,
a system call sequence:
		SetDefaultCausalClass(unique-identifier);
		    create groups....
                ResetDefaultCausalClass();
would do the trick (you just keep track of the value on a per-thread basis).
There is no causality graph or anything.  The result is that each group
is in some class, and the system then enforces causality for groups in the
same class.

> Something must know about the physical topology of the network...

I agree with this.  We need to get information like this into Mach and Chorus
if we expect to put systems like ISIS (or Netwise) over these general 
frameworks and still benefit from special hardware.  I think it could be
largely automated.

> Speaking of threads.  How does Isis handle non-kernel threads?

We use a non-blocking IO library of our own.  There seems to be no major
performance penality, at least in our applications.  The system can do about
10,000 light weight context-switches per second on a Sparc 1 or SUN 3/60,
and the cost is one select system call amortized over multiple users,
plus two context-switches per call to "isis_select", our lightweight interface.
We also found SUN lwp slow (and big) and we no longer link it in by default.
However, in fairness to SUN, lwp also has a lot of mechanism that ISIS doesn't
use...

Ken

ken@gvax.cs.cornell.edu (Ken Birman) (11/15/90)

(From: scott@amethyst.omg.ORG (Scott Meyer))
(Subject: fast causal multicast)

>  > It is [...] my contention that [...]
>  > the application developer must EXPLICITLY handle this relationship [...]
>
>  I see this as being at odds with the OO paradigm, which says that you
>  generally don't know how things you call were implemented.  How can I
>  be explicit about preserving properties for a subsystem whose
>  implementation is a black box to me?

I'm not exposing the implementation of the things I call.  I am trying
to make explicit the implementation of the call itself (how messages
are delivered) - something that most OOL designers seem have left
rather vague.  Each subsystem remains completely opaque.  Subsystems
are accessed through a heirarchical structure that allows the developer
to make strong (synchronous) guarantees about how the subsystems will
interact, *when necessary*.  Now, it is true that a naive application
of heirarchical PGs will lead to a system with excessive synchrony,
however the availability of HPGs does not preclude the use of
basic cleverness.  To reconsider your database example, an
implementation allowing a greater degree of synchrony, would have
the database server send back a ticket associated with the request.
This ticket would then be passed to other processes that might
depend on the associated update having taken place.

>  Here, you are basically saying that this whole issue is because I
>  believe in asynchronous communication and you believe in a more
>  synchronized approach.

I'm not sure that we're so far apart.  As you point out with your
examples, completely asychronous systems are problematic.  The
question is how to introduce just enough synchrony to make a
fundamentally asynchronous system easy to reason about.  Excessive
synchrony is a fundamental performance problem (Von Neuman bottleneck).
 
>  Well, ISIS users do it all the time.

I think it's fair to say that ISIS users are as a whole considerably
more knowledgable about concurrent programming than are, say, PC
applications developers or MIS programmers (the people who use the
software that Netwise sells).  As I understand it, ISIS currently
allows a knowledgable programmer to implement a solution to your
database/blackboard example problems.  ISIS itself cannot detect or
prevent the occurrence of such a problem.  When you implement you fast
CBCAST protocol (ISIS 2.0?) ISIS will be able to prevent that problem
but at the cost of unnecessary synchrony (how much depends on how
frequently processes are members of more than one PG.)  If you then
implement your causality class solution, it would seem that you are
back where you started: no excessive synchrony but no
detection/prevention of concurrency problems.  Forgive me, but I'm
still skeptical that you can deliver a free lunch here...

My inclination is to try and present a single, simple mechanism for
introducing various degrees of synchrony (PG + Broadcast protocol) that
will allow distributed systems developers to create an initial system
quite rapidly by making use of fairly gross synchrony requirements.
After the system is running, performance can be increased (if
necessary) by relaxing synchrony requirements, replacing HPGs with
synchronization mechanisms specifically tailored to the application.

Regards,

--Scott

ken@gvax.cs.cornell.edu (Ken Birman) (11/15/90)

In article <48414@cornell.UUCP> scott@amethyst.omg.ORG (Scott Meyer) writes:
> [...]  To reconsider your database example, an
>implementation allowing a greater degree of synchrony, would have
>the database server send back a ticket associated with the request.
>This ticket would then be passed to other processes that might
>depend on the associated update having taken place.

Well, this approach is feasible and the MIT group (Ladin/Liskov/Shrira)
advocates exactly what you propose here.  But, if you don't realize that
the subsystem you called uses a distribituted implementation, you certainly
won't realize that you are supposed to obtain and carry these tickets around.
So, I think this "exposes" an aspect of the implementation.

>... the question is how to introduce just enough synchrony to make a
>fundamentally asynchronous system easy to reason about.  Excessive
>synchrony is a fundamental performance problem (Von Neuman bottleneck).

I think that here we are pretty close to agreement.

>I think it's fair to say that ISIS users are as a whole considerably
>more knowledgable about concurrent programming than are, say, PC
>applications developers or MIS programmers (the people who use the
>software that Netwise sells).....

Well, the idea in ISIS is that we provide toolkits (and utility programs,
like the network resource manager that ISIS Distributed Systems will be
selling) that pretty much can be used "stand alone".  So, naive programmers
wouldn't need to learn about how ISIS works to use them.  Currently, my
feeling is that ISIS is fairly useful this way, but that it looks a bit
complex because so many different tools and utilities are all discussed in
the same tutorial.  Split into pieces, I think a system like this could
be quite approachable by, say, a high-school educated hacker.  

But, this definitely implies that users will be combining canned subsystems
and that the subsystem can't trust the user to lug some sort of ticket
around with them.

Don't you see it as contradictory to expect users to be totally naive
but also expect them to corectly manage the ticket sent back by the
database?  In the MIT scheme (and the ISIS one), these tickets are
not such small objects -- and, the best chance for keeping the size down
is to use fairly complex compression algorithms, which is the last thing
a typical user could be asked to do...

So, as I see it, you end up with system support for maintaining causality
either way -- precisely because you don't expect your users to be
sophisticated enough about system internals to do this for you.

... too bad we don't have comp.sys.isis readers from the MIT project,
because it would be interesting to have their perspective on this.
They do have some papers on their approach, which is the one you are
advocating to -- they call it "lazy replication".  The most readily
available is in the procedings of a conference call PODC 1990 (Principles
of Distributed Computing).  One thing that stands out is that the system
model that this group uses is fundamentally different than the ISIS model.
First, they really are not working with large numbers of groups, while
ISIS is moving more and more towards applications with huge numbers of
groups.  Also, they don't allow clients to broadcast directly to servers;
only a server can initiate broadcasts.  Finally, the handling of failures
is a little different.  On the other hand, if you compare our scheme with
theirs in the case of a single group, the two are very similar.  In fact,
both derive from the same basic idea (vector times, which were developed
a long time ago, perhaps by Jerry Popek and perhaps by Keith Marzullo and
Susan Owiki).  We had toyed with the idea of using this scheme, especially
after the MIT group reported good results with it. But, ISIS uses large
numbers of overlapping groups and this forced us to do a lot of work to come
up with a version that works in ISIS.  

The full details of our scheme will be in Pat Stephenson's thesis, which
should be available in TR form by the end of this year (from us).  The MIT
work is reported both in Rivka Ladin's thesis and in the PODC paper.

>... when you implement this (ISIS V2.0)?

Actually, the implementation is out as part of ISIS V2.1 now.  But, it
lacks support for "bypass" communication when a process is a client of a
process group; this extension (an important one for many of our large
users) will be in ISIS V3.0, which is due out "soon".