[comp.os.research] Transactions: not a big success in ISIS-1

ken@kama.cs.cornell.edu (Ken Birman) (07/28/87)

Keywords: Transactions, reliability, distributed systems, virtual synchrony.

We based the first version of ISIS on nested transactions and have moved away
from this towards a "toolkit" scheme in the new version.  Transactions
can be built in ISIS-2 using tools based on stuff in ISIS-1, but I really doubt
that people will do so very often.  To summarize my feelings, transactions
turn out to be too costly and unwieldy to employ on a system-wide basis
and are not the most appropriate or natural way to make O.S. mechanisms
fault-tolerant.  They represent an over-constrained reliability technique
and there are more flexible and equally powerful alternatives available.

1. Transactions and a transactional directory data structure.

Although transactions (nested ones) are elegant, problems 
arose when my group started coding applications in ISIS-1, which was
based on transactional "resilient objects".  My group is interested in
replication, but the discussion that follows is independent of that aspect
except in one respect: I will assume that concurrency control can be
costly because in my setting replicated write-locking is quite costly.
You may want to question this if you work in a non-replicated setting.

Lets take a simple example: a directory that binds unique names and
values.  Clients come along and either lookup a value for some name, insert
a new name-value tuple, or change the value bound to some existing name.
The changes/inserts run slowly: say they take a few seconds to complete.
Queries are frequent and short running and we don't want insert/change
operations to block them.

Under the straight transaction model there is no simple way to implement a
directory data structure to this specification!  The problem is that most 
representation data structures, like linked lists, have a dual implementation:
At one level there are nodes linked together, a free list, etc, and at a
second internal level there is the data inside the nodes. The standard
transaction model, however, would treat these as both being data within
the directory.  So, in order to update or insert a node you have to lock
out access to that node, and other transactions get stuck when they try
to access these locked nodes, even if they don't happen to be looking for
the item that the locked node contains.

What about a more sophisticated representation of the data structure?
For example, we could step outside of the transaction model when manipulating
links and only access the contents of each tuple transactionally.  This
is done using, for example, the ARGUS ``toplevel'' construct.  Oops:
how can a transaction tell if it is interested in the contents of a node
without looking at the contents of the name field?  Perhaps we could store
this field at the same logical level as the links and only manipulate the
value field transactionally?  Now, this starts to sound like something that
would work, except for the need for a complex "abort" mechanism tied to the
semantics of our particular data structure...

Now, let's abstract a bit.  Once we have finally implemented our spiffy
transactional directory, it would be nice to put it in a library for people
to exploit as they need it.  But, how can concurrency control be done
efficiently in this case?  Some applications may already have a ``covering''
lock and in such cases directory-level locking would not be needed.  On the
other hand, other programs might start up by talking to the directory, and
they will definitely need locks to ensure serializability.  Always acquiring
locks will be costly and superfluous, especially if data is replicated (but
even if it is not).  Sounds like we are going to need a very fancy
concurrency control scheme...

Now, here's the ironic part: if we weren't trying to achieve serializability
this application would be quite trivial!  Any reasonable synchronization
mechanism would suffice.  So, we are going to real extremes just to twist a
simple problem with an obvious solution into a serializable one.

2. Orphans can wreak havoc by taking toplevel actions

Are we done?  Well, this depends on the system.  In ISIS-1 replication and
roll-forward restart from failures combined to make this a correct solution.
But, in a system like ARGUS or CLOUDS orphans can arise, and this raises further
questions.  An orphan is using an invalid system state, and above we 
proposed that our directory use toplevel actions, which do not get aborted
when the transaction (or orphan) gets killed.  Sounds like we could end up
with duplicate copies of nodes under some circumstances... better have
someone look at exactly how the orphan termination algorithm would handle
this case before putting that directory object in the public library!

3. Lets get to the point.

What's my point?  Well, transactions look good and are easy to understand,
but here we have about the simplest possible data structure and it already
sounds like only a real wizard could conceivably get it to work.  Interesting
applications have dozens (or more) such data structures.  Obviously,
transactions are just not going to permit the construction of a very large
class of such applications (some will be built anyhow, but it won't be easy).

I would argue that we should leave transactions for the database people.
Transactions work well if you are compiling and executing queries...
they make far less sense in the highly dynamic world of distributed systems
where interactions between processes are common and persist over long periods
of time.

4. Counter proposal: Preserving potential causal orderings

I have a counter proposal, but to introduce it I need to work through
a second (shorter) example.  The question is: what sort of a correctness
constraint could we substitute for transactional serializability?
Consider this scenerio: say that p1 and p2 are processes (executing
programs) and p1 -> p2 means that "p2 reads something from p1".
Then a transactional model says that an interaction like:
	p1 -> p2 -> p1 -> p2 ... 
is illegal: it is not serializable.  But, in any OS this would be a
simple, common interaction.  Should we carve it up as little transactions?
The ARGUS group has argued that this, plus the idea of a "non-deterministic"
data structure that can legally return a result that might intuitively be
incorrect (e.g. a set object that returns empty when it is not empty) solves
most of the issues I have raised.  This would give a sequence of transactions:
	(1) p1 -> p2
	(2) p2 -> p1 
	(3) p1 -> p2
	   ... etc ...
But now, nothing in our model says that the "second" interation p1 -> p2
occurs after the first one -- strictly speaking, it could be serialized
first.  That is, in a normal transactional system it would be legal to
execute this sequence as it had run in the order:
	(3) p1 -> p2
	(2) p2 -> p1
	(1) p1 -> p2
Think what this would mean in practice:  p1 sends two requests to p2
and the second is executed as if the first had never been seen!  Obviously
this is not going to work.  Thus, we might want to carve up our transaction
into little actions, but we will also want something to enforce the idea that
if action (1) might affect the outcome of action (2) and action (1) came
first, action (2) definitely sees the effects of action (1).  Let me
call this a problem of arranging that "potentially causal" relationships
are preserved.

This idea has antecedents. Lamport defines potential causality, although 
my work (and that of Larry Peterson at Arizona) are the first to worry
about having the system enforce this sort of ordering, the way a database
system enforces serializability.  Related readings would include Herlihy's
work on linearizability, the NIL system, and Jefferson's virtual time work.
(I realize this is an incomplete list!  Sorry...)

5. Packaging this proposal: Virtual synchrony and the ISIS-2 toolkit

So, what's the actual proposal?  In ISIS-2 we are using a new scheme called
"virtual synchrony" as our correctness constraint.  I have a technical report
that talks about this (it is being revised for SOSP, but you can have an old
copy by sending a request to me).  A virtually synchronous execution is one
in which it seems that one event happens at a time, system-wide, although the
actual executions can be concurrent.  Thus, the apparent synchrony of the system
is a virtual property of a real system that is not at all synchronous.
In addition, we require that if one event preceeded another then everyone
must see a consistent event ordering.

In our approach, an "event" can be a communication event -- a message or RPC
(point-to-point or multicast to a group of processes), a change in membership
to a group of processes, a failure (process or site) or a recovery.  This is 
like serializability, but instead of working with transactions, we work with
events.  The idea is that in a virtualy synchronous setting one implements
algorithms as if the environment were synchronous and then executes them 
as cheaply as possible by relaxing synchrony whenever possible.
This might sound a bit hairy, but it works.  And, we can build a
correct directory data structure in a very simple way, using our system.

The packaging of our new system is as a set of tools that guarantee virtually
synchronous executions to their users.  Tools are used to glue conventional
non-distributed components together: they permit the grouping of processes
into groups, communication with individual members or groups or even sets
of groups, synchronization, recovery mechanisms, and even mechanisms for
dynamically reconfiguring after failures.  This stuff all works and we are
now benchmarking it and implementing some application software on top of it.
The tools themselves are built on top of a bunch of multicast protocols.
One of our graduate students, Frank Schmuck, will cover a lot of the theory
in his PhD dissertation, if I can convince him to stop hacking long enough
to write it...

6.  Not a panacea

The bad news is that virtual synchrony doesn't make distributed computing
trivial.  But, with experience we are getting very good at programming in this
setting and it is far easier than it would have been in an overconstrained
transactional setting (or an underconstrained RPC setting).  Anyhow, once a
a big enough set of tools exists -- give me a year -- most people can just
use the tools and "group RPC" mechanisms and forget about the virtual 
synchrony aspect.  

7. Summary, conclusions

So, I would argue that transactions have their place and we ought to keep them
around, but at the same time that there are other mechanisms that really work
much better if one is building distributed application software.  In particular,
it seems to me that transactions work well when running queries on data
structures that are themselves passive.  They are ill-suited to systems in
which lots of active entities interact through shared data or message
passing (RPC).  Virtual synchrony, a weaker correctness constraint than
serializability, turns out to be easier to program with and cheap to support,
and is now felt by the ISIS project to represent a more appropriate tool for
building reliable distributed systems.

Ken Birman

ken@kama.cs.cornell.edu (Ken Birman) (03/06/88)

Keywords: Transactions, reliability, distributed systems, virtual synchrony.

[ I was looking through the archives and found this.  If you missed it the ]
[ first time around,  have a look at it -- it mentions some important work ]
[ I  would  especially urge you researchers to submit substantial articles ]
[ such as this.  --DL                                                      ]

We based the first version of ISIS on nested transactions and have moved away
from this towards a "toolkit" scheme in the new version.  Transactions can be
built in ISIS-2 using tools based on stuff in ISIS-1, but I really doubt that
people will do so very often.  To summarize my feelings, transactions turn
out to be too costly and unwieldy to employ on a system-wide basis and are
not the most appropriate or natural way to make O.S. mechanisms
fault-tolerant.  They represent an over-constrained reliability technique and
there are more flexible and equally powerful alternatives available.

1. Transactions and a transactional directory data structure.

Although transactions (nested ones) are elegant, problems arose when my group
started coding applications in ISIS-1, which was based on transactional
"resilient objects".  My group is interested in replication, but the
discussion that follows is independent of that aspect except in one respect:
I will assume that concurrency control can be costly because in my setting
replicated write-locking is quite costly.  You may want to question this if
you work in a non-replicated setting.

Lets take a simple example: a directory that binds unique names and values.
Clients come along and either lookup a value for some name, insert a new
name-value tuple, or change the value bound to some existing name.  The
changes/inserts run slowly: say they take a few seconds to complete.  Queries
are frequent and short running and we don't want insert/change operations to
block them.

Under the straight transaction model there is no simple way to implement a
directory data structure to this specification!  The problem is that most 
representation data structures, like linked lists, have a dual implementation:
At one level there are nodes linked together, a free list, etc, and at a
second internal level there is the data inside the nodes. The standard
transaction model, however, would treat these as both being data within
the directory.  So, in order to update or insert a node you have to lock
out access to that node, and other transactions get stuck when they try
to access these locked nodes, even if they don't happen to be looking for
the item that the locked node contains.

What about a more sophisticated representation of the data structure?
For example, we could step outside of the transaction model when manipulating
links and only access the contents of each tuple transactionally.  This
is done using, for example, the ARGUS ``toplevel'' construct.  Oops:
how can a transaction tell if it is interested in the contents of a node
without looking at the contents of the name field?  Perhaps we could store
this field at the same logical level as the links and only manipulate the
value field transactionally?  Now, this starts to sound like something that
would work, except for the need for a complex "abort" mechanism tied to the
semantics of our particular data structure...

Now, let's abstract a bit.  Once we have finally implemented our spiffy
transactional directory, it would be nice to put it in a library for people
to exploit as they need it.  But, how can concurrency control be done
efficiently in this case?  Some applications may already have a ``covering''
lock and in such cases directory-level locking would not be needed.  On the
other hand, other programs might start up by talking to the directory, and
they will definitely need locks to ensure serializability.  Always acquiring
locks will be costly and superfluous, especially if data is replicated (but
even if it is not).  Sounds like we are going to need a very fancy
concurrency control scheme...

Now, here's the ironic part: if we weren't trying to achieve serializability
this application would be quite trivial!  Any reasonable synchronization
mechanism would suffice.  So, we are going to real extremes just to twist a
simple problem with an obvious solution into a serializable one.

2. Orphans can wreak havoc by taking toplevel actions

Are we done?  Well, this depends on the system.  In ISIS-1 replication and
roll-forward restart from failures combined to make this a correct solution.
But, in a system like ARGUS or CLOUDS orphans can arise, and this raises
further questions.  An orphan is using an invalid system state, and above we
proposed that our directory use toplevel actions, which do not get aborted
when the transaction (or orphan) gets killed.  Sounds like we could end up
with duplicate copies of nodes under some circumstances... better have
someone look at exactly how the orphan termination algorithm would handle
this case before putting that directory object in the public library!

3. Lets get to the point.

What's my point?  Well, transactions look good and are easy to understand,
but here we have about the simplest possible data structure and it already
sounds like only a real wizard could conceivably get it to work.  Interesting
applications have dozens (or more) such data structures.  Obviously,
transactions are just not going to permit the construction of a very large
class of such applications (some will be built anyhow, but it won't be easy).

I would argue that we should leave transactions for the database people.
Transactions work well if you are compiling and executing queries...
they make far less sense in the highly dynamic world of distributed systems
where interactions between processes are common and persist over long periods
of time.

4. Counter proposal: Preserving potential causal orderings

I have a counter proposal, but to introduce it I need to work through
a second (shorter) example.  The question is: what sort of a correctness
constraint could we substitute for transactional serializability?
Consider this scenerio: say that p1 and p2 are processes (executing
programs) and p1 -> p2 means that "p2 reads something from p1".
Then a transactional model says that an interaction like:
	p1 -> p2 -> p1 -> p2 ... 
is illegal: it is not serializable.  But, in any OS this would be a
simple, common interaction.  Should we carve it up as little transactions?
The ARGUS group has argued that this, plus the idea of a "non-deterministic"
data structure that can legally return a result that might intuitively be
incorrect (e.g. a set object that returns empty when it is not empty) solves
most of the issues I have raised.  This would give a sequence of transactions:
	(1) p1 -> p2
	(2) p2 -> p1 
	(3) p1 -> p2
	   ... etc ...
But now, nothing in our model says that the "second" interation p1 -> p2
occurs after the first one -- strictly speaking, it could be serialized
first.  That is, in a normal transactional system it would be legal to
execute this sequence as it had run in the order:
	(3) p1 -> p2
	(2) p2 -> p1
	(1) p1 -> p2

Think what this would mean in practice:  p1 sends two requests to p2
and the second is executed as if the first had never been seen!  Obviously
this is not going to work.  Thus, we might want to carve up our transaction
into little actions, but we will also want something to enforce the idea that
if action (1) might affect the outcome of action (2) and action (1) came
first, action (2) definitely sees the effects of action (1).  Let me
call this a problem of arranging that "potentially causal" relationships
are preserved.

This idea has antecedents. Lamport defines potential causality, although 
my work (and that of Larry Peterson at Arizona) are the first to worry
about having the system enforce this sort of ordering, the way a database
system enforces serializability.  Related readings would include Herlihy's
work on linearizability, the NIL system, and Jefferson's virtual time work.
(I realize this is an incomplete list!  Sorry...)

5. Packaging this proposal: Virtual synchrony and the ISIS-2 toolkit

So, what's the actual proposal?  In ISIS-2 we are using a new scheme called
"virtual synchrony" as our correctness constraint.  I have a technical report
that talks about this (it is being revised for SOSP, but you can have an old
copy by sending a request to me).  A virtually synchronous execution is one
in which it seems that one event happens at a time, system-wide, although the
actual executions can be concurrent.  Thus, the apparent synchrony of the system
is a virtual property of a real system that is not at all synchronous.
In addition, we require that if one event preceeded another then everyone
must see a consistent event ordering.

In our approach, an "event" can be a communication event -- a message or RPC
(point-to-point or multicast to a group of processes), a change in membership
to a group of processes, a failure (process or site) or a recovery.  This is 
like serializability, but instead of working with transactions, we work with
events.  The idea is that in a virtualy synchronous setting one implements
algorithms as if the environment were synchronous and then executes them 
as cheaply as possible by relaxing synchrony whenever possible.
This might sound a bit hairy, but it works.  And, we can build a
correct directory data structure in a very simple way, using our system.

The packaging of our new system is as a set of tools that guarantee virtually
synchronous executions to their users.  Tools are used to glue conventional
non-distributed components together: they permit the grouping of processes
into groups, communication with individual members or groups or even sets
of groups, synchronization, recovery mechanisms, and even mechanisms for
dynamically reconfiguring after failures.  This stuff all works and we are
now benchmarking it and implementing some application software on top of it.
The tools themselves are built on top of a bunch of multicast protocols.
One of our graduate students, Frank Schmuck, will cover a lot of the theory
in his PhD dissertation, if I can convince him to stop hacking long enough
to write it...

6.  Not a panacea

The bad news is that virtual synchrony doesn't make distributed computing
trivial.  But, with experience we are getting very good at programming in this
setting and it is far easier than it would have been in an overconstrained
transactional setting (or an underconstrained RPC setting).  Anyhow, once a
a big enough set of tools exists -- give me a year -- most people can just
use the tools and "group RPC" mechanisms and forget about the virtual 
synchrony aspect.  

7. Summary, conclusions

So, I would argue that transactions have their place and we ought to keep them
around, but at the same time that there are other mechanisms that really work
much better if one is building distributed application software.  In particular,
it seems to me that transactions work well when running queries on data
structures that are themselves passive.  They are ill-suited to systems in
which lots of active entities interact through shared data or message
passing (RPC).  Virtual synchrony, a weaker correctness constraint than
serializability, turns out to be easier to program with and cheap to support,
and is now felt by the ISIS project to represent a more appropriate tool for
building reliable distributed systems.

Ken Birman