ken@kama.cs.cornell.edu (Ken Birman) (07/28/87)
Keywords: Transactions, reliability, distributed systems, virtual synchrony. We based the first version of ISIS on nested transactions and have moved away from this towards a "toolkit" scheme in the new version. Transactions can be built in ISIS-2 using tools based on stuff in ISIS-1, but I really doubt that people will do so very often. To summarize my feelings, transactions turn out to be too costly and unwieldy to employ on a system-wide basis and are not the most appropriate or natural way to make O.S. mechanisms fault-tolerant. They represent an over-constrained reliability technique and there are more flexible and equally powerful alternatives available. 1. Transactions and a transactional directory data structure. Although transactions (nested ones) are elegant, problems arose when my group started coding applications in ISIS-1, which was based on transactional "resilient objects". My group is interested in replication, but the discussion that follows is independent of that aspect except in one respect: I will assume that concurrency control can be costly because in my setting replicated write-locking is quite costly. You may want to question this if you work in a non-replicated setting. Lets take a simple example: a directory that binds unique names and values. Clients come along and either lookup a value for some name, insert a new name-value tuple, or change the value bound to some existing name. The changes/inserts run slowly: say they take a few seconds to complete. Queries are frequent and short running and we don't want insert/change operations to block them. Under the straight transaction model there is no simple way to implement a directory data structure to this specification! The problem is that most representation data structures, like linked lists, have a dual implementation: At one level there are nodes linked together, a free list, etc, and at a second internal level there is the data inside the nodes. The standard transaction model, however, would treat these as both being data within the directory. So, in order to update or insert a node you have to lock out access to that node, and other transactions get stuck when they try to access these locked nodes, even if they don't happen to be looking for the item that the locked node contains. What about a more sophisticated representation of the data structure? For example, we could step outside of the transaction model when manipulating links and only access the contents of each tuple transactionally. This is done using, for example, the ARGUS ``toplevel'' construct. Oops: how can a transaction tell if it is interested in the contents of a node without looking at the contents of the name field? Perhaps we could store this field at the same logical level as the links and only manipulate the value field transactionally? Now, this starts to sound like something that would work, except for the need for a complex "abort" mechanism tied to the semantics of our particular data structure... Now, let's abstract a bit. Once we have finally implemented our spiffy transactional directory, it would be nice to put it in a library for people to exploit as they need it. But, how can concurrency control be done efficiently in this case? Some applications may already have a ``covering'' lock and in such cases directory-level locking would not be needed. On the other hand, other programs might start up by talking to the directory, and they will definitely need locks to ensure serializability. Always acquiring locks will be costly and superfluous, especially if data is replicated (but even if it is not). Sounds like we are going to need a very fancy concurrency control scheme... Now, here's the ironic part: if we weren't trying to achieve serializability this application would be quite trivial! Any reasonable synchronization mechanism would suffice. So, we are going to real extremes just to twist a simple problem with an obvious solution into a serializable one. 2. Orphans can wreak havoc by taking toplevel actions Are we done? Well, this depends on the system. In ISIS-1 replication and roll-forward restart from failures combined to make this a correct solution. But, in a system like ARGUS or CLOUDS orphans can arise, and this raises further questions. An orphan is using an invalid system state, and above we proposed that our directory use toplevel actions, which do not get aborted when the transaction (or orphan) gets killed. Sounds like we could end up with duplicate copies of nodes under some circumstances... better have someone look at exactly how the orphan termination algorithm would handle this case before putting that directory object in the public library! 3. Lets get to the point. What's my point? Well, transactions look good and are easy to understand, but here we have about the simplest possible data structure and it already sounds like only a real wizard could conceivably get it to work. Interesting applications have dozens (or more) such data structures. Obviously, transactions are just not going to permit the construction of a very large class of such applications (some will be built anyhow, but it won't be easy). I would argue that we should leave transactions for the database people. Transactions work well if you are compiling and executing queries... they make far less sense in the highly dynamic world of distributed systems where interactions between processes are common and persist over long periods of time. 4. Counter proposal: Preserving potential causal orderings I have a counter proposal, but to introduce it I need to work through a second (shorter) example. The question is: what sort of a correctness constraint could we substitute for transactional serializability? Consider this scenerio: say that p1 and p2 are processes (executing programs) and p1 -> p2 means that "p2 reads something from p1". Then a transactional model says that an interaction like: p1 -> p2 -> p1 -> p2 ... is illegal: it is not serializable. But, in any OS this would be a simple, common interaction. Should we carve it up as little transactions? The ARGUS group has argued that this, plus the idea of a "non-deterministic" data structure that can legally return a result that might intuitively be incorrect (e.g. a set object that returns empty when it is not empty) solves most of the issues I have raised. This would give a sequence of transactions: (1) p1 -> p2 (2) p2 -> p1 (3) p1 -> p2 ... etc ... But now, nothing in our model says that the "second" interation p1 -> p2 occurs after the first one -- strictly speaking, it could be serialized first. That is, in a normal transactional system it would be legal to execute this sequence as it had run in the order: (3) p1 -> p2 (2) p2 -> p1 (1) p1 -> p2 Think what this would mean in practice: p1 sends two requests to p2 and the second is executed as if the first had never been seen! Obviously this is not going to work. Thus, we might want to carve up our transaction into little actions, but we will also want something to enforce the idea that if action (1) might affect the outcome of action (2) and action (1) came first, action (2) definitely sees the effects of action (1). Let me call this a problem of arranging that "potentially causal" relationships are preserved. This idea has antecedents. Lamport defines potential causality, although my work (and that of Larry Peterson at Arizona) are the first to worry about having the system enforce this sort of ordering, the way a database system enforces serializability. Related readings would include Herlihy's work on linearizability, the NIL system, and Jefferson's virtual time work. (I realize this is an incomplete list! Sorry...) 5. Packaging this proposal: Virtual synchrony and the ISIS-2 toolkit So, what's the actual proposal? In ISIS-2 we are using a new scheme called "virtual synchrony" as our correctness constraint. I have a technical report that talks about this (it is being revised for SOSP, but you can have an old copy by sending a request to me). A virtually synchronous execution is one in which it seems that one event happens at a time, system-wide, although the actual executions can be concurrent. Thus, the apparent synchrony of the system is a virtual property of a real system that is not at all synchronous. In addition, we require that if one event preceeded another then everyone must see a consistent event ordering. In our approach, an "event" can be a communication event -- a message or RPC (point-to-point or multicast to a group of processes), a change in membership to a group of processes, a failure (process or site) or a recovery. This is like serializability, but instead of working with transactions, we work with events. The idea is that in a virtualy synchronous setting one implements algorithms as if the environment were synchronous and then executes them as cheaply as possible by relaxing synchrony whenever possible. This might sound a bit hairy, but it works. And, we can build a correct directory data structure in a very simple way, using our system. The packaging of our new system is as a set of tools that guarantee virtually synchronous executions to their users. Tools are used to glue conventional non-distributed components together: they permit the grouping of processes into groups, communication with individual members or groups or even sets of groups, synchronization, recovery mechanisms, and even mechanisms for dynamically reconfiguring after failures. This stuff all works and we are now benchmarking it and implementing some application software on top of it. The tools themselves are built on top of a bunch of multicast protocols. One of our graduate students, Frank Schmuck, will cover a lot of the theory in his PhD dissertation, if I can convince him to stop hacking long enough to write it... 6. Not a panacea The bad news is that virtual synchrony doesn't make distributed computing trivial. But, with experience we are getting very good at programming in this setting and it is far easier than it would have been in an overconstrained transactional setting (or an underconstrained RPC setting). Anyhow, once a a big enough set of tools exists -- give me a year -- most people can just use the tools and "group RPC" mechanisms and forget about the virtual synchrony aspect. 7. Summary, conclusions So, I would argue that transactions have their place and we ought to keep them around, but at the same time that there are other mechanisms that really work much better if one is building distributed application software. In particular, it seems to me that transactions work well when running queries on data structures that are themselves passive. They are ill-suited to systems in which lots of active entities interact through shared data or message passing (RPC). Virtual synchrony, a weaker correctness constraint than serializability, turns out to be easier to program with and cheap to support, and is now felt by the ISIS project to represent a more appropriate tool for building reliable distributed systems. Ken Birman
ken@kama.cs.cornell.edu (Ken Birman) (03/06/88)
Keywords: Transactions, reliability, distributed systems, virtual synchrony. [ I was looking through the archives and found this. If you missed it the ] [ first time around, have a look at it -- it mentions some important work ] [ I would especially urge you researchers to submit substantial articles ] [ such as this. --DL ] We based the first version of ISIS on nested transactions and have moved away from this towards a "toolkit" scheme in the new version. Transactions can be built in ISIS-2 using tools based on stuff in ISIS-1, but I really doubt that people will do so very often. To summarize my feelings, transactions turn out to be too costly and unwieldy to employ on a system-wide basis and are not the most appropriate or natural way to make O.S. mechanisms fault-tolerant. They represent an over-constrained reliability technique and there are more flexible and equally powerful alternatives available. 1. Transactions and a transactional directory data structure. Although transactions (nested ones) are elegant, problems arose when my group started coding applications in ISIS-1, which was based on transactional "resilient objects". My group is interested in replication, but the discussion that follows is independent of that aspect except in one respect: I will assume that concurrency control can be costly because in my setting replicated write-locking is quite costly. You may want to question this if you work in a non-replicated setting. Lets take a simple example: a directory that binds unique names and values. Clients come along and either lookup a value for some name, insert a new name-value tuple, or change the value bound to some existing name. The changes/inserts run slowly: say they take a few seconds to complete. Queries are frequent and short running and we don't want insert/change operations to block them. Under the straight transaction model there is no simple way to implement a directory data structure to this specification! The problem is that most representation data structures, like linked lists, have a dual implementation: At one level there are nodes linked together, a free list, etc, and at a second internal level there is the data inside the nodes. The standard transaction model, however, would treat these as both being data within the directory. So, in order to update or insert a node you have to lock out access to that node, and other transactions get stuck when they try to access these locked nodes, even if they don't happen to be looking for the item that the locked node contains. What about a more sophisticated representation of the data structure? For example, we could step outside of the transaction model when manipulating links and only access the contents of each tuple transactionally. This is done using, for example, the ARGUS ``toplevel'' construct. Oops: how can a transaction tell if it is interested in the contents of a node without looking at the contents of the name field? Perhaps we could store this field at the same logical level as the links and only manipulate the value field transactionally? Now, this starts to sound like something that would work, except for the need for a complex "abort" mechanism tied to the semantics of our particular data structure... Now, let's abstract a bit. Once we have finally implemented our spiffy transactional directory, it would be nice to put it in a library for people to exploit as they need it. But, how can concurrency control be done efficiently in this case? Some applications may already have a ``covering'' lock and in such cases directory-level locking would not be needed. On the other hand, other programs might start up by talking to the directory, and they will definitely need locks to ensure serializability. Always acquiring locks will be costly and superfluous, especially if data is replicated (but even if it is not). Sounds like we are going to need a very fancy concurrency control scheme... Now, here's the ironic part: if we weren't trying to achieve serializability this application would be quite trivial! Any reasonable synchronization mechanism would suffice. So, we are going to real extremes just to twist a simple problem with an obvious solution into a serializable one. 2. Orphans can wreak havoc by taking toplevel actions Are we done? Well, this depends on the system. In ISIS-1 replication and roll-forward restart from failures combined to make this a correct solution. But, in a system like ARGUS or CLOUDS orphans can arise, and this raises further questions. An orphan is using an invalid system state, and above we proposed that our directory use toplevel actions, which do not get aborted when the transaction (or orphan) gets killed. Sounds like we could end up with duplicate copies of nodes under some circumstances... better have someone look at exactly how the orphan termination algorithm would handle this case before putting that directory object in the public library! 3. Lets get to the point. What's my point? Well, transactions look good and are easy to understand, but here we have about the simplest possible data structure and it already sounds like only a real wizard could conceivably get it to work. Interesting applications have dozens (or more) such data structures. Obviously, transactions are just not going to permit the construction of a very large class of such applications (some will be built anyhow, but it won't be easy). I would argue that we should leave transactions for the database people. Transactions work well if you are compiling and executing queries... they make far less sense in the highly dynamic world of distributed systems where interactions between processes are common and persist over long periods of time. 4. Counter proposal: Preserving potential causal orderings I have a counter proposal, but to introduce it I need to work through a second (shorter) example. The question is: what sort of a correctness constraint could we substitute for transactional serializability? Consider this scenerio: say that p1 and p2 are processes (executing programs) and p1 -> p2 means that "p2 reads something from p1". Then a transactional model says that an interaction like: p1 -> p2 -> p1 -> p2 ... is illegal: it is not serializable. But, in any OS this would be a simple, common interaction. Should we carve it up as little transactions? The ARGUS group has argued that this, plus the idea of a "non-deterministic" data structure that can legally return a result that might intuitively be incorrect (e.g. a set object that returns empty when it is not empty) solves most of the issues I have raised. This would give a sequence of transactions: (1) p1 -> p2 (2) p2 -> p1 (3) p1 -> p2 ... etc ... But now, nothing in our model says that the "second" interation p1 -> p2 occurs after the first one -- strictly speaking, it could be serialized first. That is, in a normal transactional system it would be legal to execute this sequence as it had run in the order: (3) p1 -> p2 (2) p2 -> p1 (1) p1 -> p2 Think what this would mean in practice: p1 sends two requests to p2 and the second is executed as if the first had never been seen! Obviously this is not going to work. Thus, we might want to carve up our transaction into little actions, but we will also want something to enforce the idea that if action (1) might affect the outcome of action (2) and action (1) came first, action (2) definitely sees the effects of action (1). Let me call this a problem of arranging that "potentially causal" relationships are preserved. This idea has antecedents. Lamport defines potential causality, although my work (and that of Larry Peterson at Arizona) are the first to worry about having the system enforce this sort of ordering, the way a database system enforces serializability. Related readings would include Herlihy's work on linearizability, the NIL system, and Jefferson's virtual time work. (I realize this is an incomplete list! Sorry...) 5. Packaging this proposal: Virtual synchrony and the ISIS-2 toolkit So, what's the actual proposal? In ISIS-2 we are using a new scheme called "virtual synchrony" as our correctness constraint. I have a technical report that talks about this (it is being revised for SOSP, but you can have an old copy by sending a request to me). A virtually synchronous execution is one in which it seems that one event happens at a time, system-wide, although the actual executions can be concurrent. Thus, the apparent synchrony of the system is a virtual property of a real system that is not at all synchronous. In addition, we require that if one event preceeded another then everyone must see a consistent event ordering. In our approach, an "event" can be a communication event -- a message or RPC (point-to-point or multicast to a group of processes), a change in membership to a group of processes, a failure (process or site) or a recovery. This is like serializability, but instead of working with transactions, we work with events. The idea is that in a virtualy synchronous setting one implements algorithms as if the environment were synchronous and then executes them as cheaply as possible by relaxing synchrony whenever possible. This might sound a bit hairy, but it works. And, we can build a correct directory data structure in a very simple way, using our system. The packaging of our new system is as a set of tools that guarantee virtually synchronous executions to their users. Tools are used to glue conventional non-distributed components together: they permit the grouping of processes into groups, communication with individual members or groups or even sets of groups, synchronization, recovery mechanisms, and even mechanisms for dynamically reconfiguring after failures. This stuff all works and we are now benchmarking it and implementing some application software on top of it. The tools themselves are built on top of a bunch of multicast protocols. One of our graduate students, Frank Schmuck, will cover a lot of the theory in his PhD dissertation, if I can convince him to stop hacking long enough to write it... 6. Not a panacea The bad news is that virtual synchrony doesn't make distributed computing trivial. But, with experience we are getting very good at programming in this setting and it is far easier than it would have been in an overconstrained transactional setting (or an underconstrained RPC setting). Anyhow, once a a big enough set of tools exists -- give me a year -- most people can just use the tools and "group RPC" mechanisms and forget about the virtual synchrony aspect. 7. Summary, conclusions So, I would argue that transactions have their place and we ought to keep them around, but at the same time that there are other mechanisms that really work much better if one is building distributed application software. In particular, it seems to me that transactions work well when running queries on data structures that are themselves passive. They are ill-suited to systems in which lots of active entities interact through shared data or message passing (RPC). Virtual synchrony, a weaker correctness constraint than serializability, turns out to be easier to program with and cheap to support, and is now felt by the ISIS project to represent a more appropriate tool for building reliable distributed systems. Ken Birman