[comp.os.mach] Recovery

strom@andreadoria.watson.ibm.com (Rob Strom) (09/13/90)

In article <45212@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:
>In article <1990Aug30.164153.25008@swbatl.sbc.com> cbs@swbatl.sbc.com (Brad Slaten 529-7636) writes:
>>
>>    Our current application is using an object-oriented implementation
>>under the ISIS lightweight tasking library and Unix.  We have provided an
>>interface to the Inter-process Communication (IPC) mechanism such that
>>the application does not know when an object with which it is
>>communicating requires IPC.  The problem which occurs is that in
>>making the IPC transparent, the lightweight task context switch is
>>also transparent.  The problem results from the fact that certain
>>objects are shared among multiple tasks.  When a context switch occurs
>>from one task to another, any or all shared objects may be modified by
>>a different task without the initial task having knowledge of this.
>
>Because of the wide posting this received and the mention of ISIS, I
>want to comment on how our group sees this issue.  
>
> [summary of problems of transaction systems, and of ISIS's approach] ...
>
>I should probably point to other non-transactional work that uses this
>approach ("rollback if in trouble").  The best known is Jefferson's virtual
>time system ("Time Warp OS") and the Strom/Yemeni system reported in
>TOCS a few years back.  Neither approach really took off except in certain
>specialized languages and applications, notably simulation.  Willy Zwaenapoel
>(Rice) has recently done a version of this in his work on "Sender based
>message logging", which is cheap, but doesn't support rollback and hence
>isn't an appropriate tool for solving this particular problem (he is more
>interested in a cheap transparent fault-tolerance mechanism, and anyhow,
>he assumes a deterministic, non-threaded, execution).  

Since Ken has referred to our work, I'd like to clarify our approach
and the status of our experimental prototypes.

Our group (Distributed Systems Software Technology) at IBM Research is
studying techniques for simplifying the implementation of distributed
systems of the future.  We use a two-pronged approach: (1) a simpler
high-level language model which hides implementation complexity from
the programmer, and (2) program transformations and other transparent
techniques for producing an efficient implementation of our high-level
model.

(1) Our high-level model is based upon processes communicating by
making calls and sends over ports.  Our processes correspond to
modules of conventional programming languages and to objects of
object-oriented languages.  Each process has a sequential thread of
control, and private data. Ports are first-class data objects which
can be passed in messages; this enables a capability-based approach to
access control and dynamic configuration.  Our high-level model hides
from the programmer details of data representation, communication,
failure, and recovery.  We have developed a secure high-level
programming language, Hermes, which exploits this model.  In contrast
to approaches based on shared abstract data types or shared objects,
Hermes processes show no internal concurrency, although such
concurrency can be introduced by program transformations.  Multiple
calls to a single port are queued until the process dequeues them.
Therefore, Brad Slaten's problem would not exist in Hermes.  By
exploiting compile-time checking, Hermes can safely execute large
numbers of ``processes'' in a single physical address space, obtaining
process-switching costs essentially comparable to procedure-calling
costs of conventional languages.  (In the implementation of NIL, a
predecessor of Hermes, the overhead of a cross-process call was 9
instructions on an IBM 370.)

(2) We believe that the high-level approach described above is an
ideal model for the programmer provided that it can be implemented
efficiently.  The second aspect of our research addresses transparent
program optimizations which allow the implementer to improve the
efficiency of a distributed implementation without having to rewrite
the high-level code.  We have explored a number of such techniques
based upon the principle of *optimism* --- that is, guessing that some
good thing has happened or will happened, in order to gain speed when
we guess correctly, and at the expense of having to rollback
computations when we guess wrong.  We have done some exploratory work
on optimistic techniques for process replication, and for
parallelizing logically sequentially processes.  Our most mature work
in this direction is on Optimistic Recovery.  Optimistic Recovery (OR)
recovers a recent consistent state of a distributed system after a
failure of one or more system components.

>Recently, I understand
>that Rob Strom's group has done something similar under Mach, but again
>the focus is no longer on arbitrary rollback but rather is on roll-forward
>for fault-tolerance with determinism assumptions.  Multithreaded OO systems 
>are usually not deterministic because of the thread scheduling issue.
>
>To summarize, there seems to be no simple way out of this problem.
>
>...
>
>			Ken Birman
>			ken@cs.cornell.edu or isis@cs.cornell.edu
>			607-255-9199

The OR model is very general: any system describable as a collection
of ``recovery units'' which have local state, and which communicate by
exchanging messages can be recovered using this technique.  Each
recovery unit is independently made recoverable by logging input
messages.  The optimism arises from the fact that we do not
synchronize logging with computation or communication; a recovery unit
is allowed to receive messages from a sender recovery unit whose
current state is not yet recoverable.  The optimism results in lower
overhead and faster commit time during normal (failure-free)
execution; in exchange, recovery after a failure may be slower.  The
original Strom/Yemini OR algorithm has been refined. Suggested
improvements have been published by Johnson and Zwaenepoel of Rice, by
myself, Shaula Yemini, and David Bacon at IBM, and by others.  Our
rollback technique is completely general, and can be exploited for
other kinds of optimistic algorithms besides OR.  It is true that each
recovery unit is required to be ``piecewise deterministic''.  However,
this property holds in many realistic environments (e.g.
coroutine-based C-threads) and can be made to hold in ``disciplined''
multithreaded environments in which threads access shared storage in a
concurrent-reader exclusive-writer discipline enforced by lock
primitives.

A Hermes implementation for RT, SUN, and RISC/6000 is available free
of charge to interested researchers; a manual and tutorial have also
been written and will be published at the end of the year.

A prototype of a transparently recoverable Mach based on OR is being
built.  Under this system, distributed Mach applications can be made
recoverable by recompiling and relinking them with our OR library.  We
hope to make this system available sometime in the future; currently a
``demo'' prototype supporting a subset of this functionality is now
working.

Interested persons are welcome to contact me by phone, email, or by
directing followups to comp.lang.misc (Hermes) or comp.os.mach (Mach
OR prototype).

Rob Strom, strom@ibm.com, (914) 784-7641
IBM Research, 30 Saw Mill River Road, P.O. Box 704, Yorktown Heights, NY  10958
-- 
Rob Strom, strom@ibm.com, (914) 784-7641
IBM Research, 30 Saw Mill River Road, P.O. Box 704, Yorktown Heights, NY  10958

dbj@titan.rice.edu (Dave Johnson) (09/21/90)

My work with Willy Zwaenepoel has recently been referred to in this newsgroup
in messages by Ken Birman and Rob Strom.  However, we have progressed well
beyond what is represented in those messages, and I would like to take
this opportunity to comment on some of our recent results.

In article <45212@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:
>Willy Zwaenapoel
>(Rice) has recently done a version of this in his work on "Sender based
>message logging", which is cheap, but doesn't support rollback and hence
>isn't an appropriate tool for solving this particular problem (he is more
>interested in a cheap transparent fault-tolerance mechanism, and anyhow,
>he assumes a deterministic, non-threaded, execution).

Sender-based message logging was our initial work in this area, and it
is, as Ken points out, transparent and very fast, but restricted to
deterministic processes and can guarantee recovery from at most a single
failure at a time.  A recent paper describing the implementation and
performance of sender-based message logging is available as technical
report Rice COMP TR90-119.  

Beyond sender-based message logging, we have also developed optimistic
recovery methods, as mentioned by Rob Strom.  As with Strom and Yemini's
system, our optimistic system supports recovery from any number failures,
including a total failure.  Furthermore, our our current optimistic recovery
system no longer requires deterministic process execution, and can thus
support arbitrary multi-threaded processes.

The principle in our work remains the same as in Strom and Yemini's original
paper.  As with their method, but unlike ISIS, we provide transparent
fault tolerance, relieving the programmer from having to worry about it.
Our current design stresses reducing overhead during failure-free execution in
exchange for possibly somewhat slower recovery times.  Each process
logs its own input messages, and takes occasional checkpoints, independently
of other processes.  Our system differs from Strom and Yemini's in that we
use a more space-efficient dependency encoding method (requiring only one
extra integer per message), and do not rely on an underlying reliable message
delivery protocol.  Our system also guarantees recovery to the maximum
consistent system state available from the surviving processes and the 
information on stable storage.

We have completed a full implementation of optimistic recovery in the
V-System, although this implementation uses our earlier recovery algorithm
that does not yet support nondeterministic process execution.  Based on
measured performance, this method is very efficient.  Since logging is
asynchronous, very little overhead is imposed on individual communication
operations.  A 32-byte Send-Receive-Reply (similar to an RPC) between two
SUN-3/60's takes 1.6 milliseconds with logging vs. 1.4 milliseconds without,
for an overhead of 14 percent.  Similarly, a bulk data transfer of 64
kilobytes takes 89 milliseconds with logging vs. 87 milliseconds without,
for an overhead of only 2 percent.  The cost of checkpointing is reduced by
doing pre-copying of the address space to disk before freezing the process,
similarly to the method used by Theimer for process migration.  The resulting
overhead is somewhat application-dependent but is in general quite small:
the process is typically frozen for less than 50 milliseconds.  We have also
measured the performance of several distributed application programs with
the system.  For 300 x 300 Gaussian elimination with partial pivoting
running on 8 machines, the message logging adds less than 1 percent
overhead, and checkpointing each process (independently) every 15 seconds
adds less than another 2 percent, a small price to pay for the ability to
recover from any number of failures.  A complete account of this system
is included in my Ph.D. dissertation, available as technical report
Rice COMP TR89-101.

A new implementation is currently underway.  This implementation will
support nondeterministic processes, and will provide better recovery times
and reduced output latency, in addition to further improved failure-free
performance.  Nondeterministic processes are supported by allowing a process
to dynamically turn off message logging, and rely on checkpointing alone for
recovery.  Output latency and recovery times are improved by loosely
coordinating the message logging and/or checkpointing.  Details may be
found in a recent technical report Rice COMP TR90-118.  We are also
considering the possibility of implementing this algorithm under Unix or
Mach, as well.

Finally, another member of our group, Mootaz Elnozahy, has been
experimenting with a new pessimistic protocol, applicable to both message
logging and replication.  Although we do not yet have an implementation of
this method, it seems to combine many of the advantages of optimistic and
pessimistic protocols.  His thesis proposal containing the first draft of
this method is available as Rice COMP TR90-120.

All of the above technical reports or any further information can be
obtained by sending mail to me (dbj@rice.edu) or Willy Zwaenepoel
(willy@rice.edu).

                                        David B. Johnson
                                        Department of Computer Science
                                        Rice University
--