[comp.os.research] How do you tell if a remote site is alive?

darrell@sdcsvax.UCSD.EDU (Darrell Long) (06/09/87)

Here's a question for you: In a distributed system, how do you tell if a remote
site is alive or not?  A time-out could be used, but it's not reliable and its
also very slow.  I can think of many approximate solutions, but reliability is
important.

When constructing a distributed system, the network is the slowest component
and presents the bottle-neck.  From what I've read, most folks just assume that
there is a way to tell if a remote site is dead.  But, this information is very
important to many algorithms.

How is it done in practice?  For us university-types, time-out is the usual
approximation to a solution since we're usually just out to prove a concept
and not to build a product.

How do the folks in industry do it?  Performance is critical there, unlike a
university prototype.

DL
-- 
Darrell Long
Department of Computer Science & Engineering, UC San Diego, La Jolla CA 92093
ARPA: Darrell@Beowulf.UCSD.EDU  UUCP: darrell@sdcsvax.uucp
Operating Systems submissions to: mod-os@sdcsvax.uucp

darrell@sdcsvax.UUCP (06/10/87)

> ... In a distributed system, how do you tell if a remote
> site is alive or not?  A time-out could be used, but it's not reliable and its
> also very slow.  I can think of many approximate solutions, but reliability is
> important.
> 
> 
> How do the folks in industry do it?  Performance is critical there, unlike a
> university prototype.

Many systems [Ah La STARLAN {AT&T}] use a conection mode service with a keep
alive signal.  The absence of the signal within a long time frame kills the
connection.  When a node dies, or is dead, it's LOGICAL name is removed from
the network.  Any attempt to establish a new connection responds with
"NO SUCH DEVICE ON NETWORK" (-: the electronic version thereof as app. :-)

When you can't afford the wait, and must know "NOW" you send a message
through the "connection" or attempt to establish a new "connection" and
the lack of an ACK packet, or addressable logical name provides you with
the information in question.

[ Doesn't this still imply a time-out? --DL ]

Robert.
-- 
Darrell Long
Department of Computer Science & Engineering, UC San Diego, La Jolla CA 92093
ARPA: Darrell@Beowulf.UCSD.EDU  UUCP: darrell@sdcsvax.uucp
Operating Systems submissions to: mod-os@sdcsvax.uucp

ken@gvax.cs.cornell.edu (Ken Birman) (06/10/87)

In the ISIS system, we use a software protocol that triggers higher level
failure actions.  The protocol is very fast and quite simple -- a multiphase
commit, basically.  It gets triggered by timeouts, but the idea is that
when an overload occurs or something else causes an incorrect timeout, the
system shouldn't suddenly become inconsistent.  This is important because
timeouts are really flakey in LAN systems (as any SUN user will tell you).
Thus, it is hard to build really robust software on top of a purely timeout
based scheme.  Our approach has no overhead at all except when a failure
actually takes place and normally kicks in within about 2 seconds of a
crash (but this can be tuned adaptively, say if overloads are common).

People may want to refer to a recent publication in the February issue of
TOCS for details, or send to me requesting reprints of this and other ISIS
related papers.  The protocols described in that paper, including the failure
detector, are operational now at Cornell.  We'll have performance figures
soon.  If you know us mostly from our old work on resilient objects, you
may want to look at this new material.  It focuses on fault-tolerance and
reliability without transactions (I no longer believe in transactions) and
uses a toolkit approach.  Much of the rest of the system is running too, and
we will have some nontrivial application software coming up during the summer.

Ken Birman

heddaya@harvard.harvard.edu (Abdelsalam Heddaya--aka Solom) (06/11/87)

In-reply-to: darrell@sdcsvax.UCSD.EDU's message of 9 Jun 87 06:35:08 GMT

The question of detecting failures in a distributed system is a very
tricky one.  The standard method of sending a "ping" message to the
machine in question and timing out on the "pong" (or ack) suffers from
the following two problems:

1. The machine may simply be slow in responding, and ends up being
   assumed dead by some machines on the networks, and live by some
   others.  Worse still, it doesn't know that it is assumed dead and
   may happily proceed to destroy the consistency of shared data.  One
   (unsatisfactory) solution is for the other machines to force the
   slow one to fail, aborting all the updates it may have done while
   in the twilight zone.  This problem may go away if the ping-pong
   messages are handled by dedicatead hardware, which, if running, has
   constant speed.

2. Distinguishing between link failures and processor failures is
   hard, especially if the link failure is such that only some
   messages are dropped or delayed.  A timeout can indicate either a
   message loss or delay, a link failure, or a processor failure.
   Again, if we are to successfully distinguish these failure modes,
   special low-level network support will be needed.  In particular,
   the network must guarantee that if a path exists, the message will
   be delivered within some bounded time delay.  A weaker version of
   this requirement, which turns a message delay into a message loss,
   is easily achievable.  The network can guarantee that a message is
   either delivered within a bounded time, or not at all, by having
   the receiver check the sender timestamp on the message and dropping
   it if it is too far in the past.

For these two reasons, many of the protocols in the literature do not
require immediate or accurate failure detection (e.g. the various
flavors of voting protocols).

[ Voting protocols allow some flexibility by only requiring a majority of ]
[ sites.  The problem here is still "How long do you wait?"  --DL         ]

---Abdelsalam Heddaya
	heddaya@harvard.harvard.edu	heddaya@harvard.cs.net
	heddaya@harvunxh.bitnet		heddaya@harvard.uucp
	{rutgers, topaz, ihnp4, allegra, seismo, ...}!harvard!heddaya
	Aiken Lab, 33 Oxford St., Harvard U., Cambridge, MA 02138.
	(617) 495-3998

darrell@sdcsvax.UUCP (06/12/87)

> How is it done in practice?  For us university-types, time-out is the usual
> approximation to a solution since we're usually just out to prove a concept
> and not to build a product.
> 
> How do the folks in industry do it?  Performance is critical there, unlike a
> university prototype.

In much (user-level) software, it's still done by timeouts.  Even in the Sun
NFS kernel there are layers of timeouts on top of the UDP protocol (they use
UDP because it's fast on a small LAN, and then put timeouts of 1 second on
top of it ... strange, huh?).

In one distributed application I read about, hosts periodically send out
sanity checks which have (host#,state) pairs, either just for themselves or
for all hosts they know about.  When a host discovers that another host is
down, it modifies its state information for that host; in the next broadcast,
all other hosts learn of the dead host.  When it comes back up it tells
everyone "I'm alive again".  This gets very expensive when there are a lot of
hosts on the network, but it does keep other hosts from having to timeout.

Not sure of any other methods, you might want to check to see what LOCUS did
-- I know they're not a commercial product (or are they?), but it seemed like
a well done system to me.

I wish I knew of more better methods (than timeouts), but I don't.  I'd be
very interested in hearing what you discover in this area.  Thanks, and happy
researching (from the land of development!).

			      --      jad      --
				 John A Dilley
			      Hewlett Packard Co.
			  Colorado Networks Division
			Fort Collins, COlorado	 80525

ARPA:			    jad%hpcndm@hplabs.HP.COM
UUCP:		           {ihnp4,hplabs} !hpfcla!jad

darrell@sdcsvax.UUCP (06/12/87)

In article <3302@sdcsvax.UCSD.EDU> jad@hpcndm.HP.COM (John A Dilley) writes:
>> How is it done in practice?  For us university-types, time-out is the usual
>> approximation to a solution since we're usually just out to prove a concept
>> and not to build a product.
>> 
>>How do the folks in industry do it?  Performance is critical there, unlike a
>> university prototype.
>
>In much (user-level) software, it's still done by timeouts.  Even in the Sun
>NFS kernel there are layers of timeouts on top of the UDP protocol (they use
>UDP because it's fast on a small LAN, and then put timeouts of 1 second on
>top of it ... strange, huh?).
>
...
>				 John A Dilley

	If you are using a token-bus, as opposed to ethernet, or other similar
protocols, you should be able to know if a node is down in one circuit of the
bus.  This requires a timeout, but it is miniscule compared to things like
1 sec.  If the node's packet was somehow lost, or the node comes back up,
it notices it isn't in the ring and inserts a request to be added at a 
predefined spot of the ring.  (This is of course an over-simplified 
explanation, but I think everyone should get the idea).

	Randell Jesup
	jesup@ge-crd.arpa
	jesup@steinmetz.uucp

jack@cwi.nl (Jack Jansen) (06/17/87)

I might misunderstand the whole matter, but as far as I know there is
no way to distinguish between a network partitioning and a crashed
host, is there?

Maybe you could come up with some scheme where the host that was cut off from
the main network would die voluntarily, but I'm not sure wether this would
do you any good. Moreover, if the network is partitioned into two parts
with an equal number of hosts, you don't wat to bring the whole
network down, I guess.

Also, usually you are interested in *when* the host crashed, i.e. what it
got done before crashing. Your application will have to find that out anyway,
so why not just assume that it did crash as soon as you have that suspicion?

-- 
	Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp)
	The shell is my oyster.