darrell@sdcsvax.UCSD.EDU (Darrell Long) (06/09/87)
Here's a question for you: In a distributed system, how do you tell if a remote site is alive or not? A time-out could be used, but it's not reliable and its also very slow. I can think of many approximate solutions, but reliability is important. When constructing a distributed system, the network is the slowest component and presents the bottle-neck. From what I've read, most folks just assume that there is a way to tell if a remote site is dead. But, this information is very important to many algorithms. How is it done in practice? For us university-types, time-out is the usual approximation to a solution since we're usually just out to prove a concept and not to build a product. How do the folks in industry do it? Performance is critical there, unlike a university prototype. DL -- Darrell Long Department of Computer Science & Engineering, UC San Diego, La Jolla CA 92093 ARPA: Darrell@Beowulf.UCSD.EDU UUCP: darrell@sdcsvax.uucp Operating Systems submissions to: mod-os@sdcsvax.uucp
darrell@sdcsvax.UUCP (06/10/87)
> ... In a distributed system, how do you tell if a remote > site is alive or not? A time-out could be used, but it's not reliable and its > also very slow. I can think of many approximate solutions, but reliability is > important. > > > How do the folks in industry do it? Performance is critical there, unlike a > university prototype. Many systems [Ah La STARLAN {AT&T}] use a conection mode service with a keep alive signal. The absence of the signal within a long time frame kills the connection. When a node dies, or is dead, it's LOGICAL name is removed from the network. Any attempt to establish a new connection responds with "NO SUCH DEVICE ON NETWORK" (-: the electronic version thereof as app. :-) When you can't afford the wait, and must know "NOW" you send a message through the "connection" or attempt to establish a new "connection" and the lack of an ACK packet, or addressable logical name provides you with the information in question. [ Doesn't this still imply a time-out? --DL ] Robert. -- Darrell Long Department of Computer Science & Engineering, UC San Diego, La Jolla CA 92093 ARPA: Darrell@Beowulf.UCSD.EDU UUCP: darrell@sdcsvax.uucp Operating Systems submissions to: mod-os@sdcsvax.uucp
ken@gvax.cs.cornell.edu (Ken Birman) (06/10/87)
In the ISIS system, we use a software protocol that triggers higher level failure actions. The protocol is very fast and quite simple -- a multiphase commit, basically. It gets triggered by timeouts, but the idea is that when an overload occurs or something else causes an incorrect timeout, the system shouldn't suddenly become inconsistent. This is important because timeouts are really flakey in LAN systems (as any SUN user will tell you). Thus, it is hard to build really robust software on top of a purely timeout based scheme. Our approach has no overhead at all except when a failure actually takes place and normally kicks in within about 2 seconds of a crash (but this can be tuned adaptively, say if overloads are common). People may want to refer to a recent publication in the February issue of TOCS for details, or send to me requesting reprints of this and other ISIS related papers. The protocols described in that paper, including the failure detector, are operational now at Cornell. We'll have performance figures soon. If you know us mostly from our old work on resilient objects, you may want to look at this new material. It focuses on fault-tolerance and reliability without transactions (I no longer believe in transactions) and uses a toolkit approach. Much of the rest of the system is running too, and we will have some nontrivial application software coming up during the summer. Ken Birman
heddaya@harvard.harvard.edu (Abdelsalam Heddaya--aka Solom) (06/11/87)
In-reply-to: darrell@sdcsvax.UCSD.EDU's message of 9 Jun 87 06:35:08 GMT The question of detecting failures in a distributed system is a very tricky one. The standard method of sending a "ping" message to the machine in question and timing out on the "pong" (or ack) suffers from the following two problems: 1. The machine may simply be slow in responding, and ends up being assumed dead by some machines on the networks, and live by some others. Worse still, it doesn't know that it is assumed dead and may happily proceed to destroy the consistency of shared data. One (unsatisfactory) solution is for the other machines to force the slow one to fail, aborting all the updates it may have done while in the twilight zone. This problem may go away if the ping-pong messages are handled by dedicatead hardware, which, if running, has constant speed. 2. Distinguishing between link failures and processor failures is hard, especially if the link failure is such that only some messages are dropped or delayed. A timeout can indicate either a message loss or delay, a link failure, or a processor failure. Again, if we are to successfully distinguish these failure modes, special low-level network support will be needed. In particular, the network must guarantee that if a path exists, the message will be delivered within some bounded time delay. A weaker version of this requirement, which turns a message delay into a message loss, is easily achievable. The network can guarantee that a message is either delivered within a bounded time, or not at all, by having the receiver check the sender timestamp on the message and dropping it if it is too far in the past. For these two reasons, many of the protocols in the literature do not require immediate or accurate failure detection (e.g. the various flavors of voting protocols). [ Voting protocols allow some flexibility by only requiring a majority of ] [ sites. The problem here is still "How long do you wait?" --DL ] ---Abdelsalam Heddaya heddaya@harvard.harvard.edu heddaya@harvard.cs.net heddaya@harvunxh.bitnet heddaya@harvard.uucp {rutgers, topaz, ihnp4, allegra, seismo, ...}!harvard!heddaya Aiken Lab, 33 Oxford St., Harvard U., Cambridge, MA 02138. (617) 495-3998
darrell@sdcsvax.UUCP (06/12/87)
> How is it done in practice? For us university-types, time-out is the usual > approximation to a solution since we're usually just out to prove a concept > and not to build a product. > > How do the folks in industry do it? Performance is critical there, unlike a > university prototype. In much (user-level) software, it's still done by timeouts. Even in the Sun NFS kernel there are layers of timeouts on top of the UDP protocol (they use UDP because it's fast on a small LAN, and then put timeouts of 1 second on top of it ... strange, huh?). In one distributed application I read about, hosts periodically send out sanity checks which have (host#,state) pairs, either just for themselves or for all hosts they know about. When a host discovers that another host is down, it modifies its state information for that host; in the next broadcast, all other hosts learn of the dead host. When it comes back up it tells everyone "I'm alive again". This gets very expensive when there are a lot of hosts on the network, but it does keep other hosts from having to timeout. Not sure of any other methods, you might want to check to see what LOCUS did -- I know they're not a commercial product (or are they?), but it seemed like a well done system to me. I wish I knew of more better methods (than timeouts), but I don't. I'd be very interested in hearing what you discover in this area. Thanks, and happy researching (from the land of development!). -- jad -- John A Dilley Hewlett Packard Co. Colorado Networks Division Fort Collins, COlorado 80525 ARPA: jad%hpcndm@hplabs.HP.COM UUCP: {ihnp4,hplabs} !hpfcla!jad
darrell@sdcsvax.UUCP (06/12/87)
In article <3302@sdcsvax.UCSD.EDU> jad@hpcndm.HP.COM (John A Dilley) writes: >> How is it done in practice? For us university-types, time-out is the usual >> approximation to a solution since we're usually just out to prove a concept >> and not to build a product. >> >>How do the folks in industry do it? Performance is critical there, unlike a >> university prototype. > >In much (user-level) software, it's still done by timeouts. Even in the Sun >NFS kernel there are layers of timeouts on top of the UDP protocol (they use >UDP because it's fast on a small LAN, and then put timeouts of 1 second on >top of it ... strange, huh?). > ... > John A Dilley If you are using a token-bus, as opposed to ethernet, or other similar protocols, you should be able to know if a node is down in one circuit of the bus. This requires a timeout, but it is miniscule compared to things like 1 sec. If the node's packet was somehow lost, or the node comes back up, it notices it isn't in the ring and inserts a request to be added at a predefined spot of the ring. (This is of course an over-simplified explanation, but I think everyone should get the idea). Randell Jesup jesup@ge-crd.arpa jesup@steinmetz.uucp
jack@cwi.nl (Jack Jansen) (06/17/87)
I might misunderstand the whole matter, but as far as I know there is no way to distinguish between a network partitioning and a crashed host, is there? Maybe you could come up with some scheme where the host that was cut off from the main network would die voluntarily, but I'm not sure wether this would do you any good. Moreover, if the network is partitioned into two parts with an equal number of hosts, you don't wat to bring the whole network down, I guess. Also, usually you are interested in *when* the host crashed, i.e. what it got done before crashing. Your application will have to find that out anyway, so why not just assume that it did crash as soon as you have that suspicion? -- Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp) The shell is my oyster.