jimbo@tandem.com (Jim Lyon) (02/02/90)
- In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes: >My problem is that it WASN'T about technology, and we ought to turn >it back into a technology discussion that might be useful: > a) Various people do fault-tolerance various ways. How about > people who know posting some things to explain how they work, > and what the strengths and weaknesses of the various ways are? Now that the discussion is back to technology, I'll be happy to put in my two-cents worth. The following is NOT to be taken as a pronouncement on Tandem strategy, about which I know relatively little (and am willing to say even less). In general, the following represents merely the opinions of one lowly grunt (me). Before talking too much about fault tolerance, it is important to know a little bit about a fault model. In general, most people think of a fault in a component causing one of the following two behaviors: a) It stops dead. or b) It goes insane. The latter case is VERY difficult to deal with. People put it very much work to try to translate it into the first case. TMR schemes try to shoot the insane processor before it manages to poison the outside world. At higher levels, software is very frequently full of tests for violated assertions (which is evidence of insanity), in an attempt to kill the software component before the insanity spreads. In the latter case, one is not always 100% successful. Hardware faults are frequently classed as either transient (a cosmic ray flipped a bit in memory) or hard (a transister is broken and a bit in memory will always return zero). Software faults are harder. They are frequently classed as either Bohr bugs or Heisenbugs. A Bohr bug is a deterministic bug (every time I try to run hack, the system fails). A Heisenbug, on the other hand, is a nondeterministic bug (if an interrupt occurs in a particularly sensitive part of code, we'll corrupt a data structure so that the next user of the data structure will die). This breakdown also applies to hardware, but people don't usually talk about it in that context (I don't know why). A good QA process will catch nearly 100% of the Bohr bugs and many, but by no means all, of the Heisenbugs in a product. It is realistic to expect that a released product have more Heisenbugs left than Bohr bugs. These days, a typical complex system is built in lots of layers. The reliability of the system is the product of the reliabilities of each of the layers [eg, hardware, microcode, operating system, database, application, etc]. In a normal world, a failure in one layer will cause the immediately higher layer to either: a) notice the failure and correct for it, or b) notice the failure and throw up its hands (because it doesn't know how to correct for it), or c) fail to notice the failure, thereby failing itself. In case (a), that's the end of it. Your system has successfully tolerated a fault. In cases (b) and (c), the failure in one layer has just been translated to the failure of the next layer up, and we need to repeat. Notice that in most systems, the uppermost layer is a person (liveware?). So, a failure at a low layer propogates up and up, until we find a layer smart enough to deal with the failure. However, not every failure starts at the bottom. Operating systems fail. Device drivers have bugs. Database managers have bugs. Applications have bugs. All of this having been said, the question still remains: Why is checkpointing good/bad? The prime virtue of checkpointing is that you can do it again at each layer. Conceptually, you introduce a new layer between each of your previous layers. Where before, layer 3 made direct requests of layer 2, now we have layer 3 use a new layer 2a. Layer 2a maintains interfaces to two (or more) instances of layer 2. Should one instance of layer 2 fail, layer 2a transparently redirects requests to the other instance of layer 2. The client, layer 3, never sees the failure. Of course, this gives rise to a couple of requirements: a) All of the requests to a replicated layer must be idempotent. If I ask an instance of layer 2 to debit my bank account by $100, and if fails after doing so but before reporting success, I don't want the other instance to debit another $100. There are well-known schemes (using sequence numbers) to turn non-idempotent requests into idempotent ones. b) If a layer maintains state about its clients, this state needs to be kept synchronized among the various instances of that layer. Typically, they do this by informing each other whenever they change their state. This is what we call a checkpoint. If we do this replication and checkpointing at every layer of the system, we can acheive very high reliability. It turns out that we can mask all of the single hardware failures (both transient and hard), and well as most of the software Heisenbugs. This technique does not mask the Bohr bugs; if a layer contains a Bohr bug such that a certain request causes an instance of that layer to fail, then each instance of that layer will end up failing, one at a time. [One of these days I'll send something to alt.computers.folklore about the bug that caused 34 CPUs to fail sequentially, at 4-second intervals.] In summary, checkpointing not only allows you to survive most of your hardware failures, but also most of your operating system bugs, most of your database manager bugs, most of your communication protocol bugs, most of your transaction manager bugs, and even many of your application bugs. So, why doesn't everybody use checkpointing, all of the time? In particular, why didn't Tandem use checkpointing in the S2? Well, ... 1) It's hard. No doubt about it, it's frequently twice as hard to design a piece of software with checkpointing as it is without it. 2) If you already have a piece of software that was designed without checkpointing, it's VERY hard to add it as an afterthought. 3) If you insist on retrofitting checkpointing into something that wasn't designed with it in mind, you are likely to see VERY poor performance. Remember that the Tandem S2's mission in life is to run Unix. If you want a machine to run Unix, you don't do checkpointing. If you really wanted to, you could, with a huge amount of work, put checkpointing into the Unix kernel. You couldn't, even if you wanted to, manage to put checkpointing into any significant fraction of the third-party software (like database mangers, bizarre comm managers, applications, etc.). So, the amount of reliability that you could add via checkpointing is exactly that you could mask some of the Heisenbugs in the kernel. There just aren't that many there. So, what DO you do if you want a high-reliability Unix system? You: a) Use TMR on the processor and memory. We've just tolerated all of the single faults from these components. b) Duplex the disks. We're now in a position to tolerate hard disk errors. c) Beef up the device drivers. A large number of the panics that Unix systems experience are directly traceable to a transient error of one sort or another on a device. Put in code to recover from these errors. Use some aggresive test strategies to make sure that this code actually works. d) Clean up a small number of other places where the kernel just gives up (primarily due to resource exhaustion). The result is a system which: a) Isn't perfect. b) Will fail far less often than a conventional Unix system. SUMMARY: If you want the highest degree of fault tolerance possible, design it from the start to use checkpointing [If you come work for Tandem for a few years, you'll learn how]. If you can't design it [or redesign it] from the start, don't use checkpointing. Depending on where the real reasons for failure are, you may or may not benefit from running it on a system that uses checkpointing at a lower level. I hope this has been informative and hasn't sounded too much like a Tandem commercial. If not, well, I'll put on my asbestos suit now. -- Jim Lyon -- Tandem Computers -- jimbo@tandem.com
dhepner@hpisod2.HP.COM (Dan Hepner) (02/07/90)
From: jimbo@tandem.com (Jim Lyon) >In article <35300@mips.mips.COM> mash@mips.COM (John Mashey) writes: >> a) Various people do fault-tolerance various ways. How about > >Now that the discussion is back to technology, I'll be happy to put >in my two-cents worth. We'll thank John Mashey for his contribution. >fault in a component causing one of the following two behaviors: >a) It stops dead. or >b) It goes insane. >The latter case is VERY difficult to deal with. People put it very >much work to try to translate it into the first case. TMR schemes >try to shoot the insane processor before it manages to poison the >outside world. Are you suggesting that the techniques available to a TMR/QMR designer are not totally effective in isolating an insane processor? Could you offer an example of a type of insanity which can spread through a voting/comparison barrier? [a lot of good stuff on software faults deleted] >In summary, checkpointing not only allows you to survive most of your >hardware failures, but also most of your operating system bugs, most >of your database manager bugs, most of your communication protocol bugs, >most of your transaction manager bugs, and even many of your application >bugs. and later: >If you want the highest degree of fault tolerance possible, design it >from the start to use checkpointing [If you come work for Tandem for >a few years, you'll learn how]. Could you clarify the claim here? It sure seems you suggesting that the checkpointing system, including HW and SW, is inherently more reliable than would be a TMR system which had had an equivelent amount of effort devoted to reliability enhancement of its SW. But most of the excellent recommendations made WRT the SW are equally applicable to both products, or even non-FT products. What is unique to checkpointing is the notion that each SW layer has available to it a backup process(s), and that the hardware checkpointing mechanism can be used as a tool for abandoning work which led to some failure, succeeding in avoiding the Hiesenbug panic case. As long as we're willing to pay substantially increased SW development costs, we might consider what else we might get for our money. There are other tools which can be used to attain high reliability, and the basic save state/ fall back on failure mechanism can be used in the absence of even a backup process, let alone a process in a different memory space. Is there really something offered by checkpointing to another memory space which makes such SW inherently more reliable? And from there, is there really something offered by completely checkpointing HW/SW systems which is not achievable on TMR/QMR systems? >I hope this has been informative and hasn't sounded too much like a >Tandem commercial. If not, well, I'll put on my asbestos suit now. > >-- Jim Lyon >-- Tandem Computers >-- jimbo@tandem.com Thanks a lot. Dan Hepner
dhepner@hpisod2.HP.COM (Dan Hepner) (02/08/90)
From: rtrauben@cortex.Sun.COM (Richard Trauben) [hopefully someone can answer the excellent "just how do you get it stopped" questions] >Presumably no-one is interested in dumping the state of a failed PE-pairs >write-back$; execution would resume from last process checkpoint. Hmm. If what you mean by "PE-pairs" is what is generally called Quad Modular Redundancy (QMR), with two lockstep processors constituting a PE, and two of those constituting a logical processor, there is no requirement for checkpointing; the instruction will be successfully executed. If what you mean however by a PE-pair is two lockstepped processors, which upon detection of a miscomparison take themselves offline, indeed a checkpoint needs to be done to preserve the state for some backup processor. Part of the checkpointing mechanism is the necessity to abandon all effects of processing done after the checkpoint by the failed processor, which includes any write-back state. >How about resuming from the checkpoint and unintentionally resending >redundant mass store and datacom messages. The checkpoint itself must be atomic, in that it must complete fully or not at all. "Half-checkpoints" must be seen as effects of processing done after the last successful checkpoint, and be abandoned. The IO request atomicity can be addressed as part of the problem of checkpoint atomicity. Once the atomic checkpoint mechanism is developed, the initiation of IO requests can be incorporated, so that the initiation of an IO request happens only at the time of a successful checkpoint. From the recovery processor's point of view, either the checkpoint/ IO request happened or it didn't, and that is discernible. This has covered the case of processor failure, and guaranteed that the request has been issued once and only once. As noted, reissuing a disk write after an arbitrary amount of other activity has happened could raise real havoc. Left uncovered is the potential for the requestee of the IO request to loose it, but that's a different question. I/O caching and TCPIP >packet sequence numbers might conceal some of these problems but probably >not all. WRT disks, it seems essential to get it perfect. Some comm might be different. >Richard Dan Hepner
rtrauben@cortex.Sun.COM (Richard Trauben) (02/09/90)
Dan Hepner responds to a thread about redundant mass-store and datacom requests wrt rolling back to a checkpoint after a PE-pair failure: >> The IO request atomicity can be addressed as part of the problem of >> checkpoint atomicity. Once the atomic checkpoint mechanism is developed, >> the initiation of IO requests can be incorporated, so that the initiation >> of an IO request happens only at the time of a successful checkpoint. >> From the recovery processor's point of view, either the checkpoint/ >> IO request happened or it didn't, and that is discernible. A consequence of what you suggest is that a unique checkpoint must exist for every packet in a duplex conversation (over a link) where there are dependencies between talker and listener (debit/credit): as in one checkpoint per TCP/IP or X.25 packet. While it works, I suspect it becomes THE bottleneck in packet transmission rates and might lead to a very high frequency of checkpoints per second. Richard
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/09/90)
In article <38@exodus.Eng.Sun.COM> rtrauben@cortex.EBay.Sun.COM (Richard Trauben) writes: >Dan Hepner responds to a thread about redundant mass-store and datacom >requests wrt rolling back to a checkpoint after a PE-pair failure: >>> The IO request atomicity can be addressed as part of the problem of >>> checkpoint atomicity. Once the atomic checkpoint mechanism is developed, >>> the initiation of IO requests can be incorporated, so that the initiation >>> of an IO request happens only at the time of a successful checkpoint. >>> From the recovery processor's point of view, either the checkpoint/ >>> IO request happened or it didn't, and that is discernible. > >A consequence of what you suggest is that a unique checkpoint must >exist for every packet in a duplex conversation (over a link) where there >are dependencies between talker and listener (debit/credit): as in >one checkpoint per TCP/IP or X.25 packet. The checkpointing systems that I'm aware of, do not perform a checkpoint on every IO. Instead, they treat IO as a form of message traffic. Whenever a process receives a message (does a read), a copy of the message is also put in a special queue. When the process is checkpointed, the queue is cleared. So, yes, there is an overhead per application-level IO operation. But, no, the overhead is not a complete checkpoint. In the case of a read from a read-only file, I suppose that the "message" could be a description of the read request, instead of being a copy of the actual data. Reliability is never without a price, but the price can be a lot lower in selected cases. For example: just ask the customer to try again. Also, "end to end" is a more general concept than some people seem to think. Suppose that a salesman sends orders to a central system, but also keeps a copy in his local machine. At intervals, the salesman can have his machine prepare a summary, compress it, and send it in when the telephone rates are low. The central system can use late-night cycles to check summaries against the online data. This sort of lazy checksumming is really cheap, and _eventually_ the files are as correct as any other method could get them. -- Don D.C.Lindsay Carnegie Mellon Computer Science