dhepner@hpisod2.HP.COM (Dan Hepner) (01/18/90)
From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) >Some >machines had redundant hardware (e.g. two ALUs) and could reconfigure >to cut failed units out of the "processor complex". I don't see >micros following this path. Fault tolerant micro machines are following a path at least something like this. The general scheme, as implemented by Stratus, Sequoia, and Tandem [S2] involves redundant CPUs, and a scheme for 'cutting out' any offender who gets a wrong answer. >Live CPU recovery has become much less interesting since >multiprocessors came along. With the right software, a failed >processor does not imply a failed process. For example, Tandem >checkpoints each process regularly, so that a different processor can >do a prompt checkpoint-resumption. Tandem has apparently decided that this was not the correct model to implement fault tolerance, although I've searched for and not found yet an official statement on just how the S2 does do FT. The CPU and IO interconnects have >to be up to it, of course (dual port those disks). Dual porting is a classic question in FT. You're going to use redundant disks, of course. Once you have redundant disks, and have paid attention to your interconnect scheme to insure that a path failure won't take down both disks, then dual porting will not enhance FT, if FT is defined as "the ability to sustain any single point of failure". And besides: if a >master/checker pair of CPUs disagree, which one was the one that >failed? Better to ignore them both and force the board into self test >mode. This scheme works fine for Stratus, but one can get to roughly the same place by using three, and tossing out any one which disagrees with the other two. >Well, nonstop machines are ruggedized and rated for e.g. sudden >overpressures (no kidding). This might influence a chip company to >change its chip packaging, but not its chip design. Maybe you could expand on this. Great. A discussion of fault tolerance. Dan Hepner dhepner@hpda.hp.com
rwpratt@Neon.Stanford.EDU (Robert W. Pratt) (01/19/90)
In article <13910004@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes: >From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) >>Live CPU recovery has become much less interesting since >>multiprocessors came along. With the right software, a failed >>processor does not imply a failed process. For example, Tandem >>checkpoints each process regularly, so that a different processor can >>do a prompt checkpoint-resumption. > >Tandem has apparently decided that this was not the correct model >to implement fault tolerance, although I've searched for and not >found yet an official statement on just how the S2 does do FT. I think a more accurate statement would be that Tandem decided not to do checkpointing for fault tolerance under UNIX system V.3, since that would have called for radical (IMHO) changes to UNIX. Guardian (Tandem's proprietary OS) still uses checkpointing. Disclaimer: I did not work on the S2, and the above is exclusively my opinion,not Tandem's or Stanford's. Bob P. -- Bob Pratt INTERnet: pratt@jessica.stanford.edu (much more reliable) pratt_robert@comm.tandem.com (checked more, but flaky) USMail: 2225 Sharon Rd. #323 Menlo Park, CA. 94025
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (01/19/90)
In article <13910004@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes: >From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) >>Live CPU recovery has become much less interesting since >>multiprocessors came along. With the right software, a failed >>processor does not imply a failed process. For example, Tandem >>checkpoints each process regularly, so that a different processor can >>do a prompt checkpoint-resumption. > >Tandem has apparently decided that this was not the correct model >to implement fault tolerance, although I've searched for and not >found yet an official statement on just how the S2 does do FT. The most basic thing is to contain the damage, so it's usual to use messages between machines that don't share memory. The next problem is the data that was inside the failed thing. There are several ways out: 1- whatever it was doing is dead. (Rarely acceptable, but if the customer is just buying a lottery ticket, I suppose you can ask him do it again.) 2- another CPU recomputes the lost data from its copy of the last checkpoint, and copies of all the messages and signals sent to the dead machine. 3- another CPU has been computing in parallel, and contains data that should be identical to the lost data. Case 1 is an application-level choice, but still needs error detection. Case 2 is done with software and with error detection. Case 3 can be done like case 2, or it can be done with a master-and- two-checkers. As you pointed out, the three chips can hold a vote, and decide that only two of the three contain valid data. That can be a lot of data: with on-chip caches, and PIDs, it can also be data from several different processes. >>Well, nonstop machines are ruggedized and rated for e.g. sudden >>overpressures (no kidding). This might influence a chip company to >>change its chip packaging, but not its chip design. >Maybe you could expand on this. I'm probably out of touch with ruggedizing, but there are companies that specialize in it. They like things that tolerate flexing. They like sealed spaces (to keep out salt air and conductive dust and tobacco tar). They like to coat things: a grease layer is used on some automotive chips. They like hermetically sealed chips, and they used to like ceramic over plastic. They used to worry about pin corrosion, but I don't know what they think of TAB. The system specs can involve higher ambient temperatures, explosion (overpressure), shock, high voltage shorted onto the rack, ground currents on the Ethernet shield, ambient RF, industrial-strength noise on the power lines, locks on the rack door, having to push a telex signal through a one-henry coil. I recall a system that failed from being too near a gamma-ray source: but that wasn't in the spec. (The gamma rays were just enough so that an EPROM would forget one bit after about a month. We solved this with squared-law shielding. That is, we pushed the desk further down the hall.) -- Don D.C.Lindsay Carnegie Mellon Computer Science
dhepner@hpisod2.HP.COM (Dan Hepner) (02/03/90)
From: mash@mips.COM (John Mashey) > > a) Various people do fault-tolerance various ways. How about > people who know posting some things to explain how they work, > and what the strengths and weaknesses of the various ways are? While various people do FT various ways, it is reasonable to decide to discuss a restricted set. If the space shuttle needs five computers, programmed by different companies using different algorithms, that's fine, but would appear beyond the scope of how FT is best done using microprocessors. The ways cleanly divide into two: checkpointing vs. redundantly executing instructions with enough processors to guarantee completion regardless of any failure. All FT schemes should include some means to detect failure, of course. "The problem" which must be solved by any FT scheme is how to get each instruction executed once and only once WRT the user. Checkpointing solves this by saving enough process state in a place available to another processor to restart the process in the event of original processor failure. Any post checkpoint execution by the failed processor is guaranteed to be abandoned. Redundant processor machines always execute the instruction with more than one processor (3 for Tandem, 4 for Stratus), and compare results. Miscomparisons result in reliable detection of "crazy processors", including cache, logic, or whatever. > b) Of particular interest in this discussion: are there features > in fault-tolerant OLTP systems that: OLTP, while clearly correlated with FT, is a separate topic. Tandem's Guardian line does seem to assume that the best reason one might need FT is for OLTP, but other markets for FT exist as well. Communications applications stick out, e.g. telecom. > a) are in UNIX > b) aren't in UNIX, but could be > c) aren't in UNIX, but would require complete rewrites > to get them there. While there are a bunch of requests for UNIX enhancements from the OLTP community, these are OLTP requests, not FT requests. There's also a bunch of requests for UNIX enhancements from people who advocate reliable OSs. Laudable no doubt, but again not FT requests. Ideally FT would exist completely in the hardware, and present a platform to the OS which looks like a non-FT machine. The reality is that this can't be quite true. FT vendors will always be required to supply whatever kernel support that their idiosyncratic implementation requires, and an OS port to such a machine will always be more difficult than on a non-FT platform. However, by and large the progress of UNIX, and the related progress of DMBS software should proceed without undue concern for the needs of FT. The ball is in the other court. FT machines should (and are) being designed with the needs of UNIX and DMBS software in mind. > d) What's the tradeoff between: > degree of software fault-tolerancy > and > ability to run standard software, with no changes "Software fault tolerance" implies that _someone_ must figure out when and what to checkpoint, or that an extreme penalty will be paid because everything is checkpointed. How much standard software has such checkpointing already programmed in? None, of course. >Anyway, I'd observe that from the publicly-available data, it is clear that >the 2 Tandem product lines don't really overlap very much, and are aimed >at different markets, for different reasons, and hence, trying to read too >much into this about the merit or lack thereof of a specific technical >feature just doesn't make sense. I guess we just come to different interpretations of publicly available, and maybe ambiguous information. It still looks to me like a technological generation change. Time will certainly tell. >the VAX meant that DEC thought PDP-11s were Wrong Things, of course I'd >have objected. Unless my memory fails, I thought DEC made something like >$1B last year on PDP-11-based products... 10 years after the introduction of >the VAX. Although the two overlapped in some areas, they didn't at all in >others. I guess what is at the core of the disagreement over what is appropriate for discussion is that many of the problems of changing from older to newer technology are universal to all successful companies, and discussing one instantiation doesn't seem directed at the subject being discussed. HP continues to sell "traditional" 16 bit HP-3000s, while having moved to 32 bit RISC. Most observers speculated when they heard the HP-PA announcement that the 16 bit architecture was being moved away from by HP, that HP believed it could do better than before. HP surely never concluded that traditional HP-3000s were "wrong things", nor of course will Tandem ever come to such a conclusion over a product which is on par with the HP-3000 for being a successful both commercially and technically. Obsolescence does not imply wrongness. As shown by the PDP-11, it doesn't imply lack of commercial success, and certainly doesn't imply lack of support. >As a matter of style, I believe that it is much better to carefully >label speculations as such, and ask questions, than to make strong-sounding >statements that can easily mislead the casual observer. I'll accept your advice, and thank you for indeed a better way of presenting the case. But I'll point out that anyone who is misled by a strong sounding statement in comp.arch is a candidate for selling a bridge to. >I have no desire to suppress such discussions, as the interactions of >technology and business are extremely important to understand. The problem is that companies, _all_ long lived technology companies, control technology for business reasons. Taking company announcements at face value is not the way to understand such interactions. >-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> Dan Hepner Not a statement of Hewlett-Packard Co.
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/06/90)
In article <13910010@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes: >Ideally FT would exist completely in the hardware, and present >a platform to the OS which looks like a non-FT machine. I'm not so sure. How would this catch the software bugs that Tandem's scheme does catch? >Redundant processor machines always execute the instruction >with more than one processor (3 for Tandem, 4 for Stratus), >and compare results. Miscomparisons result in reliable detection >of "crazy processors", including cache, logic, or whatever. There is "lockstep" redundant execution, and then there are looser forms. Lockstep redundancy is very simple to build, but it cannot catch Heisenbugs - order dependencies that aren't supposed to be in the software, but are anyway. Looser redundancy schemes declare synchronization events at (say) a kilohertz. This is no where near as clear cut, because processes may not have been scheduled in the same order on all machines. Both interrupts and traps become interesting topics in such a system. And you can't expect all machines to reach the same synchronization event at the exact same moment. However, a loose redundancy scheme is essentially the same as a checkpoint scheme, except for latency. A redundant process has been out there fighting for cycles all along. Checkpoint systems recover by running a shadow process forwards from the last checkpoint. So, the Space Shuttle uses redundancy, because they don't want The Pause That Refreshes to happen during reentry. -- Don D.C.Lindsay Carnegie Mellon Computer Science
donnh@ziggy.SanDiego.NCR.COM (Donn Holtzman) (02/06/90)
In article <13910010@hpisod2.HP.COM> dhepner@hpisod2.HP.COM (Dan Hepner) writes: > >Ideally FT would exist completely in the hardware, and present >a platform to the OS which looks like a non-FT machine. > From my perspective one problem that exists with a "HW only" solution to FT is the issue of SW failures. As the gentleman from Tandem pointed out there are a class of faults (Heisen Bugs) which are very timing and case dependant. A loosely coupled approach, such as Tandem's, will recover from many SW based faults simply because the timing and load characteristics are different on another processor. If the bug causes your kernel to hang, TMR or pair-and-spare approaches won't succeed. Performance is certainly and issue but one can trade check pointing overhead for recovery speed (at least in the OLTP arena). On the other hand the "HW only" approaches are easier to explain and sell. They certainly are conceptually simpler, if not actually simpler to implement correctly. >The reality is that this can't be quite true. FT vendors will >always be required to supply whatever kernel support that their >idiosyncratic implementation requires, and an OS port to such a >machine will always be more difficult than on a non-FT platform. > This is a good point. I would be surprised if Tandem didn't have to make kernel changes to make their machine work. But in this day of standards and narrow market windows it was probably easier to sell this approach to management then a large SW effort (kernel changes or no kernel changes) to support a loosely coupled approach. Interesting stuff. Donn Holtzman NCR E&M San Diego Donn.Holtzman@SanDiego.NCR.COM
dhepner@hpisod2.HP.COM (Dan Hepner) (02/07/90)
From: lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) > >>Ideally FT would exist completely in the hardware, and present >>a platform to the OS which looks like a non-FT machine. > >I'm not so sure. How would this catch the software bugs that Tandem's >scheme does catch? Tandem's checkpointing scheme catches SW bugs by deliberate SW design, not because the scheme provides some inherent resistance to such. This resistance has proven to be limited by its own bugs. What is as yet unestablished is whether the schemes described for detection of SW bugs are the best available, or if they are the best available, that they can't be equally applied to a redundant CPU implementation. We can assume we would prefer _all_ our applications to run reliably in the face of HW failures. In a checkpointing system, this means we must incorporate checkpointing logic, as compared to running them straight off on a redundant CPU machine. If we need to "harden" certain SW, making it resistant to SW bugs while already being impervious to HW failures, we are free to do so on a redundant CPU machine. >Lockstep redundancy is very simple to build, but it cannot catch >Heisenbugs - order dependencies that aren't supposed to be in the >software, but are anyway. Lockstep redundancy, or the alternatives for that matter, are not designed to catch SW bugs at all, Heisenbugs or whatever. Deliberate SW design to achieve resistance is the only technique that catches SW bugs. This is not to deny the conceptual difference between "single memory space" and "multiple memory space" machines, although the line can be a bit blurry. Intuitively, it sure seems more likely that the global state will be more easily irretrievably corrupted if there is only one memory space. But is there any actual evidence that those techniques (they're all SW techniques) which minimize the potential of irretrievable corruption due to SW bugs apply equally well to both systems? Ultimately we have to just trust the released SW. If some programmer writes precisely the line of code which corrupts the entire system, and that line of code manages to get past whatever QA process that is in place, we have no defense. Looser redundancy schemes declare >synchronization events at (say) a kilohertz. [...] >However, a loose redundancy scheme is essentially the same as a >checkpoint scheme, except for latency. Right. Dan Hepner
dhepner@hpisod2.HP.COM (Dan Hepner) (02/07/90)
From: donnh@ziggy.SanDiego.NCR.COM (Donn Holtzman) If >the bug causes your kernel to hang, TMR or pair-and-spare approaches >won't succeed. This is an interesting case. Kernel "Hangs" are actually double SW bugs: a combination of some original bug and a lack of detection by failed assertion. There is a last defense TMR/QMR schemes have against such, the deadman switch initiated "fast reboot", which effectively converts the hang into a "panic". Which brings up the comparison of reaction to the more generic panic. In order to make the case that loose redundancy is superior to lock-step in its response time to panics, one must assert that the backup loose processor will notice the failure of the primary, and complete its takeover in less time than the primary could have reset itself and achieved a similar state. How does this comparison turn out in real life? The question of whether the state of the machine is sound enough to return to seems independent of the basic question; one can do a fast reboot and leave machine state mostly intact, maybe suffering a repeat. Alternatively one can bet on a checkpointed machine and suffer the same repeat. Is there some fundamental difference that makes the takeover from the checkpointed machine faster? Performance is certainly and issue but one can trade >check pointing overhead for recovery speed (at least in the OLTP >arena). Maybe you could elaborate here. >>an OS port to such a >>machine will always be more difficult than on a non-FT platform. >> >This is a good point. I would be surprised if Tandem didn't have to >make kernel changes to make their machine work. There is a real dividing line: can you port the next kernel or do you have to retrofit the new functionality into your existing, 80% proprietary code kernel. >Interesting stuff. Yes! >Donn Holtzman Dan Hepner