dowlati@mips2.ma30.bull.com (Saadat Dowlati) (06/20/91)
I have been reading a lot of papers on fault-tolerant systems. One thing they all have in common is the many worderful expectations that they have from the underlying hardware: fail-stop processors, self-checking components, non-partionable networks, etc. But none says how. So, I am curious. I like to know, for example: - What are the symptoms of a failing CPU, i.e., fault types? - How soon a failing/failed CPU can be detected? - What are the techniques used in detecting a failing/failed CPU? (I know about processor-pair technique) - What are the techniques used to report a failed CPU to the OS? I also have similar questions about Buses, Disks, and the Memory subsystem. I would like to hear specially from those who have actual experiences. Regards, -- Saadat Dowlati Affiliation: Bull HN Information Systems, Inc. Voice: (508) 294-3426 300 Concord Road, MA30-826A Fax: (508) 294-3807 Billerica, Massachusetts 01821-4186 E-mail: S.Dowlati@ma30.bull.com U.S.A.
dhepner@hpcuhc.cup.hp.com (Dan Hepner) (06/21/91)
From: dowlati@mips2.ma30.bull.com (Saadat Dowlati) > - What are the symptoms of a failing CPU, i.e., fault types? Impossible to predict, which means that any conceivable failure must be handled. Early fault tolerant machines were able to react to total processor failure, but vulnerable to for example a processor adding 2+2 and getting 5, or a disk controller processor corrupting a bit on a sector on the way through. These don't meet modern expectations of a fault tolerant system. > - How soon a failing/failed CPU can be detected? Failure must be detected before affecting anything else, such as the state of external memory or before issuing any IO request. It _can_ be detected sooner, either by internal checks or before modification of unshared cache memory, but need not be, and it is usually advantageous for performance purposes to not detect any sooner (check more frequently) than necessary. > - What are the techniques used in detecting a failing/failed CPU? > (I know about processor-pair technique) 3 of the 4 major commercial FT architectures (Tandem Guardian, Stratus, Sequoia, and Tandem S2) use other processors to check each other, albeit each in a unique way. Tandem's Guardian uses a combination of redundant components and parity checking. > - What are the techniques used to report a failed CPU to the OS? >Saadat Dowlati Affiliation: Bull HN Information Systems, Inc. There are two kinds of reasons why the OS might care. One is for support for diagnostic messages to prompt replacement; this is handled similarly to "ordinary" degraded conditions. The other reason, which is more fundamental to the Guardian and Sequoia architectures, is to prevent future work from being scheduled on this processor. In this case, there is always the option of the processor module ceasing to process any more instructions; this is soon detected by the Sequoia OS, or by another Guardian. Both react by scheduling the in-progress work on another processor. The Stratus and S2 OSs need not solve this problem, as the processor module never fails completely and continues, from the OS point of view, as if nothing had happened. Dan Hepner
mshute@cs.man.ac.uk (Malcolm Shute) (06/21/91)
In article <1991Jun19.172757.20852@mips2.ma30.bull.com> dowlati@mips2.ma30.bull.com (Saadat Dowlati) writes: >I like to know, for example: > - What are the symptoms of a failing CPU, i.e., fault types? > - How soon a failing/failed CPU can be detected? > - What are the techniques used in detecting a failing/failed CPU? > (I know about processor-pair technique) > - What are the techniques used to report a failed CPU to the OS? I know it is not quite what you were asking for... but it might be slightly relevant, but for ULSI/WSI in CMOS, many types of fault cause the affected processor to draw enormous amounts of current (for many possible different reasons... eg. short ccts between signal wires and/or power rails; pull-up and pull-down transistors on simultaneously). Consequently, this can be quite a good initial thermometer of a processor's health. -- Malcolm SHUTE. (The AM Mollusc: v_@_ ) Disclaimer: all
janm@dramba.neis.oz (Jan Mikkelsen) (06/22/91)
In article <1991Jun19.172757.20852@mips2.ma30.bull.com> dowlati@mips2.ma30.bull.com (Saadat Dowlati) writes: > >I have been reading a lot of papers on fault-tolerant systems. One thing >they all have in common is the many worderful expectations that they have >from the underlying hardware: fail-stop processors, self-checking >components, non-partionable networks, etc. But none says how. So, I am >curious. We just installed a Tandem S2 (MIPS based, fault-tolerant Unix), and have had Tandem Guardian machines in our parent company for some time. > I like to know, for example: > > - What are the symptoms of a failing CPU, i.e., fault types? In the S2, you have three CPU's, each executing the same code. Whenever they require access to "global" memory, they have a vote to confirm that they are all attempting to do the same thing. If one of the CPUs looses the vote, it is taken off-line, and processing continues with two processors. If these two disagree, then the system is stopped. So, I think the point here is that rather than looking for a specific type of failure, this implementation goes with a majority decision. It is of course possible, but unlikely, that two will fail in the same way, and one will succeed. > - How soon a failing/failed CPU can be detected? As soon as the instruction flow requires access to something outside of a processors local memory. Memory in the S2 is organised into local and global memory; a vote is required whenever access is required to global memory or when an I/O operation is attempted. I am note sure of the size of transfers between local and global memory. > - What are the techniques used in detecting a failing/failed CPU? > (I know about processor-pair technique) See above. > - What are the techniques used to report a failed CPU to the OS? I suspect that this would vary considerably between a machine with multiple logical processors and a machine with a single logical processor. Each logical processor in the Tandem S2 architecture consists of three physical CPUs. In a machine with one logical CPU, and failure of a physical CPU should probably not affect the OS. The failure of a logical CPU obviously has a more severe impact on a machine like this. In a machine with multiple logical CPUs, the failure of a logical CPU should involve the OS, which should start rescheduling jobs to working CPUs. How do machines like the Sequent or the Stratus i860 based machines handle this? > >I also have similar questions about Buses, Disks, and the Memory subsystem. I think in essence, the Tandem philosophy is have two or more of everything. -- Jan Mikkelsen janm@dramba.neis.oz.AU or janm%dramba.neis.oz@metro.ucc.su.oz.au "She really is."