[comp.arch] Fault-Tolerant Systems

dowlati@mips2.ma30.bull.com (Saadat Dowlati) (06/20/91)

I have been reading a lot of papers on fault-tolerant systems. One thing 
they all have in common is the many worderful expectations that they have 
from the underlying hardware: fail-stop processors, self-checking 
components, non-partionable networks, etc. But none says how. So, I am 
curious. I like to know, for example:

	- What are the symptoms of a failing CPU, i.e., fault types?
	- How soon a failing/failed CPU can be detected?
	- What are the techniques used in detecting a failing/failed CPU?
  	  (I know about processor-pair technique)
	- What are the techniques used to report a failed CPU to the OS?

I also have similar questions about Buses, Disks, and the Memory subsystem. 
I would like to hear specially from those who have actual experiences.

Regards,
-- 
Saadat Dowlati		   Affiliation:	Bull HN Information Systems, Inc.
Voice:	(508) 294-3426			300 Concord Road, MA30-826A
Fax:	(508) 294-3807			Billerica, Massachusetts 01821-4186
E-mail:	S.Dowlati@ma30.bull.com       	U.S.A.

dhepner@hpcuhc.cup.hp.com (Dan Hepner) (06/21/91)

From: dowlati@mips2.ma30.bull.com (Saadat Dowlati)

>	- What are the symptoms of a failing CPU, i.e., fault types?

Impossible to predict, which means that any conceivable failure must 
be handled.  Early fault tolerant machines were able to react to total 
processor failure, but vulnerable to for example a processor adding 2+2 
and getting 5, or a disk controller processor corrupting a bit on a sector 
on the way through.  These don't meet modern expectations of a fault 
tolerant system.

>	- How soon a failing/failed CPU can be detected?

Failure must be detected before affecting anything else, such as the 
state of external memory or before issuing any IO request.   It _can_ be
detected sooner, either by internal checks or before modification
of unshared cache memory, but need not be, and it is usually advantageous
for performance purposes to not detect any sooner (check more frequently)
than necessary. 
 
>	- What are the techniques used in detecting a failing/failed CPU?
>  	  (I know about processor-pair technique)

3 of the 4 major commercial FT architectures (Tandem Guardian, Stratus,
Sequoia, and Tandem S2) use other processors to check each other, albeit
each in a unique way.  Tandem's Guardian uses a combination of redundant 
components and parity checking.

>	- What are the techniques used to report a failed CPU to the OS?
>Saadat Dowlati		   Affiliation:	Bull HN Information Systems, Inc.

There are two kinds of reasons why the OS might care.  One is for support
for diagnostic messages to prompt replacement; this is handled similarly
to "ordinary" degraded conditions.  The other reason, which is more fundamental
to the Guardian and Sequoia architectures, is to prevent future work from
being scheduled on this processor.  In this case, there is always the option
of the processor module ceasing to process any more instructions; this is
soon detected by the Sequoia OS, or by another Guardian.  Both react
by scheduling the in-progress work on another processor.  The Stratus and
S2 OSs need not solve this problem, as the processor module never fails
completely and continues, from the OS point of view, as if nothing had
happened.

Dan Hepner

mshute@cs.man.ac.uk (Malcolm Shute) (06/21/91)

In article <1991Jun19.172757.20852@mips2.ma30.bull.com> dowlati@mips2.ma30.bull.com (Saadat Dowlati) writes:
>I like to know, for example:
>	- What are the symptoms of a failing CPU, i.e., fault types?
>	- How soon a failing/failed CPU can be detected?
>	- What are the techniques used in detecting a failing/failed CPU?
>  	  (I know about processor-pair technique)
>	- What are the techniques used to report a failed CPU to the OS?

I know it is not quite what you were asking for... but it might be slightly relevant,
but for ULSI/WSI in CMOS, many types of fault cause the affected processor to draw
enormous amounts of current (for many possible different reasons... eg. short ccts
between signal wires and/or power rails; pull-up and pull-down transistors on
simultaneously).  Consequently, this can be quite a good initial thermometer of
a processor's health.
--

Malcolm SHUTE.         (The AM Mollusc:   v_@_ )        Disclaimer: all

janm@dramba.neis.oz (Jan Mikkelsen) (06/22/91)

In article <1991Jun19.172757.20852@mips2.ma30.bull.com> dowlati@mips2.ma30.bull.com (Saadat Dowlati) writes:
>
>I have been reading a lot of papers on fault-tolerant systems. One thing 
>they all have in common is the many worderful expectations that they have 
>from the underlying hardware: fail-stop processors, self-checking 
>components, non-partionable networks, etc. But none says how. So, I am 
>curious.

We just installed a Tandem S2 (MIPS based, fault-tolerant Unix),
and have had Tandem Guardian machines in our parent company
for some time.

>          I like to know, for example:
>
>	- What are the symptoms of a failing CPU, i.e., fault types?

In the S2, you have three CPU's, each executing the same code.  Whenever
they require access to "global" memory, they have a vote to confirm that
they are all attempting to do the same thing.

If one of the CPUs looses the vote, it is taken off-line, and processing
continues with two processors.  If these two disagree, then the system
is stopped.

So, I think the point here is that rather than looking for a specific type
of failure, this implementation goes with a majority decision.  It is of
course possible, but unlikely, that two will fail in the same way, and one
will succeed.

>	- How soon a failing/failed CPU can be detected?

As soon as the instruction flow requires access to something outside of
a processors local memory.  Memory in the S2 is organised into local and
global memory;  a vote is required whenever access is required to global
memory or when an I/O operation is attempted.  I am note sure of the size
of transfers between local and global memory.

>	- What are the techniques used in detecting a failing/failed CPU?
>  	  (I know about processor-pair technique)

See above.

>	- What are the techniques used to report a failed CPU to the OS?

I suspect that this would vary considerably between a machine with multiple
logical processors and a machine with a single logical processor.  Each
logical processor in the Tandem S2 architecture consists of three physical
CPUs.

In a machine with one logical CPU, and failure of a physical CPU should
probably not affect the OS.  The failure of a logical CPU obviously has
a more severe impact on a machine like this.

In a machine with multiple logical CPUs, the failure of a logical CPU
should involve the OS, which should start rescheduling jobs to working
CPUs.  How do machines like the Sequent or the Stratus i860 based machines
handle this?

>
>I also have similar questions about Buses, Disks, and the Memory subsystem. 

I think in essence, the Tandem philosophy is have two or more of everything.

-- 
Jan Mikkelsen
janm@dramba.neis.oz.AU or janm%dramba.neis.oz@metro.ucc.su.oz.au
"She really is."