[comp.arch] Reliability

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (01/17/90)

In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes:
> a) What are CPU differences between micros and mainframes in this area?
> Are there reliability features of current big machine CPUs that
> are impossible to duplicate in micros? hard to duplicate? easy, but,
> too expensive? Are there features of micros that make them easier to
> make reliable systems from?

Mainframes tend to put parity _everywhere_, and most micros are
completely without. Ah, but what do you do when an error occurs?
Microcoded machines drop into diagnostic microcode, which analyzes,
reports, and then tries to resume/restart the macroinstruction.  Some
machines had redundant hardware (e.g. two ALUs) and could reconfigure
to cut failed units out of the "processor complex". I don't see
micros following this path.

> b) Are any of these reliability features from mainframes not so
> necessary when entire CPUs are on single chips?

Yes: the stuff above. Chips have failure modes, and "age", but to the
first order, the (un)reliability of a box depends on its chip-pin
count. Putting the CPU on one chip, instead of 2,000, has a serious
impact. Futher, the micro solution allows tricks like master/checker
pairs, which you just wouldn't do if the processor was in a box 40
feet long.

The #1 reason for "parity everywhere" was to detect that you were in
trouble. The #2 reason was to identify the field-replaceable module
(which for a micro is the whole CPU, or more). The trailing #3 reason
was the hope of live CPU recovery.

Live CPU recovery has become much less interesting since
multiprocessors came along. With the right software, a failed
processor does not imply a failed process. For example, Tandem
checkpoints each process regularly, so that a different processor can
do a prompt checkpoint-resumption. The CPU and IO interconnects have
to be up to it, of course (dual port those disks). And besides: if a
master/checker pair of CPUs disagree, which one was the one that
failed? Better to ignore them both and force the board into self test
mode.

> c) Beyond the CPUs, what are the issues that might be different
> at the system level?

Well, nonstop machines are ruggedized and rated for e.g. sudden
overpressures (no kidding). This might influence a chip company to
change its chip packaging, but not its chip design.

> d) ECC, parity, nothing: where are the boundaries on tradeoffs?

Well, the Cyclone uses cache refill as a way to fix cache parity
errors. And, they have extra cache RAMs that they can spare in.  But
it would be probably be OK (and simpler) if the machine just disabled
a quarter or a half of its cache, and then ran on one lung. 
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

news@haddock.ima.isc.com (overhead) (01/18/90)

In article <7608@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>> a) What are CPU differences between micros and mainframes in this area?
>> Are there reliability features of current big machine CPUs that
>> are impossible to duplicate in micros? hard to duplicate? easy, but,
>> too expensive? Are there features of micros that make them easier to
>> make reliable systems from?

At least around here, the UNIX based workstations run around the
clock.  If you want reliability, why not periodically run
dignostics?  A 'cron' job could run at 3 AM and report any
errors.  The CPU, FPU, RAM, DISK, and some other parts could be
checked reasonably well without consuming too much resources.
I'd prefer this to parity, where the only thing the machine does
is crash.  I still say provide real error correction, or don't bother.

Lots of machines out there have never had any real diagnostics
run on them since installation.  In particular, disk blocks do
become bad.  This can be very frustrating, etc.  The idea of
running periodic diagnostics is not new.

Stephen.

aglew@oberon.csg.uiuc.edu (Andy Glew) (01/18/90)

>At least around here, the UNIX based workstations run around the
>clock.  If you want reliability, why not periodically run
>dignostics?  A 'cron' job could run at 3 AM and report any
>errors.  The CPU, FPU, RAM, DISK, and some other parts could be
>checked reasonably well without consuming too much resources.
>I'd prefer this to parity, where the only thing the machine does
>is crash.  I still say provide real error correction, or don't bother.

Good idea...

Except that most of the diagnostic programs written concurrent with
hardware development (the diagnostics that may consume most of the
hardware development budget) assume that they have exclusive control
of the CPU.  They can do things like turning cache on and off,
deliberately writing bad data and then waiting for the trap, etc.
This is why most of these diagnostic programs only run when the
system is booting, or otherwise not running UNIX.

Some sorts of stress diagnostics can be run on a normal UNIX system.
But, normal multiuser activity, such as mandatory interrupts every
1/60th of a second, can mask the very sort of errors that you are
looking for.

Note that many of these activities also require hardware privilige
to do things like turning off the TLB. (Not just root).  I have heard
of kernels that have diagnostics integrated with them, but the kernel
is large enough already.

When we were placing Gould's Real Time UNIX on the Gould NPL, the
diagnostics engineers started thinking about putting diagnostics up
that would run under UNIX.  Real-Time UNIX gave (priviliged) user
processes the ability to acquire any sort of hardware privilige, and, in
effect, take over the entire system. This was particularly attractive
for multiple-CPU systems, where one CPU could be isolated, diagnostics
run, and then released back to normal UNIX operations.  Even on a single
CPU system a priviliged diagnostic process could take over the system,
run diagnostics, and then return to UNIX.  It would probably have to be
more careful about starting UNIX up, though, than in the multiple CPU
case.

SUMMARY:
    Regularly running diagnostics benefits from real-time UNIX features
    and multiple CPUs.


aglew@uiuc.edu
--
Andy Glew, aglew@uiuc.edu

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (01/18/90)

In article <15679@haddock.ima.isc.com> suitti@anchovy.UUCP (Stephen Uitti) writes:
>In article <7608@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>clock.  If you want reliability, why not periodically run
>dignostics?  A 'cron' job could run at 3 AM and report any

In fact, it is not unheard of to run ALU diagnostics in "idle".
When you go beyond that, to memory, channels, disks, etc., you
can easily start hurting the performance of other things going on (by
flushing the cache(s), by affecting other processors and processes, by
thrashing the disks, or using network bandwidth...)  But, it
is a good idea to check the ALU when the system is on idle, as long as you
can do it without hurting performance.  The systems where I saw it done
in the past did not have caches, so a few stray memory references were not
a big deal.  I am not sure if you could write an effective ALU diagnostic
that didn't have the effect of flushing the cache after a few million
instructions...   Does anyone know if any Unix kernels have this capability?


  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

vinoski@apollo.HP.COM (Stephen Vinoski) (01/18/90)

In article <AGLEW.90Jan17142256@oberon.csg.uiuc.edu> aglew@oberon.csg.uiuc.edu (Andy Glew) writes:
>>At least around here, the UNIX based workstations run around the
>>clock.  If you want reliability, why not periodically run
>>dignostics?  A 'cron' job could run at 3 AM and report any
>>errors.  The CPU, FPU, RAM, DISK, and some other parts could be
>>checked reasonably well without consuming too much resources.
>
>Good idea...
>
>Except that most of the diagnostic programs written concurrent with
>hardware development (the diagnostics that may consume most of the
>hardware development budget) assume that they have exclusive control
>of the CPU.  They can do things like turning cache on and off,
>deliberately writing bad data and then waiting for the trap, etc.
>This is why most of these diagnostic programs only run when the
>system is booting, or otherwise not running UNIX.
>
>Some sorts of stress diagnostics can be run on a normal UNIX system.

Stress diagnostics become very applicable when machines are single-user, such as
in the (ideal) workstation world.  The Testability and Diagnostics Department
here at Apollo has a system called SAX (System Acceptance EXercisor) which does
just that.  It doesn't assume that it has exclusive control of the CPU, but it
"beats up" the system so much when it is running that no other useful work can
be done.  It runs on top of the operating system and, to my knowledge, uses no
special system calls.

It can be configured so that it runs automatically in a chosen time slot; it
then notifies the user of any problems via email.  It is usually run overnight,
and it can and regularly does catch problems well before they become critical.
Due to the fact that it runs in a multitasking environment, it also catches
problems that cannot be detected by most boot diagnostics and stand-alone
diagnostics, such as bus arbitration problems and cache coherency troubles.


-steve
| Steve Vinoski                                                                |
| Hewlett-Packard Apollo Division, Testability and Diagnostics Dept.           |
| Chelmsford, MA    01824    (508)256-6600 x5904                               | 
| Internet: vinoski@apollo.com UUCP: {mit-eddie,yale,uw-beaver}!apollo!vinoski |

andrew@frip.WV.TEK.COM (Andrew Klossner) (01/19/90)

[]

	"I'd prefer this to parity, where the only thing the machine
	does is crash.  I still say provide real error correction, or
	don't bother."

The usual counter-argument is to consider a system which takes an error
while printing paychecks.  You'd much rather see the system crash than
finish printing bad checks.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]