[comp.arch] The Killer Micro From Hell actually, reliability

mash@mips.COM (John Mashey) (01/15/90)

In article <7566@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:

>Yes. A case in point is the new Cyclone processor from Tandem. I'm
>not knocking it: I'm sure that it was built by sharp people, and will
>be sold successfully. It has the important property that it's bit-
>compatible with Tandem's previous stack machines.
....
>Is it reliable? Well, yes, it's a Tandem product. It has parity and
>temperature compensation and a diagnostic processor and spare cache
>RAMs. But a Killer Micro with the same throughput could be made more
>reliable, at a lower price, simply from its reduced chip count.

This comparison is somewhat orthogonal to the previous line of discussion,
in that I'd expect the tradeoffs for Cyclones are probably different
from either K.M.'s or supercomputers.  Also, Tandem is obviously aware
of K.M.s and incorporating them into its product line.

However, this might start off a new line of discussion.  There has been plenty
of discussion of the differences between micros and {mainframes, supers}
in areas like I/O.  How about one in the area of reliability, with issues like:
	a) What are CPU differences between micros and mainframes in this area?
	Are there reliability features of current big machine CPUs that
	are impossible to duplicate in micros? hard to duplicate? easy, but,
	too expensive? Are there features of micros that make them easier to
	make reliable systems from?
	b) Are any of these reliability features from mainframes not so
	necessary when entire CPUs are on single chips?
	c) Beyond the CPUs, what are the issues that might be different
	at the system level?
	d) ECC, parity, nothing: where are the boundaries on tradeoffs?
		(example: R3000s require/support parity on caches;
		from customer demand, IDT's 3001 can omit parity to lessen
		costs, because many embedded customers don't want it;
		of course, this would give other kinds of embedded customers
		(avionics, switches for example) nightmares.
		How big does a write-back cache get before you really want ECC?
		How about parity and ECC on busses?
		How about parity on ALU operations?
Needless to say, hard data on error rates, and environmetns in which
errors are OK, not-OK, would be good....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (01/17/90)

In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>		How about parity and ECC on busses?
>		How about parity on ALU operations?

Has anyone a good reference for error checking on ALU operations?  I
remember reading about error checking in the IBM 360/67.  The typical
pipelined machine of my youth, the CDC 7600, had no such checking.  It is
really fun to be in a central service bureau the day you discover that
your F.P. operations have been broken for the last three days  :-)

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

rpw3@rigden.wpd.sgi.com (Robert P. Warnock) (01/18/90)

In article <40694@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov
(Hugh LaMaster) writes:
+---------------
| In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes:
| >		How about parity on ALU operations?
| ...really fun to be in a central service bureau the day you discover that
| your F.P. operations have been broken for the last three days  :-)
+---------------

Real live case that "mere" parity on the ALU ops wouldn't have caught:

Circa 1970 at the Emory University Chemistry Department we had a DEC PDP-10
(KA10) which started giving wrong answers on "a few" programs (actually,
only on one or two specific input data sets on each of one or two programs).

Turned out there was a transistor going leaky on the clear line of the
latch (part of the instruction register) that stored which general register
the results of a floating-point op gotten written back to. Some input
data sets made enough "noise" in the floating point that "occasionally"
floating-point instructions wrote their results to AC0 instead of whichever
AC they were supposed to. AC0 was FORTRAN's subroutine value return reg,
so generally the stomping had no "obvious" disastrous effects -- no wild
array ref's, no wild jumps. And of course the kernel uses no fl-pt.)
*HARD* to find; trivial to fix.

ALU parity adds some confidence, but not certainty. And no help in this case.

-Rob


-----
Rob Warnock, MS-9U/510		rpw3@sgi.com		rpw3@pei.com
Silicon Graphics, Inc.		(415)335-1673		Protocol Engines, Inc.
2011 N. Shoreline Blvd.
Mountain View, CA  94039-7311