mash@mips.COM (John Mashey) (01/15/90)
In article <7566@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >Yes. A case in point is the new Cyclone processor from Tandem. I'm >not knocking it: I'm sure that it was built by sharp people, and will >be sold successfully. It has the important property that it's bit- >compatible with Tandem's previous stack machines. .... >Is it reliable? Well, yes, it's a Tandem product. It has parity and >temperature compensation and a diagnostic processor and spare cache >RAMs. But a Killer Micro with the same throughput could be made more >reliable, at a lower price, simply from its reduced chip count. This comparison is somewhat orthogonal to the previous line of discussion, in that I'd expect the tradeoffs for Cyclones are probably different from either K.M.'s or supercomputers. Also, Tandem is obviously aware of K.M.s and incorporating them into its product line. However, this might start off a new line of discussion. There has been plenty of discussion of the differences between micros and {mainframes, supers} in areas like I/O. How about one in the area of reliability, with issues like: a) What are CPU differences between micros and mainframes in this area? Are there reliability features of current big machine CPUs that are impossible to duplicate in micros? hard to duplicate? easy, but, too expensive? Are there features of micros that make them easier to make reliable systems from? b) Are any of these reliability features from mainframes not so necessary when entire CPUs are on single chips? c) Beyond the CPUs, what are the issues that might be different at the system level? d) ECC, parity, nothing: where are the boundaries on tradeoffs? (example: R3000s require/support parity on caches; from customer demand, IDT's 3001 can omit parity to lessen costs, because many embedded customers don't want it; of course, this would give other kinds of embedded customers (avionics, switches for example) nightmares. How big does a write-back cache get before you really want ECC? How about parity and ECC on busses? How about parity on ALU operations? Needless to say, hard data on error rates, and environmetns in which errors are OK, not-OK, would be good.... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (01/17/90)
In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes: > How about parity and ECC on busses? > How about parity on ALU operations? Has anyone a good reference for error checking on ALU operations? I remember reading about error checking in the IBM 360/67. The typical pipelined machine of my youth, the CDC 7600, had no such checking. It is really fun to be in a central service bureau the day you discover that your F.P. operations have been broken for the last three days :-) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
rpw3@rigden.wpd.sgi.com (Robert P. Warnock) (01/18/90)
In article <40694@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: +--------------- | In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes: | > How about parity on ALU operations? | ...really fun to be in a central service bureau the day you discover that | your F.P. operations have been broken for the last three days :-) +--------------- Real live case that "mere" parity on the ALU ops wouldn't have caught: Circa 1970 at the Emory University Chemistry Department we had a DEC PDP-10 (KA10) which started giving wrong answers on "a few" programs (actually, only on one or two specific input data sets on each of one or two programs). Turned out there was a transistor going leaky on the clear line of the latch (part of the instruction register) that stored which general register the results of a floating-point op gotten written back to. Some input data sets made enough "noise" in the floating point that "occasionally" floating-point instructions wrote their results to AC0 instead of whichever AC they were supposed to. AC0 was FORTRAN's subroutine value return reg, so generally the stomping had no "obvious" disastrous effects -- no wild array ref's, no wild jumps. And of course the kernel uses no fl-pt.) *HARD* to find; trivial to fix. ALU parity adds some confidence, but not certainty. And no help in this case. -Rob ----- Rob Warnock, MS-9U/510 rpw3@sgi.com rpw3@pei.com Silicon Graphics, Inc. (415)335-1673 Protocol Engines, Inc. 2011 N. Shoreline Blvd. Mountain View, CA 94039-7311