lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (01/17/90)
In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes: > a) What are CPU differences between micros and mainframes in this area? > Are there reliability features of current big machine CPUs that > are impossible to duplicate in micros? hard to duplicate? easy, but, > too expensive? Are there features of micros that make them easier to > make reliable systems from? Mainframes tend to put parity _everywhere_, and most micros are completely without. Ah, but what do you do when an error occurs? Microcoded machines drop into diagnostic microcode, which analyzes, reports, and then tries to resume/restart the macroinstruction. Some machines had redundant hardware (e.g. two ALUs) and could reconfigure to cut failed units out of the "processor complex". I don't see micros following this path. > b) Are any of these reliability features from mainframes not so > necessary when entire CPUs are on single chips? Yes: the stuff above. Chips have failure modes, and "age", but to the first order, the (un)reliability of a box depends on its chip-pin count. Putting the CPU on one chip, instead of 2,000, has a serious impact. Futher, the micro solution allows tricks like master/checker pairs, which you just wouldn't do if the processor was in a box 40 feet long. The #1 reason for "parity everywhere" was to detect that you were in trouble. The #2 reason was to identify the field-replaceable module (which for a micro is the whole CPU, or more). The trailing #3 reason was the hope of live CPU recovery. Live CPU recovery has become much less interesting since multiprocessors came along. With the right software, a failed processor does not imply a failed process. For example, Tandem checkpoints each process regularly, so that a different processor can do a prompt checkpoint-resumption. The CPU and IO interconnects have to be up to it, of course (dual port those disks). And besides: if a master/checker pair of CPUs disagree, which one was the one that failed? Better to ignore them both and force the board into self test mode. > c) Beyond the CPUs, what are the issues that might be different > at the system level? Well, nonstop machines are ruggedized and rated for e.g. sudden overpressures (no kidding). This might influence a chip company to change its chip packaging, but not its chip design. > d) ECC, parity, nothing: where are the boundaries on tradeoffs? Well, the Cyclone uses cache refill as a way to fix cache parity errors. And, they have extra cache RAMs that they can spare in. But it would be probably be OK (and simpler) if the machine just disabled a quarter or a half of its cache, and then ran on one lung. -- Don D.C.Lindsay Carnegie Mellon Computer Science
news@haddock.ima.isc.com (overhead) (01/18/90)
In article <7608@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >In article <34469@mips.mips.COM> mash@mips.COM (John Mashey) writes: >> a) What are CPU differences between micros and mainframes in this area? >> Are there reliability features of current big machine CPUs that >> are impossible to duplicate in micros? hard to duplicate? easy, but, >> too expensive? Are there features of micros that make them easier to >> make reliable systems from? At least around here, the UNIX based workstations run around the clock. If you want reliability, why not periodically run dignostics? A 'cron' job could run at 3 AM and report any errors. The CPU, FPU, RAM, DISK, and some other parts could be checked reasonably well without consuming too much resources. I'd prefer this to parity, where the only thing the machine does is crash. I still say provide real error correction, or don't bother. Lots of machines out there have never had any real diagnostics run on them since installation. In particular, disk blocks do become bad. This can be very frustrating, etc. The idea of running periodic diagnostics is not new. Stephen.
aglew@oberon.csg.uiuc.edu (Andy Glew) (01/18/90)
>At least around here, the UNIX based workstations run around the >clock. If you want reliability, why not periodically run >dignostics? A 'cron' job could run at 3 AM and report any >errors. The CPU, FPU, RAM, DISK, and some other parts could be >checked reasonably well without consuming too much resources. >I'd prefer this to parity, where the only thing the machine does >is crash. I still say provide real error correction, or don't bother. Good idea... Except that most of the diagnostic programs written concurrent with hardware development (the diagnostics that may consume most of the hardware development budget) assume that they have exclusive control of the CPU. They can do things like turning cache on and off, deliberately writing bad data and then waiting for the trap, etc. This is why most of these diagnostic programs only run when the system is booting, or otherwise not running UNIX. Some sorts of stress diagnostics can be run on a normal UNIX system. But, normal multiuser activity, such as mandatory interrupts every 1/60th of a second, can mask the very sort of errors that you are looking for. Note that many of these activities also require hardware privilige to do things like turning off the TLB. (Not just root). I have heard of kernels that have diagnostics integrated with them, but the kernel is large enough already. When we were placing Gould's Real Time UNIX on the Gould NPL, the diagnostics engineers started thinking about putting diagnostics up that would run under UNIX. Real-Time UNIX gave (priviliged) user processes the ability to acquire any sort of hardware privilige, and, in effect, take over the entire system. This was particularly attractive for multiple-CPU systems, where one CPU could be isolated, diagnostics run, and then released back to normal UNIX operations. Even on a single CPU system a priviliged diagnostic process could take over the system, run diagnostics, and then return to UNIX. It would probably have to be more careful about starting UNIX up, though, than in the multiple CPU case. SUMMARY: Regularly running diagnostics benefits from real-time UNIX features and multiple CPUs. aglew@uiuc.edu -- Andy Glew, aglew@uiuc.edu
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (01/18/90)
In article <15679@haddock.ima.isc.com> suitti@anchovy.UUCP (Stephen Uitti) writes: >In article <7608@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >clock. If you want reliability, why not periodically run >dignostics? A 'cron' job could run at 3 AM and report any In fact, it is not unheard of to run ALU diagnostics in "idle". When you go beyond that, to memory, channels, disks, etc., you can easily start hurting the performance of other things going on (by flushing the cache(s), by affecting other processors and processes, by thrashing the disks, or using network bandwidth...) But, it is a good idea to check the ALU when the system is on idle, as long as you can do it without hurting performance. The systems where I saw it done in the past did not have caches, so a few stray memory references were not a big deal. I am not sure if you could write an effective ALU diagnostic that didn't have the effect of flushing the cache after a few million instructions... Does anyone know if any Unix kernels have this capability? Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
vinoski@apollo.HP.COM (Stephen Vinoski) (01/18/90)
In article <AGLEW.90Jan17142256@oberon.csg.uiuc.edu> aglew@oberon.csg.uiuc.edu (Andy Glew) writes: >>At least around here, the UNIX based workstations run around the >>clock. If you want reliability, why not periodically run >>dignostics? A 'cron' job could run at 3 AM and report any >>errors. The CPU, FPU, RAM, DISK, and some other parts could be >>checked reasonably well without consuming too much resources. > >Good idea... > >Except that most of the diagnostic programs written concurrent with >hardware development (the diagnostics that may consume most of the >hardware development budget) assume that they have exclusive control >of the CPU. They can do things like turning cache on and off, >deliberately writing bad data and then waiting for the trap, etc. >This is why most of these diagnostic programs only run when the >system is booting, or otherwise not running UNIX. > >Some sorts of stress diagnostics can be run on a normal UNIX system. Stress diagnostics become very applicable when machines are single-user, such as in the (ideal) workstation world. The Testability and Diagnostics Department here at Apollo has a system called SAX (System Acceptance EXercisor) which does just that. It doesn't assume that it has exclusive control of the CPU, but it "beats up" the system so much when it is running that no other useful work can be done. It runs on top of the operating system and, to my knowledge, uses no special system calls. It can be configured so that it runs automatically in a chosen time slot; it then notifies the user of any problems via email. It is usually run overnight, and it can and regularly does catch problems well before they become critical. Due to the fact that it runs in a multitasking environment, it also catches problems that cannot be detected by most boot diagnostics and stand-alone diagnostics, such as bus arbitration problems and cache coherency troubles. -steve | Steve Vinoski | | Hewlett-Packard Apollo Division, Testability and Diagnostics Dept. | | Chelmsford, MA 01824 (508)256-6600 x5904 | | Internet: vinoski@apollo.com UUCP: {mit-eddie,yale,uw-beaver}!apollo!vinoski |
andrew@frip.WV.TEK.COM (Andrew Klossner) (01/19/90)
[] "I'd prefer this to parity, where the only thing the machine does is crash. I still say provide real error correction, or don't bother." The usual counter-argument is to consider a system which takes an error while printing paychecks. You'd much rather see the system crash than finish printing bad checks. -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]