carey@m.cs.uiuc.edu (08/05/88)
I would like to get (adapt, or write) a memory test routine for a VAX-750 or a VAX-780. I want it to run under UNIX, and if it crashes the machine when it runs and encounters a hard error that is OK (because I would normally run it only when there was a hard error present). It doesn't need to print out any extra messages, because the kernel should print out an error message when it encounters a memory error. However, I wouldn't complain if it did print out it's own messages. This should be pretty simple, but I don't really know how to go about it. If I have to write it myself, I would appreciate any hints or advice. Any references to writing VAX assembly code would be appreciated (especially if I am likely to find it in our library), and also any help in imbedding such a program in a C program would be great, too. Maybe there are some tools like this available, if so, please let me know. carey@cs.uiuc.edu carey@{uunet,seismo,pur-ee...}!uiucdcs
chris@mimsy.UUCP (Chris Torek) (08/05/88)
In article <3300032@m.cs.uiuc.edu> carey@m.cs.uiuc.edu writes: >I would like to get (adapt, or write) a memory test routine for >a VAX-750 or a VAX-780. > >I want it to run under UNIX, .... Memory tests usually cannot run underneath a virtual memory O/S, since they have no way to force the machine to allocate particular memory regions. To have any hope of testing all of memory (except that part in which the test program is loaded, unless it moves itself), the program will have to run standalone. >This should be pretty simple, but I don't really know how to go about it. A trivial test, like the ones in DEC's microdiagnostics, is not too hard. A complete test is infeasible: there are 2^(total number of bits) patterns to test. You can test for common faults such as address and data line shorts, stuck bits, etc.; but no matter what you test, there will always be *some* failure mode you will miss. (For instance, early MicroVAX IIs would sometimes crash if idle. This happened only when running Unix, never under VMS. It turned out that the refresh on some of the DRAMs was weak, and after not being accessed for 30 seconds or so, they would begin to forget. VMS's idle loop did enough work that the external RAS- or CAS-only refresh hit all the rows or columns and thus took care of it. The 4BSD idle loop is just a few instructions, and did not.) >If I have to write it myself, I would appreciate any hints or advice. The easiest test for a Vax with ECC is to let the machine run normally. Single-bit errors will be corrected automatically, with a note from the O/S (as reported by the hardware) when this happens. The `address' and `syndrome' bits identify exactly which chip is failing, although the only way to go from address+syndrome to chip is via the manufacturer's tables (or trace the board! ... not for me, thanks), which are sometimes hard to find. (Of course, if the ECC logic is broken . . . .) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
jbs@fenchurch.MIT.EDU (Jeff Siegal) (08/05/88)
In article <12849@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >The easiest test for a Vax with ECC is to let the machine run >normally. Single-bit errors will be corrected automatically, with a >note from the O/S (as reported by the hardware) when this happens. Except on MicroVAX-II's, where ECC memory is not officially supported. You can get ECC memory from third-party vendors, but all the products I've seen correct any errors silently and do not report them to the O/S in order to stay compatible (one of them kept an internal error log you could read by looking at the appropriate address) Actually, it was a while ago when I looked at this, so it may no longer be true. Jeff Siegal
jack@swlabs.UUCP (Jack Bonn) (08/06/88)
In article <12849@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >The easiest test for a Vax with ECC is to let the machine run >normally. Single-bit errors will be corrected automatically, with a >note from the O/S (as reported by the hardware) when this happens. The >`address' and `syndrome' bits identify exactly which chip is failing, >although the only way to go from address+syndrome to chip is via the >manufacturer's tables (or trace the board! ... not for me, thanks), >which are sometimes hard to find. On a high reliability system we designed, we had an ECC error "gleaner" which would read and write successive memory locations in order to correct soft single bit errors before alpha particles (or whatever) changed them into double bit errors and made them uncorrectable. Is there a more proper name for this? To me a "gleaner" brings to mind someone working at the feet of the grim reaper. Not too pleasant a thought. -Jack -- Jack Bonn, <> Software Labs, Ltd, Box 451, Easton CT 06612 uunet!swlabs!jack (UUCP) jack%swlabs.uucp@uunet.uu.net (INTERNET)
seeger@beach.cis.ufl.edu (Charles Seeger) (08/07/88)
In article <1743@swlabs.UUCP> jack@swlabs.UUCP (Jack Bonn) writes: >..., we had an ECC error "gleaner" >which would read and write successive memory locations in order to correct >soft single bit errors before ... >into double bit errors and made them uncorrectable. >Is there a more proper name for this? >-Jack Generally, this is called "memory scrubbing." I think it will become more widespread as memory chips become denser and soft errors become more common. Cheers Chuck
greg@vertical.oz (Greg Bond) (08/08/88)
In article <1743@swlabs.UUCP> jack@swlabs.UUCP (Jack Bonn) writes:
.On a high reliability system we designed, we had an ECC error "gleaner"
.which would read and write successive memory locations in order to correct
.soft single bit errors before alpha particles (or whatever) changed them
.into double bit errors and made them uncorrectable.
.
.Is there a more proper name for this?
Usually called memory scrubbing. A task for the null process?
And is it usually implemented in hardware or software? Do hardware
implementations automagically write back corrected value?
--
Gregory Bond, Vertical Software, Melbourne (greg@vertical.oz)
I used to be a pessimist. Now I am a realist.
levy@ttrdc.UUCP (Daniel R. Levy) (08/09/88)
In article <1743@swlabs.UUCP>, jack@swlabs.UUCP (Jack Bonn) writes: # On a high reliability system we designed, we had an ECC error "gleaner" # which would read and write successive memory locations in order to correct # soft single bit errors before alpha particles (or whatever) changed them # into double bit errors and made them uncorrectable. # Is there a more proper name for this? To me a "gleaner" brings to mind # someone working at the feet of the grim reaper. Not too pleasant a # thought. I've heard this referred to as "scrubbing." -- |------------Dan Levy------------| THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY | Bell Labs Area 61 (R.I.P., TTY)| AND ARE NOT TO BE IMPUTED TO AT&T. | Skokie, Illinois | |-----Path: att!ttbcad!levy-----|
mat@amdahl.uts.amdahl.com (Mike Taylor) (08/10/88)
In article <2854@ttrdc.UUCP>, levy@ttrdc.UUCP (Daniel R. Levy) writes: > In article <1743@swlabs.UUCP>, jack@swlabs.UUCP (Jack Bonn) writes: > # On a high reliability system we designed, we had an ECC error "gleaner" > # which would read and write successive memory locations in order to correct > # soft single bit errors before alpha particles (or whatever) changed them > # into double bit errors and made them uncorrectable. > # Is there a more proper name for this? To me a "gleaner" brings to mind > # someone working at the feet of the grim reaper. Not too pleasant a > # thought. On our systems we call it storage "patrol." Has a nice ring to it? -- Mike Taylor ...!{hplabs,amdcad,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
rminnich@super.ORG (Ronald G Minnich) (08/10/88)
In article <167@vertical.oz> greg@vertical.oz (Greg Bond) writes: >Usually called memory scrubbing. A task for the null process? >And is it usually implemented in hardware or software? Do hardware >implementations automagically write back corrected value? At burroughs it was called healing. The memory controller did it all automatically. ON a cheaper machine there is no reason (i can think of) that you couldn't do it in software- just read the value and write it back. Course, if you have snooping caches or channels or other things that might whomp memory that has side effects, whereas on the burroughs machines which did it in hardware that was not an issue. ron