[comp.arch] VAX Memory Test

carey@m.cs.uiuc.edu (08/05/88)

I would like to get (adapt, or write) a memory test routine for
a VAX-750 or a VAX-780.

I want it to run under UNIX, and if it crashes the machine when it
runs and encounters a hard error that is OK (because I would normally
run it only when there was a hard error present).

It doesn't need to print out any extra messages, because the kernel
should print out an error message when it encounters a memory error.
However, I wouldn't complain if it did print out it's own messages.

This should be pretty simple, but I don't really know how to go about it.
If I have to write it myself, I would appreciate any hints or advice.
Any references to writing VAX assembly code would be appreciated (especially
if I am likely to find it in our library), and also any help in imbedding
such a program in a C program would be great, too.

Maybe there are some tools like this available, if so, please let me know.

carey@cs.uiuc.edu
carey@{uunet,seismo,pur-ee...}!uiucdcs

chris@mimsy.UUCP (Chris Torek) (08/05/88)

In article <3300032@m.cs.uiuc.edu> carey@m.cs.uiuc.edu writes:
>I would like to get (adapt, or write) a memory test routine for
>a VAX-750 or a VAX-780.
>
>I want it to run under UNIX, ....

Memory tests usually cannot run underneath a virtual memory O/S, since
they have no way to force the machine to allocate particular memory
regions.  To have any hope of testing all of memory (except that
part in which the test program is loaded, unless it moves itself),
the program will have to run standalone.

>This should be pretty simple, but I don't really know how to go about it.

A trivial test, like the ones in DEC's microdiagnostics, is not too
hard.  A complete test is infeasible: there are 2^(total number of bits)
patterns to test.  You can test for common faults such as address and
data line shorts, stuck bits, etc.; but no matter what you test, there
will always be *some* failure mode you will miss.

(For instance, early MicroVAX IIs would sometimes crash if idle.  This
happened only when running Unix, never under VMS.  It turned out that
the refresh on some of the DRAMs was weak, and after not being accessed
for 30 seconds or so, they would begin to forget.  VMS's idle loop did
enough work that the external RAS- or CAS-only refresh hit all the rows
or columns and thus took care of it.  The 4BSD idle loop is just a few
instructions, and did not.)

>If I have to write it myself, I would appreciate any hints or advice.

The easiest test for a Vax with ECC is to let the machine run
normally.  Single-bit errors will be corrected automatically, with a
note from the O/S (as reported by the hardware) when this happens.  The
`address' and `syndrome' bits identify exactly which chip is failing,
although the only way to go from address+syndrome to chip is via the
manufacturer's tables (or trace the board! ... not for me, thanks),
which are sometimes hard to find.

(Of course, if the ECC logic is broken . . . .)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

jbs@fenchurch.MIT.EDU (Jeff Siegal) (08/05/88)

In article <12849@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>The easiest test for a Vax with ECC is to let the machine run
>normally.  Single-bit errors will be corrected automatically, with a
>note from the O/S (as reported by the hardware) when this happens.  

Except on MicroVAX-II's, where ECC memory is not officially supported.
You can get ECC memory from third-party vendors, but all the products
I've seen correct any errors silently and do not report them to the
O/S in order to stay compatible (one of them kept an internal error
log you could read by looking at the appropriate address)

Actually, it was a while ago when I looked at this, so it may no
longer be true.

Jeff Siegal

jack@swlabs.UUCP (Jack Bonn) (08/06/88)

In article <12849@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>The easiest test for a Vax with ECC is to let the machine run
>normally.  Single-bit errors will be corrected automatically, with a
>note from the O/S (as reported by the hardware) when this happens.  The
>`address' and `syndrome' bits identify exactly which chip is failing,
>although the only way to go from address+syndrome to chip is via the
>manufacturer's tables (or trace the board! ... not for me, thanks),
>which are sometimes hard to find.

On a high reliability system we designed, we had an ECC error "gleaner"
which would read and write successive memory locations in order to correct
soft single bit errors before alpha particles (or whatever) changed them 
into double bit errors and made them uncorrectable.

Is there a more proper name for this?  To me a "gleaner" brings to mind
someone working at the feet of the grim reaper.  Not too pleasant a 
thought.

-Jack
-- 
Jack Bonn, <> Software Labs, Ltd, Box 451, Easton CT  06612
uunet!swlabs!jack (UUCP)	jack%swlabs.uucp@uunet.uu.net (INTERNET)

seeger@beach.cis.ufl.edu (Charles Seeger) (08/07/88)

In article <1743@swlabs.UUCP> jack@swlabs.UUCP (Jack Bonn) writes:

>..., we had an ECC error "gleaner"
>which would read and write successive memory locations in order to correct
>soft single bit errors before ...
>into double bit errors and made them uncorrectable.
>Is there a more proper name for this? 
>-Jack

Generally, this is called "memory scrubbing."  I think it will become more
widespread as memory chips become denser and soft errors become more common.

Cheers
Chuck

greg@vertical.oz (Greg Bond) (08/08/88)

In article <1743@swlabs.UUCP> jack@swlabs.UUCP (Jack Bonn) writes:
.On a high reliability system we designed, we had an ECC error "gleaner"
.which would read and write successive memory locations in order to correct
.soft single bit errors before alpha particles (or whatever) changed them 
.into double bit errors and made them uncorrectable.
.
.Is there a more proper name for this?

Usually called memory scrubbing.  A task for the null process?

And is it usually implemented in hardware or software?  Do hardware
implementations automagically write back corrected value?
-- 
Gregory Bond,  Vertical Software, Melbourne (greg@vertical.oz)
I used to be a pessimist. Now I am a realist.

levy@ttrdc.UUCP (Daniel R. Levy) (08/09/88)

In article <1743@swlabs.UUCP>, jack@swlabs.UUCP (Jack Bonn) writes:
# On a high reliability system we designed, we had an ECC error "gleaner"
# which would read and write successive memory locations in order to correct
# soft single bit errors before alpha particles (or whatever) changed them 
# into double bit errors and made them uncorrectable.
# Is there a more proper name for this?  To me a "gleaner" brings to mind
# someone working at the feet of the grim reaper.  Not too pleasant a 
# thought.

I've heard this referred to as "scrubbing."
-- 
|------------Dan Levy------------|  THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY
| Bell Labs Area 61 (R.I.P., TTY)|  AND ARE NOT TO BE IMPUTED TO AT&T.
|        Skokie, Illinois        | 
|-----Path:  att!ttbcad!levy-----|

mat@amdahl.uts.amdahl.com (Mike Taylor) (08/10/88)

In article <2854@ttrdc.UUCP>, levy@ttrdc.UUCP (Daniel R. Levy) writes:
> In article <1743@swlabs.UUCP>, jack@swlabs.UUCP (Jack Bonn) writes:
> # On a high reliability system we designed, we had an ECC error "gleaner"
> # which would read and write successive memory locations in order to correct
> # soft single bit errors before alpha particles (or whatever) changed them 
> # into double bit errors and made them uncorrectable.
> # Is there a more proper name for this?  To me a "gleaner" brings to mind
> # someone working at the feet of the grim reaper.  Not too pleasant a 
> # thought.

On our systems we call it storage "patrol." Has a nice ring to it?
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

rminnich@super.ORG (Ronald G Minnich) (08/10/88)

In article <167@vertical.oz> greg@vertical.oz (Greg Bond) writes:
>Usually called memory scrubbing.  A task for the null process?
>And is it usually implemented in hardware or software?  Do hardware
>implementations automagically write back corrected value?
At burroughs it was called healing. The memory controller did it
all automatically. ON a cheaper machine there is no reason
(i can think of) that you couldn't do it in software- just read the 
value and write it back. Course, if you have snooping caches 
or channels or other things that might whomp memory that 
has side effects, whereas on the burroughs machines which did it in 
hardware that was not an issue.
ron