[comp.arch] Workstation Data Integrity

jackk@shasta.stanford.edu (Jack Kouloheris) (08/04/90)

I'm a bit puzzled by the lack of any type of memory error detection/
correction on many workstations and high-end PCs. These workstations
are beginning to have memories that rival or exceed those of
the previous generation of minicomputers, which almost always used
some sort of ECC protection. Do manufacturers feel that it isn't needed
any more ?

A 1Mbit DRAM chip may have a typical soft error rate of 
.001-.005 PPM/KPOH/bit. Suppose we have a workstation with
16 Megabytes of memory ( = approx 1.34 * 10^ 8 bits). This
yields a memory system error rate of .671 errors/KPOH, a non-negligible
number. Servers may have even more memory than this, and may
be running continually, so some errors are bound to occur. What
happens if a bit flips, and then the data is paged out or written to
a file ? The error is now permanent and can propagate. 
Why does no one worry about this ?

Some SUNs have parity checking on the memory system, but what does
the OS do when a parity error occurs, since correction is not
possible ?

Jack

bobeson@saturn.ucsc.edu (Robert Ellefson) (08/04/90)

The IBM RS/6000 line has full memory ECC.  They use 40 bits/word, which
gives 7 bits for correction, and 1 unused bit.  All busses have
8-bit parity checking.

They also 'scrub' the memory, which involves periodically reading
and correcting 1-bit errors before they become uncorrectable 2-bit
errors.

For a good reference on this, see the "RS/6000 Technology" Book.

-Bob

darrylo@hpnmdla.HP.COM (Darryl Okahata) (08/05/90)

In comp.arch, jackk@shasta.stanford.edu (Jack Kouloheris) writes:

> I'm a bit puzzled by the lack of any type of memory error detection/
> correction on many workstations and high-end PCs. These workstations
> are beginning to have memories that rival or exceed those of
> the previous generation of minicomputers, which almost always used
> some sort of ECC protection. Do manufacturers feel that it isn't needed
> any more ?

     Just as a data point, ECC memory is available as an option on
Hewlett-Packard workstations.

     For those HP workstations with only parity-checked memory, the
system administrator can choose one of three actions upon the occurrence
of a parity error:

1. Print a "Parity error" message to the console.

2. Print a "Parity error" message to the console, plus:

	If user state, it kills the current process (which may not
	always be the process which caused the error, as with a DMA
	card) and prints an error message to the tty.

	If supervisor state, it panics with a "parity error" message to
	the console.

3. Always panics with a "parity error" message to the console.

The last one (#3 above) is the default action (with the other actions,
data corruption could occur depending on where the RAM parity error
occurred).

     -- Darryl Okahata
	UUCP: {hplabs!, hpcea!, hpfcla!} hpnmd!darrylo
	Internet: darrylo%hpnmd@hp-sde.sde.hp.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion or policy of Hewlett-Packard or of the
little green men that have been following him all day.

henry@zoo.toronto.edu (Henry Spencer) (08/05/90)

In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes:
>I'm a bit puzzled by the lack of any type of memory error detection/
>correction on many workstations and high-end PCs. These workstations
>are beginning to have memories that rival or exceed those of
>the previous generation of minicomputers, which almost always used
>some sort of ECC protection...

DRAM chips have improved a great deal in the last decade.  Thank heavens.
Speed pressures have also increased a lot, and ECC in particular tends to
incur speed penalties.

And it's a tempting thing to leave off when timing or board space gets tight.
After all, the thing still works...

>Some SUNs have parity checking on the memory system, but what does
>the OS do when a parity error occurs, since correction is not
>possible ?

Depends on the situation.  A parity error in a code page is harmless --
just bring in a fresh copy from disk.  A parity error in data in an
ordinary user program can be dealt with by killing that program.  You
get into difficulties only when the error hits the kernel or some vital
system daemon.  If errors are rare enough, parity is adequate.

(Many people -- e.g. the imbeciles who have their kernels kill processes
at random when swap space is short -- overlook the fact that some of
the daemons are every bit as vital to proper operation as the kernel.
Fortunately they're often not all that large, and are less likely to
get hit by memory errors than elephantine user programs.)

If you want something to be concerned about, consider that while most
PCs have parity, almost all PC software ignores parity errors.
-- 
The 486 is to a modern CPU as a Jules  | Henry Spencer at U of Toronto Zoology
Verne reprint is to a modern SF novel. |  henry@zoo.toronto.edu   utzoo!henry

davec@nucleus.amd.com (Dave Christie) (08/07/90)

In article <1990Aug4.231129.1358@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes:
>>I'm a bit puzzled by the lack of any type of memory error detection/
>>correction on many workstations and high-end PCs. These workstations
>>are beginning to have memories that rival or exceed those of
>>the previous generation of minicomputers, which almost always used
>>some sort of ECC protection...
>

 [some valid points about current dram quality and the temptation to not
  bother with the extra hardware deleted]

>>Some SUNs have parity checking on the memory system, but what does
>>the OS do when a parity error occurs, since correction is not
>>possible ?
>
>Depends on the situation.  A parity error in a code page is harmless --
>just bring in a fresh copy from disk.  A parity error in data in an
>ordinary user program can be dealt with by killing that program.  You

Spoken like a true sysadmin :-).

>get into difficulties only when the error hits the kernel or some vital
>system daemon.  If errors are rare enough, parity is adequate.

"Rare enough" is pretty relative - one has to consider the run time of
one's programs.  (John McCalpin was recently talking of runtimes on the
order of months!)  And since most cycles are spent running user programs 
(hopefully!) I think they deserve a little more consideration.  But the 
workstation market is pretty cutthroat and cost/performance is critical - 
fault tolerance hardware tends to push that ratio in the wrong direction
so there's some initiative to leave it out.  When comparing the current
workstations with previous systems, one has to consider that those
systems consisted of many more parts, with a lot more interconnections -
a significant cause of failure (especially unsoldered ones); today's
increased densities have improved this.  And such systems were more often
used in enterprise situations, such as maintaining critical company
records, rather than for single users.

Certain segments of the market certainly do require more fault tolerance 
than one finds in unix/workstation systems, and if such systems want to
penetrate those segments, they are going to have to learn a few lessons
from the mainframe hardware and software world.  (Gee, I can almost hear
some people who think unix on a workstation is the be-all and end-all in
computers systems gagging.)  And of course is doesn't come for free (I've
heard that the fault tolerance aspects of the 3081/3090 was as big a
project as the rest of the system!).  The RS/6000 has been mentioned: ECC 
on memory, with an extra bit which is used as a last resort to replace a 
hard failure that can't be scrubbed.  This is what one would expect from a
company such as IBM - fault tolerance is a way of life for all mainframe/mini 
manufacturers.  And I bet the associated software is the larger part of 
the work - I wouldn't be overly surprised if it wasn't all supported yet.

But all in all, the overall error rate for workstations relative to what the
runtime of most applications that people are running must be satisfactory;
it doesn't seem to be a big issue.  I know that's true in my environment 
(uP design) - a few problems now and then, but not enough to push me over 
the edge and demand better hardware.

---------------------------------
Dave Christie             My opinions only.
All purpose comp.arch disclaimer: It depends.

cprice@mips.COM (Charlie Price) (08/09/90)

In article <1990Aug3.204358.330@portia.Stanford.EDU> jackk@shasta.stanford.edu (Jack Kouloheris) writes:
>I'm a bit puzzled by the lack of any type of memory error detection/
>correction on many workstations and high-end PCs. These workstations
>are beginning to have memories that rival or exceed those of
>the previous generation of minicomputers, which almost always used
>some sort of ECC protection. Do manufacturers feel that it isn't needed
>any more ?
>A 1Mbit DRAM chip may have a typical soft error rate of 
>.001-.005 PPM/KPOH/bit. Suppose we have a workstation with
>16 Megabytes of memory ( = approx 1.34 * 10^ 8 bits). This
>yields a memory system error rate of .671 errors/KPOH, a non-negligible
>number. Servers may have even more memory than this, and may
>be running continually, so some errors are bound to occur. What
>happens if a bit flips, and then the data is paged out or written to
>a file ? The error is now permanent and can propagate. 
>Why does no one worry about this ?
>
>Some SUNs have parity checking on the memory system, but what does
>the OS do when a parity error occurs, since correction is not
>possible ?

The answer seems to be that the user community "votes"
for particular performance/reliability/cost configurations with
their money and that is what gets produced.
Successful vendors of general-purpose systems build systems
that have market-success-defined "acceptable" error rates
that sell for an "acceptable" amount of money.

MIPS, for example, produces both systems with parity and with ECC.
The "little" machines, the tower-like servers and the workstations,
use parity.  The tower-like machines have custom memory cards
and the workstations use SIMMs.
The bigger machines, the M/2000, the RC3260, and RC6280
all use ECC with 1-bit correction, 2-bit detection on
large (9U) custom boards.
The caches for all these machines are parity-protected
(and with a write-through cache, you just refetch from main
memory when you see a cache parity error).

Parity detects most memory errors, at a moderate cost
of an extra bit every now and then
(typically per byte, bit it could be per word)
and a fairly simple parity tree to check/generate parity.
ECC is quite a bit more expensive than parity.
You need several extra bits per word which makes SIMMS
less easy to use, and you need a more complicated device
to generate and check ECC.
With a fast memory system you probably have to use multiple ECC
chips (or VERY fast ECC chips) since you use multiple memory banks
to achieve high bandwidth memory.
This all adds to manufacturing cost, design cost, testing cost,
software cost...

Most PCs (including the MACs I've seen) don't have or at least
don't use parity.
They silently accept occasional wrong computations rather than
stop a computation that gets a transient memory error.
Cost seems to be extremely important for PCs.

For some uses, real workstations among them, the acceptable level
of error seems to be "occasionally" having a computation explicitly
fail (system panic or process killed) rather than silently producing
an erroneous result.
Cost in workstations seems to be important for success.
Parity is OK for this environment then (at least by demonstration).

A server, or a system that needs to support more reliable computation,
may include ECC to overcome alpha hits.

Real fault tolerance is yet another topic, and though there are
companies that do well in the marker, most of us don't want
to pay for it.

-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/09/90)

In article <40694@mips.mips.COM> cprice@mips.COM (Charlie Price) writes:
| >A 1Mbit DRAM chip may have a typical soft error rate of 
| >.001-.005 PPM/KPOH/bit. Suppose we have a workstation with
| >16 Megabytes of memory ( = approx 1.34 * 10^ 8 bits). This
| >yields a memory system error rate of .671 errors/KPOH, a non-negligible
| >number. Servers may have even more memory than this, and may
| >be running continually, so some errors are bound to occur. What
| >happens if a bit flips, and then the data is paged out or written to
| >a file ? The error is now permanent and can propagate. 
| >Why does no one worry about this ?

  The answer is that at those error rates the chances of a two bit error
(to avoid parity checking) are so low that it is not worth worrying
about. Not that paranoids like myself don't validate their files with an
external 32 bit CRC program on a regular basis.

| Most PCs (including the MACs I've seen) don't have or at least
| don't use parity.
| They silently accept occasional wrong computations rather than
| stop a computation that gets a transient memory error.
| Cost seems to be extremely important for PCs.

  The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the
clone machines built by other vendors. This provides adequate
protection. The Mac and Amiga don't use parity (at least the older ones
don't). The term PC includes both business PCs, with minicomputer
features, and machines intended primarily for games and home use, which
are built as cheaply as possible for a customer base which doesn't
understand or care about data security, and which is highly price concious.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

henry@zoo.toronto.edu (Henry Spencer) (08/11/90)

In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>| Most PCs (including the MACs I've seen) don't have or at least
>| don't use parity.
>  The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the
>clone machines built by other vendors. This provides adequate
>protection... The term PC includes both business PCs, with minicomputer
>features, and machines intended primarily for games and home use, which
>are built as cheaply as possible...

But, but, but... virtually all MSDOS software *explicitly ignores*
parity errors.  A friend of mine, working for a clone builder, had
an interesting story to tell.  They were horrified to discover that
their parity circuit didn't work... after a good many of the machines
were in the field and functioning fine!  It hadn't been caught in
the factory because there is no way that software can test the IBMPC
parity system, and it hadn't been caught by the customers because all
the commercial software just ignored it.

People who think their MSDOS "business PCs" are somehow "protected"
against memory errors by the parity hardware are kidding themselves.
-- 
It is not possible to both understand  | Henry Spencer at U of Toronto Zoology
and appreciate Intel CPUs. -D.Wolfskill|  henry@zoo.toronto.edu   utzoo!henry

dhinds@portia.Stanford.EDU (David Hinds) (08/11/90)

In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>>  The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the
>>clone machines built by other vendors. This provides adequate
>>protection... The term PC includes both business PCs, with minicomputer
>>features, and machines intended primarily for games and home use, which
>>are built as cheaply as possible...
>
>But, but, but... virtually all MSDOS software *explicitly ignores*
>parity errors.  A friend of mine, working for a clone builder, had
>an interesting story to tell.  They were horrified to discover that
>their parity circuit didn't work... after a good many of the machines
>were in the field and functioning fine!  It hadn't been caught in
>the factory because there is no way that software can test the IBMPC
>parity system, and it hadn't been caught by the customers because all
>the commercial software just ignored it.
>
    Wait - what do you mean, the parity circuit didn't work?  That it
couldn't detect parity errors, or what?  On the IBM PC, and most clones,
I think, a parity error raises a non-maskable interrupt.  Under DOS, this
is not a recoverable error - i.e., a parity error hangs the system.  DOS
just prints some dumb message, and stops dead in its tracks.  I suppose
commercial software could patch the interrupt vector to try to recover
from the error, but no one bothers.  As far as I know, yes, there isn't a
way for software to tell if the parity system is working, but then
wouldn't that be a bit much to expect on a PC?

 -David Hinds
  dhinds@popserver.stanford.edu

dricejb@drilex.UUCP (Craig Jackson drilex1) (08/12/90)

In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
|In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
|>Someone else wrote:
|>| Most PCs (including the MACs I've seen) don't have or at least
|>| don't use parity.
|>  The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the
|>clone machines built by other vendors. This provides adequate
|>protection... The term PC includes both business PCs, with minicomputer
|>features, and machines intended primarily for games and home use, which
|>are built as cheaply as possible...
|
|But, but, but... virtually all MSDOS software *explicitly ignores*
|parity errors.  A friend of mine, working for a clone builder, had
|an interesting story to tell.  They were horrified to discover that
|their parity circuit didn't work... after a good many of the machines
|were in the field and functioning fine!  It hadn't been caught in
|the factory because there is no way that software can test the IBMPC
|parity system, and it hadn't been caught by the customers because all
|the commercial software just ignored it.

While this may be a good story, I've never truely heard of software routinely
disabling the parity check, or the NMIs it reports.  Although I have
not been associated with any mainstream applications, I know that nothing
my company has delivered disables NMIs.  There's really no reason to--
the 16k, 64k, and 256k chips used in most PCs just don't have that many
errors.  Lots of people will report that they have seen a parity error
message from a PC, but only rarely.

Parity in "personal" computers was one of the innovations of IBM--their
corporate standards required it.  Up until the PC came out, hardly any
of the computers sold as "personal" computers (Apples, CP/M boxes) had
parity.  I'm not sure if even the contemporary Unix boxes (Onyxs) did.
The PCjr was the first computer IBM ever shipped without parity--I'm sure
that the angst nearly killed somebody.

|People who think their MSDOS "business PCs" are somehow "protected"
|against memory errors by the parity hardware are kidding themselves.

Admittedly, modern computer users (both businesspersons and engineers)
rarely view their hardware with the skepticism that it deserves...
They haven't lived through the era of "If you don't like the answers,
run it again.  They might change."  (CDC 6400, circa 1976)

|-- 
|It is not possible to both understand  | Henry Spencer at U of Toronto Zoology
|and appreciate Intel CPUs. -D.Wolfskill|  henry@zoo.toronto.edu   utzoo!henry

On this, I agree with Henry.  Anybody who claims to appreciate the 80x8x
line of Intel CPUs needs education, medical attention, or both.
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

henry@zoo.toronto.edu (Henry Spencer) (08/12/90)

In article <1990Aug10.223619.6223@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes:
>>...They were horrified to discover that
>>their parity circuit didn't work...
>    Wait - what do you mean, the parity circuit didn't work?  That it
>couldn't detect parity errors, or what?

He wasn't specific, but the implication was that under at least some
circumstances it falsely reported errors.

>On the IBM PC, and most clones,
>I think, a parity error raises a non-maskable interrupt.  Under DOS, this
>is not a recoverable error - i.e., a parity error hangs the system.  DOS
>just prints some dumb message, and stops dead in its tracks.  I suppose
>commercial software could patch the interrupt vector to try to recover
>from the error, but no one bothers.

What he said was that, based on their experience, *everyone* bothers --
or did at the time (this wasn't recent) -- and the "recovery" consisted 
of ignoring the error completely.  Since I avoid Intel processors :-),
I can't confirm or deny this myself.

>As far as I know, yes, there isn't a
>way for software to tell if the parity system is working, but then
>wouldn't that be a bit much to expect on a PC?

Had the Japanese designed it, you can bet it would have been testable.
(Another input to the parity encoder, controlled by software or even
a DIP switch, would suffice.)  The only way you can improve quality
is if you can measure it.
-- 
It is not possible to both understand  | Henry Spencer at U of Toronto Zoology
and appreciate Intel CPUs. -D.Wolfskill|  henry@zoo.toronto.edu   utzoo!henry

landon@Apple.COM (Landon Dyer) (08/12/90)

>Depends on the situation.  A parity error in a code page is harmless --
>just bring in a fresh copy from disk.

Assuming it's fresh.  Nearly all of the I/O systems I've seen on small
computers lack end-to-end parity or ECC.  For instance, SCSI data and commands
are subject to mangling by poor termination, bad connections, transients, and
firmware or hardware failure (e.g. an insane controller that wiggles a bus
line at random).

This (and to be fair, other real-world catasrophes including scrambled file
systems, flakey packet routers, media decay and buggy drivers) is what causes
some application writers to put checksum fields in their document formats.


    Q:	Let's get this straight.  The data on _disk_ is checksummed
	within an inch of its life.  The data in _memory_ is ECC'd
	and can't be harmed.  But going from disk to memory, the
	data is, ah, er ...

    A:	Let's see what the standard sez ... [flip, flip] ... "Naked in
	the breeze?"


-- 
Landon Dyer (landon@apple.com)  :::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::: making the merry-go-round SPIN FASTER
Apple Computer, Inc.            :: so that everyone has to HOLD ON TIGHTER
NOT THE VIEWS OF APPLE COMPUTER :: just to keep from being THROWN TO THE WOLVES

peter@stca77.stc.oz (Peter Jeremy) (08/13/90)

In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the
>clone machines built by other vendors.
Although the hardware is there, it relies on software to do anything useful,
and the software generally isn't there.  I have also heard (rumour time)
that at least one PC has a design fault in its parity circuitry.  This
doesn't appear to have hindered that machine at all.

> The Mac and Amiga don't use parity (at least the older ones
>don't).
None of the A500, A1000, A2000 or A3000 have provision for parity for
built-in memory.  There is nothing to stop you using parity on additional
RAM (you would also need to add software to handle the errors sanely).

As an additional comment: I use a Motorola Delta 1147 clone.  It's basically
a single-board 68030 with 8MB RAM.  The RAM includes parity, but the checking
is switchable.  Apparently the parity checking is slow, so you have a choice
of stretching the memory cycles by 1 clock to get the error reported correctly,
or having the parity error reported on the following cycle.  (You can also
disable it totally).

I run it with delayed parity (which means any parity error causes a PANIC)
and haven't had any parity errors in 18 months operation.
-- 
Peter Jeremy (VK2PJ)         peter@stca77.stc.oz.AU
Alcatel STC Australia        ...!uunet!stca77.stc.oz!peter
240 Wyndham St               peter%stca77.stc.oz@uunet.UU.NET
ALEXANDRIA  NSW  2015

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/13/90)

In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

| But, but, but... virtually all MSDOS software *explicitly ignores*
| parity errors.  

  Please cite any software (at least from the top 20 best seller list)
which does this. We have have 800 or so PCs here running MS-DOS, PC-DOS
and at least four flavors of UNIX, and every one of them seems to see
parity errors, although most just stop dead when they do. Given an error
rate of 4-5 cases a year in that many systems, I think that's a better
thing to do than produce wrong answers. Ever.

  I have had people tell me that the Mac had better hardware because
"they don't get those stupid parity errors," but I don't even try to
explain, I just give their names to headhunters.

| People who think their MSDOS "business PCs" are somehow "protected"
| against memory errors by the parity hardware are kidding themselves.
| -- 
| It is not possible to both understand  | Henry Spencer at U of Toronto Zoology
| and appreciate Intel CPUs. -D.Wolfskill|  henry@zoo.toronto.edu   utzoo!henry

  Note from the sig that Henry makes no pretension of being unbiased in
this. The PC uses an Intel processor, is the hardware can't be faulted
for not having parity, the software design must be corrupted by being
run on a CPU made by Intel.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
       "This is your PC. This is your PC on OS/2. Any questions?"

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/14/90)

In article <1016@stca77.stc.oz> peter@stca77.stc.oz (Peter Jeremy) writes:

| I run it with delayed parity (which means any parity error causes a PANIC)
| and haven't had any parity errors in 18 months operation.

  I concluded some time ago that with memory as reliable as it is, and
the cost of an undetected parity error as high as it could be, that
while a panic is not the *best* way to handle parity errors, it is more
acceptable than ignoring them.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
       "This is your PC. This is your PC on OS/2. Any questions?"

henry@zoo.toronto.edu (Henry Spencer) (08/15/90)

In article <2421@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>| But, but, but... virtually all MSDOS software *explicitly ignores*
>| parity errors.  
>
>  Please cite any software (at least from the top 20 best seller list)
>which does this...

As I thought I made clear, my information is secondhand.  As I probably
should have made clearer, it is also a bit old.  If the situation has
changed, I am (a) pleased, and (b) surprised. :-)
-- 
It is not possible to both understand  | Henry Spencer at U of Toronto Zoology
and appreciate Intel CPUs. -D.Wolfskill|  henry@zoo.toronto.edu   utzoo!henry

eli@aspasia.gang.umass.edu (Eli Brandt) (08/15/90)

In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <2399@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>>| Most PCs (including the MACs I've seen) don't have or at least
>>| don't use parity.
>>  The IBM PC, AT, and PS/2 models use per-byte parity, as do all of the
>>clone machines built by other vendors. This provides adequate
>>protection... The term PC includes both business PCs, with minicomputer
>>features, and machines intended primarily for games and home use, which
>>are built as cheaply as possible...
>
>But, but, but... virtually all MSDOS software *explicitly ignores*
>parity errors.  A friend of mine, working for a clone builder, had
>an interesting story to tell.  They were horrified to discover that
>their parity circuit didn't work... after a good many of the machines
>were in the field and functioning fine!  It hadn't been caught in
>the factory because there is no way that software can test the IBMPC
>parity system, and it hadn't been caught by the customers because all
>the commercial software just ignored it.
>
>People who think their MSDOS "business PCs" are somehow "protected"
>against memory errors by the parity hardware are kidding themselves.
>-- 
>It is not possible to both understand  | Henry Spencer at U of Toronto Zoology
>and appreciate Intel CPUs. -D.Wolfskill|  henry@zoo.toronto.edu   utzoo!henry

Almost all PC hardware that I know of detects parity errors and handles them -
well, "handles" by crashing with a "Parity error" message.  Better than a
corrupted filesystem.  The one exception that I know of is that some laptops
leave off parity checking to save *weight*, of all things.  How much can 11% of
your DRAM weigh?  It's possible that some fly-by-night cloners leave off 
parity checking, but I've never heard of any machines that do this.

I can personally testify that PS/2's, at least, know about parity errors.  I
was playing around with the DRAM refresh rate and managed to get parity errors
quite definitively.  A parity error triggers an NMI which calls, I think, an
INT 1.  Not at all sure about that.  I don't know of any commercial software
that turns off parity checking (by trapping the interrupt, presumably).  Can
you name any?

davecb@yunexus.YorkU.CA (David Collier-Brown) (08/15/90)

henry@zoo.toronto.edu (Henry Spencer) writes:
>>| But, but, but... virtually all MSDOS software *explicitly ignores*
>>| parity errors.  

>In article <2421@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>>  Please cite any software (at least from the top 20 best seller list)
>>which does this...

  How about Lotus 1-2-3?  The **newest** version may or may not do so, if
it is not required to run on 8088-series processors, but older versions
certainly did.
  One of its competitors, that I worked on, had to block the parity error
logic in order to run reliably.  It would have been unacceptable to merely
crash every time we had a parity error since our compeditors didn't!
  We also usually got them hourly on our XT's, (if memory serves) Thud and
Nermal. Maybe on some other machines, too...  I don't remember the ATs
having the problem.

  Strangely enough, this blocking seemed to have no effect on the program: a
recalculation after a parity error yeilded the same answers as before.
Subsequently it was suggested that the PC and XT chip addressing logic was
having adressing errors, which were reported erroniously as parity errors.
So maybe we never has parity errors at all (:-)).

--dave
-- 
David Collier-Brown,  | davecb@Nexus.YorkU.CA, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | "And the next 8 man-months came up like
CANADA. 416-223-8968  |   thunder across the bay" --david kipling

seanf@sco.COM (Sean Fagan) (08/19/90)

In article <2421@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  I have had people tell me that the Mac had better hardware because
>"they don't get those stupid parity errors," but I don't even try to
>explain, I just give their names to headhunters.

Wonderful stuff, PC parity.

The two times I've gotten parity errors (that didn't clear up on a reboot),
the chip I've had to replace was the parity chip.

Yep.  I'm so glad it's there.

-- 
Sean Eric Fagan  | "let's face it, finding yourself dead is one 
seanf@sco.COM    |   of life's more difficult moments."
uunet!sco!seanf  |   -- Mark Leeper, reviewing _Ghost_
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/20/90)

In article <1990Aug18.210132.25203@sco.COM> seanf@sco.COM (Sean Fagan) writes:

| The two times I've gotten parity errors (that didn't clear up on a reboot),
| the chip I've had to replace was the parity chip.

  Yes, and I got called for jury duty for the tenth time last month.
Isn't it wonderful how unlikely bad things happen so much more often
than unlikely good things? If I could get stuck in an elevator with
a beautiful woman as often as I get behind someone in the "cash only"
checkout who is try to pay with an out of state thrid party check, I'd
be content.

  In spite of all that, I'd rather have parity checking, because I have
had real genuine errors in the data memory, and I want to know about it
when it happens.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
       "This is your PC. This is your PC on OS/2. Any questions?"

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (08/20/90)

In article <14623@drilex.UUCP> dricejb@drilex.UUCP (Craig Jackson drilex1) writes:
>In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>|But, but, but... virtually all MSDOS software *explicitly ignores*
>|parity errors.  A friend of mine, working for a clone builder, had
>|an interesting story to tell.  They were horrified to discover that
>|their parity circuit didn't work... after a good many of the machines
>|were in the field and functioning fine!
...
>While this may be a good story, I've never truely heard of software routinely
>disabling the parity check, or the NMIs it reports.

Ignoring interrupts used to be the norm back when polled I/O was
common.  Many micros ran with interrupts disabled because they could
interfere with the activities of some "dumb" floppy disk controllers,
etc., which depended on timing of memory accesses (e.g., the old
NorthStar floppy disk controller and their BCD floating point board).

...
>Parity in "personal" computers was one of the innovations of IBM--their
>corporate standards required it.  Up until the PC came out, hardly any
>of the computers sold as "personal" computers (Apples, CP/M boxes) had
>parity.  I'm not sure if even the contemporary Unix boxes (Onyxs) did.
>The PCjr was the first computer IBM ever shipped without parity--I'm sure
>that the angst nearly killed somebody.

Not so.  Lots of CP/M machines had memory boards with byte parity long
before the IBM PC.  Note that I'm not saying people used it -- in
fact, I vaguely recall at least one board which had sockets for parity
RAM, but standardly came with that portion of the board unpopulated.
Of course, one could argue that before the IBM PC, "hardly any"
microprocessor-based computers of any kind were sold.  ;-)

BTW, none of old machines I've played with has ever had a parity error
(i.e., bad RAM chip), although I've seen a fair number in newer
machines.  Remember the days when companies used to actually test
machines *BEFORE* shipping them...?  ;-)

						-hankd@ecn.purdue.edu

md89mch@cc.brunel.ac.uk (Martin Howe) (08/20/90)

In article <14623@drilex.UUCP> dricejb@drilex.UUCP (Craig Jackson drilex1) writes:
>On this, I agree with Henry.  Anybody who claims to appreciate the 80x8x
>line of Intel CPUs needs education, medical attention, or both.
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

OK, I'll take them, as long as YOU're paying !

(Never seen my .sig before ? (the cat speech))
-- 
  -   /|  . . JCXZ ! MOVSB ! SGDT ! iAPX ! | "Good morning Citizens. I would
  \`O.O' .    Martin Howe, Microelectronics|  remind you that Armed Robbery
  ={___}=     System Design MSc, Brunel U. |  is illegal in Megacity One." - JD
   ` U '      Any unattributed opinions are mine -- Brunel U. can't afford them.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/22/90)

In article <1990Aug20.151438.27121@ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes:

| BTW, none of old machines I've played with has ever had a parity error
| (i.e., bad RAM chip), although I've seen a fair number in newer
| machines.  Remember the days when companies used to actually test
| machines *BEFORE* shipping them...?  ;-)

  You are either really lucky or had top quality machines not available
to us mortals. We used to run memory test as the idle daemon in S100
systems, and right after boot and fully warm with our Intellec systems.
I wrote my own memory test for the Z80, to force the M1 fetch into every
byte, so I could try the worst case timing.

  I haven't had a parity error in a memory chip with more than four
hours burn-in on any of my systems in quite a while, but I still have a
ziplock full of 1702's from the "old days." Note I don't call them good.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/24/90)

In article <1990Aug20.151438.27121@ecn.purdue.edu> 
	hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes:
>Ignoring interrupts used to be the norm back when polled I/O was
>common.  Many micros ran with interrupts disabled because they could
>interfere with the activities of some "dumb" floppy disk controllers,
>etc., which depended on timing of memory accesses (e.g., the old
>NorthStar floppy disk controller and their BCD floating point board).

When a feature is unused, it often doesn't actually work.  At one
time, Unix was the toughest diagnostic for PDP-11 MMU's...

When my company built one of the first IBM PC clones, we had
mysterious software crashes.  It turned out that no one else was
sending non-maskable interrrupts (NMIs) to their 8088s.  So, we got
to be the people who noticed the NMI hardware bug.  Recall that the
8088 has prefix instructions, which change the addressing of the
following instruction.  An NMI could be honored between the two, but
the interrupt return would "forget" the prefixing.

While we're on the subject, RISC machines with branch delay slots
have a similar problem.  Of course, the easy instruction decoding
means that they can push some of the work into the interrupt
handlers.  Does anyone want to describe how their favorite machine
did this?
-- 
Don		D.C.Lindsay

cprice@mips.COM (Charlie Price) (08/24/90)

In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>...  So, we got
>to be the people who noticed the NMI hardware bug.  Recall that the
>8088 has prefix instructions, which change the addressing of the
>following instruction.  An NMI could be honored between the two, but
>the interrupt return would "forget" the prefixing.
>
>While we're on the subject, RISC machines with branch delay slots
>have a similar problem.  Of course, the easy instruction decoding
>means that they can push some of the work into the interrupt
>handlers.  Does anyone want to describe how their favorite machine
>did this?

Here is an answer for the MIPS R2000, R3000, and R6000.

An exception causes a trap to kernel code and loads a couple registers:
EPC - Exception Program Counter - the address at which execution
      should resume.
Cause - various bits of information about the cause of the exception.
Normally, the EPC points at the instruction that caused the exception
or, in the case of interrupts, that was about to be fetched.
If an exception occurs during execution of the instruction
in a branch delay slot or "between" a branch and the
instruction in the branch-delay slot,
the Cause register has the Branch Delay (BD) bit set and the
EPC register contains the address of the branch instruction.

For interrupts, you don't generally care about this and no
special processing is required.
Only if you have to examine the instruction that caused the fault
do you have to decide whether to look at the instruction pointed
at by the EPC or the next instruction.

This isn't *quite* the whole story.
TLB misses, for instance, are special and have other hardware support
so you don't have to look at the instruction to figure out
the address that missed in the TLB.

If the kernel has to emulate the instruction in the branch delay slot
(wierd FP stuff for instance) then it can't re-execute the
branch instruction and will need to emulate it as well.
This is a rare case, so the performance is not a problem.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

jkenton@pinocchio.encore.com (Jeff Kenton) (08/24/90)

From article <10307@pt.cs.cmu.edu>, by lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay):
> 
>                         An NMI could be honored between the two, but
> the interrupt return would "forget" the prefixing.
> 
> While we're on the subject, RISC machines with branch delay slots
> have a similar problem.  Of course, the easy instruction decoding
> means that they can push some of the work into the interrupt
> handlers.  Does anyone want to describe how their favorite machine
> did this?

On the 88000 (my current favorite machine) the instruction pipeline has
three stages -- XIP, NIP, FIP.  These tell you where you've been, where
you are and where you're going.  Restoring the proper values gets you
back exactly where you are supposed to be.  Not really trouble.

The only problem to beware of is single stepping or other debugging, where
you may be looking at a program interrupted at a delay slot.  In this case
"go (or step) from the pc" can be ambiguous.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      jeff kenton  ---	temporarily at jkenton@pinocchio.encore.com	 
		   ---  always at (617) 894-4508  ---
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/25/90)

In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:

|                                                     Recall that the
| 8088 has prefix instructions, which change the addressing of the
| following instruction.  An NMI could be honored between the two, but
| the interrupt return would "forget" the prefixing.

  It could have been worse, it could have remembered it. Then when
servicing the interrupt the normal fetch from the CS:PC would have
fetched an instruction byte from somewhere else.

  I would think disallowing interrupts after prefix is the best way to
solve it, rather than try to hack things to save the state after the
chip was designed.

  Was this problem only with NMI? I've run a lot of interrupts into an
original XT and never seen a problem.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

colin@array.UUCP (Colin Plumb) (08/25/90)

In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
> While we're on the subject, RISC machines with branch delay slots
> have a similar problem.  Of course, the easy instruction decoding
> means that they can push some of the work into the interrupt
> handlers.  Does anyone want to describe how their favorite machine
> did this?

The Am29000 just has two program counters, current and next.  (it
provides a previous, as well, but doesn't need it).  Most faults leave
the processor ready to continue with the next instruction; if you want
to retry instead of emulating (e.g. data TLB miss), you have to back up
the PC's a cycle.

Excpetions: instruction-fetch errors (TLB miss, protection violation, etc.),
illegal opcode (as opposed to software traps) and protection violation
(a supervisor-only instruction).   These retry the current instruction.
I don't think there's any deep reason, it was just easier to do that way
because it's detected during decode.  To skip one of these instructions,
you just point the current PC at a NOP somewhere and leave the next PC
alone.
-- 
	-Colin

tim@proton.amd.com (Tim Olson) (08/25/90)

| In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
| >...  So, we got
| >to be the people who noticed the NMI hardware bug.  Recall that the
| >8088 has prefix instructions, which change the addressing of the
| >following instruction.  An NMI could be honored between the two, but
| >the interrupt return would "forget" the prefixing.
| >
| >While we're on the subject, RISC machines with branch delay slots
| >have a similar problem.  Of course, the easy instruction decoding
| >means that they can push some of the work into the interrupt
| >handlers.  Does anyone want to describe how their favorite machine
| >did this?

The Am29000 simply uses 2 PC buffer registers to hold the return
address(es).  Normally these are sequential, but if an interrupt or
trap occurs between a branch and its delay slot, then PC1 points to
the delay instruction and PC0 points to the branch target.  The
interrupt-return (IRET) instruction uses both these addresses (if
required) to restart the instruciton stream correctly.

In article <41066@mips.mips.COM> cprice@mips.COM (Charlie Price) writes:
| Here is an answer for the MIPS R2000, R3000, and R6000.
| 
| An exception causes a trap to kernel code and loads a couple registers:
| EPC - Exception Program Counter - the address at which execution
|       should resume.
| Cause - various bits of information about the cause of the exception.
| Normally, the EPC points at the instruction that caused the exception
| or, in the case of interrupts, that was about to be fetched.
| If an exception occurs during execution of the instruction
| in a branch delay slot or "between" a branch and the
| instruction in the branch-delay slot,
| the Cause register has the Branch Delay (BD) bit set and the
| EPC register contains the address of the branch instruction.

Just curious -- what happens in the perverse case that someone tries a
conditional branch-and-link instruction using the link register as a
conditional source, i.e.:

	<r31 contains -1>
	bltzal	r31, label
				<- interrupt here
	<delay operation>

Does the link portion of the branch-and-link take place anyway,
destroying the conditional information and preventing the branch from
being restartable correctly?
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

rwallace@vax1.tcd.ie (08/26/90)

In article <2434@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
>   In spite of all that, I'd rather have parity checking, because I have
> had real genuine errors in the data memory, and I want to know about it
> when it happens.

If the operating system just told you about it when there was a parity error
I'd agree with you, something like flashing up a message on the screen:
"Parity error detected in code segment at 1234:5678, reboot? (Y/N) ".
However automatically crasing the computer is NOT acceptable behaviour: I'd
much rather do without the parity checking. Consider: suppose a parity error
occurs on a 640K machine. The error is probably in either an unused area of
memory (e.g. at the DOS prompt) or in a section of code that isn't going to be
executed on this session with the program (e.g. error handling code). (I'm
talking only about transient errors here: the boot time memory check will get
other kinds). So ignoring the parity error will probably have no effect. If
it's in a section of code that will be executed, the machine will just crash
which is what would have happened anyway. OK take the very unlikely case that
it is in your data. For me that means in the source for a program I'm writing.
This is no problem, I can just fix the one trashed character when the compiler
barfs on the code. Much better than having the machine crash and lose several
minutes' work. Or say the error is in a floating-point number in a spreadsheet.
Chances are the program will crash with a floating-point error or at least
produce obviously wrong results e.g. profit for 1989 was $-32198742.88888.

The point is that ignoring a parity error is a pretty safe thing to do; there's
very little chance of getting a misleading answer. Much better than crashing
the computer, which is guaranteed to lose you whatever you had in memory.
(Suppose you have a parity error while running a Speed Disk program: kiss your
hard disk goodbye. Let's see, when did I do my last full backup?). So the PC
parity protection is worse than useless.

"To summarize the summary of the summary: people are a problem"
Russell Wallace, Trinity College, Dublin
rwallace@vax1.tcd.ie

jfc@athena.mit.edu (John F Carr) (08/27/90)

In article <10307@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:

>While we're on the subject, RISC machines with branch delay slots
>have a similar problem.  Of course, the easy instruction decoding
>means that they can push some of the work into the interrupt
>handlers.  Does anyone want to describe how their favorite machine
>did this?

The IBM RT doesn't allow interrupts between a branch-and-execute and the
following instruction.  This seems to me the best solution (doesn't require
any special logic in the interrupt handling software or in the hardware to
restart after an interrupt).

The new IBM RISC machine avoids problems with branch delay and interrupts
by not having delay slots.  Instead, instruction prefetch follows branches.
I think the 68040 also does this.

--
    --John Carr (jfc@athena.mit.edu)

don@zl2tnm.gp.govt.nz (Don Stokes) (08/28/90)

rwallace@vax1.tcd.ie writes:

> If the operating system just told you about it when there was a parity error
> I'd agree with you, something like flashing up a message on the screen:
> "Parity error detected in code segment at 1234:5678, reboot? (Y/N) ".
> However automatically crasing the computer is NOT acceptable behaviour: I'd
> much rather do without the parity checking. Consider: suppose a parity error
> occurs on a 640K machine. The error is probably in either an unused area of
> memory (e.g. at the DOS prompt) or in a section of code that isn't going to b
> executed on this session with the program (e.g. error handling code).

Since when was PC memory not used to the last bit?  Every PC I have ever
had to deal with suffered severe "ram cram" (why, oh why do people
insist on treating '386s as 8086s, with their measly memory maps?).
A memory mix of 100K of system software, 200K of application and 300K of
data isn't unreasonable or atypical.

>                                        OK take the very unlikely case that
> it is in your data.  

Unlikely?  In the mix I give, I make it about 50%.  "Unlikely" isn't how
I'd describe it.

> For me that means in the source for a program I'm writing

Ah.  Now we reveal true colours.  I hate to bring the real world crashing
about your ears, but the *vast* majority of PC users are *not*
programmers.  You take an extremely self-centered view.

> This is no problem, I can just fix the one trashed character when the compile
> barfs on the code.

Even as a programmer, do you not fear unexepected bugs creeping into
your code due to unexpected errors?  I recall back in the dim darks in
my Apple ][ days (no flames please, I was receiving good money for
this), I ran into a problem where well tested code simply stopped
working properly; it didn't give errors, just wrong answers (not nice
when the incorrect answers are cheque totals to go into a general
ledger).  It turned out that there was a bug in the Apple-supplied
BASIC line renumbering program, which would result in constants
occasionally being "renumbered" as well as GOTO/GOSUB targets.  It cost
me over half a day to find the problem, fix it and fully verify that the
problem had indeed been fixed.  It would have been Real Nice if it had
happened just before implementation, wouldn't it?

Compilers often do not barf on single bit errors.  Single bit errors are
the difference between a '+' and '*', '.' and '/', 0 and 1, 'x' and 'y'
(don't try to tell me you don't use single letter temporary variables!).

The PC's I have to deal with are used for typesetting work as well as
more "normal" applications such as running Lotus, word processing etc
(although most users use the spreadsheet available on the VAX).

> minutes' work. Or say the error is in a floating-point number in a spreadshee
> Chances are the program will crash with a floating-point error or at least
> produce obviously wrong results e.g. profit for 1989 was $-32198742.88888.

...or plus or minus $65,536, which in a ~$200,000 bottom line, is easily
missed, but could represent the difference between a profit and a loss;
not to mention the poor bean-counter's job when the mistake is
discovered.

> (Suppose you have a parity error while running a Speed Disk program: kiss you
> hard disk goodbye. Let's see, when did I do my last full backup?). So the PC
> parity protection is worse than useless.

Implementation detail: Norton's Speed Disk writes the directory entry
only after moving files; until the file pointers are finally changed,
they still point to the old, valid copies of data.  The machine crashes
-- so what?  Nothing has been lost.  Now imagine an undetected single
bit error while moving the BACKUP program, that causes backups to be
soundlessly trashed.  Now imagine the inevitable hard disk crash.  Now
imagine the trashed backups being your business records; your livelihood.

Don Stokes, ZL2TNM  /  /                            Home: don@zl2tnm.gp.govt.nz
Systems Programmer /GP/ Government Printing Office  Work:        don@gp.govt.nz
__________________/  /__Wellington, New Zealand_____or:_PSI%(5301)47000028::DON

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/29/90)

In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes:

|                                             Consider: suppose a parity error
| occurs on a 640K machine. The error is probably in either an unused area of
| memory (e.g. at the DOS prompt) or in a section of code that isn't going to be
| executed on this session with the program (e.g. error handling code). (I'm
| talking only about transient errors here: the boot time memory check will get
| other kinds). So ignoring the parity error will probably have no effect.

  You may be the only DOS user on the planet who has any bits unused.

|                                           OK take the very unlikely case that
| it is in your data. For me that means in the source for a program I'm writing.
| This is no problem, I can just fix the one trashed character when the compiler
| barfs on the code. 

  One bit changes 0 to 1, + to *, etc. Your compiler catches this?

|                    Much better than having the machine crash and lose several
| minutes' work. Or say the error is in a floating-point number in a spreadsheet.
| Chances are the program will crash with a floating-point error or at least
| produce obviously wrong results e.g. profit for 1989 was $-32198742.88888.
| 
| The point is that ignoring a parity error is a pretty safe thing to do; there's
| very little chance of getting a misleading answer.

  A little though should show you the error of that thought. Over half
the word is dedicated to less significant bits of the mantissa. An error
in any of those bits will result in a percent or less change.

|                                                    Much better than crashing
| the computer, which is guaranteed to lose you whatever you had in memory.

  Exactly. No answer is better than a wrong answer. What would anyone
bother to run on a computer which is so valueless that they don't care
if they get a right answer, just so that you get an answer? If that's
the case, why not make one up?

| (Suppose you have a parity error while running a Speed Disk program: kiss your
| hard disk goodbye. Let's see, when did I do my last full backup?). So the PC
| parity protection is worse than useless.

  In the first place, anyone who runs a program like that without
running a backup first is really careless with their data. To quote my
old sig "stupidity, like virtue, is its own reward." And if you have an
error and *don't* catch it, you can blindly read in all the data on the
disk, corrupt it, and rewrite it wrong. Is this better? A good disk
packer will only lose a portion of the data on a crash, and it will be
gone, not corrupted. A parity error will corrupt the data.

  I have worked with systems which didn't have parity, and I hated it. I
don't have anything my system I don't need, so I care about all of it.
While my data doesn't represent a threat to someone's life if it's
wrong, it could be a threat to my income. That's important enough for
me.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

antony@george.lbl.gov (Antony A. Courtney) (08/29/90)

In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes:
>In article <2434@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
>
>However automatically crasing the computer is NOT acceptable behaviour: I'd
>much rather do without the parity checking. Consider: suppose a parity error
>occurs on a 640K machine. The error is probably in either an unused area of
>memory (e.g. at the DOS prompt) or in a section of code that isn't going to be
>executed on this session with the program (e.g. error handling code).
> ...
>OK take the very unlikely case that
>it is in your data. For me that means in the source for a program I'm writing.
>This is no problem, I can just fix the one trashed character when the compiler
>barfs on the code. Much better than having the machine crash and lose several
>minutes' work.

Well, since you seem to be using the example of a compiler environment, I
notice you like to claim that the error will be "easy to catch" regardless of
where it occured.  

People have already pointed out loads of examples where this can cause
problems, but seemed to have skipped the obvious one:

Stray pointer problems.  Suppose it DOES corrupt your data for your compiler.
It could change the address of some pointer to god-knows-where.  Worse yet, it
might only change one of the more low-order bits, so that it only causes rare,
spurious problems for you.  (i.e. it still writes into data space, just into
the WRONG chunk of data space).  Worse YET, suppose the error occurs not in
just the execution or test phase, but in the actual COMPILE phase, so that
the address of some static pointer gets corrupted BEFORE the new binary gets
written out to disk.  That would be loads of fun to try and track down! :-)

Stray pointer problems are hard enough to track when __I__ am at fault, I
don't want to even THINK what it would be like to try and find them when the
hardware is flakey and doesn't tell me about it! :-)

Obviously written by someone who hasn't had to do enough debugging....

>The point is that ignoring a parity error is a pretty safe thing to do; there's
>very little chance of getting a misleading answer.

main()
{
	static int g[4]={1,2,3,4};
	printf("%d %d %d %d\n",g[0],g[1],g[2],g[3]);
}

output:

segmentation fault(core dumped)

I dunno about you, but I wouldn't exactly call this easy to catch...

>Russell Wallace, Trinity College, Dublin
>rwallace@vax1.tcd.ie

		~antony

--
*******************************************************************************
Antony A. Courtney                                        antony@george.lbl.gov
Advanced Development Group                           ucbvax!csam.lbl.gov!antony
Lawrence Berkeley Laboratory                                     (415) 486-6692

cprice@mips.COM (Charlie Price) (08/29/90)

In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes:
>
>If the operating system just told you about it when there was a parity error
 ...
>The error is probably in either an unused area of
>memory (e.g. at the DOS prompt) or in a section of code that isn't going to be
>executed on this session with the program (e.g. error handling code). (I'm
>talking only about transient errors here: the boot time memory check will get
>other kinds). So ignoring the parity error will probably have no effect. If
>it's in a section of code that will be executed, the machine will just crash
>which is what would have happened anyway.
 [some stuff deleted]
>The point is that ignoring a parity error is a pretty safe thing to do; there's
>very little chance of getting a misleading answer.

I sense some confusion.
A boot-time memory check might detect a permanent error,
and this seems to be what you are talking about,
but this isn't what parity is mostly for.

Mostly, parity is to detect transient errors caused by
alpha particles (or some such).
The memory chip doesn't have a permanent problem,
it just forgot the value of a bit.
Parity is computed and written for every store operation.
For every data item fetched (data or instruction) the parity of
the data bits is computed and compared to the parity bit(s) that
were also fetched from memory.  If the computed and stored bits
are not the same, you have a parity error.
A (parity) error in a memory location is only detected at the
time it is fetched, so you are probably going to want the right data.
so your argument about it being unimportant is basically invalid.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

powell@crg5.UUCP (Powell Quiring) (08/29/90)

In article <56qmo1w162w@zl2tnm.gp.govt.nz> don@zl2tnm.gp.govt.nz (Don Stokes) writes:
>rwallace@vax1.tcd.ie writes:
>
>> If the operating system just told you about it when there was a parity error
>> I'd agree with you, something like flashing up a message on the screen:
>> "Parity error detected in code segment at 1234:5678, reboot? (Y/N) ".
>> However automatically crasing the computer is NOT acceptable behaviour: I'd
>> much rather do without the parity checking. Consider: suppose a parity error
>> occurs on a 640K machine. The error is probably in either an unused area of
>> memory (e.g. at the DOS prompt) or in a section of code that isn't going to b
>> executed on this session with the program (e.g. error handling code).

The fact that you got a parity error indicates that the memory has
been read.  The only question is how much this incorrect value
is going to screw you up.

peter@ficc.ferranti.com (Peter da Silva) (08/29/90)

In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
> What would anyone
> bother to run on a computer which is so valueless that they don't care
> if they get a right answer, just so that you get an answer?

Videogames.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

hrich@emdeng.Dayton.NCR.COM (George.H.Harry.Rich) (08/29/90)

rwallace@vax1.tcd.ie writes:

> hard disk goodbye. Let's see, when did I do my last full backup?). So the PC
> parity protection is worse than useless.

I've had three hard disks trashed by hardware problems.  Every one of them
went, not because systems  failed to update the directory and allocation map, 
but because they updated the directory and allocation map from trashed memory 
and the trashing was undetected.  Lest I create paranoia, two of the systems 
were prototypes, and one was an ancient, abused, non-standard, and  flaky 
PDP-11/05.

Given a properly organized disk caching system, loss of data already on
disk due to system halts is a very low probability occurance.  If it
happens often, you should get a different caching system.  (Sales pitch
deleted).

My own experience is that a parity error on a good workstation is a once in a
blue moon occurance. (Sales pitch deleted).  In my environment, where I'm doing
a lot of changing and testing, stopages due to software bugs or incompatabilies
are much more frequent.

My suggestion is that you don't commit more than
half an hour's work to ram or more than a day's work to hard disk.  A day's
contingency in a software schedule will take care of this.  If you don't
have a day's contingency, you should be making selective backups more
often; you can do them during think time if you organize them properly.

Regards,

	Harry Rich

Disclaimer:  The ideas expressed here are mine and not necessarily those
of my employer.

tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) (08/29/90)

In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>No answer is better than a wrong answer. What would anyone
>bother to run on a computer which is so valueless that they don't care
>if they get a right answer, just so that you get an answer?

I'm sorry, but I have to go into reality mode here.  I can understand
if you were running a simulation on the space shuttle you'd rather
get no answer than a wrong answer.  But let's say you were doing something
more typical, like ... oh ... replying to a long article in news.  You've
been typing and researching for an hour now.  I ask you this: would you
rather I just blow away that entire article and crash your machine or change
a single random character?

Paul Chamberlain | I do NOT represent IBM         tif@doorstop, sc30661@ausvm6
512/838-7008     | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif

rsc@merit.edu (Richard Conto) (08/29/90)

In article <3294@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes:
>I'm sorry, but I have to go into reality mode here.  I can understand
>if you were running a simulation on the space shuttle you'd rather
>get no answer than a wrong answer.  But let's say you were doing something
>more typical, like ... oh ... replying to a long article in news.  You've
>been typing and researching for an hour now.  I ask you this: would you
>rather I just blow away that entire article and crash your machine or change
>a single random character?

There's more choices than that. If your news is running on a multitasking
machine, I'd hope that the kernel would be able to terminate the task (if
the parity error occured in task-memory rather than kernel memory.)

But think. A parity error MAY occur in the text being manipulated. But it can
also occur in worse places. It could corrupt the datastructures in your
news program, leading to an eventual core dump (but not right away.) If you're
keeping track of core dumps (for whatever reason), do you want to waste time
tracking down an obscure bug like that?

If the memory is in user-space, the kernel should at the very least kill the
task. If it doesn't check the page of memory that caused the parity error, it
should (at the very least) never re-allocate that page, and log a nasty message
on the operator's console. If the error occurs in kernel-space, it should try
for as gracefull a shutdown as it can. Which may mean printing a very nasty
message on the operator's console and halting, since it can't trust it's disk
system anymore.

--- Richard

dab@myrias.com (Danny Boulet) (08/30/90)

In article <3294@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes:
>In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>>No answer is better than a wrong answer. What would anyone
>>bother to run on a computer which is so valueless that they don't care
>>if they get a right answer, just so that you get an answer?
>
>I'm sorry, but I have to go into reality mode here.  I can understand
>if you were running a simulation on the space shuttle you'd rather
>get no answer than a wrong answer.  But let's say you were doing something
>more typical, like ... oh ... replying to a long article in news.  You've
>been typing and researching for an hour now.  I ask you this: would you
>rather I just blow away that entire article and crash your machine or change
>a single random character?

Gee.  That depends.  Consider the characters "1.23456e12".  If the random
change hits the '6' and turns the characters into "1.23452e12" then I probably
don't mind.  If the random change hits the exponent field and I get the
characters "1.23456e92" (a one bit change) then I probably mind quite a bit.
Similar effects occur with binary data (hitting the low order bit of a floating
point number won't matter much but watch what happens when a high order bit
or sign or exponent bit gets hit).

I'd prefer that the question was: "would you prefer that I crash the
machine and force you to use the backed up file produced by your editor
or silently produce a wrong answer?".  I'm very much in favour of
answers that I trust.

I know that there are an awful lot of ways that a computer can produce
wrong answers.  That is no excuse for failing to catch the ones that it
is practical to catch.  Adding an extra bit to each byte (or whatever)
seems like a small price to pay for a bit more confidence in the
results.  Also, given the reliability of current memory and such, crashes
due to parity errors would probably be a lot less frequent than crashes
due to other random events (i.e. adding this feature probably wouldn't
do much harm to the MTBF numbers for the system).

One final note:  a lot of small computers are used for business applications
like payroll, accounting, inventory and such.  This may not be as flashy as
simulating the space shuttle but silent failures in these applications can
be pretty devastating to the business.  Unfortunately, the users of such
systems are probably the least likely to appreciate the value of knowing that
the computer detected an error and aborted rather than giving wrong answers.

limey@sparc2.hri.com (Craig Hughes) (08/30/90)

In the interest of keeping the machine up, even with a potentially fatal
problem, how about having the hardware notify the OS (through some kind
of exception) that there is a problem with a certain area of memory? The
O/S could then dynamically remove that portion of physical
memory from it's virtual map after copying the data to a new page, 
log the error, and continue processing. The corrupted data isn't fixed,
but at least the machine is still running - and that can be imporant
sometimes. (reminds me of a military computer I've seen with a big
'combat mode' button on the front - apparently when pressed all exceptions
would be ignored.......)
--
------------------------------------------------------------------------
---------
Craig S. Hughes				UUCP: ...bbn!hri.com!limey
Horizon Research, Inc.			INET: limey@hri.com
1432 Main Street
Waltham, MA 02154			
			<- ------------- ->
------------------------------------------------------------------------
---------

karsh@trifolium.esd.sgi.com (Bruce Karsh) (08/30/90)

>I know that there are an awful lot of ways that a computer can produce
>wrong answers.  That is no excuse for failing to catch the ones that it
>is practical to catch.  Adding an extra bit to each byte (or whatever)
>seems like a small price to pay for a bit more confidence in the
>results.

But adding the extra bit has a reliability cost too:

    Memory boards need more pins on their connectors.  Mechanical connections
    are a notorious failure point.

    More power is used so the system runs hotter.  There may need to be more
    reliance on fans (which are also notorious) to cool the system.

    The component count is increased so there are more components which can
    potentially fail.

    Parity checking circuitry which can also fail has been added.

    Multiple bit errors may not be detected.

These may all be small effects, but with a modern, well designed memory
system, parity errors are a small effect as well.

> Also, given the reliability of current memory and such, crashes
>due to parity errors would probably be a lot less frequent than crashes
>due to other random events (i.e. adding this feature probably wouldn't
>do much harm to the MTBF numbers for the system).

Given the reliability of current memory and such, how probably is the
event that parity protects against.  I don't have the answer to this, but
I have to believe that someone has studied this problem.  Are memory
parity errors in any way a significant contributer to computer errors?

It seems to me that there are so many other sources of computer error which
are so much more significant that memory parity is just silly.  We don't
usually put parity on floating point processors or internal CPU data paths
and registers.  Putting it on memory seems like a very expensive "spit in
the ocean".

Is there some real hard data which shows that memory is so failure-prone
that parity checking is called for?  If so, why is it that a single bit
of parity checking is adequate.  Is the failure mode such that even-bit
failures are by far the most common kind?  The few memory failures that
I've looked carefully at have been pretty massive, not single-bit.

Has memory parity become a sensless security blanket for the insecure and
uninformed?

>One final note:  a lot of small computers are used for business applications
>like payroll, accounting, inventory and such.  This may not be as flashy as
>simulating the space shuttle but silent failures in these applications can
>be pretty devastating to the business.  Unfortunately, the users of such
>systems are probably the least likely to appreciate the value of knowing that
>the computer detected an error and aborted rather than giving wrong answers.

True, but if the protection is from an extremely unlikely event, it makes
sense to put the cost of protection into protecting against a more likely
event.  Or, alternatively, just leave it off entirely.  You'll never make
a perfectly reliable computer.  You have to settle for some statistical level
of reliability.

I'd like to see a comparison of the probability of a memory parity error
causing a business to make a significant financial mistake, versus the
probability of a software error causing the mistake.

			Bruce Karsh
			karsh@sgi.com

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (08/30/90)

In article <1990Aug29.150017@sparc2.hri.com> limey@hri.com writes:
| In the interest of keeping the machine up, even with a potentially fatal
| problem, how about having the hardware notify the OS (through some kind
| of exception) that there is a problem with a certain area of memory? The
| O/S could then dynamically remove that portion of physical
| memory from it's virtual map after copying the data to a new page, 
| log the error, and continue processing. 

  This is how it usually works in a good O/S. If the page is
instructions, it can be reloaded from disk, as can an unmodified data
page. If the data page is dirty the process must be terminated.

  On a PC running XX-DOS, it makes more sense to take the whole system
down, since there is no way to tell if the page is dirty or clean, and
no memory mapping to do anything about it if you could. And no penalty,
since the current task is the only task (unless extenders are running).

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

eli@smectos.gang.umass.edu (Eli Brandt) (08/31/90)

In article <19875@crg5.UUCP> powell@crg5.UUCP (Powell Quiring) writes:
>In article <56qmo1w162w@zl2tnm.gp.govt.nz> don@zl2tnm.gp.govt.nz (Don Stokes) writes:
>>rwallace@vax1.tcd.ie writes:
>>
>>> If the operating system just told you about it when there was a parity error
>>> I'd agree with you, something like flashing up a message on the screen:
>>> "Parity error detected in code segment at 1234:5678, reboot? (Y/N) ".
>>> However automatically crasing the computer is NOT acceptable behaviour: I'd
>>> much rather do without the parity checking. Consider: suppose a parity error
>>> occurs on a 640K machine. The error is probably in either an unused area of
>>> memory (e.g. at the DOS prompt) or in a section of code that isn't going to b
>>> executed on this session with the program (e.g. error handling code).
>
>The fact that you got a parity error indicates that the memory has
>been read.  The only question is how much this incorrect value
>is going to screw you up.

How 'bout when you get a parity error a little window pops up with the mangled byte and some 
context?  That way you can fix it if it's in human-readable data and choose either to continue
or reboot otherwise.  I personally would always choose the latter - I don't want munged FP values,
I don't want corrupted FATs written to disk, and I really don't want a 21h call changed from 25h
to 35h.  Of course, you could always add a few error-correcting bits, too.  However, *I* wouldn't
pay for the extra RAM/circuitry/design time because I've never had a genuine parity error.

chuckh@apex.UUCP (Chuck Huffington) (09/01/90)

In article <3294@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes:
|In article <2469@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
|>No answer is better than a wrong answer. What would anyone
|>bother to run on a computer which is so valueless that they don't care
|>if they get a right answer, just so that you get an answer?
|
|I'm sorry, but I have to go into reality mode here.  I can understand
|if you were running a simulation on the space shuttle you'd rather
|get no answer than a wrong answer.  But let's say you were doing something
|more typical, like ... oh ... replying to a long article in news.  You've
|been typing and researching for an hour now.  I ask you this: would you
|rather I just blow away that entire article and crash your machine or change
|a single random character?

Two points:

1) How do feel about a single random character in an ilist or
   in a free block map?

2) Are there really that many workstations that are ONLY used to read
   news?  And NEVER used to do anything critical?  How do you prevent
   a toy workstation from being used in a critial application?  And
   in the same line, what defines critical?  It would be really nice
   to have fault tolerant systems, but failing that, I would usually
   prefer to have a system crash instead of corrupting its filesystems,
   or silently making "innocent" errors.

rwallace@vax1.tcd.ie (09/01/90)

Organization: Computer Laboratory, Trinity College Dublin
Lines: 34

In article <2469@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
> In article <6797.26d6edce@vax1.tcd.ie> rwallace@vax1.tcd.ie writes:
> |                                           OK take the very unlikely case that
> | it is in your data. For me that means in the source for a program I'm writing.
> | This is no problem, I can just fix the one trashed character when the compiler
> | barfs on the code. 
> 
>   One bit changes 0 to 1, + to *, etc. Your compiler catches this?
> 
> |                    Much better than having the machine crash and lose several
> | minutes' work. Or say the error is in a floating-point number in a spreadsheet.
> | Chances are the program will crash with a floating-point error or at least
> | produce obviously wrong results e.g. profit for 1989 was $-32198742.88888.
> | 
> | The point is that ignoring a parity error is a pretty safe thing to do; there's
> | very little chance of getting a misleading answer.
> 
>   A little though should show you the error of that thought. Over half
> the word is dedicated to less significant bits of the mantissa. An error
> in any of those bits will result in a percent or less change.

OK fair point, and other people have pointed out that parity is only checked
when reading/writing an individual memory word. I still think it would be MUCH
more useful for the operating system to flash up a message telling you that a
parity error had been detected and where it was, so I could decide for myself
whether to reset the machine. (A likely response if doing stuff with very
important data is to save your work under a different filename and then compare
your modified version with the previous version to check if there are any
differences other than in the section you modified).

"To summarize the summary of the summary: people are a problem"
Russell Wallace, Trinity College, Dublin
rwallace@vax1.tcd.ie

#! rnews 3591
Relay-Versi

leavitt@mordor.hw.stratus.com (Will Leavitt) (09/01/90)

Concerning the additional hardware needed for parity on memory, Bruce
Karsh (karsh@sgi.com) writes:

>But adding the extra bit has a reliability cost too:
>
>  Memory boards need more pins on their connectors.  Mechanical connections
>  are a notorious failure point.
>
>  More power is used so the system runs hotter.  There may need to be more
>  reliance on fans (which are also notorious) to cool the system.
>
>  The component count is increased so there are more components which can
>  potentially fail.
>
>  Parity checking circuitry which can also fail has been added.
>
>  Multiple bit errors may not be detected.

All true...  but modern CMOS memory runs ridiculously cool, parity
detects half of all multi bit errors, and as we'll see, there are very 
definite reasons why memory chips fail more often.

>We don't usually put parity on floating point processors or internal
>CPU data paths and registers.  Putting it on memory seems like a very
>expensive "spit in the ocean".
>
>Is there some real hard data which shows that memory is so failure-prone
>that parity checking is called for?  If so, why is it that a single bit
>of parity checking is adequate.  Is the failure mode such that even-bit
>failures are by far the most common kind?  The few memory failures that
>I've looked carefully at have been pretty massive, not single-bit.

There are at least 3 DRAM failure mechanisms that aren't applicable to
floating point processors, CPU data paths, and registers.
  1) alpha particle flipped bits
  2) DRAMs come in difficult to solder-inspect packages
  3) DRAMs (like all dynamic circuits) are prone to forgetfullness

One at a time:

  1) alpha particle flipped bits

Quoting from a Seimens Information report #6: "Alpha particles are
doubly charged helium nuclei emmitted in the radioactive decay of many
radioactive elements (principly Uranium & Thorium).  Naturally
occuring alpha's range in energy from about 2 to 9 MeV and are treated
as classical particles.  An alpha interacts electronically with
silicon creating a track of electron-hole pairs along the 25 um
straight line path of the particle."  A track of electron-hole pairs
conducts, by the way.

Data bits are stored as charge on a capacitor in the memory cell, and
is read (sensed) by connecting the cell to a sense amplifier via a bit
line shared by other cells.  Thus if an alpha particle zips through
your capacitor, it can flip a bit in memory.  If it zips through the
bitline while you are reading, you get wrong data, and the data gets
writen back wrong.  (internally, DRAM reads are destructive, and are
always followed by a restore).  For CMOS 1 Meg parts, a typical error
rate is 270 failures per 10^9 device hours. According to Seimens, bit
line failures now dominate alpha sensitivity.  Both of these lead to
INTERMITTANT SINGLE BIT ERRORS.

Now, what is uranium doing next to your chips?  It can be a
contaiminant in the silicon or aluminum, or in the pacakge.  There is
a story where Amdahl built a series of mainframes with no error
correction.  Their DRAM vendor packaged a batch of DRAMs in ceramic
DIPs with a good dose of uranium contaminating the ceramic, and the
resulting mainframes wouldn't stay up for more than a day.  Of course,
they failed a different way each time.  Amdahls now have ECC.

2) DRAMs come in difficult to solder-inspect packages

Crack open the top of your Sparcstation or Iris for this one...  The
most popular package for DRAMs these days is the SOJ; the leads curl
underneath the chip and are impossible to inspect.  The most popular
packages for logic are either through hole (like pin grid arrays), or
gull wing (plastic quad flat pack).  Those big gate arrays are PQFPs.
Both are easy to inspect.  Now if not quite enough solder gets
squigied through the solder mask when they make the board, and/or if
the chip has a slightly non-coplanar lead, then instead of being
soldered to the board, the lead ends up resting on a bump of solder
below it.  Because of the springiness in the leads, this will work for
a while, but eventually oxidation will cause intermitant contact.
Now, most DRAMs used for main memory are 1 bit wide parts, so this
results in INTERMITTANT SINGLE BIT ERRORS.

Why are DRAMs packaged in impossible to inspect packages?  Because they
are denser than gull wing, and besides PARITY WILL DETECT ANY PROBLEMS 
ANYWAY.

3) DRAMs (like all dynamic circuits) are prone to forgetfullness

DRAMs store a bit by either charging or not charging a tiny capacitor;
the charge on the capacitor must be refreshed every 15ms before it
disipates.  Normally this works fine, but marginal chips are prone to
data retention problems-- especially at high temperatures and out of
spec voltage ranges.  Dynamic circuits are used in many CMOS
microprocessors as well, but typicaly refreshing is not a problem (it
happens on every clock tick, for example).

>Has memory parity become a sensless security blanket for the insecure and
>uninformed?

Probably.  Pretty soon error correction will be standard on all machines
with signifigant memory sizes.

>I'd like to see a comparison of the probability of a memory parity error
>causing a business to make a significant financial mistake, versus the
>probability of a software error causing the mistake.

True.  But soft memory errors, like bad disk blocks, are a solved problem.
Software errors are not.

        -will
--
-----------------------------------------------------------------
leavitt@mordor.hw.stratus.com

karsh@trifolium.esd.sgi.com (Bruce Karsh) (09/01/90)

In article <2201@lectroid.sw.stratus.com> leavitt@mordor.sw.stratus.com (Will Leavitt) writes:

>All true...  but modern CMOS memory runs ridiculously cool, parity
>detects half of all multi bit errors, and as we'll see, there are very 
>definite reasons why memory chips fail more often.

CMOS devices run cool when they are switched slowly.  They can consume a lot
of power when they are switched rapidly.  Also, CMOS memory is expensive.
Large memory systems are not often CMOS.

>Data bits are stored as charge on a capacitor in the memory cell, and
>is read (sensed) by connecting the cell to a sense amplifier via a bit
>line shared by other cells.  Thus if an alpha particle zips through
>your capacitor, it can flip a bit in memory.  If it zips through the
>bitline while you are reading, you get wrong data, and the data gets
>writen back wrong.  (internally, DRAM reads are destructive, and are
>always followed by a restore).  For CMOS 1 Meg parts, a typical error
>rate is 270 failures per 10^9 device hours. According to Seimens, bit
>line failures now dominate alpha sensitivity.  Both of these lead to
>INTERMITTANT SINGLE BIT ERRORS.

That works out to less than one single-bit error every 13 years of
continuous operation on a system with 4 megabytes of CMOS DRAM.  An in
most cases, that single-bit error would not even affect the operation
of the system.  Surely this is a spit in the ocean.  I doubt that most
people would ever observe one of these in their entire computing life.
Certainly there are sources of failure in most computer systems which
are much higher than this.  Like the electrical wall outlet!

If the failure rate of 4 Meg DRAMs is really a lot higher than this,
then perhaps some protection is called for.  But what good is parity?
It just replaces the system damage caused by the memory error with the
system damage caused by a system failure caused by a catastrophic system
crash.

>There is
>a story where Amdahl built a series of mainframes with no error
>correction.  Their DRAM vendor packaged a batch of DRAMs in ceramic
>DIPs with a good dose of uranium contaminating the ceramic, and the
>resulting mainframes wouldn't stay up for more than a day.  Of course,
>they failed a different way each time.  Amdahls now have ECC.

A company sent out a bad batch of DRAMS.  So what else is new?  It happens
all the time.  How common is this failure.  It sounds like a spit in the
ocean to me.

>2) DRAMs come in difficult to solder-inspect packages

>Crack open the top of your Sparcstation or Iris for this one...  The
>most popular package for DRAMs these days is the SOJ; the leads curl
>underneath the chip and are impossible to inspect.  The most popular
>packages for logic are either through hole (like pin grid arrays), or
>gull wing (plastic quad flat pack).  Those big gate arrays are PQFPs.
>Both are easy to inspect.  Now if not quite enough solder gets
>squigied through the solder mask when they make the board, and/or if
>the chip has a slightly non-coplanar lead, then instead of being
>soldered to the board, the lead ends up resting on a bump of solder
>below it.  Because of the springiness in the leads, this will work for
>a while, but eventually oxidation will cause intermitant contact.
>Now, most DRAMs used for main memory are 1 bit wide parts, so this
>results in INTERMITTANT SINGLE BIT ERRORS.

No doubt true, but at what rate does this failure mode occur?  There
are a lot of high density interconnect schemes now and even more is on
the way.  Are you suggesting that they are so failure prone that they
require error detecting logic?

In most all cases, this failure would be detected during system thermal
testing and it would never make it out the door.  It is possible that a
certain number would slip through.  How common is this failure?  Would
a typical system ever have a failure because of this failure mode?  Is
it worth adding 12% to the cost and size of a memory system and making
it run more slowly because of this?  Couldn't that money be spent elsewhere
to more effectively improve the reliability of the system?

I think you're system is more likely to be hit by lightning than to have
sporadic crashes due to this failure mode.  Do we have any real hard
numbers on how often this failure occurs?

You're probably more likely to see this failure on the SIM socket rather
than on the chip leads.  In that case, there could easily be more than
a single bit error and the parity detection could still fail to catch the
error.

>3) DRAMs (like all dynamic circuits) are prone to forgetfullness

>DRAMs store a bit by either charging or not charging a tiny capacitor;
>the charge on the capacitor must be refreshed every 15ms before it
>disipates.  Normally this works fine, but marginal chips are prone to
>data retention problems-- especially at high temperatures and out of
>spec voltage ranges.  Dynamic circuits are used in many CMOS
>microprocessors as well, but typicaly refreshing is not a problem (it
>happens on every clock tick, for example).

DRAMS, when properly used, are not any more prone to forgetfullness than
the other logic chips, unless the DRAM is defective.

A defective memory chip will have errors.  But are memory chips defective
at so much of a higher rate than other chips that it is a problem? If not,
then why single out memory chips for parity protection?

>Probably.  Pretty soon error correction will be standard on all machines
>with signifigant memory sizes.

I suspect that won't happen.  Memory parity errors are a very rare
failure mode.  I don't think too many designers are going to add extra
cost to their systems to guard against this failure.  Especially not in
the price-competitive computer market of today.  There are just too many
better places to improve reliability at.

>>I'd like to see a comparison of the probability of a memory parity error
>>causing a business to make a significant financial mistake, versus the
>>probability of a software error causing the mistake.

>True.  But soft memory errors, like bad disk blocks, are a solved problem.
>Software errors are not.

But if a part who's only job is to decrease the rate of undetected failures
does not make a significant improvement in the rate of undetected failures,
then what good is it?

If someone can show me that those parity chips really do significantly
decrease the rate of undetected system failures, then I'll agree that
they are necessary.  Even if they only make a 5% reduction in this
rate, they may be an acceptable idea.

Somehow I think that if they make any reduction at all, it's several
places to the right of the decimal point.  E.g. .0001%.  Even worse
though, they may actually be decreasing the overall reliability of systems.

			Bruce Karsh
			karsh@sgi.com

gd@geovision.uucp (Gord Deinstadt) (09/02/90)

rwallace@vax1.tcd.ie writes:
> Consider: suppose a parity error
>occurs on a 640K machine. The error is probably in either an unused area of
>memory (e.g. at the DOS prompt) or in a section of code that isn't going to be
>executed on this session with the program (e.g. error handling code).

In all the parity checking memories I've seen, parity is only checked when
the data is fetched by the CPU or DMA controller.  So it *is* in use.
ECC systems do generally access
memory locations that are not in use, but that's because they can usually
do something useful, ie. fix the data if only a single bit is in error.

> Or say the error is in a floating-point number in a spreadsheet.
>Chances are the program will crash with a floating-point error or at least
>produce obviously wrong results e.g. profit for 1989 was $-32198742.88888.

No doubt 32198742.88888 posters have already replied to this.  What they said.
--
Gord Deinstadt  gdeinstadt@geovision.UUCP

Don_A_Corbitt@cup.portal.com (09/03/90)

> I think you're system is more likely to be hit by lightning than to have
> sporadic crashes due to this failure mode.  Do we have any real hard
> numbers on how often this failure occurs?
> 
> You're probably more likely to see this failure on the SIM socket rather
> than on the chip leads.  In that case, there could easily be more than
> a single bit error and the parity detection could still fail to catch the
> error.
> 
> 
> 
> But if a part who's only job is to decrease the rate of undetected failures
> does not make a significant improvement in the rate of undetected failures,
> then what good is it?
> 
> If someone can show me that those parity chips really do significantly
> decrease the rate of undetected system failures, then I'll agree that
> they are necessary.  Even if they only make a 5% reduction in this
> rate, they may be an acceptable idea.
> 
> Somehow I think that if they make any reduction at all, it's several
> places to the right of the decimal point.  E.g. .0001%.  Even worse
> though, they may actually be decreasing the overall reliability of systems.
> 
> 			Bruce Karsh
>			karsh@sgi.com

Well, I have some anecdotal evidence of the benefits of parity.  I've been
working with IBM PC and clones (the original subject of this discussion)
since the early days.  I've probably been around machines for
	8 years * 4 machines (average)   or
	32 machine years
I've seen 5 or 6 machines that would tend to get parity errors.  Each time,
it was possible to fix by replacing one or more RAM chips (with one exception).
These machines all passed their power-on-self-test, but would fail every few
minutes/hours/days.  Knowing that hardware was broken, we were able to 
blindly swap RAMs until things worked.  If we didn't have parity checking,
we would suspect our software (SW developers) for bugs, pointer problems, etc.
Each machine treats parity errors differently.  Some show suspected address
and ram chip, others just say "Parity Error R)eboot or I)gnore".

The one time the problem wasn't bad RAM chips was when I installed a memory
expansion board improperly (vendor sent wrong docs).  It used page mode RAM,
but I had the page mode switch turned off.  This was for the upper 4MB of
an 8BM 386 machine.  POST worked fine, using the RAM for Ram Disk worked fine,
but OS/2 would crash with a parity error when booting.  It appeared that the
access pattern would change the failure mode.  

What's the point?  RAM is an area where the end-user often gets involved.
Since it is so easy to damage chips when installing them, I find it to be
worthwhile to have some sanity checking on their operation.  Also, most of
the transistors of a given system will be in the RAM chips.  Parity gives
an inexpensive way to reduce the number of "silent wrong answers".

---
Don_A_Corbitt@cup.portal.com      Not a spokesperson for CrystalGraphics, Inc.
Mail flames, post apologies.       Support short .signatures, three lines max.

henry@zoo.toronto.edu (Henry Spencer) (09/04/90)

In article <68362@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes:
>... perhaps some protection is called for.  But what good is parity?
>It just replaces the system damage caused by the memory error with the
>system damage caused by a system failure caused by a catastrophic system
>crash.

A hard failure is usually preferable to a silently wrong answer.
-- 
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday|  henry@zoo.toronto.edu   utzoo!henry

karsh@trifolium.esd.sgi.com (Bruce Karsh) (09/05/90)

In article <1990Sep4.163619.24726@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

>A hard failure is usually preferable to a silently wrong answer.

Given that a memory system is otherwise properly designed and tested
and uses modern 4Mbit DRAM memory chips, is there any evidence that
memory parity makes a measurable difference in the silent wrong answer
rate?

The memory failure component of the silent wrong answer rate seems to be so
small as to be insignificant.

If the answer is no, then isn't parity just a historical and cutural
artifact from the days when parity really was necessary?

			Bruce Karsh
			karsh@sgi.com

wilker@descartes.math.purdue.edu (Clarence Wilkerson) (09/05/90)

I don't remember the exact rate but I thought that with 4 megs of
memory, one expected from alpha radiation alone one error per
two weeks.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/05/90)

In article <68505@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes:

| Given that a memory system is otherwise properly designed and tested
| and uses modern 4Mbit DRAM memory chips, is there any evidence that
| memory parity makes a measurable difference in the silent wrong answer
| rate?

  If the error rate for 1 bit error is 1 in N, then the rate for a 2 bit
error is 1 in N^2. With N in the order of some millions (or billions),
you make the chance of silent error millions of time less likely.

  EDAC makes it possible to *correct* errors, but the rate of 2 bit
errors which would be caught is small to the point of really being
insignificant. Below the rate of errors caused by noise on the bus, I
believe.

  I would expect EDAC on a 64 bit machine, however, since it is probably
cheaper. Note that for byte parity it takes 8 bits of parity memory, but
for EDAC you can use 1+log2(N) or 7 bits, and get 2 bit error detection,
1 bit error correction, and use less memory.

  This assumes that you can (a) build fast EDAC as cheaply as parity,
and (b) that you use ALL 64 bit data fetches. You can run a 71 bit data
bus and put the EDAC in the bus masters (CPU and I/O controllers) which
access the bus. You can still use parity on I/O devices without
controllers, such as serial ports, if you have any such devices. This is
a bus design issue and doesn't effect the theory at all, just the
cost/simplicity ratio.

Glossary:
  EDAC - error detection and correction
  BMD - bus master devices
  TLA - three letter acronym
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/05/90)

In article <13625@mentor.cc.purdue.edu> wilker@math.purdue.edu (Clarence Wilkerson) writes:
| 
| I don't remember the exact rate but I thought that with 4 megs of
| memory, one expected from alpha radiation alone one error per
| two weeks.

  I haven't seen error rates that high in any workstation or 32 bit PC.
I haven't seen a parity error in several years on nine machines with
100+ MB of memory total.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

alix@cerl.uiuc.edu (Chris Alix) (09/05/90)

In article <13625@mentor.cc.purdue.edu> (Clarence Wilkerson) writes:
>  I don't remember the exact rate but I thought that with 4 megs of
>memory, one expected from alpha radiation alone one error per
>two weeks.

In article <2484@crdos1.crd.ge.COM> (Wm E Davidsen Jr) writes:
>  I haven't seen error rates that high in any workstation or 32 bit PC.
>I haven't seen a parity error in several years on nine machines with
>100+ MB of memory total.

NOT A FLAME

In what kind of physical environment are these machines located?  I see
1 or 2 parity errors per year on two 12MB Sun 3/180's in an atypically
"noisy" computer room (lots of unshielded custom hardware, video, fans
switching on and off, etc.)  I'd imagine that a lone workstation in a
Tempest-compliant room might never see an error, but with more and more
standard workstations being used for factory-floor applications, I suspect
that the decision to include parity, secded, etc. is made with the worst
possible physical environment in mind.

--------------------------------------------------------------------------
Christopher Alix                               E-mail: alix@uiuc.edu
University of Illinois                  PLATO/NovaNET: alix / s / cerl
Computer-Based Education Research Lab           Phone: (217) 333-7439
103 S. Mathews  Urbana, IL  61820                 Fax: (217) 244-0793
--------------------------------------------------------------------------

douglas%cirrusl@oliveb.ATC.olivetti.com (Douglas Lee) (09/06/90)

DRAM manufactures express reliablity in terms of FITs (Failures In
Time). On FIT represents one error in one billion (10 ^ 9) hours of
operation. Toshiba claims a FIT rate of 252 for 1 Mb DRAMs.
Clearpoint, who makes add-in memory boards claims the actual rate is
1000 FITs. The FIT rate has steadily decreased for each sucessive
generation of DRAMs until the 4 Mb. The FIT rate for 4 Mb DRAMs is
higher than 1 Mb.

From the FIT rate you can calculate the MTBF of any memory system. The
MTBF in hours for one DRAM is calculated as 10 ^ 9 / FIT. The MTBF for
a system is just the MTBF of each DRAM divided by the total number of
DRAMs in the system. We are only looking at single bit errors here.
Assuming a FIT rate of 252:

# of DRAMs	Memory Size	MTBF
 32		 4 MB		14.1 years
 96		12 MB		 4.7 years
160		20 MB		 2.8 years

Assuming a FIT rate of 1000:

# of DRAMs	Memory Size	MTBF
 32		 4 MB		3.6 years
 96		12 MB		1.2 years
160		20 MB		260 days

For most PC (memories < 12 MB) a single bit error should occur rarely
due to soft errors. The FIT rate really only measures errors due to
alpha particle radiation. There can be more soft errors caused by
power supply spikes, drop outs, etc. that have not been accounted for
here. This will cause the FIT rate to go up, reducing the MTBF. The
thing to realize here, is that parity will actually make the MTBF go
down. This is because more parts are added, more things can fail.
Parity does allow you to detect these errors, however.

Error detection and correction (EDAC) have been mentioned as an
alternative and these are used in many workstations (i.e. Sun). One of
the most popular parts is the Am29C660 and its predecessor Am2960.
This part uses a modified Hamming code to detect and correct single
bit errors and to detect double bit errors. It will in fact detect
many multi-bit errors and catastrophic failures such as all 0's or all
1's. The part appends 7 bits to a 32 bit word and 8 bits to a 64 bit
word (two parts are cascaded). For 32 bits the overhead is greater
than parity, 7 vs. 4, but at 64 bits you break even. Similar parts are
made by IDT and many workstation manufactures implement the same
function in gate arrays. The advantage of this scheme is that all
single bit errors are corrected. Also during refresh cycles, the EDAC
can scrub memory. This is done by reading one memory location and
correcting any single bit errors during each refresh cycle. By
appropriately partioning memory the entire memory can be scrub in a
short time and prevent the accumulation of double-bit errors.

To calculate the probability of two bit errors occurring, the birthday
paradox is used. This will give the probability of two single bit
errors occuring in the same memory word. Assuming 32 bit words and 252
FITs:

# of DRAMs	Memory Size	MTBF
 39		 4 MB		14,907 years
117		12 MB		 8,607 years
195		20 MB		 6,667 years

For 1000 FITs

# of DRAMs	Memory Size	MTBF
 39		 4 MB		3,757 years
117		12 MB		2,168 years
195		20 MB		1,680 years

This increase is overstated since you have added extra circuitry and
devices that can cause other failures to occur. The expected total
system MTBF increase is 50 to 60 times the non-EDAC system. If
scrubbing is used, than this will be even higher.

What this also neglects is that may single bit errors can occur in
memory locations that are not used, or are not read before they are
written again. Therefore, the system may not detect all the parity
errors that occur.

I would expect that most 64 bit memories will have EDC circuits,
especially memories using DRAMs > 1Mb. Some PC companies have looked
at EDC, but found it too expensive to justify putting in the box.

I now must say that I worked for Advanced Micro Devices supporting the
Am29C660. I no longer am affiliated with them.

I hope this answers some of the questions about memory reliability.

Douglas Lee

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (09/06/90)

In article <2361@cirrusl.UUCP> douglas%cirrusl@oliveb.ATC.olivetti.com 
	(Douglas Lee) writes:
>DRAM manufactures express reliablity in terms of FITs (Failures In
>Time). ... The FIT rate really only measures errors due to
>alpha particle radiation. There can be more soft errors caused by
>power supply spikes, drop outs, etc. that have not been accounted for
>here.

Yes. Also, note that parity/ECC may catch problems with connectors,
bus drivers, fans and filters (== overheating), system environment,
and so on.  Further, FIT MTBF is an average.  There are always
machines "built on a Monday", just as with cars.  

It doesn't contribute much to this discussion to give anecdotal
evidence of zero parity errors.  Others of us have anecdotes to the
contrary.  For example, my workstation has had errors in its frame
buffer - which isn't parity protected, because the occasional extra
pixel isn't too important.  I just refresh the screen. 
-- 
Don		D.C.Lindsay

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (09/06/90)

In article <10397@pt.cs.cmu.edu>, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
> 
> Yes. Also, note that parity/ECC may catch problems with connectors,
> bus drivers, fans and filters (== overheating), system environment,
> and so on. ...

Exactly.  There is a workstation in a lab near my office that was having
several parity errors per hour, until an unnamed idiot removed the extra
SIMMS he'd scrounged from a different model of the same brand of machine.
The diagnostics reported no problems, and the errors occurred only when the
machine got hot.  Parity saved days of looking for strange, new kernel
bugs, which would have been the diagnose without the parity error reports.

Parity errors caused by a timing problem figured promenently in the
resolution after years of searching for a problem in the old 68K SGI line.
Without the parity error reports, we would still be looking for a wild
pointer.

From reading the UNIX-on-PC-clones news groups, it seems to me that parity
errors are the main and most universally available and reliable memory
diagnostic on such machines, detecting all kinds of speed, heat, and
compatibility problems.

Vernon Schryver,   vjs@sgi.com

davec@nucleus.amd.com (Dave Christie) (09/06/90)

In article <2483@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>
>  I would expect EDAC on a 64 bit machine, however, since it is probably
>cheaper. Note that for byte parity it takes 8 bits of parity memory, but
>for EDAC you can use 1+log2(N) or 7 bits, and get 2 bit error detection,
>1 bit error correction, and use less memory.

Its been many years since I designed EDAC for a 64-bit machine, but what
I seem to remember is that using 7 bits would only allow you to correct
the 64-bit data portion, not the 7 check bits themselves.  To cover those
you need one more bit (and you really do want those covered as well). 

This extra bit is worthwhile for another reason - it gives you a lot more
freedom in arranging the matrix for generating (& regenerating) the
check bits, which translates into speed.  Eight bits would probably allow
faster encoding and decoding for a 64-bit word than seven.

So lower cost really isn't a factor.  For the 64-bit machines around today,
the money spent on EDAC is a drop in the bucket, and any performance penalty
is greatly reduced either by caches or vector operations.  Moreover, these
machines often run looooong jobs, and they are paid for by charging the 
users.  If a user's 3-day job bombs after 2 1/2 days because of a memory
error, you really can't charge him, and have lost significant revenue.
(Not to mention severely pissing off the user, who undoubtedly has things
timed to finish 1 hour before his paper on the results must be submitted :-).

Future 64-bit workstations will certainly have some different considerations
though.

---------------------------------
Dave Christie		My fuzzy memories only.

tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) (09/06/90)

In article <2483@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>In article <68505@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes:
>
>| Given that a memory system is otherwise properly designed and tested
>| and uses modern 4Mbit DRAM memory chips, is there any evidence that
>| memory parity makes a measurable difference in the silent wrong answer
>| rate?
>
>  If the error rate for 1 bit error is 1 in N, then the rate for a 2 bit
>error is 1 in N^2. With N in the order of some millions (or billions),
>you make the chance of silent error millions of time less likely.

I do not pretend to answer the original question but only to say that
this answer is unfounded.  According to my statistics class this is only
true if the two events are independent.  Perhaps the question could have
been read as:

	... is there any evidence that there are measurably more single
	bit errors than multiple bit errors?

And now for a slightly biased question:  Is it typical for a workstation
to provide ECC and memory scrubbing like the Risc System/6000 does?  I
am getting at another possible selling point of this machine.

Paul Chamberlain | I do NOT represent IBM         tif@doorstop, sc30661@ausvm6
512/838-7008     | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif

tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) (09/07/90)

In article <1189@geovision.UUCP> gd@geovision.uucp (Gord Deinstadt) writes:
>rwallace@vax1.tcd.ie writes:
>>Chances are the program will crash with a floating-point error or at least
>>produce obviously wrong results e.g. profit for 1989 was $-32198742.88888.
>No doubt 32198742.88888 posters have already replied to this.  What they said.

Either he got hit by a memory error or he typed that wrong because it's
obvious that the number is wrong. I'd be willing to bet $32198742.88888
that he typed it that way.     :-)

Paul Chamberlain | I do NOT represent IBM         tif@doorstop, sc30661@ausvm6
512/838-7008     | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/07/90)

In article <1990Sep6.141040.3244@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes:

| Its been many years since I designed EDAC for a 64-bit machine, but what
| I seem to remember is that using 7 bits would only allow you to correct
| the 64-bit data portion, not the 7 check bits themselves.  To cover those
| you need one more bit (and you really do want those covered as well). 

  I just looked at some C code for Hamming code I wrote years ago, and
it appears to need log2(N)+1 bits, including the EDAC bits themselves.
In any case, if you can have EDAC for the price of parity, why not?
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/07/90)

In article <3405@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain/32767) writes:

| I do not pretend to answer the original question but only to say that
| this answer is unfounded.  According to my statistics class this is only
| true if the two events are independent.  Perhaps the question could have
| been read as:

  There are exceptions to every assumption, but assuming that most
memory systems are based on 1 bit wide chips, alpha strikes (which seem
to be the common cause of bit errors) would be limited to one bit in a
word. I think the answer is that multibit errors in a word are rare.

  My hardware guru says that one particle should only hit one bit, even
in the same chip, and that depending on the chip it can only make one
state transition. That means on some chips it can change 0 to 1, but not
back. The alpha hit discharges the capacitor in the cell.

  Sounds right to me, but I don't claim to be a hardware type.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

dhinds@portia.Stanford.EDU (David Hinds) (09/07/90)

In article <2496@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>In article <1990Sep6.141040.3244@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes:
>
>| Its been many years since I designed EDAC for a 64-bit machine, but what
>| I seem to remember is that using 7 bits would only allow you to correct
>| the 64-bit data portion, not the 7 check bits themselves.  To cover those
>| you need one more bit (and you really do want those covered as well). 
>
>  I just looked at some C code for Hamming code I wrote years ago, and
>it appears to need log2(N)+1 bits, including the EDAC bits themselves.
>In any case, if you can have EDAC for the price of parity, why not?
>-- 
   But don't you really only need one parity bit per word, if you only
want to be able to detect single bit errors?  Using one parity bit
per byte is wasteful - which is why the EDAC looks good.  Having one
parity bit per 64 bit word would seem to be the more fair comparison.
Using 8 parity bits per word amounts to catching most two-bit errors
as well - but catching more than single bit errors is not what parity
is tailored for.

 -David Hinds
  dhinds@popserver.stanford.edu

davec@nucleus.amd.com (Dave Christie) (09/07/90)

In article <1990Sep7.003451.13193@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes:

>   But don't you really only need one parity bit per word, if you only
>want to be able to detect single bit errors?  Using one parity bit
>per byte is wasteful - which is why the EDAC looks good.  Having one
>parity bit per 64 bit word would seem to be the more fair comparison.

Yes, this would work, but 

	1) it would take almost twice as long to check the parity,
	   (7 levels of XOR vs. 4), and parity checking tends to be a
	   time critical path (although sometimes you can delay it), 
	   so it's a classic realestate/speed tradeoff

	2) byte parity allows byte writes without the control complexity
	   of doing read/modify/write.  (EDAC of course requires r/m/w
	   for partial-word writes.)


------------
Dave Christie

douglas%cirrusl@oliveb.ATC.olivetti.com (Douglas Lee) (09/07/90)

In <1990Sep7.003451.13193@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes:

>   But don't you really only need one parity bit per word, if you only
>want to be able to detect single bit errors?  Using one parity bit
>per byte is wasteful - which is why the EDAC looks good.  Having one
>parity bit per 64 bit word would seem to be the more fair comparison.
>Using 8 parity bits per word amounts to catching most two-bit errors
>as well - but catching more than single bit errors is not what parity
>is tailored for.

> -David Hinds
>  dhinds@popserver.stanford.edu

But using byte parity allows you to do things like byte writes. If you
use word parity, you must do a read modify write for every byte in
order to update the parity of the word. This is very inefficient.

Douglas Lee

dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) (09/07/90)

In article <1990Sep7.144514.19015@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes:
>In article <1990Sep7.003451.13193@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes:
:
::   But don't you really only need one parity bit per word, if you only
::want to be able to detect single bit errors?  Using one parity bit
::per byte is wasteful - which is why the EDAC looks good.  Having one
::parity bit per 64 bit word would seem to be the more fair comparison.
:
:Yes, this would work, but 
:
:	1) it would take almost twice as long to check the parity,
:	   (7 levels of XOR vs. 4), and parity checking tends to be a
:	   time critical path (although sometimes you can delay it), 
:	   so it's a classic realestate/speed tradeoff

Beg pardon?  You're telling me that all of these parity checks are done
in serial???  If not, what difference does it make how many check bits
there are?

:
:	2) byte parity allows byte writes without the control complexity
:	   of doing read/modify/write.  (EDAC of course requires r/m/w
:	   for partial-word writes.)
:

This argument as least doesn't always hold when dealing with a write-back
cache.

:
:------------
:Dave Christie


--

Dan S.

davec@neutron.amd.com (Dave Christie) (09/08/90)

In article <2253@lectroid.sw.stratus.com> dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) writes:
>In article <1990Sep7.144514.19015@mozart.amd.com> I write:
>:
>:	1) it would take almost twice as long to check the parity,
>:	   (7 levels of XOR vs. 4), and parity checking tends to be a
>:	   time critical path (although sometimes you can delay it), 
>:	   so it's a classic realestate/speed tradeoff
>
>Beg pardon?  You're telling me that all of these parity checks are done
>in serial???  If not, what difference does it make how many check bits
>there are?

Parity is typically generated and checked with a tree of 2-bit 
exclusive-ORs.  Generating  (or regenerating) 8-bit parity takes 
3 levels of XOR.  For an 8-byte word, you need three more levels to 
combine these into a single bit.  Comparing the regenerated bit(s) 
with the stored bit(s) on a read requires one more XOR for each bit.  
As for generating an error signal, the 9 levels of XOR do it directly 
for 64-bit parity.  With 8-bit parity on a 64-bit word the eight
error signals you have after 4 levels of XOR must be combined with
(logically) three levels of 2-bit OR, which in any technology will be 
somewhat faster than 3 levels of XOR, and can often be much faster 
(eg. ECL wired logic, CMOS dynamic logic). 

>:
>:	2) byte parity allows byte writes without the control complexity
>:	   of doing read/modify/write.  (EDAC of course requires r/m/w
>:	   for partial-word writes.)
>
>This argument as least doesn't always hold when dealing with a write-back
>cache.

Quite true.  (Providing your I/O system or whatever else you have writing
also does block or word writes, as Bill D. more or less pointed out earlier.)

Just how significant either of these two points are is highly dependent
on many parameters of your memory system design such as cycle time, ram
speed, degree of pipelining, desired latency, write setup time, etc.
------------
Dave Christie

davec@neutron.amd.com (Dave Christie) (09/08/90)

In article <2496@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>In article <1990Sep6.141040.3244@mozart.amd.com> I write:
>
>| Its been many years since I designed EDAC for a 64-bit machine, but what
>| I seem to remember is that using 7 bits would only allow you to correct
>| the 64-bit data portion, not the 7 check bits themselves.  To cover those
>| you need one more bit (and you really do want those covered as well). 
>
>  I just looked at some C code for Hamming code I wrote years ago, and
>it appears to need log2(N)+1 bits, including the EDAC bits themselves.

I stand corrected (thank you Robert Herndon) - the eighth bit gives
you double error detection.  (I knew it was necessary for some reason
- like I said, its been a looong time...)  The system I did was SECDED,
which is the only EDAC scheme I've ever seen implemented, but then, I've
only worked on mainframes.  For a more cost sensitive workstation you
may well skip the double-error stuff, but if it's only one more bit
on top of 71, what the hell - it's good fodder for the sales brochures
if nothing else.
----------
Dave Christie		My opinions only.

daveg@near.cs.caltech.edu (Dave Gillespie) (09/08/90)

>>>>> On 8 Sep 90 01:46:08 GMT, davec@neutron.amd.com (Dave Christie) said:

> The eighth bit gives you double error detection...
> ... if it's only one more bit on top of 71, what the hell...

I wonder, I can see single-bit errors occurring in isolation, but
how likely is it to have an exactly two-bit error?  Most catastrophes
I can think of will nuke one bit or many.  And if the only danger is
two statistically independent errors occuring at once in the same
word, I think a more pressing danger is that your machine might be
the Ravenous Bugblatter Beast in a clever disguise.

								-- Dave
--
Dave Gillespie
  256-80 Caltech Pasadena CA USA 91125
  daveg@csvax.cs.caltech.edu, ...!cit-vax!daveg

ching@brahms.amd.com (Mike Ching) (09/09/90)

In article <DAVEG.90Sep7233206@near.cs.caltech.edu> daveg@near.cs.caltech.edu (Dave Gillespie) writes:
>>>>>> On 8 Sep 90 01:46:08 GMT, davec@neutron.amd.com (Dave Christie) said:
>
>> The eighth bit gives you double error detection...
>> ... if it's only one more bit on top of 71, what the hell...
>
>I wonder, I can see single-bit errors occurring in isolation, but
>how likely is it to have an exactly two-bit error?  Most catastrophes
>I can think of will nuke one bit or many.

The problem is that the two errors don't have to occur simultaneously. If
a soft error is not corrected (by accessing the word and writing a corrected
word back), a second bit can be corrupted at a later time and result in a
double bit error when the word is accessed. This is why scrubbing was
incorporated in DRAM controllers. Scrubbing is a term coined for doing
a RMW cycle with correction during a refresh cycle. All words in memory
get accessed (and corrected if necessary) every few minutes instead of
when accessed by a program. An added benefit is that errors are corrected
in the background instead of imposing a correction cycle on an access
when the processor is waiting for the data.

Mike Ching

friedl@mtndew.Tustin.CA.US (Steve Friedl) (09/09/90)

[ discussions on ECC ]

In article <1990Sep8.172848.4600@amd.com>, ching@brahms.amd.com (Mike Ching) writes:
> The problem is that the two errors don't have to occur simultaneously. If
> a soft error is not corrected (by accessing the word and writing a corrected
> word back), a second bit can be corrupted at a later time and result in a
> double bit error when the word is accessed. This is why scrubbing was
> incorporated in DRAM controllers.

On some machines, the scrubbing is done in software.  The newer 3B2s
all have a job running out of cron at the top of the hour that does:

	dd if=/dev/mem of=/dev/null

This seems to serve the same purpose of provoking the single bit errors
in the background so they get fixed right away.

     Steve

-- 
Stephen J. Friedl, KA8CMY / I speak for me only / Tustin, CA / 3B2-kind-of-guy
+1 714 544 6561  / friedl@mtndew.Tustin.CA.US  / {uunet,attmail}!mtndew!friedl

Steve's bright idea #44: COBOL interface library for X Windows

ddb@ns.network.com (David Dyer-Bennet) (09/25/90)

In article <1990Aug10.171744.9639@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
:But, but, but... virtually all MSDOS software *explicitly ignores*
:parity errors.

The original article is dated 10-Aug, but no later ones in the thread
have reached this site.

This is a new story to me; what I KNOW is that I have had bad memory
chips in my ibm pc detected and reported by the parity logic.  It gave
me enough information to identify the chip, and replacing that chip
cured the problem.

The lack of any way to test the parity system is extremely
unfortunate. 

(Parity handling wouldn't come to the attention of the individual
program unless it went to special effort, it would be handled in
MS-DOS itself.)


-- 
David Dyer-Bennet, ddb@terrabit.fidonet.org
or ddb@network.com
or ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb
or Fidonet 1:282/341.0, (612) 721-8967 9600hst/2400/1200/300