[comp.sys.next] NeXT Memory - No Error Checking or Parity !

jensen@gt-eedsp.UUCP (P. Allen Jensen) (10/26/88)

Ok, Straight from a NeXT salesrep in response to the question:
Q: Does the memory have a parity check bit ?
A: "No"

The reason was that "memory is reliable enough that the added cost
was not justified."  If you have ever worked on some older equipment
without parity, your opinion may differ.  Could an expert on RAM
chips respond ?   Is memory really "reliable enough" ?

I was supprised to learn that the cold-start diagnostics do not
check memory for errors - they just look to see if there is any
memory there.  If I had a NeXT, I think I would have a crontab
entry to check memory every day/night !

P. Allen Jensen
-- 
P. Allen Jensen
Georgia Tech, School of Electrical Engineering, Atlanta, GA  30332-0250
USENET: ...!{allegra,hplabs,ulysses}!gatech!gt-eedsp!jensen
INTERNET: jensen@gt-eedsp.gatech.edu

tim@hoptoad.uucp (Tim Maroney) (10/26/88)

In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
>I was supprised to learn that the cold-start diagnostics do not
>check memory for errors - they just look to see if there is any
>memory there.

How fast can *you* check 8 Megabytes of RAM???
-- 
Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim
"Because there is something in you that I respect, and that makes me desire
 to have you for my enemy."
"Thats well said.  On those terms, sir, I will accept your enmity or any
 man's."
    - Shaw, "The Devil's Disciple"

roy@phri.UUCP (Roy Smith) (10/26/88)

jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
> The reason was that "memory is reliable enough that the added cost was not
> justified." [...]  Could an expert on RAM chips respond ?   Is memory
> really "reliable enough" ?

	I'm hardly an expert on ram, but here goes anyway.  We've got 19
Sun-3's of various flavors around here with a total of 84 Mbytes of ram.
We get a parity error panic on one machine or another a couple of times a
year.  Make that, oh maybe, 1 error per 400 Mbyte-months.  In perhaps 2000
Mbyte-months of operation, we've had one hard memory error.

	That's my data.  Draw your own conclusions.
-- 
Roy Smith, System Administrator
Public Health Research Institute
{allegra,philabs,cmcl2,rutgers}!phri!roy -or- phri!roy@uunet.uu.net
"The connector is the network"

debra@alice.UUCP (Paul De Bra) (10/26/88)

In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
>Ok, Straight from a NeXT salesrep in response to the question:
>Q: Does the memory have a parity check bit ?
>A: "No"
>
>The reason was that "memory is reliable enough that the added cost
>was not justified."  If you have ever worked on some older equipment
>without parity, your opinion may differ.  Could an expert on RAM
>chips respond ?   Is memory really "reliable enough" ?

NO. (memory is NOT reliable enough)

I have seen memory go bad on ATs, Microvaxen, big Vaxen, ...
My impression is that memory chips are not being tested well enough to
be able to put them in a machine and expect them to still work (within
specifications) in a year or so. 99.9% of the NeXT boxes may never have a
problem, but I don't want to be among the 0.1% that spends weeks on the
phone discussing unidentified problems, which cannot be reproduced, and
tracking them down to a bad memory chip, when the few extra $ for parity
could have pointed out the problem right away. It need not even be 9 chips
for each row of 8; 33 instead of 32 would be adequate (though harder to get).

Paul.


-- 
-------------------------------------------------------------------------
|debra@research.att.com   | uunet!research!debra     | att!grumpy!debra |
-------------------------------------------------------------------------

pardo@june.cs.washington.edu (David Keppel) (10/27/88)

roy@phri.UUCP (Roy Smith) writes:
>[ 84Mb and only 2 parity error panics each year ]

The Eagle PC's of a while back had no parity bits -- it was their
claim that the parity bits had errors often enough that "fake" parity
errors *could* be a problem and that *real* parity errors were few
enough *not* to be a problem.

I still like the idea of getting warned when something goes wrong...

    ;-D on  ( My memory is perfac... purrfe... forgot how to spill )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

wtm@neoucom.UUCP (Bill Mayhew) (10/27/88)

I'm looking at the photo of the motherboard for the Next computer
on page 164 of the Nov. 1988 issue of Byte.  The SIMMS would appear
to be 8 * 1 megabit chips.  The accompanying text says that it is
100 nS memory.  There is also a set of four 8K * 8 bit (possibly
HM6164?) cache chips.  The accompanying text says that they are
rated at 45 nS.  There are also four chips identified as "custom
memory buffers" adjacent to the SIMM array.  Last of all, there is
256K of VIDRAM.

For the 32K static RAM, 24K is given to the DSP chip, and 8K goes
to the disk controller.  Apparently the LSI DMA chip must have some
internal smarts.  They do mention that there is a DMA burst mode
that allows 4 long words to be fetched in 9 cycles.

The SCSI controller is a 5390.  The chip is marked NCR in the
photo, but I can't find it in my SMS/OMTI catalog.  I'm willing to
believe the quoted 4 mb/s transfer rate.

I don't know about lack of parity of error correction hardware.
Relative to some of the other goodies included, it doesn't seem
like it would have been that hard to include .. even if it was just
parity.  I'd like to know when my box is making a mistake.  When I
looked inside the Mac II, there didn't seem to be any sort of
parity or error correction there either.  I hope that the kernel
does its own sanity checking in lieu of hardware.

--Bill

jewett@hpl-opus.HP.COM (Bob Jewett) (10/28/88)

>> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
>> Could an expert on RAM chips respond ?  Is memory really "reliable enough" ?
>> 
To which Roy Smith replies:
> 	I'm hardly an expert on ram, but here goes anyway.  ...
> Make that, oh maybe, 1 error per 400 Mbyte-months.

I too am not a RAM expert, but here is another data point.

We have had about 5000 megabyte-months on 16 HP9000/350 workstations.
(68020, 16 Meg RAM each)  We have seen roughly the same rate of parity
errors.  Whether that error rate is a problem depends a lot on the
application you're running.  If there's a one-bit error when writing out
your term report, it's probably OK.  If it's the final version of an IC
design, it may cost big bucks.

Our file server 350 is equipped with ECC RAM (39 one-Meg chips for each
4 megabytes of RAM).  There is a nightly daemon that "scrubs" the RAM --
finds and fixes all one-bit soft errors.  The log shows two errors fixed
in the last five months.  That kind of RAM is slightly slower, but a
parity error panic on the file server is painful enough that the extra
safety was considered worthwhile.

A subtle point in the statistics is that many (maybe most) soft errors in RAM
are never noticed.  Often RAM is written but not read.

Bob Jewett jewett@hplabs

This is not an official statement of the Hewlett-Packard Company.

david@ms.uky.edu (David Herron -- One of the vertebrae) (10/28/88)

Lack of a parity bit is a definite minus... even with "modern reliable memory".

On my 3b1, somewhere in the 3.5 megs of memory in the machine, there
is one or more bad chips.  When the machine boots it finds those bad
chips during the memory check and maps them out.  I would prefer if
it were to tell me where the bad chip is so that I could replace it,
but I like the fact that it's being mapped out.  And as near as I can
figure it's only costing me 35K of memory ...

I *like* the parity in my unix pc..
-- 
<-- David Herron; an MMDF guy                              <david@ms.uky.edu>
<-- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<--
<-- Controlled anarchy -- the essence of the net.

jtn@potomac.ads.com (John T. Nelson) (10/28/88)

> In article <3569@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:

> 	I'm hardly an expert on ram, but here goes anyway.  We've got 19
> Sun-3's of various flavors around here with a total of 84 Mbytes of ram.
> We get a parity error panic on one machine or another a couple of times a
> year.  Make that, oh maybe, 1 error per 400 Mbyte-months.  In perhaps 2000
> Mbyte-months of operation, we've had one hard memory error.


It only takes once to crash a machine... and it will probably occur at
the least convenient time.  This might not sound so bad in a
University environment, but it could be disastorous elsewhere.  NeXT
really should have provided the extra bit pepr byte of parity
checking (and they probably will once we've all bought these initial
machine grumble).



-- 

John T. Nelson			UUCP: sun!sundc!potomac!jtn
Advanced Decision Systems	Internet:  jtn@potomac.ads.com
1500 Wilson Blvd #512; Arlington, VA 22209-2401		(703) 243-1611

Shar and Enjoy!

geoff@desint.UUCP (Geoff Kuenning) (10/28/88)

CDC made that mistake, too, on the old 6000 series machines.  The way
I heard it, somebody "discovered" that most of the parity errors on the
3000 series were in the parity bits themselves.  So dropping the parity
bits would not only save money, but would cut pointless downtime.

Needless to say, the result was hard-to-trace problems.  I think
(though I'm not sure) that they installed parity again on the 7600.  I'm
pretty sure that the Cray machines have ECC memory.  One thing about
Seymour -- he learns from his mistakes.
-- 
	Geoff Kuenning   geoff@ITcorp.com   uunet!desint!geoff

ejf@well.UUCP (Erik James Freed) (10/28/88)

In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
>Ok, Straight from a NeXT salesrep in response to the question:
>Q: Does the memory have a parity check bit ?
>A: "No"
>The reason was that "memory is reliable enough that the added cost
>was not justified."  If you have ever worked on some older equipment
>without parity, your opinion may differ.  Could an expert on RAM
>chips respond ?   Is memory really "reliable enough" ?
>I was supprised to learn that the cold-start diagnostics do not
>check memory for errors - they just look to see if there is any
>memory there.  If I had a NeXT, I think I would have a crontab
>entry to check memory every day/night !

I think it was Seymour Cray who was quoted as saying
	"Parity is for farmers"
I would tend to support NeXT's decision. Parity is supposed to allow
you to pinpoint where errors reside, but the software is rarely written
so that information is easily available. In general if your system memory
is flakey, you will soon realize that something is up and then you can
run memory tests to isolate the particular simm module. (I assume that a
good memory checking diagnostic will be available at a standalone level
for the NeXT) A useable (thorough) memory test takes a lot of time. It is 
not something that you want to run every boot up. And parity memory in
my experience just is not really that useful. (at least to justify the PC
real estate) I now submit myself to the flames
Erik

whh@pbhya.PacBell.COM (Wilson Heydt) (10/29/88)

In article <1807@desint.UUCP>, geoff@desint.UUCP (Geoff Kuenning) writes:
> Needless to say, the result was hard-to-trace problems.  I think
> (though I'm not sure) that they installed parity again on the 7600.  I'm
> pretty sure that the Cray machines have ECC memory.  One thing about
> Seymour -- he learns from his mistakes.

The story I heard was that Cray was told that USGov purchasing *required*
parity on computers, and that--therefore--he *had* to have it--or his
primary customers wouldn't be allowed to buy the machines.  Cray is
reputed to have grumbled and complained about how parity slowed things
down and added six inches to height of the box to add the parity bits &
circuits.

    --Hal

=========================================================================
  Hal Heydt                             |    "Hafnium plus Holmium is
  Analyst, Pacific*Bell                 |     one-point-five, I think."
  415-645-7708                          |       --Dr. Jane Robinson
  {att,bellcore,sun,ames,pyramid}!pacbell!pbhya!whh   

jbn@glacier.STANFORD.EDU (John B. Nagle) (10/29/88)

      It's not that "the machine might crash".  It's that one might get
bad data and not know it.  Particularly in applications with long-lived
databases updated over time, any source of undetected error is intolerable.
Corrupted program objects might be generated.

      Some early MS-DOS machines, such as the Texas Instruments TI PRO,
lacked memory parity.  I at one time had one of these machines.  The
usual symptom of memory trouble was not a system crash, but junk in
newly compiled and linked executables.  It's really bad when you have
to compile the same program twice and compare the executables to insure
that the compile and link were successful.

      That TI PRO became a doorstop in late 1984.

      The Mac II doesn't have memory parity either.  A bad move by Apple.

      I consider a machine without memory parity unacceptable for serious
work.  But then, NeXT is targeting the educational environment.

					John Nagle

hal@gvax.cs.cornell.edu (Hal Perkins) (10/29/88)

In article <1807@desint.UUCP> geoff@desint.UUCP (Geoff Kuenning) writes:
>CDC made that mistake, too, on the old 6000 series machines.  The way
>I heard it, somebody "discovered" that most of the parity errors on the
>3000 series were in the parity bits themselves.  So dropping the parity
>bits would not only save money, but would cut pointless downtime.


The way I heard it, the parity bit was omitted on the 6000 series to
save time.  The clock would have had to be slower to generate and check
parity.  Apprently they assumed that if a memory module went bad, it
would be obvious that there was a problem and the operator or field
engineer could run diagnostics.

It didn't work like that though.  I was operating a 6400 a couple of
times when a memory module failed.  The machine would start acting
weird, like it was having a nervous breakdown.  Jobs would abort for no
apparent reason and then work just fine when they were rerun, other
jobs would appear to run correctly, but when rerun would produce
different answers, parts of the operating system would abort or
deadlock, etc.  We learned that these symptoms probably meant a
hardware problem, but then we'd have to tell the engineers to rerun
their last couple of day's work to be safe, since there could have been
errors in their numbers before things got bad enough to be noticable.

Later CDC machines as well as Cray's have error correcting memory,
which is essential in huge memories if you want to have acceptable
MTBF.

Personally, it's fine with me if a workstation-class machine doesn't
have ECC, but I would like to have parity so I know when something is
wrong.  I wouldn't want to be riding on an airplane designed on
machines without any form of error detection.

Hal Perkins               hal@cs.cornell.edu
Cornell CS

henry@utzoo.uucp (Henry Spencer) (10/29/88)

In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
>The reason was that "memory is reliable enough that the added cost
>was not justified."  If you have ever worked on some older equipment
>without parity, your opinion may differ.  Could an expert on RAM
>chips respond ?   Is memory really "reliable enough" ?

I'm not really an expert on RAM chips, but I do know that the reliability
of modern RAMs is *spectacularly* better than the ones that were routinely
in use 5-10 years ago.  Parity and error correction were fully justified
on the 4Kb and 16Kb chips; the 64Kbs were vastly better, the 256Kbs better
yet, and I imagine the 1Mbs are probably a further improvement.  We're
talking orders-of-magnitude improvement here.  My feeling is that parity
is nice but no longer a necessity.
-- 
The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

james@bigtex.cactus.org (James Van Artsdalen) (10/30/88)

In <549@gt-eedsp.UUCP>, jensen@gt-eedsp.UUCP (P. Allen Jensen) wrote:

> The reason was that "memory is reliable enough that the added cost
> was not justified."  If you have ever worked on some older equipment
> without parity, your opinion may differ.  Could an expert on RAM
> chips respond ?   Is memory really "reliable enough" ?

I personally haven't found parity checking to be worthwhile.  I have
had three memory systems errors on machines that had parity checking,
and only one of those errors was a chip.  None of those systems
reported parity errors until well after I had discovered or deduced
the problem myself, and the Apple Lisa never reported an error.

A large number of machines in the PC market effectively don't have
parity checking.  Many clones use Phoenix's BIOS, which has this habit
of disabling NMI and hence parity error reporting.  Microsoft's symdeb
debugger also leaves NMI disabled.  Many video cards do bizarre things
to NMI too.

For those not aware: the Intel 80x88 family has a design flaw that
requires external hardware to disable NMI.  Without such hardware it
is not possible to prevent the system from randomly crashing when NMIs
are used.
-- 
James R. Van Artsdalen      james@bigtex.cactus.org      "Live Free or Die"
Home: 512-346-2444 Work: 338-8789       9505 Arboretum Blvd Austin TX 78759

mvs@meccsd.MECC.MN.ORG (Michael V. Stein) (10/30/88)

In article <1807@desint.UUCP> geoff@desint.UUCP (Geoff Kuenning) writes:
>CDC made that mistake, too, on the old 6000 series machines.  The way
>I heard it, somebody "discovered" that most of the parity errors on the
>3000 series were in the parity bits themselves.  So dropping the parity
>bits would not only save money, but would cut pointless downtime.

I'm almost positive that old CDC machines had no form of parity bits.
I am positive that our old Cyber 73 had no parity.  

>I think
>(though I'm not sure) that they installed parity again on the 7600.

All of the later CDC machines had full SECDED (Single Error Correction
Double Error Detection) support.  This meant that all of the 60 bit
words had an extra 11 bits of SECDED data associated with it.

-- 
Michael V. Stein - Minnesota Educational Computing Corp. - Technical Services
{bungia,uiucdcs,umn-cs}!meccts!mvs  or  mvs@mecc.MN.ORG

jim@belltec.UUCP (Mr. Jim's Own Logon) (10/31/88)

In article <1988Oct28.210152.29417@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
> >The reason was that "memory is reliable enough that the added cost
> >was not justified."  If you have ever worked on some older equipment
> >without parity, your opinion may differ.  Could an expert on RAM
> >chips respond ?   Is memory really "reliable enough" ?
> 
> I'm not really an expert on RAM chips, but I do know that the reliability
> of modern RAMs is *spectacularly* better than the ones that were routinely
> in use 5-10 years ago.

   According to the last court case I was a witness at, I'm an expert in 
memory system design. Credentials upon request. 

   While it is true that the reliability of RAMs being corrupted from 
spurious alpha partical hits has greatly increased this has actually left 
them more susceptable to other failures (which were more prevelant anyway).
The size of the memory cell, and the rate of leakage were what made earlier 
RAMs easy to have bit errors due to random alpha particals. The newer (larger)
RAMs have smaller cells and less leakage, making them more reliable.

   But, the main cause of memory corruption has always been noise. Electrical
noise from exrternal sources (generators, power surges, etc.), and from internal
sources (other chips, peripherals, power supply). The less the memory cell 
charge, the more susceptable to noise. 

   It is also very true that some vendors products are much more resilient to
signal noise than others (no, I won't name names). In this era of RAM shortage
you can bet that a company will be scatter buying their RAM to get as much
as possible and as cheaply as possible. It follows then that some of the
NExT machines will be better than others, some will never fail, some will fall
out in burn-in, and those in the middle....

   What should they have done? Make it a build option. If you want the extra
safety, you shell out the extra bucks. If you like going to Las Vegas, playing
the lottery, and Russian Roulette, you can have the base unit.


						-Jim Wall
						Bell Technologies, Inc.

The above opions are mine. However in this case, the company would probably
go along with them.

henry@utzoo.uucp (Henry Spencer) (11/01/88)

In article <8348@alice.UUCP> debra@alice.UUCP () writes:
>NO. (memory is NOT reliable enough)
>
>I have seen memory go bad on ATs, Microvaxen, big Vaxen, ...

You're at Bell Labs CS research, right?  Do you use a Blit/5620/etc.?
If so, unless they've changed the hardware, you're using a machine with
no parity on its memory every day.  Is it a problem?

To expand on some of my earlier comments:

There is no such thing as perfectly reliable memory.  It's all a matter
of how much you want to pay for lower error rate.  If your memory chips
are good enough, parity may be past the point of diminishing returns.
I personally prefer it, but I don't insist on it.

Those who are smug about their PCs having parity might want to consider
three small complications:

1. There is no way for PC software to test the parity machinery, so it
	can go bad without notice.

2. At least one widely-used BIOS implementation has bugs in its handling
	of parity errors.

3. A lot of PC software essentially disables parity-error reporting.

(This is not first-hand information, but it's from a source I consider
quite reliable.)
-- 
The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

cramer@optilink.UUCP (Clayton Cramer) (11/02/88)

In article <7493@well.UUCP>, ejf@well.UUCP (Erik James Freed) writes:
> I think it was Seymour Cray who was quoted as saying
> 	"Parity is for farmers"
> I would tend to support NeXT's decision. Parity is supposed to allow
> you to pinpoint where errors reside, but the software is rarely written
> so that information is easily available. In general if your system memory
> is flakey, you will soon realize that something is up and then you can
> run memory tests to isolate the particular simm module. (I assume that a
> good memory checking diagnostic will be available at a standalone level
> for the NeXT) A useable (thorough) memory test takes a lot of time. It is 
> not something that you want to run every boot up. And parity memory in
> my experience just is not really that useful. (at least to justify the PC
> real estate) I now submit myself to the flames
> Erik

ARGGGH!  I have one terribly unpleasant experience with a lack of parity,
and it makes me firm in my belief that memory needs parity.

I was writing a gas station accounting system in BASIC on a Radio Shack
Model 3 (not my choice of hardware or language, obviously), and I had
to use someone else's system for an emergency bug fix.  I loaded in
my program, edited in my changes, then saved the program back to disk.
Then I tried to run it.  And it didn't work.  Lots of variables were
undefined, and I couldn't figure out why.

After a bit of study, it turned out that bit 13 had gone bad in 16K
of RAM.  As a consequence, Q turned into A, R turned in B, S turned
into C -- and the BASIC interpreter still accepted the lines, but
all the variables were thoroughly garbled.  (The keywords survived,
perhaps using only small numbers in each word).

Fortunately, a friend drove up from Los Angeles with our master
disks, or I would have been in deep trouble.  Parity would have caught
this error, and prevented a huge loss of time and effort.

I can't take seriously a machine without memory parity.  The PC even
has parity, and there are times it has caught memory errors.  Waiting
for those errors to be obvious before searching for them is a recipe
for frustrated users, and corrupted data.

-- 
Clayton E. Cramer
..!ames!pyramid!kontron!optilin!cramer

crum@lipari.usc.edu (Gary L. Crum) (11/02/88)

hmm...  Hacking the MACH memory manager to check pages using a bit of its
spare time might be fun; don't know about the kernel space though...

mitchell@cadovax.UUCP (Mitchell Lerner) (11/03/88)

In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
>Ok, Straight from a NeXT salesrep in response to the question:
>Q: Does the memory have a parity check bit ?
>A: "No"
>
>The reason was that "memory is reliable enough that the added cost
>was not justified."  If you have ever worked on some older equipment
>without parity, your opinion may differ.  Could an expert on RAM
>chips respond ?   Is memory really "reliable enough" ?

Well, where I work, we used to make machines that had parity checking
and then we made machines that didn't and now we make them that do again.

When I asked about this, I remember the answers that I got were something 
like this:

Older memories would/could fail one cell at a time and in this case it is
crucial to have parity checking logic in the hardware.  So this is why
we had parity checking in our older machines.

RAMs don't fail one cell at a time.  If a failure occurs in a RAM, the
entire bank will fail and other hardware detects that, and so, to have parity 
checking in a RAM type machine is a waist of time, money and space.
The customers though, don't know that parity is antiquated with newer
memories but they still ask for it.  And this is why we put it in our
newest machines; not because it could do anything useful, but as
a checklist item.  I belive that this is why IBM put parity in their
PCs.  And since IBM put it in their PCs, everybody thinks that it is
necessary for all machines.  In their PC, it certainly doesn't do anything!

This is what I heard, not necessarily is it what was said.  I may
be wrong.  And, it is certainly not a technical statement of Contel 
Business Systems.
-- 
Mitchell Lerner -- UUCP:  {ucbvax, decvax}!trwrb!cadovax!mitchell
	Don't speak to him of your heartache, for he is speaking.
	He feels the touch of an ants foot.  If a stone moves under the 
	water he knows it.

mitchell@cadovax.UUCP (Mitchell Lerner) (11/03/88)

I apologize for the half truths (which are often more harmful than lies) 
in my last posting.

What I now understand is this:

In the old days, memory wasn't that reliable and parity checking was 
implemented to bring the system down quickly so that damage was minimal 
due to corrupt data.

Parity used to be implemented on buses between processors and memory
but the logic and the technology got so refined that harware people 
eventualy found out that they never had an error across these channels, so 
they removed parity in that area of the system.

Today's memory is VERY reliable and (he said) it virtualy never fails one
cell at a time; usually the entire bank or group of banks fail.  I suppose 
that memory errors like this make failures much more obvious these days and 
the system will come down pretty quickly in the case of a memory failure.

The logic used for parity checking can introduce more errors into the
system if it should fail.

Implementing parity on a system slowes the system down.  With 100ns memories
and 200ns to compute parity, one cannot run a system as fast as without
parity.

When I told him that the Next computer was implemented without parity,
he said: "Well, I guess that guy is smarter than I give him credit for" :-)

We build multi-user business systems that are used for on-line accounting, 
order-entry, billing and such.  People's businesses depend on our systems and 
from what I understand, our systems are very reliable (software and hardware).

I talked to some of our field support people and they said that memories
just don't fail that often these days.  "We just don't see data disasters 
caused by memory failing these days".

Just one man's thoughts, not the opinion of Contel Business Systems.
-- 
Mitchell Lerner -- UUCP:  {ucbvax, decvax}!trwrb!cadovax!mitchell
	Don't speak to him of your heartache, for he is speaking.
	He feels the touch of an ants foot.  If a stone moves under the 
	water he knows it.

joel@peora.ccur.com (Joel Upchurch) (11/05/88)

Speaking of error checking. I wonder how many of the manufacturers of
computers using memory caching use parity checking on the cache memory
as well as the main memory? A lot of 386 machines have larger cache
memories than the original PC had as main memory. Not only the data in
the cache, but the translation addresses. And if the processor has
loadable microcode how about on the microcode control store?

Personally if my life depended on it I'd prefer having redundant computers
to having a lot of error checking in a single computer. And if the
consequences are really disastrous, I'd have two or more different kinds
of computers running different programs written by different teams.
-- 
Joel Upchurch/Concurrent Computer Corp/2486 Sand Lake Rd/Orlando, FL 32809
joel@peora.ccur.com {uiucuxc,hoptoad,petsd,ucf-cs}!peora!joel (407)850-1040

ralphw@ius3.ius.cs.cmu.edu (Ralph Hyre) (11/07/88)

In article <8348@alice.UUCP> debra@alice.UUCP () writes:
>In article <549@gt-eedsp.UUCP> jensen@gt-eedsp.UUCP (P. Allen Jensen) writes:
>>Ok, Straight from a NeXT salesrep in response to the question:
>>Q: Does the memory have a parity check bit ?
>>A: "No"
>>... "memory is reliable enough that the added cost was not justified."
>NO. (memory is NOT reliable enough)
>
And this is such a religious issue that I believe it should be left up to the
end user/systems integrator.   Add an 'extra' SIMM socket or two for a bank of
chips to be used for parity/ECC, and make sure it is jumper-or even software 
selectable, depending on the users taste and wealth.  For some applications
(like digitized speech), I might rather have 10M of 99.9999% reliable memory
than 9M+parity (all Unix can do is panic) or 8M+ECC.

Anyone want to design a memory controller/MMU for this?
-- 
					- Ralph W. Hyre, Jr.
Internet: ralphw@ius3.cs.cmu.edu    Phone:(412) CMU-BUGS
Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA
"You can do what you want with my computer, but leave me alone!8-)"

alien@cpoint.UUCP (Alien Wells) (12/15/88)

Disclaimer:  I work for a company whose main business is producing aftermarket
	memories.  As such, I am exposed the memory business - but I cannot
	claim to be a memory expert.

Memory reliability is extremely important in a computer.  With decreasing cell
sizes, it is becoming easier to have spurious bit errors, and the larger 
memory sizes lead to increased probabilities of failures.  Even before joining
Clearpoint, I considered the lack of parity to be a major problem with the 
Macintosh.  I am extremely surprised to see it repeated by NeXT.

Some figures about memory reliability.  Prof McEliece (Caltech) in a paper
called "The Reliability of Computer Memories" (Jan 1985 - Scientific American)
estimated soft failure rate of a single memory cell at 1 every 1,000,000 
years.  In a 1MB board with party - this is a MBTF of 43 days.  TI estimates
MBTF more optimistically (no surprise).  For their 64K DRAMS they estimate 
MBTF of 33.4 days for an 8MB system.  AMD estimated a 16MB system would have
an MBTF of 13 days.

These error rates and MBTFs are for 64K DRAMS.  Since 1MB DRAMS are considered
to have twice as many errors per device, but 16 times the bits, multiply the
above times by a factor of 8 to get MBTF estimates for 1MB chips.  Thus, the
optimistic TI estimate would lead to an extrapolation of an 8 month MBTF for
soft errors for an 8MB system using 1MB memory chips.  Prof McEliece's 
figures would extrapolate to 43 days for an 8MB system.

TI estimates hard errors to be roughly 1/5 to 1/3 as likely as soft errors.

Any 'reasonable' memory or computer manufacturer will use a 72 hour burn-in
to assure infant mortality problems are found before shipment, but I think 
that the above figures are a compelling argument for a system-level approach
to handle errors in the field.  The simplest thing to do is parity checking.
However, more and more vendors are using VLSI to incorporate Error Detection
and Correction (EDC) circuitry on their memory boards.

Standard EDC will detect 2 or more errors and correct 1 in the word size it
deals with.  The number of check bits required is log(2) of the word size.
Thus, the following chart shows the memory overhead required:

Word Size		EDC Check Bits		8-bit Parity Bits
---------		--------------		-----------------
     8			      5				1
    16			      6				2
    32			      7				4
    64			      8				8

As you can see, by the time you get to 64 bit memory - there really isn't a
reasonable excuse to not use EDC.  (Of course, you could start using 16 bit
parity ... but the protection is significantly diluted)  Even 32 bit memories 
are seeing EDC used more and more often.

In conclusion - I think that NeXT is bucking the trend in moving to no 
protection at all instead of moving to EDC protection for their memory.  If
the NeXT machine takes off, I expect that there will be a demand for 0MB
next boxes which get populated with a 3rd party memory board - just for the
reliability concerns.  (-: Unless the claim is that the University Environment
doesn't care about reliable operation any more than they care about packaged
software. :-)

For anyone who is interested in designing, evaluating, or purchasing computer
memories, Clearpoint publishes a 70+ page "bible" entitled "The Designer's
Guide to Add-In Memory".  This is chock full of good information, and very
light on the propaganda.  It is available at no charge by calling:
	1-800-CLEARPT


Apologies:  I thought I had sent this quite a while back, and recently found
	that I had not.  I apologize if this seems dated.

johns@calvin.EE.CORNELL.EDU (John Sahr) (12/17/88)

In an article, Alien Wells gives interesting information about MTBF
for soft and hard errors of memory.  Also, the following information
about performance of single error correct/double error detect is 
given:

In article <1429@cpoint.UUCP> alien@cpoint.UUCP (Alien Wells) writes:
>Standard EDC will detect 2 or more errors and correct 1 in the word size it
>deals with.  The number of check bits required is log(2) of the word size.
>Thus, the following chart shows the memory overhead required:
>
>Word Size		EDC Check Bits		8-bit Parity Bits
>---------		--------------		-----------------
>     8			      5				1
>    16			      6				2
>    32			      7				4
>    64			      8				8
>
>As you can see, by the time you get to 64 bit memory - there really isn't a
>reasonable excuse to not use EDC.  (Of course, you could start using 16 bit
>parity ... but the protection is significantly diluted)  Even 32 bit memories 
>are seeing EDC used more and more often.

The parity check versus EDC comparison is not quite fair, because they are
really doing two different things.  For 64 bit word, although EDC can detect
2 errors and correct 1, the parity check can detect up to 8 errors (while
correcting none).  So, the tradeoff is not quite so clear.  Although single
error correction and 2 error detection is straightforward, parity checking
must be faster: in fact, the 2 error detect is just a global parity check
built on top of the Hamming code (single error correct).

As far as the absence of any checking on the Mac or NeXT, I think it is
defensible: single error detect per word would be nice, however.

(ps. error detection is a little hobby of mine; I've taken a few classes,
that's all)-- 
John Sahr,                          School of Elect. Eng.,  Upson Hall   
                                    Cornell University, Ithaca, NY 14853

ARPA: johns@calvin.ee.cornell.edu; UUCP: {rochester,cmcl2}!cornell!calvin!johns

edwardm@hpcuhc.HP.COM (Edward McClanahan) (12/17/88)

/ hpcuhc:comp.sys.next / alien@cpoint.UUCP (Alien Wells) /  6:53 pm  Dec 14, 1988 /

> ...this is a MBTF of 43 days...

You used this acronym so consistently, I wasn't sure...  But you must mean
MTBF.

I don't know of a single PC-class computer that uses ECC memory.  The IBM PC
uses PARITY to DETECT errors.  I believe that all Atari, Apple, Commodore,
Compac, Dell, etc... "affordable" computers don't even have Parity!  One could
argue that cost is the determinant.  The obvious rebuttal to this tack is the
fact that several of these manufacturers sell machines in the $10,000 range.

In the PC days of the past, memory failures may have been acceptable.  No
cached/paged data needed to be flushed/posted for consistency.  In fact,
crashes are no big deal on a vintage PC (unless your editor doesn't do
frequent posts).  No LAN would be left in an inconsistent state.

All that is changing quickly.  OS/2 and UNIX both have Virtual Memory.  Many
of the high performance PCs contain cache (albiet presently usually of the
write-through nature).  Ram-disks are quite common.  And finally, a large
percentage of PCs are being integrated into LANs.  We all witnessed how
quickly "corruption" can infect other machines on these LANs (refer to the
email/Internet WORM reports).

If all these concerns are valid, where are all the ECC memory add-on boards?
Also, which NeXT competitors use ECC memory?

ed "I still wanna NeXT" mcclanahan

edwardm@hpcuhc.HP.COM (Edward McClanahan) (12/17/88)

> ...this is a MBTF of 43 days...

You used this acronym so consistently, I wasn't sure...  But you must mean
MTBF.

I don't know of a single PC-class computer that uses ECC memory.  The IBM PC
uses PARITY to DETECT errors.  I believe that all Atari, Apple, Commodore,
Compac, Dell, etc... "affordable" computers don't even have Parity!  One could
argue that cost is the determinant.  The obvious rebuttal to this tack is the
fact that several of these manufacturers sell machines in the $10,000 range.

In the PC days of the past, memory failures may have been acceptable.  No
cached/paged data needed to be flushed/posted for consistency.  In fact,
crashes are no big deal on a vintage PC (unless your editor doesn't do
frequent posts).  No LAN would be left in an inconsistent state.

All that is changing quickly.  OS/2 and UNIX both have Virtual Memory.  Many
of the high performance PCs contain cache (albiet presently usually of the
write-through nature).  Ram-disks are quite common.  And finally, a large
percentage of PCs are being integrated into LANs.  We all witnessed how
quickly "corruption" can infect other machines on these LANs (refer to the
email/Internet WORM reports).

If all these concerns are valid, where are all the ECC memory add-on boards?
Also, which NeXT competitors use ECC memory?

ed "I still wanna NeXT" mcclanahan
----------

baum@Apple.COM (Allen J. Baum) (12/17/88)

[]
>In article <1429@cpoint.UUCP> alien@cpoint.UUCP (Alien Wells) writes:
>>As you can see, by the time you get to 64 bit memory - there really isn't a
>>reasonable excuse to not use EDC.

Um, theres a slight gotcha in doing ECC on a 64 bit chunk. Its not possible
to write a byte anymore. You must read all 64 bits, substitute the byte,
and write the 64 bits+new ECC. This is generally a sufficient reason to 
avoid ECC in low cost systems. Note that if you have a cache that reads and
writes 64 bit chunks to main memory anyway, it may not be a big deal, until
you have to worry about handling uncached writes, memory mapped i/o, ......
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

bkliewer@iuvax.cs.indiana.edu (Bradley Dyck Kliewer) (12/17/88)

In article <680002@hpcuhc.HP.COM> edwardm@hpcuhc.HP.COM (Edward McClanahan) writes:
>If all these concerns are valid, where are all the ECC memory add-on boards?
>Also, which NeXT competitors use ECC memory?
Well, there's always the Orchid ECCell board (which I have in my IBM AT).
But, it's not the hottest seller on the market, and as far as I know, they
don't have a Micro Channel or 32-bit version of the card.  It would appear
that end-users don't think error correction is important (whether this
is simply over-confidence in technology is hard to say).  If I remember
correctly, there is little price difference between the Orchid and similar
(non-ECC) cards, so I don't think price is the motivating factor here 
assuming RAM prices become reasonable again, which they surely will.

Bradley Dyck Kliewer                Hacking...
bkliewer@iuvax.cs.indiana.edu       It's not just an adventure
                                    It's my job!

jbn@glacier.STANFORD.EDU (John B. Nagle) (12/19/88)

In article <15877@iuvax.cs.indiana.edu> bkliewer@iuvax.UUCP (Bradley Dyck Kliewer) writes:
>assuming RAM prices become reasonable again, which they surely will.

       I don't expect this to happen.  Now that the Japanese manufacturers
have achieved total market dominance, prices will be coordinated by the
makers (which is legal in Japan) and will fall slowly, if at all.  The
era of "forward pricing" is over in RAM.  Observation of the price trend
in cars, color TVs, and VCRs will indicate the strategy.  

       Yes, there will be 4 and 16Mb RAMs.  But, just as we have seen with
the 1Mb RAMs, they will not be priced so as to kill the market in smaller
RAMs until sufficient time has elapsed, say five years, that the investment
in the older technology has been repaid.

       Seen in this light, the rumor that 4Mb will be skipped and the
RAM industry will go directly to 16Mb makes more sense.  Having achieved
coordination, it makes sense to wait until the 1Mb technology is fully
amortized while working out the 16Mb production process, then introduce
the new model in a controlled way.  This is how a cartelized industry
operates.

       One implication of this is that we cannot rely on advances in 
semiconductor technology to save us from the tendency of software to grow
in size without bound.

					John Nagle

prem@andante.UUCP (Swami Devanbu) (12/28/88)

How can the Japanese Zaibatsu engage in price fixing while selling
in countries (like the US) which do not allow anti-freemarket
practices ? If such price fixing is indeed being conducted, can
American manufacturers not bring Legal Action against the japanese
manufacturers, and put an end to this ? It shouldn't really matter
that they are Japanese companies, as long as they are selling
in US markets.

Prem Devanbu
AT&T Bell Laboratories,
(201) 582 - 2062
{...}!allegra!prem
prem%allegra@research.att.com

izumi@violet.berkeley.edu (Izumi Ohzawa) (12/29/88)

In article <14723@andante.UUCP> prem@andante.UUCP (Swami Devanbu) writes:
>
>How can the Japanese Zaibatsu engage in price fixing while selling
>in countries (like the US) which do not allow anti-freemarket
>practices ? If such price fixing is indeed being conducted, can
>Prem Devanbu
>AT&T Bell Laboratories,

Are you talking about the pricing of DRAMs and EPROMs ??
Well, SURPRISE!!  The price fixing is imposed by the US Government
in the first place.  Yeah, I wonder why the government of
a countery where cartel is prohibited is doing exactly that itself.

Izumi Ohzawa
izumi@violet.berkeley.eud