[comp.arch] Error rates

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (12/15/90)

In article <11393@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>In article <PCG.90Dec12174320@odin.cs.aber.ac.uk> 
>	pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>>There are even drives, magnetic or not, whose mean undetected error rate
>>is of the same order as their capacity, so virtually guaranteeing that
>>you get an undetected error every time you make a copy of them.  After
>>all even a fairly respectable undetected error rate of 1 in 10^12 is
>>usually expressed in bits.
>
>Good point! Creo's 1 TB optical tape holds 10^12 bytes and has "fewer
>than 1 in 10^12" bit errors. The pessimistic reading is as "fewer
>than 8 mistakes per reel". It doesn't wash to say that one is storing
>(say) images, where errors will be unnoticable. Images are usually
>stored in some compressed form, and decompression should be a pretty
>good error magnifier.
>
>Rather than expecting perfection, we should probably expect systems
>to have selectable, adjustable amounts of protection.

Question 1:
 
 What does the specified error rates mean?

 Are we talking 1 (undetected) error after reading 10^12 bits on average?
 (Even if we just read the same bit 10^12 times)

 Or does it mean 1 error for every 10^12 bits stored? So that a bit that
 was read correctly is expected to read correctly forever (or >> 10^12).


question 2:

 Assuming we want systems to have "selectable, adjustable amounts of
 protection", which component of the system should handling the error
 correction code, etc.?

 In the case under discussion (Creo's 1 TB optical tape), it seems clear
 that neither the drive nor the controller is the right place. The cost
 of highly reliable storage that is fast is expensive enough that we don't
 want to include the cost in all applications. We are then left with the
 choice of either the O/S or the application. 

question 3:

 What facilities should the O/S provide? An error corrected file system?
 Almost no O/S today does this - they all rely on underlying H/W to do
 the error detection/correction. (Actually, desktop systems like MS-DOS
 and Amiga do have checksums for every block on disk, but there is still
 no real attempt to handle the errors).

 What are the costs of such a "reliable" file systems? Does this mean we
 have to stop DMA direct to user buffer?

qestion 4:

 If the applications have to handle the error, how can applications be
 device independent and yet have optimum error correction stratagy?

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (12/16/90)

In article <3838@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP 
	(Stanley T.H. Chow) writes:
>In article <11393@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>>Creo's 1 TB optical tape holds 10^12 bytes and has "fewer
>>than 1 in 10^12" bit errors. The pessimistic reading is as "fewer
>>than 8 mistakes per reel".

> What does the specified error rates mean?

In general, it means that any bit has one chance in 10^12 of being
wrong. However, that's merely the standard abstraction, found in
marketing literature.  When you get down to actually building
specific devices, you deal in various error sources, and characterize
each. For example, the Creo stores 64 KB of data with 16 KB of ECC,
making an 80 KB physical record. So, you would want to know the
chance of a given one-record read having an uncorrectable error.
Since optical systems are susceptible to dust, you would also want to
know the chance that the error was soft, ie. that a reread would
succeed.

> Assuming we want systems to have "selectable, adjustable amounts of
> protection", which component of the system should handling the error
> correction code, etc.?

This stuff is best done in hardware: if not in the drive, then in
the controller. I don't see any reason why that hardware can't
allow the software to select from some limited menu.

> What facilities should the O/S provide? An error corrected file system?

Well, it would be nice if media error rates (particularly corrected-
error rates) could be logged in some coherent fashion. This is more a
management issue than an applications issue: perhaps a disk is
planning to fail, or a tape drive needs cleaning.

Arbitrarily high reliability is achieved by replication - for
instance, multiple backup tapes, kept in different buildings. How
much protection is enough? Well, it's usually figured that the chance
of a subsystem mangling data, should be [..hand wave..] less than the
chance of some other subsystem mangling it. More reliability than
that, is a waste of money. Less reliability than that, is asking to
be the goat. Does anyone have a current figure on the bit error rate
going _to_ the storage system?
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon