[comp.unix.sysv386] Automatic bad sector mapping

john@jwt.UUCP (John Temples) (06/07/91)

In article <767@dumbcat.sf.ca.us> marc@dumbcat.sf.ca.us (Marco S Hyman) writes:
>I haven't found this in TFM yet -- perhaps the net can help.  Given an error
>message that says something like "SCSI absolute sector 1234 on drive 1 is bad"

Which of the 386 UNIXes do automatic bad sector mapping?  I know that
ESIX does, and ISC 2.0.2 does not.  What about the newer ISCs, AT&T,
SCO, and the various SVR4 implementations?  This is a *really* nice
feature, in my opinion.
-- 
John W. Temples -- john@jwt.UUCP (uunet!jwt!john)

cpcahil@virtech.uucp (Conor P. Cahill) (06/09/91)

john@jwt.UUCP (John Temples) writes:

>Which of the 386 UNIXes do automatic bad sector mapping?  I know that
>ESIX does, and ISC 2.0.2 does not.  What about the newer ISCs, AT&T,
>SCO, and the various SVR4 implementations?  This is a *really* nice
>feature, in my opinion.

It is not a nice feature at all. At one time we were running Bell
Technologies 3.1 or so and it had automatic bad sector mapping.  
This caused great headaches when a directory magically appeared with 
bogus data in the middle, or an executable all of a sudden has a 
block of zeros in the middle of it.  

Our hardware at the time (also Bell Tech) had a problem with static
electricity that would cause a series of sectors to appear bad whenever
the machine was touched.  This torched our system several times and
the only way to fix it was a full restore (trying to patch around
it is almost impossible when /usr/bin or /bin get blown away).

Yes, it sould be easier to map blocks, but IMHO automatic mapping is 
not the answer.
-- 
Conor P. Cahill            (703)430-9247        Virtual Technologies, Inc.
uunet!virtech!cpcahil                           46030 Manekin Plaza, Suite 160
                                                Sterling, VA 22170 

john@jwt.UUCP (John Temples) (06/10/91)

In article <1991Jun09.133624.2806@virtech.uucp> cpcahil@virtech.uucp (Conor P. Cahill) writes:
>>Which of the 386 UNIXes do automatic bad sector mapping?

>It is not a nice feature at all. At one time we were running Bell
>Technologies 3.1 or so and it had automatic bad sector mapping.  
>This caused great headaches when a directory magically appeared with 
>bogus data in the middle, or an executable all of a sudden has a 
>block of zeros in the middle of it.  

The ESIX implementation catches errors while they're still "soft,"
i.e., the error is recoverable.  So remapping occurs with no data loss,
as long as the first time a sector has an error it isn't a hard error.

>Our hardware at the time (also Bell Tech) had a problem with static
>electricity that would cause a series of sectors to appear bad whenever
>the machine was touched.

You're saying that since the feature didn't handle an unusual hardware
problem, it's bad?  I think I'm more concerned with it handling the
more likely hardware failures well.  My experience with ESIX is that
all errors have been caught while still soft -- my system kept right on
running without a hitch.  With ISC -- boom, I'm told I've got bad
sectors; then the backup/mkpart/fsck/restore headache begins.  I've
never spent one second of my time handling bad blocks under ESIX; under
ISC, hours have been wasted.

>Yes, it sould be easier to map blocks, but IMHO automatic mapping is 
>not the answer.

What if you had the option of having the driver report problems, and
you had to give your OK for it to proceed with remapping?  Or better
yet, you could select between that mode and fully automatic mode.
-- 
John W. Temples -- john@jwt.UUCP (uunet!jwt!john)

chip@chinacat.unicom.com (Chip Rosenthal) (06/11/91)

In article <1991Jun10.025527.10161@jwt.UUCP>
	john@jwt.UUCP (John Temples) writes:
>The ESIX implementation catches errors while they're still "soft,"
>i.e., the error is recoverable.  So remapping occurs with no data loss,
>as long as the first time a sector has an error it isn't a hard error.

Sorry...I still think Conor is right.  On-the-fly bad sector mapping
is the machine being too damn smart for its own good.  If you've setup
the system from the beginning with the disk manufacturer's flaw map,
you should have no call for this `feature'.  If you are looking for
this `feature' as a way of eeking out another meg or two of storage
space (i.e. don't map out marginal areas, just wait for them to fail),
I think you are being pound foolish.  (I wouldn't run non-RLL certified
hard disks on RLL controllers either when that was in vogue.  My disk
data is too valuable to screw around with.)

If you setup your disk correctly right from the start, you shouldn't
be seeing bad sectors, and this automatic mapping becomes a rarely
used feature.  And when a sector goes bad, the *last* thing I want is
for the machine to automagically patch around it.  I want bells and
lights and claxon screaming - because any time I've had a sector go
bad which wasn't on the flaw map it's meant big, big trouble is on
the way.  I'd just as soon do my reformat/reload now, rather than
waiting for a couple of weeks for the entire disk to crap out.

-- 
Chip Rosenthal     <chip@chinacat.Unicom.COM>  |  Don't play that
Unicom Systems Development      512-482-8260   |    loud, Mr. Collins.

rcd@ico.isc.com (Dick Dunn) (06/11/91)

john@jwt.UUCP (John Temples) writes about automatic remapping:
> The ESIX implementation catches errors while they're still "soft,"
> i.e., the error is recoverable.  So remapping occurs with no data loss,
> as long as the first time a sector has an error it isn't a hard error.

I don't believe this is a common failure characteristic.  Assuming that
(1) you're running the drive in-spec, and (2) you've mapped out all the bad
sectors determined by the drive manufacturer [N.B.: This is *NOT* the same
as bad sectors found by running a r/w test], you shouldn't expect soft
failures because you're not using any marginal sectors.  A drive which is
about to Bite the Big One may show a few soft errors before the disaster
happens, but that's an omen that Something Bad Is About to Happen, so you
want to know about it right away.

For example, a tiny particle can get loose somehow.  If it's just the right
size to get under the head, it'll take a tiny ding out of the coating on a
platter...and there's a good chance it'll be small enough to leave you with
a soft error.  However, you now have at least *two* tiny particles cruising
around, possibly many more (the original and whatever got dug up).  You can
see how that one degenerates.  It's only one hypothetical situation; the
point is that if you start out using only the good sectors of a good disk
and run in-spec, the sorts of things that can go wrong to produce soft
errors are almost always (by that I mean something > 90%) precursors to
a disastrous failure.

If you run out-of-spec (e.g., non-RLL drives on an RLL controller), you're
much more likely to see soft errors that stay soft.
-- 
Dick Dunn     rcd@ico.isc.com -or- ico!rcd       Boulder, CO   (303)449-2870
   ...Simpler is better.

cpcahil@virtech.uucp (Conor P. Cahill) (06/13/91)

john@jwt.UUCP (John Temples) writes:

>What if you had the option of having the driver report problems, and
>you had to give your OK for it to proceed with remapping?  Or better
>yet, you could select between that mode and fully automatic mode.

I would say that the vendor is spending too much time working on a 
feature that is of little use instead of spending that time working
on bug fixes or performance enhancments.

-- 
Conor P. Cahill            (703)430-9247        Virtual Technologies, Inc.
uunet!virtech!cpcahil                           46030 Manekin Plaza, Suite 160
                                                Sterling, VA 22170 

bill@bilver.uucp (Bill Vermillion) (06/16/91)

In article <1991Jun10.230223.10316@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes:
>john@jwt.UUCP (John Temples) writes about automatic remapping:
>> The ESIX implementation catches errors while they're still "soft,"
>> i.e., the error is recoverable.  So remapping occurs with no data loss,
>> as long as the first time a sector has an error it isn't a hard error.
 
>I don't believe this is a common failure characteristic.  Assuming that
>(1) you're running the drive in-spec, and (2) you've mapped out all the bad
>sectors determined by the drive manufacturer [N.B.: This is *NOT* the same
>as bad sectors found by running a r/w test], you shouldn't expect soft
>failures because you're not using any marginal sectors.

In the ESDI world (John's running ESDI on his ESIX as I am) the mapping of
the sectors from the manufacturers list is done automatically - it is read
from the defect list of the supplied drive.   And on big drives - no one is
going to type in a couple of hundred defects willing or perhaps accurately.

 
>For example, a tiny particle can get loose somehow.  If it's just the right
>size to get under the head, it'll take a tiny ding out of the coating on a
>platter...and there's a good chance it'll be small enough to leave you with
>a soft error.  However, you now have at least *two* tiny particles cruising
>around, possibly many more (the original and whatever got dug up).  You can
>see how that one degenerates.  It's only one hypothetical situation; the
>point is that if you start out using only the good sectors of a good disk
>and run in-spec, the sorts of things that can go wrong to produce soft
>errors are almost always (by that I mean something > 90%) precursors to
>a disastrous failure.

Your scenario would point to a drive that has not long to live.   Anytime
you have "particles" inside the drive you are going to loose that drive in
a short time.   At 3600 rpm it won't take long to trash that drive.

The ESIX system tell you when it has recovered the sector and what sector
it was.   It uses ECC to recover from the hard error.   That's why ECC is
used in the first place - whether it is on hard drives or tape drive.
Anytime you get an error that has to be corrected with ECC and you DON"T
block out the problem area you are asking for trouble.

I have a 660 meg ESDI that had about 300 bad sectors (I got it for about
$1000 off because it was just over the limit for that drive).   I have had
about 3 instances of ESIX remapping a bad sector in the 10 months I have
had this current drive running.   They occured in the first 3 months of use
and I have not had any since.   Remember, these are only remapped when a
hard error occurs and ECC is used for recovery.

The system has been running 24 hours per day and usually runs from 20 to 40
Megs a day through the system as a news node.  If there were problems with
their system I feel that I should have found it by now.

>If you run out-of-spec (e.g., non-RLL drives on an RLL controller), you're
>much more likely to see soft errors that stay soft.

Any one who does that gets exactly what they deserve, IMO.

-- 
Bill Vermillion - UUCP: ...!tarpit!bilver!bill
                      : bill@bilver.UUCP