[comp.sys.mips] bad block tables lost?

wrp@biochsn.acc.Virginia.EDU (William R. Pearson) (05/15/91)

	The 300 Mbyte system drive on my MIPS M/120 died today - there
seemed to be a bad sector in the swap partition (fortunately I have a
service contract and should get a new one tomorrow).  What surprised
me was my inability to recover from the problem by reformatting the
disk and scanning for bad blocks.  When I reformatted with the format.std
program from 4.52 distribution tape, no problems were encountered.
Scanning likewise did not uncover any bad blocks.  When I listed the bad
blocks, none were found.

	I was surprised when my customer support person told me that I
should not have tried to list the bad block table, because, according to
him, a bug in the software causes the table to be erased after it is
listed, and it must be reentered manually.  Is this true?  Is this fact
mentioned somewhere in the volumes of release notes that I am constantly
refered to?  Is there any version of RISC/os that allows one to format a
disk and examine/edit the bad-block table without destroying it?

	This seems like a very serious bug to me. It is also insidious,
since it tends to require that I keep my machine under maintenance
with MIPS, since it is now very difficult to reformat a disk with a
few bad blocks; the disk almost must be returned to the factory and
exchanged for a known good one.  On other computers, I have found that
disks may loose a sector now and then, often because of bad power
fluctuations, but they can then be reformatted, scanned, and put back
in use in an hour or so. By not allowing the bad-block table to be
examined and modified, a simple reformat becomes a disk exchange.

	Earlier, I installed a third-party disk after formatting without
any problems under RISC/os 4.01(2?).  Did its format/scan/list-bad-blocks
function better?

Bill Pearson

sun@ME.UTORONTO.CA (Andy Sun) (05/16/91)

Newsgroups: comp.sys.mips
Subject: Re: bad block tables lost?
References: <1991May15.010630.24949@murdoch.acc.Virginia.EDU>
Organization: U of Toronto, Dept. of Mechanical Engineering

wrp@biochsn.acc.Virginia.EDU (William R. Pearson) writes:

>	The 300 Mbyte system drive on my MIPS M/120 died today - there
>seemed to be a bad sector in the swap partition (fortunately I have a
>service contract and should get a new one tomorrow).  What surprised
>me was my inability to recover from the problem by reformatting the
>disk and scanning for bad blocks.  When I reformatted with the format.std
>program from 4.52 distribution tape, no problems were encountered.
>Scanning likewise did not uncover any bad blocks.  When I listed the bad
>blocks, none were found.

I found this on 4.51 last week. Don't know that this still doesn't get
fixed in 4.52. I worked on an M/1000 that has lost its original defect
list (i.e. bad sector table). The documentation said the scanning phase
of format.std will pick out the bad sectors, but apparently it didn't,
not all of them anyways (it got 15 out of 212). Everytime I installed the
OS back, I got this "sector not found" error and I haven't a clue what's
going on. Finally, after entering the manufacturer's defect list + the
disk scanning, this problem finally disappeared. I don't see any point
in this so-called "scanning" phase, especially it is a time-consuming
process and it's actually going through each cylinder.

>	I was surprised when my customer support person told me that I
>should not have tried to list the bad block table, because, according to
>him, a bug in the software causes the table to be erased after it is
>listed, and it must be reentered manually.  Is this true?  Is this fact
>mentioned somewhere in the volumes of release notes that I am constantly
>refered to?  Is there any version of RISC/os that allows one to format a
>disk and examine/edit the bad-block table without destroying it?

It seems to work under RISC/os 4.51. That is, add/delete/list without
destroying the bad sector table. I performed a scan first and then add
the manufacturer's defect list and list the content before writing to
the volume header and it didn't get destroyed. The disk that I worked
with is a Fujitsu 2372K and the controller is an Interphase SMD controller.

>	This seems like a very serious bug to me. It is also insidious,
>since it tends to require that I keep my machine under maintenance
>with MIPS, since it is now very difficult to reformat a disk with a
>few bad blocks; the disk almost must be returned to the factory and
>exchanged for a known good one.  On other computers, I have found that
>disks may loose a sector now and then, often because of bad power
>fluctuations, but they can then be reformatted, scanned, and put back
>in use in an hour or so. By not allowing the bad-block table to be
>examined and modified, a simple reformat becomes a disk exchange.

>	Earlier, I installed a third-party disk after formatting without
>any problems under RISC/os 4.01(2?).  Did its format/scan/list-bad-blocks
>function better?

>Bill Pearson

Andy

diamond@jit533.swstokyo.dec.com (Norman Diamond) (05/16/91)

In article <91May15.232155edt.20146@me.utoronto.ca> sun@ME.UTORONTO.CA (Andy Sun) writes:

>I worked on an M/1000 that has lost its original defect
>list (i.e. bad sector table). The documentation said the scanning phase
>of format.std will pick out the bad sectors, but apparently it didn't,
>not all of them anyways (it got 15 out of 212). Everytime I installed the
>OS back, I got this "sector not found" error and I haven't a clue what's
>going on. Finally, after entering the manufacturer's defect list + the
>disk scanning, this problem finally disappeared. I don't see any point
>in this so-called "scanning" phase, especially it is a time-consuming
>process and it's actually going through each cylinder.

Disks deteriorate over time.  The purpose of the scanning process is to
find sectors that have become bad since the last time they were used.

Manufacturers have equipment to perform harsher tests than a normally
operating disk drive can do, so they detect borderline blocks that would
pass scanning tests.  This is why they provide an initial list in the
first place.  It is generally considered better to avoid using a block
that seems relatively likely to go bad within a few years, rather than
use it until it goes bad (with loss of data).  So these are generally
disabled even when scanning and usage would (temporarily) work.

Nonetheless, if the scanning process passed a sector that produced
"sector not found" a short time later, I'd say the scanning process is
far from adequate.
(This does not represent the opinion of my employer or any other organization.)
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.
Permission is granted to feel this signature, but not to look at it.

sun@ME.UTORONTO.CA (Andy Sun) (05/16/91)

Newsgroups: comp.sys.mips
Subject: Re: bad block tables lost?
References: <91May15.232155edt.20146@me.utoronto.ca> <1991May16.060214.22997@tkou02.enet.dec.com>
Organization: U of Toronto, Dept. of Mechanical Engineering

diamond@jit533.swstokyo.dec.com (Norman Diamond) writes:

>In article <91May15.232155edt.20146@me.utoronto.ca> sun@ME.UTORONTO.CA (Andy Sun) writes:

>Disks deteriorate over time.  The purpose of the scanning process is to
>find sectors that have become bad since the last time they were used.

Agreed. But if the software are smart "new" defects, it should recognise
"old" defects as well, because, afterall, they are all defects.

>Manufacturers have equipment to perform harsher tests than a normally
>operating disk drive can do, so they detect borderline blocks that would
>pass scanning tests.  This is why they provide an initial list in the
>first place.  It is generally considered better to avoid using a block
>that seems relatively likely to go bad within a few years, rather than
>use it until it goes bad (with loss of data).  So these are generally
>disabled even when scanning and usage would (temporarily) work.

The story I heard was nothing as magical as "detecting borderline blocks"
that "seems relatively likely to go bad within a few years". Defects
(bad sectors) on virgin disks are REAL defects. However, it is impractical
from a manufacturer's point of view to provide zero defect disks because
the cost will be much too high (to install better quality equipment as
well as better quality control). So they set a tolerance level (say, less
than 1% of the total disk capacity) to the product and supply a defect
list instead. Marking them as bad and avoiding them is cheaper than actually
getting rid of them. Disk manufacturers out there might want to confirm this.

>Nonetheless, if the scanning process passed a sector that produced
>"sector not found" a short time later, I'd say the scanning process is
>far from adequate.

Agreed. At first I thought the scanning just does read/write at random
locations only but I realized latter that it's actually going through
each individual cylinder during scanning. And that sounds pretty dumb
to me that it cannot detect major errors (errors that fsck.ffs will
recognise as errors and incapable of correcting). 

>(This does not represent the opinion of my employer or any other organization.)
>--
>Norman Diamond       diamond@tkov50.enet.dec.com
>If this were the company's opinion, I wouldn't be allowed to post it.
>Permission is granted to feel this signature, but not to look at it.

Andy

_______________________________________________________________________________
Andy Sun                            | Internet: sun@me.utoronto.ca
University of Toronto, Canada       | UUCP    : ...!utai!me!sun
Dept. of Mechanical Engineering     | BITNET  : sun@me.toronto.BITNET

cprice@mips.com (Charlie Price) (05/17/91)

In article <91May15.232155edt.20146@me.utoronto.ca> sun@ME.UTORONTO.CA (Andy Sun) writes:
>Newsgroups: comp.sys.mips
>Subject: Re: bad block tables lost?
>
>I found this on 4.51 last week. Don't know that this still doesn't get
>fixed in 4.52. I worked on an M/1000 that has lost its original defect
>list (i.e. bad sector table). The documentation said the scanning phase
>of format.std will pick out the bad sectors, but apparently it didn't,
>not all of them anyways (it got 15 out of 212). Everytime I installed the
>OS back, I got this "sector not found" error and I haven't a clue what's
>going on. Finally, after entering the manufacturer's defect list + the
>disk scanning, this problem finally disappeared. I don't see any point
>in this so-called "scanning" phase, especially it is a time-consuming
>process and it's actually going through each cylinder.

The scan pass doesn't discover all the blocks on the manufacturer's
defect list.
It CAN'T.
This is definitely a pain in the nether regions,
but there is not much that we can do about it.
If any other vendor has a superior scheme for finding defects in
the field, I'm sure that MIPS would like to know about it.

I used to work for an IBM-compatible disk manufacturer 
developing test equipment (runing under UNIX!).
Discovering "defects" on a HDA (Head-Disk Assembly --
the part with disks and heads, but little electronics and no motor...)
was an involved process.
We built elaborate equipment with special electronics that did a
great deal more than the normal read/write operation and servo control.
One thing that we spent a lot of effort to detect was "marginal" areas
where the drive would read/write reliably if the head were exactly on
track, but where it could fail if it were very slightly off the center
of the track (but within servo parameters).
Another "marginal" manifestation is where the coating is "thin"
(not very many oxide particles) and sometimes write/read worked
and sometimes you lost a bit or two.

There is NO WAY a standard drive in the field equipped with
standard read/write and servo electronics controlled by a
standard (SMD or SCSI in this case) controller can duplicate
what a manufacturer can do.
The scan pass has no magic available.
For each sector on the drive, you write a data pattern that should
be a worst-case pattern for the media and data encoding scheme
(the maximum number of flux transitions) and then you read it back
and see if you get the same data or if the drive/controller gives you an error.
If the drive/controller gives you an error then you have a bad block,
otherwise it must be good, right?

You might observe that
there is something very time consuming that could be done with
standard electronics and controllers that the formatter scan
pass doesn't do today.
Right now it visit the blocks in order and ends up being reliably
on-track for most of them.  It could do a reasonably large seek
between each block (visiting them in some nonsequential order)
to introduce some servo noise and track-settling activity into the process.
This *might* increase the likelyhood of positioning the head-arm
in-spec from the servo track but slightly off-center and
thereby show up more marginal blocks.
This may be grasping at straws.
There is no way that you can reliably use the standard controller to
produce all the situations that the drive will see in use --
at least in a reasonable amount of time.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave.  MS 1-03 / Sunnyvale, CA   94088-3650

clp@mips.com (Carol Preston) (05/20/91)

SCSI and SMD keep track of defects differently.  SCSI keeps track of the 
defect list itself, and SMD basically requires the OS to keep track of 
the list.

For a SCSI disk, the defects that the manufacturer mapped out are kept in
the "primary" defect list.  This can't change.  Those defects that are mapped
out subsequently are put in a "secondary" list.  There is a specific SCSI
command to access the part of the disk where the lists are kept, and the
format command doesn't supply this functionality, for better or worse.

For SMD disks, RISC/os keeps the list of defects and the manner in which
they were mapped, in the volume header (partition 8).  If you re-format this
partition, and don't write out the volume header before exiting the format
program, the list is forever lost and must be reentered.  There are also other
ways in which this list can be lost.  As previously mentioned, one way is
if the block to which they are written goes bad (and can't be recovered).
An on-line format program has been released since RISC/os 4.52, and with it 
comes the functionality for saving the defect list to a Unix file.  I would 
suggest that if you are worried about losing this list, you should run 
/etc/format, and do nothing more than list the defects to a file.  (Of 
course, I would not recommend keeping this file on the same disk.)

Additionally '/etc/badspots -l' displays the defect list for a given disk.  
For SCSI, it lists both the primary and secondary lists, and for SMD, it 
lists the defects stored in the volume header.

Both format and badspots have a man page with further information.  Neither
of these commands can be run on pre-4.52 kernels as they use newly defined
ioctl calls.
-- 
Carol Preston
UUCP: 	{ames,decwrl,prls,pyramid}!mips!clp  	clp@mips.com
DDD:  	(408)720-1700 x8108 or (408)524-8108
USPS:   Mips Computer Systems 950 DeGuigne Ave. Sunnyvale, CA 94088