wrp@biochsn.acc.Virginia.EDU (William R. Pearson) (05/15/91)
The 300 Mbyte system drive on my MIPS M/120 died today - there seemed to be a bad sector in the swap partition (fortunately I have a service contract and should get a new one tomorrow). What surprised me was my inability to recover from the problem by reformatting the disk and scanning for bad blocks. When I reformatted with the format.std program from 4.52 distribution tape, no problems were encountered. Scanning likewise did not uncover any bad blocks. When I listed the bad blocks, none were found. I was surprised when my customer support person told me that I should not have tried to list the bad block table, because, according to him, a bug in the software causes the table to be erased after it is listed, and it must be reentered manually. Is this true? Is this fact mentioned somewhere in the volumes of release notes that I am constantly refered to? Is there any version of RISC/os that allows one to format a disk and examine/edit the bad-block table without destroying it? This seems like a very serious bug to me. It is also insidious, since it tends to require that I keep my machine under maintenance with MIPS, since it is now very difficult to reformat a disk with a few bad blocks; the disk almost must be returned to the factory and exchanged for a known good one. On other computers, I have found that disks may loose a sector now and then, often because of bad power fluctuations, but they can then be reformatted, scanned, and put back in use in an hour or so. By not allowing the bad-block table to be examined and modified, a simple reformat becomes a disk exchange. Earlier, I installed a third-party disk after formatting without any problems under RISC/os 4.01(2?). Did its format/scan/list-bad-blocks function better? Bill Pearson
sun@ME.UTORONTO.CA (Andy Sun) (05/16/91)
Newsgroups: comp.sys.mips Subject: Re: bad block tables lost? References: <1991May15.010630.24949@murdoch.acc.Virginia.EDU> Organization: U of Toronto, Dept. of Mechanical Engineering wrp@biochsn.acc.Virginia.EDU (William R. Pearson) writes: > The 300 Mbyte system drive on my MIPS M/120 died today - there >seemed to be a bad sector in the swap partition (fortunately I have a >service contract and should get a new one tomorrow). What surprised >me was my inability to recover from the problem by reformatting the >disk and scanning for bad blocks. When I reformatted with the format.std >program from 4.52 distribution tape, no problems were encountered. >Scanning likewise did not uncover any bad blocks. When I listed the bad >blocks, none were found. I found this on 4.51 last week. Don't know that this still doesn't get fixed in 4.52. I worked on an M/1000 that has lost its original defect list (i.e. bad sector table). The documentation said the scanning phase of format.std will pick out the bad sectors, but apparently it didn't, not all of them anyways (it got 15 out of 212). Everytime I installed the OS back, I got this "sector not found" error and I haven't a clue what's going on. Finally, after entering the manufacturer's defect list + the disk scanning, this problem finally disappeared. I don't see any point in this so-called "scanning" phase, especially it is a time-consuming process and it's actually going through each cylinder. > I was surprised when my customer support person told me that I >should not have tried to list the bad block table, because, according to >him, a bug in the software causes the table to be erased after it is >listed, and it must be reentered manually. Is this true? Is this fact >mentioned somewhere in the volumes of release notes that I am constantly >refered to? Is there any version of RISC/os that allows one to format a >disk and examine/edit the bad-block table without destroying it? It seems to work under RISC/os 4.51. That is, add/delete/list without destroying the bad sector table. I performed a scan first and then add the manufacturer's defect list and list the content before writing to the volume header and it didn't get destroyed. The disk that I worked with is a Fujitsu 2372K and the controller is an Interphase SMD controller. > This seems like a very serious bug to me. It is also insidious, >since it tends to require that I keep my machine under maintenance >with MIPS, since it is now very difficult to reformat a disk with a >few bad blocks; the disk almost must be returned to the factory and >exchanged for a known good one. On other computers, I have found that >disks may loose a sector now and then, often because of bad power >fluctuations, but they can then be reformatted, scanned, and put back >in use in an hour or so. By not allowing the bad-block table to be >examined and modified, a simple reformat becomes a disk exchange. > Earlier, I installed a third-party disk after formatting without >any problems under RISC/os 4.01(2?). Did its format/scan/list-bad-blocks >function better? >Bill Pearson Andy
diamond@jit533.swstokyo.dec.com (Norman Diamond) (05/16/91)
In article <91May15.232155edt.20146@me.utoronto.ca> sun@ME.UTORONTO.CA (Andy Sun) writes: >I worked on an M/1000 that has lost its original defect >list (i.e. bad sector table). The documentation said the scanning phase >of format.std will pick out the bad sectors, but apparently it didn't, >not all of them anyways (it got 15 out of 212). Everytime I installed the >OS back, I got this "sector not found" error and I haven't a clue what's >going on. Finally, after entering the manufacturer's defect list + the >disk scanning, this problem finally disappeared. I don't see any point >in this so-called "scanning" phase, especially it is a time-consuming >process and it's actually going through each cylinder. Disks deteriorate over time. The purpose of the scanning process is to find sectors that have become bad since the last time they were used. Manufacturers have equipment to perform harsher tests than a normally operating disk drive can do, so they detect borderline blocks that would pass scanning tests. This is why they provide an initial list in the first place. It is generally considered better to avoid using a block that seems relatively likely to go bad within a few years, rather than use it until it goes bad (with loss of data). So these are generally disabled even when scanning and usage would (temporarily) work. Nonetheless, if the scanning process passed a sector that produced "sector not found" a short time later, I'd say the scanning process is far from adequate. (This does not represent the opinion of my employer or any other organization.) -- Norman Diamond diamond@tkov50.enet.dec.com If this were the company's opinion, I wouldn't be allowed to post it. Permission is granted to feel this signature, but not to look at it.
sun@ME.UTORONTO.CA (Andy Sun) (05/16/91)
Newsgroups: comp.sys.mips Subject: Re: bad block tables lost? References: <91May15.232155edt.20146@me.utoronto.ca> <1991May16.060214.22997@tkou02.enet.dec.com> Organization: U of Toronto, Dept. of Mechanical Engineering diamond@jit533.swstokyo.dec.com (Norman Diamond) writes: >In article <91May15.232155edt.20146@me.utoronto.ca> sun@ME.UTORONTO.CA (Andy Sun) writes: >Disks deteriorate over time. The purpose of the scanning process is to >find sectors that have become bad since the last time they were used. Agreed. But if the software are smart "new" defects, it should recognise "old" defects as well, because, afterall, they are all defects. >Manufacturers have equipment to perform harsher tests than a normally >operating disk drive can do, so they detect borderline blocks that would >pass scanning tests. This is why they provide an initial list in the >first place. It is generally considered better to avoid using a block >that seems relatively likely to go bad within a few years, rather than >use it until it goes bad (with loss of data). So these are generally >disabled even when scanning and usage would (temporarily) work. The story I heard was nothing as magical as "detecting borderline blocks" that "seems relatively likely to go bad within a few years". Defects (bad sectors) on virgin disks are REAL defects. However, it is impractical from a manufacturer's point of view to provide zero defect disks because the cost will be much too high (to install better quality equipment as well as better quality control). So they set a tolerance level (say, less than 1% of the total disk capacity) to the product and supply a defect list instead. Marking them as bad and avoiding them is cheaper than actually getting rid of them. Disk manufacturers out there might want to confirm this. >Nonetheless, if the scanning process passed a sector that produced >"sector not found" a short time later, I'd say the scanning process is >far from adequate. Agreed. At first I thought the scanning just does read/write at random locations only but I realized latter that it's actually going through each individual cylinder during scanning. And that sounds pretty dumb to me that it cannot detect major errors (errors that fsck.ffs will recognise as errors and incapable of correcting). >(This does not represent the opinion of my employer or any other organization.) >-- >Norman Diamond diamond@tkov50.enet.dec.com >If this were the company's opinion, I wouldn't be allowed to post it. >Permission is granted to feel this signature, but not to look at it. Andy _______________________________________________________________________________ Andy Sun | Internet: sun@me.utoronto.ca University of Toronto, Canada | UUCP : ...!utai!me!sun Dept. of Mechanical Engineering | BITNET : sun@me.toronto.BITNET
cprice@mips.com (Charlie Price) (05/17/91)
In article <91May15.232155edt.20146@me.utoronto.ca> sun@ME.UTORONTO.CA (Andy Sun) writes: >Newsgroups: comp.sys.mips >Subject: Re: bad block tables lost? > >I found this on 4.51 last week. Don't know that this still doesn't get >fixed in 4.52. I worked on an M/1000 that has lost its original defect >list (i.e. bad sector table). The documentation said the scanning phase >of format.std will pick out the bad sectors, but apparently it didn't, >not all of them anyways (it got 15 out of 212). Everytime I installed the >OS back, I got this "sector not found" error and I haven't a clue what's >going on. Finally, after entering the manufacturer's defect list + the >disk scanning, this problem finally disappeared. I don't see any point >in this so-called "scanning" phase, especially it is a time-consuming >process and it's actually going through each cylinder. The scan pass doesn't discover all the blocks on the manufacturer's defect list. It CAN'T. This is definitely a pain in the nether regions, but there is not much that we can do about it. If any other vendor has a superior scheme for finding defects in the field, I'm sure that MIPS would like to know about it. I used to work for an IBM-compatible disk manufacturer developing test equipment (runing under UNIX!). Discovering "defects" on a HDA (Head-Disk Assembly -- the part with disks and heads, but little electronics and no motor...) was an involved process. We built elaborate equipment with special electronics that did a great deal more than the normal read/write operation and servo control. One thing that we spent a lot of effort to detect was "marginal" areas where the drive would read/write reliably if the head were exactly on track, but where it could fail if it were very slightly off the center of the track (but within servo parameters). Another "marginal" manifestation is where the coating is "thin" (not very many oxide particles) and sometimes write/read worked and sometimes you lost a bit or two. There is NO WAY a standard drive in the field equipped with standard read/write and servo electronics controlled by a standard (SMD or SCSI in this case) controller can duplicate what a manufacturer can do. The scan pass has no magic available. For each sector on the drive, you write a data pattern that should be a worst-case pattern for the media and data encoding scheme (the maximum number of flux transitions) and then you read it back and see if you get the same data or if the drive/controller gives you an error. If the drive/controller gives you an error then you have a bad block, otherwise it must be good, right? You might observe that there is something very time consuming that could be done with standard electronics and controllers that the formatter scan pass doesn't do today. Right now it visit the blocks in order and ends up being reliably on-track for most of them. It could do a reasonably large seek between each block (visiting them in some nonsequential order) to introduce some servo noise and track-settling activity into the process. This *might* increase the likelyhood of positioning the head-arm in-spec from the servo track but slightly off-center and thereby show up more marginal blocks. This may be grasping at straws. There is no way that you can reliably use the standard controller to produce all the situations that the drive will see in use -- at least in a reasonable amount of time. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. MS 1-03 / Sunnyvale, CA 94088-3650
clp@mips.com (Carol Preston) (05/20/91)
SCSI and SMD keep track of defects differently. SCSI keeps track of the defect list itself, and SMD basically requires the OS to keep track of the list. For a SCSI disk, the defects that the manufacturer mapped out are kept in the "primary" defect list. This can't change. Those defects that are mapped out subsequently are put in a "secondary" list. There is a specific SCSI command to access the part of the disk where the lists are kept, and the format command doesn't supply this functionality, for better or worse. For SMD disks, RISC/os keeps the list of defects and the manner in which they were mapped, in the volume header (partition 8). If you re-format this partition, and don't write out the volume header before exiting the format program, the list is forever lost and must be reentered. There are also other ways in which this list can be lost. As previously mentioned, one way is if the block to which they are written goes bad (and can't be recovered). An on-line format program has been released since RISC/os 4.52, and with it comes the functionality for saving the defect list to a Unix file. I would suggest that if you are worried about losing this list, you should run /etc/format, and do nothing more than list the defects to a file. (Of course, I would not recommend keeping this file on the same disk.) Additionally '/etc/badspots -l' displays the defect list for a given disk. For SCSI, it lists both the primary and secondary lists, and for SMD, it lists the defects stored in the volume header. Both format and badspots have a man page with further information. Neither of these commands can be run on pre-4.52 kernels as they use newly defined ioctl calls. -- Carol Preston UUCP: {ames,decwrl,prls,pyramid}!mips!clp clp@mips.com DDD: (408)720-1700 x8108 or (408)524-8108 USPS: Mips Computer Systems 950 DeGuigne Ave. Sunnyvale, CA 94088