[comp.sys.att] 3B1 Hard Disk Woes

jbm@uncle.UUCP (John B. Milton) (07/31/89)

In article <850@flatline.UUCP> erict@flatline.UUCP (J. Eric Townsend) writes:
>In article <1989Jul26.174524.21833@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
>>Somebody else wrote:
[ error ]
>
>According to an AT&T tech who came out and replaced the HD in
>my 3b1 (while it was under warranty), this is something that could
>be fixed from floppy-unix, if AT&T had bothered to ship a program
>that could do the super-low level format needed to test the hard drive.
Hmmm. Far too general a statement to be entirely corrrect.

>This is where I start to lose understanding of the subject, so
>I only *think* I'm correct.
Well, ok

>There are two levels of formatting:  The normal level, what "we"
>use, merely erases the disk, and sets up the base for the file system.
I can tell you've been too close to DOS land.

>There is a lower level format that actually writes the 0 block (or
>wherever the "what am I" information for the drive is stored).  This
>"what am I" information is what the 3b1 uses to format the hard drive.
>Currently, there is no way to do a "you are a X" format on a drive.
>(I've done this on IBM PClones, however. :-(

Well, there are several levels to formatting the hard disk drive. What the
diag disk does is a "low level format", that is, it sends a format track
command to the WD1010 hard disk controller chip. Oh, yeah, then how does it
remeber the old bad block table when you format a drive twice. Easy, before
formatting it checks to see if the disk about to be formatted is a UNIXpc
disk with a good VHB. If so, it reads the existing BBT, formats the disk,
then re-writes the old BBT when the format is complete. The reason is obvious:
once a bad spot, always a bad spot. I, like most people would not like to give
a bad spot a second chance. If you want to dump the old BBT, you have to trash
the VHB to make the diag disk think it's a new, raw disk. The low level format
re-writes the ENTIRE track from index to index, gaps, headers, data, everything.
There is a "sort of" lower level format which involves warping the format by
changing the gap sizes. This is done to AVOID bad spots, is very time consuming,
and not very reliable. Some of the PC "low level" format programs can do this.

What the DOS format command does is something completely different. It just
fills the disk with a pattern (FD I think). This is all it can do because DOS
has no way to tell just what kind of hard disk controller (chip) is down there.

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!

jcm@mtunb.ATT.COM (was-John McMillan) (08/03/89)

In article <580@uncle.UUCP> jbm@uncle.UUCP (John B. Milton) writes:
:
> ...		 If so, it reads the existing BBT, formats the disk,
>then re-writes the old BBT when the format is complete. The reason is obvious:
>once a bad spot, always a bad spot. I, like most people would not like to give
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>a bad spot a second chance. If you want to dump the old BBT, you have to trash
>the VHB to make the diag disk think it's a new, raw disk. The low level format
>re-writes the ENTIRE track from index to index, gaps, headers, data, everything.

Be more charitable, John:
	Redemption IS possible, for some... but it takes love and attention.

Brother McMillan has spent long nights with several disks, and twice
the WORD has driven satan outta that disk!

We've been here before... o yes, children, this HAS BEEN SAID BEFORE!

Formatting a disk puts META information on the disk: it's like
	spraying the lines in the parking lot.  Without these lines,
	the disk cannot identify where the data is to be placed/grabbed.
	Each disk sector has a leading-edge gap, mark, identifier,
	gap/mark, user data, and final gap (sometimes w/ CRC).
	(OK -- I'm faking it, I haven't looked at the code for years!-)

Reading/Writing requires finding the sector identifier and precisely
	dribbling/sucking-up the data after hitting the internal mark.
	(Nice technical terms, eh?)

A bad block, then, is one which has:
		Proven itself to be unreadable, or unreliably readable.

Types of BAD BLOCKS:
     +	SOME blocks are simply a victim of poor penmenship:
		If a block is badly written, it may be unreadable.
	And bad vibrations -- not to mention bad karma -- or power
	glitches might cause bad writes.  (Likewise, some RETRIES may
	indicate NOT a bad block, but an Anomaly occuring during
	the READ cycle.)

     ++	If the write error is in the USER DATA field, the block may
	be recoverable by performing a FULL-BLOCK write: this will
	overlay the badly written bits without trying a read first.
	(If you write only 100 bytes, the other bytes have to be READ first.)

     ++	However, if the META information is corrupted -- by overwriting
	or by a marginal write during formatting -- only a re-format
	of that sector (track) can reclaim the block.

     +	OK: there are also MEDIA defects.  And THESE are the BAD BLOCKS
	which John was referring to, the ones we presume to be beyond
	salvation.

When my 67 MB drive began having MAJOR problems with bad blocks in the
SWAP -- I developed 120+ BAD BLOCKS  over two weeks, and some odd messages
from the diagnostics/surface check code -- I backed up everything.
	Feature -- this took only 3 days because CPIO was
	failing to verify dump after dump.

Then I ZERO'd the Bad block list, and reformatted.

And now, I have NO BAD BLOCKS.
And there's been not a single read error in the subsequent 4 weeks.

So I say, John: Bad Block lists aren't holy.  There are varying reasons
why Blocks are entered.  And good reasons for considering a reformat
-- 'though the manufacturers list of defects *MAY* be worth copying in.

john mcmillan -- att!mtunb!jcm -- speaking for hizzelf, only

PS: In a recent power hit in Lincroft, several 3B1 disks expired.
	Oddly, none of us with Spike/Noise suppression units were hurt.
	Then there's the fellah who hadn't put his suppression unit in
	service yet... sad fellah.  Don't you be sad: use line conditioning.

thad@cup.portal.com (Thad P Floryan) (08/05/89)

John McMillan concludes one of his recent postings with:

"	PS: In a recent power hit in Lincroft, several 3B1 disks expired.
	Oddly, none of us with Spike/Noise suppression units were hurt.
	Then there's the fellah who hadn't put his suppression unit in
	service yet... sad fellah.  Don't you be sad: use line conditioning.
"

Sage advice.  Prior to installing a UPS _AND_ a line-conditioner on every
system here, I could expect several failures a week (on any of Amiga, UNIXPC,
several homebrews, etc.).  Even turning on a flourescent room light or turning
off a modem while writing to a floppy would trash the disk; now I can operate
drill motors, etc. on the same line circuit with impunity ... ZERO errors for
over 3-1/2 years now, and ALL my systems are operated 24 hours/day, 7 days/wk.

In my quest for 100% system reliability, I rented a line monitor for 30 days
and let it record everything on the AC power.  What it saw almost made me poop
my pants ... literally.  2000V spikes, hash, RF, etc etc etc  even lossage of
a cycle (of the 60 Hz) now and then (and this was NOT during the "normal" power
outages for this area).

The types of crap one finds on the AC power line are caused by any/all of:
	air conditioners,
	refrigerators,
	flourescent lamps,
	any other inductive or capacitive loads (modems, printers, fans, etc.),
	thunderstorm activity ANYWHERE near your power grid,
	hospitals and medical equipment,
	construction activity (esp. the ol' backhoe digging up power lines),
	air pollution and acid rain,
	animals "playing" around power lines/transformers (and this includes
		your neighbors' kids with their kites), and
	anything else that plugs into the AC power line.

If your livelihood depends on reliable system operation, you're living on
borrowed time (or walking the edge) if you don't have at least a good spike
and surge/transient suppressor between the wall outlet and your system(s).  I
even have special modem protectors (designed/built by GTE) to work in
conjunction with the "Primary Phone Line Protector" (the "box" where the phone
service enters your site) installed on every line by one's local TelCo.

Yeah, I sometimes joke about operating my computers under candlelight during
a power failure (when the UPS is powering everything), but the peace of mind is
definitely worth it.  The only thing I haven't got working with the 1200 Watt
units yet is getting the signals from the UPS' DB-9 connector into the UNIXPC
to carry on the dialogue between the UPS and the computer as has been done with
the Convergent Tech Miniframes (these UPS systems were DESIGNED for use with
the Miniframe under contract to SAFE in Arizona).  If you want to contact them
for the address of a dealer near you:

	SAFE Power Systems, Inc.
	528 West 21st Street
	Tempe, AZ  85282        602/894-6864

"PC" magazine had a good article several years ago about surge protectors; some
they tested even AMPLIFIED the spikes to yet higher voltages!  At least you know
that nothing over approx. 4,000 volts will come into your system ... 4000V is
the flashover point between the two prongs (hot and neutral) on your standard
USA AC power plug.

Thad Floryan [ thad@cup.portal.com (OR) ..!sun!portal!cup.portal.com!thad ]

jbm@uncle.UUCP (John B. Milton) (08/07/89)

In article <1583@mtunb.ATT.COM> jcm@mtunb.UUCP (was-John McMillan) writes:
>In article <580@uncle.UUCP> jbm@uncle.UUCP (John B. Milton) writes:
>:
>> ...		 If so, it reads the existing BBT, formats the disk,
>>then re-writes the old BBT when the format is complete. The reason is obvious:
>>once a bad spot, always a bad spot. I, like most people would not like to give
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>a bad spot a second chance. If you want to dump the old BBT, you have to trash
>>the VHB to make the diag disk think it's a new, raw disk. The low level format
>>re-writes the ENTIRE track from index to index, gaps, headers, data, everything.
...
>     +	OK: there are also MEDIA defects.  And THESE are the BAD BLOCKS
>	which John was referring to, the ones we presume to be beyond
>	salvation.

I really did mean what I said. Perhaps I should have been more specific. What
I was referring to was places on the disk that are physically not responding
correctly. Many, many other things can go wrong that do not mesh with the "bad
data read, this must be a bad block" idea. The format routine on the diagnostics
disk was written for the user. It was written to find pre-existing bad spots
that are expected to be there. The diagnostics assume that the hardware is
functioning correctly. Remember, if it acts weird, you're supposed to call AT&T
service, right. My original comment was also made assuming your system is
functioning properly (or is now).

So is it time for a HwNote on what's REALLY on the disk and how disks work?

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!