[unix-pc.general] 3B1 Hard Disk Woes

kww@cbnews.ATT.COM (Kevin W. Wall) (07/25/89)

I am having trouble with my hard disk on my 3B1 (a 67Mb Miniscribe).

A while back, the power supply on my 3B1 failed.  I since have replaced
the power supply, but its failure apparently caused the hard disk to fail
as well.

At the time, I thought that was no big deal; I figured I could just reformat
the hard drive and everything would be okay.  Well, when THAT didn't work,
I thought I might have had a head crash, so I borrowed a colleague's
"UNIX PC Reference Manual" and turned to the diagnostics section (Chapter 3)
so seem I could confirm whether or not it was indeed a head crash.  As it
turns out, even the so called "Expert Mode Diagnostics Program" is virtually
useless, but THAT'S another story.  However, I did learn that 1) all hard
disk diagnostics apparently first try to "recalibrate" the drive, and 2) this
"recalibration" always would fail.  For example, if I run the diagnostic
to park the heads on the hard disk, I will get the following output:

		WINCHESTER DISK TEST

		Hard disk restore failed
			**ERROR**
		Test:		Hard disk test (drive 0)
		Subtest:	Park Disk Heads
		Error:		WINCHESTER: Can't Recal; Response = 4
			Enter y[Y] to Abort, Return to continue

Anything that I tried run that concerned the hard disk resulted in the same:

		Error:		WINCHESTER: Can't Recal; Response = 4

however, all diagnostics on other devices (e.g., floppy, CPU, memory, etc.) 
passed with no errors.

Now my questions are these:

	1) What exactly does "Can't Recal; Response = 4" mean?  In particular,
	   what does this most likely indicate?  (E.g., a problem with the disk
	   controller, with the hard disk media, etc.)
	2) Can I safely assume that the problem is in the hard disk UNIT
	   (all the pieces inside the smaller cage containing the disk itself),
	   as opposed to being some kind of problem on the mother board?

	   [The reason that I want to no this is that I have a 72Mb Miniscribe
	   (Model # 6085) that I could (would like to) install to replace the
	   (bad?) drive, but don't want to do this it there is a problem some-
	   where on the mother board which will cause this one to fail too.]

	3) Is there any way, short of opening the sealed drive itself, to
	   tell if the problem is a head crash (vs. say, the failure of the
	   hard disk controller)?  [Note that the no portion of the 3B1 is
	   still under warranty, so I am not adverse to opening the disk
	   unit and peering inside.  However, I would like to send the disk
	   drive to be refurbished, so I don't want to break the seal if
	   this will keep companies from trying to fix it or if I have a
	   good chance of making things worse than they already are.]

	4) Assuming the disk is repairable (how do can I tell, especially
	   without opening it?) does anyone know of a reliable and reputable
	   company that you would trust fixing it?  [Recovery of the data is
	   only of secondary interest.  I have (about) everything either on
	   floppy, or on a mainframe at work, or am willing to do without.
	   I estimate to recover everything will take approximately 72 hrs.
	   of connect time at 2400 baud so if recovery is CHEAP enough, I
	   may be interested; it depends if I could have them selectively
	   recover certain directories (e.g., /usr/local and my $HOME
	   directory).]

Some final info that might be pertinent.  I was running version 3.51 of the
operating system when the system "crashed", and ran version 3.51 of the
diagnostics disk to analyze the problem.  The specifics on the disk itself
follow.  (Fortunately, I wrote this info down a long time ago, as the
diagnostics to get this information now fail!)

	Volume Name: mi67-4
	1024 Cylinders. 8 heads per Cylinder.
	Configured as:
		Total space:	65,536 blocks
		Partition 0 =	64 blocks
		Partition 1 =	5000 blocks
		User space =	60,472 blocks

One final note: please E-mail reply directly to me as opposed to congesting
this newsgroup with a bunch of follow-up discussions.  If enough interest is
expressed, I'll summarize to the net.

Please try to explain in layman's terms; I'm a UNIX hacker, not a hardware
jock. :-)

Thanks in advance for your help!
-- 
In person: Kevin W. Wall			AT&T Bell Laboratories
Usenet/UUCP: {att!}cblpf!kww			6200 E. Broad St.
Internet: kww@cblpf.att.com			Columbus, Oh. 43213
		"Death is life's way of firing you!" -- Hack rumor

clewis@eci386.uucp (Chris Lewis) (07/27/89)

Talk about timely!

I'm posting instead of mailing because of the co-incidence (and we *know*
what's wrong with ours...  Graphically.)

In article <8569@cbnews.ATT.COM> kww@cbnews.ATT.COM (Kevin W. Wall) writes:

>I am having trouble with my hard disk on my 3B1 (a 67Mb Miniscribe).

So am I.

> For example, if I run the diagnostic
>to park the heads on the hard disk, I will get the following output:
>
>		WINCHESTER DISK TEST
>
>		Hard disk restore failed
>			**ERROR**
>		Test:		Hard disk test (drive 0)
>		Subtest:	Park Disk Heads
>		Error:		WINCHESTER: Can't Recal; Response = 4
>			Enter y[Y] to Abort, Return to continue

That's what we get too.

>	1) What exactly does "Can't Recal; Response = 4" mean?  In particular,
>	   what does this most likely indicate?  (E.g., a problem with the disk
>	   controller, with the hard disk media, etc.)

That the drive cannot find the first track on the disk.  I'm not sure how
good the diagnostics are, but I do know that you really cannot do anything
with a disk if you don't recalibrate first.  Unless your diagnostics are
capable of actually testing the controller (which I doubt in this case),
it's hard to tell whether it's the controller or disk.  My system originally
didn't boot HD or floppy, but we eventually got the floppy running, ruling
out the rest of the logic board except possibly the controller itself.  Could
have been a bad diagnostic floppy (was an off-the-net copy of s4diag that
I had booted successfully once before) that prevented the floppy boot at
home.

>	2) Can I safely assume that the problem is in the hard disk UNIT
>	   (all the pieces inside the smaller cage containing the disk itself),
>	   as opposed to being some kind of problem on the mother board?

Not necessarily.

>	   [The reason that I want to no this is that I have a 72Mb Miniscribe
>	   (Model # 6085) that I could (would like to) install to replace the
>	   (bad?) drive, but don't want to do this it there is a problem some-
>	   where on the mother board which will cause this one to fail too.]

*Very* unlikely - on ST506 drives you can do almost anything to the connections
without harm (eg: getting either cable backwards...)  Unless something's
wrong with the powersupply - a VOM would come in handy.

>	3) Is there any way, short of opening the sealed drive itself, to
>	   tell if the problem is a head crash (vs. say, the failure of the
>	   hard disk controller)?

No.  But there can be additional evidence.  Eg: loud scraping noises.
Which is what I been getting louder and louder over the previous week or
two.  Originally thought it was the fan dying, but once I had the cover
off, it became obvious where it was really coming from.

Another thing that might help is opening the 3b1, disconnecting the ribbon
cable from the power supply to the logic board, and powering the thing
up.  If the drive spins up reasonably quietly with no activity on the
HD drive LED (seen through the perforations on the HD cage), you probably
still have a good drive.  Mine made noises and the LED gave me a repeating

flash ... flash-flash-flash ... flash-flash-flash ... flash-flash-flash

code.  Which might mean something if you have the right manuals.

If there is a true head crash, chances are that the drive isn't worth
repairing....  Generally speaking, repair houses charge a fixed rate
(on the order of $500-$1000) to repair a drive.  Even then, you generally
don't get data recovery (especially if some of the oxide is missing...)
And you can usually buy a new drive for less than the repair cost.

If the drive is truly zorched, I don't think that a repair centre
would care whether you had peeked inside.  Once you find one, you could
always ask.

In our case, our resident expert on 3b1 noises took the chance and opened
the drive this morning.  He has managed to take one apart, fix it,
and have it work after he's closed it up, but it should really be done 
in a "clean room".  

Oh my!

Heads 3 and 5 fell off, and there's this neat 1/2" wide stripe of melted 
aluminum where the head supports touched down on two of the surfaces.  
Lost a few square inches of oxide.  Starting at cylinder 0.  There go
my comp.sources.unix and comp.sources.misc archives - they're just
reddish dust on the workbench now.

Sigh.  I must be sick - I'm actually giggling about it...

I understand Jim Joyce (of UNIX bookstore fame) makes a living recovering
data from mangled drives, but he charges quite a bit (quite a bit for
a hobbyist, not that much for a company who's got lots of money riding
on their disk contents).

None of the people we go to for repairs would be of use to you "down there".

Now I just have to see if I can get another drive...
-- 
Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc.
UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis
Phone: (416)-595-5425

erict@flatline.UUCP (J. Eric Townsend) (07/28/89)

In article <1989Jul26.174524.21833@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
>Somebody else wrote:
>>		WINCHESTER DISK TEST
>>
>>		Hard disk restore failed
>>			**ERROR**
>>		Test:		Hard disk test (drive 0)
>>		Subtest:	Park Disk Heads
>>		Error:		WINCHESTER: Can't Recal; Response = 4
>>			Enter y[Y] to Abort, Return to continue

According to an AT&T tech who came out and replaced the HD in
my 3b1 (while it was under warranty), this is something that could
be fixed from floppy-unix, if AT&T had bothered to ship a program
that could do the super-low level format needed to test the hard drive.
This is where I start to lose understanding of the subject, so
I only *think* I'm correct.

There are two levels of formatting:  The normal level, what "we"
use, merely erases the disk, and sets up the base for the file system.
There is a lower level format that actually writes the 0 block (or
wherever the "what am I" information for the drive is stored).  This
"what am I" information is what the 3b1 uses to format the hard drive.
Currently, there is no way to do a "you are a X" format on a drive.
(I've done this on IBM PClones, however. :-(

Anyway, maybe somebody with a tad more HD knowledge can xlate the
above into technical-talk, and correct it where necessary.

-- 
J. Eric Townsend       "[Leslie Stahl was] a pussy compared to [Dan] Rather."
uunet!sugar!flatline!erict     -- George Herbert Walker Bush
com6@uhnix1.uh.edu   511 Parker #2, Houston, Tx 77007
EastEnders Mailing list: eastender@flatline.UUCP

clewis@eci386.uucp (Chris Lewis) (07/28/89)

In article <850@flatline.UUCP> erict@flatline.UUCP (J. Eric Townsend) writes:

>According to an AT&T tech who came out and replaced the HD in
>my 3b1 (while it was under warranty), this is something that could
>be fixed from floppy-unix, if AT&T had bothered to ship a program
>that could do the super-low level format needed to test the hard drive.
>This is where I start to lose understanding of the subject, so
>I only *think* I'm correct.

You're not.  He may have been simply mistaken, or trying to make sure that
you only buy drives from AT&T because "only we can format 'em".
[Only a possiblity, I see no evidence of this with the AT&T people I
deal with]

Proof?  Simple: almost every single ST506 controller uses a slightly
different pattern of bits for the physical representation of sectors
headers and trailers.  We do hardware maintenance on a host of machines,
and I can assure you that when you take a disk from another type of
machine and insert it in a 3b1, the formats are different, and the 
diagnostic floppy formatter *does* do low level formats.  (Otherwise,
I'd never get a new drive for my machine  ;-)

The "disk erase and file system preparation" program he was refering to
is UNIX "mkfs" and is the second stage of preparing a HD for UNIX.
(analogous to low level formatters and FDISK on DOS)

However, there are at least a few companies that do not provide low
level formatters for HD's, or other similar things like requiring
tape drivers to only accept a tape with a label that only the
machine's vendor can write.  So they can have a captive media market.

You can take some comfort that at least one of these companies (quite 
large at one point I may add) has gone belly up.  So did the company
that took 'em over.
-- 
Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc.
UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis
Phone: (416)-595-5425

jbm@uncle.UUCP (John B. Milton) (07/31/89)

In article <850@flatline.UUCP> erict@flatline.UUCP (J. Eric Townsend) writes:
>In article <1989Jul26.174524.21833@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
>>Somebody else wrote:
[ error ]
>
>According to an AT&T tech who came out and replaced the HD in
>my 3b1 (while it was under warranty), this is something that could
>be fixed from floppy-unix, if AT&T had bothered to ship a program
>that could do the super-low level format needed to test the hard drive.
Hmmm. Far too general a statement to be entirely corrrect.

>This is where I start to lose understanding of the subject, so
>I only *think* I'm correct.
Well, ok

>There are two levels of formatting:  The normal level, what "we"
>use, merely erases the disk, and sets up the base for the file system.
I can tell you've been too close to DOS land.

>There is a lower level format that actually writes the 0 block (or
>wherever the "what am I" information for the drive is stored).  This
>"what am I" information is what the 3b1 uses to format the hard drive.
>Currently, there is no way to do a "you are a X" format on a drive.
>(I've done this on IBM PClones, however. :-(

Well, there are several levels to formatting the hard disk drive. What the
diag disk does is a "low level format", that is, it sends a format track
command to the WD1010 hard disk controller chip. Oh, yeah, then how does it
remeber the old bad block table when you format a drive twice. Easy, before
formatting it checks to see if the disk about to be formatted is a UNIXpc
disk with a good VHB. If so, it reads the existing BBT, formats the disk,
then re-writes the old BBT when the format is complete. The reason is obvious:
once a bad spot, always a bad spot. I, like most people would not like to give
a bad spot a second chance. If you want to dump the old BBT, you have to trash
the VHB to make the diag disk think it's a new, raw disk. The low level format
re-writes the ENTIRE track from index to index, gaps, headers, data, everything.
There is a "sort of" lower level format which involves warping the format by
changing the gap sizes. This is done to AVOID bad spots, is very time consuming,
and not very reliable. Some of the PC "low level" format programs can do this.

What the DOS format command does is something completely different. It just
fills the disk with a pattern (FD I think). This is all it can do because DOS
has no way to tell just what kind of hard disk controller (chip) is down there.

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!

jcm@mtunb.ATT.COM (was-John McMillan) (08/03/89)

It is my impression, from 'wasting' several days in the bowels of
the S4 diagnostics code, that NOTHING can be done from software
that will recover a disk which fails re-calibration.  (And those
bowels were NOT a pretty site!-)

Those days were spent trying to redeem the lost soul of an MX-2190
-- this was not a mere intellectual exercise, it was a personal crisis!-)
-- as only 140 MB of lost sources can be....

If anyone can correct my impression that a failed re-calibration
prevents ALL useful WDx020 operations, please advise.  Until then
I will presume that the controller CANNOT write to a disk for which
the base-reference -- the recalibration point -- cannot be found:

	Ya can't FORMAT track 0 if ya can't FIND track 0.

john mcmillan -- att!mtunb!jcm -- ...speaking for self, only...

jcm@mtunb.ATT.COM (was-John McMillan) (08/03/89)

In article <580@uncle.UUCP> jbm@uncle.UUCP (John B. Milton) writes:
:
> ...		 If so, it reads the existing BBT, formats the disk,
>then re-writes the old BBT when the format is complete. The reason is obvious:
>once a bad spot, always a bad spot. I, like most people would not like to give
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>a bad spot a second chance. If you want to dump the old BBT, you have to trash
>the VHB to make the diag disk think it's a new, raw disk. The low level format
>re-writes the ENTIRE track from index to index, gaps, headers, data, everything.

Be more charitable, John:
	Redemption IS possible, for some... but it takes love and attention.

Brother McMillan has spent long nights with several disks, and twice
the WORD has driven satan outta that disk!

We've been here before... o yes, children, this HAS BEEN SAID BEFORE!

Formatting a disk puts META information on the disk: it's like
	spraying the lines in the parking lot.  Without these lines,
	the disk cannot identify where the data is to be placed/grabbed.
	Each disk sector has a leading-edge gap, mark, identifier,
	gap/mark, user data, and final gap (sometimes w/ CRC).
	(OK -- I'm faking it, I haven't looked at the code for years!-)

Reading/Writing requires finding the sector identifier and precisely
	dribbling/sucking-up the data after hitting the internal mark.
	(Nice technical terms, eh?)

A bad block, then, is one which has:
		Proven itself to be unreadable, or unreliably readable.

Types of BAD BLOCKS:
     +	SOME blocks are simply a victim of poor penmenship:
		If a block is badly written, it may be unreadable.
	And bad vibrations -- not to mention bad karma -- or power
	glitches might cause bad writes.  (Likewise, some RETRIES may
	indicate NOT a bad block, but an Anomaly occuring during
	the READ cycle.)

     ++	If the write error is in the USER DATA field, the block may
	be recoverable by performing a FULL-BLOCK write: this will
	overlay the badly written bits without trying a read first.
	(If you write only 100 bytes, the other bytes have to be READ first.)

     ++	However, if the META information is corrupted -- by overwriting
	or by a marginal write during formatting -- only a re-format
	of that sector (track) can reclaim the block.

     +	OK: there are also MEDIA defects.  And THESE are the BAD BLOCKS
	which John was referring to, the ones we presume to be beyond
	salvation.

When my 67 MB drive began having MAJOR problems with bad blocks in the
SWAP -- I developed 120+ BAD BLOCKS  over two weeks, and some odd messages
from the diagnostics/surface check code -- I backed up everything.
	Feature -- this took only 3 days because CPIO was
	failing to verify dump after dump.

Then I ZERO'd the Bad block list, and reformatted.

And now, I have NO BAD BLOCKS.
And there's been not a single read error in the subsequent 4 weeks.

So I say, John: Bad Block lists aren't holy.  There are varying reasons
why Blocks are entered.  And good reasons for considering a reformat
-- 'though the manufacturers list of defects *MAY* be worth copying in.

john mcmillan -- att!mtunb!jcm -- speaking for hizzelf, only

PS: In a recent power hit in Lincroft, several 3B1 disks expired.
	Oddly, none of us with Spike/Noise suppression units were hurt.
	Then there's the fellah who hadn't put his suppression unit in
	service yet... sad fellah.  Don't you be sad: use line conditioning.

thad@cup.portal.com (Thad P Floryan) (08/05/89)

John McMillan concludes one of his recent postings with:

"	PS: In a recent power hit in Lincroft, several 3B1 disks expired.
	Oddly, none of us with Spike/Noise suppression units were hurt.
	Then there's the fellah who hadn't put his suppression unit in
	service yet... sad fellah.  Don't you be sad: use line conditioning.
"

Sage advice.  Prior to installing a UPS _AND_ a line-conditioner on every
system here, I could expect several failures a week (on any of Amiga, UNIXPC,
several homebrews, etc.).  Even turning on a flourescent room light or turning
off a modem while writing to a floppy would trash the disk; now I can operate
drill motors, etc. on the same line circuit with impunity ... ZERO errors for
over 3-1/2 years now, and ALL my systems are operated 24 hours/day, 7 days/wk.

In my quest for 100% system reliability, I rented a line monitor for 30 days
and let it record everything on the AC power.  What it saw almost made me poop
my pants ... literally.  2000V spikes, hash, RF, etc etc etc  even lossage of
a cycle (of the 60 Hz) now and then (and this was NOT during the "normal" power
outages for this area).

The types of crap one finds on the AC power line are caused by any/all of:
	air conditioners,
	refrigerators,
	flourescent lamps,
	any other inductive or capacitive loads (modems, printers, fans, etc.),
	thunderstorm activity ANYWHERE near your power grid,
	hospitals and medical equipment,
	construction activity (esp. the ol' backhoe digging up power lines),
	air pollution and acid rain,
	animals "playing" around power lines/transformers (and this includes
		your neighbors' kids with their kites), and
	anything else that plugs into the AC power line.

If your livelihood depends on reliable system operation, you're living on
borrowed time (or walking the edge) if you don't have at least a good spike
and surge/transient suppressor between the wall outlet and your system(s).  I
even have special modem protectors (designed/built by GTE) to work in
conjunction with the "Primary Phone Line Protector" (the "box" where the phone
service enters your site) installed on every line by one's local TelCo.

Yeah, I sometimes joke about operating my computers under candlelight during
a power failure (when the UPS is powering everything), but the peace of mind is
definitely worth it.  The only thing I haven't got working with the 1200 Watt
units yet is getting the signals from the UPS' DB-9 connector into the UNIXPC
to carry on the dialogue between the UPS and the computer as has been done with
the Convergent Tech Miniframes (these UPS systems were DESIGNED for use with
the Miniframe under contract to SAFE in Arizona).  If you want to contact them
for the address of a dealer near you:

	SAFE Power Systems, Inc.
	528 West 21st Street
	Tempe, AZ  85282        602/894-6864

"PC" magazine had a good article several years ago about surge protectors; some
they tested even AMPLIFIED the spikes to yet higher voltages!  At least you know
that nothing over approx. 4,000 volts will come into your system ... 4000V is
the flashover point between the two prongs (hot and neutral) on your standard
USA AC power plug.

Thad Floryan [ thad@cup.portal.com (OR) ..!sun!portal!cup.portal.com!thad ]

jbm@uncle.UUCP (John B. Milton) (08/07/89)

In article <1583@mtunb.ATT.COM> jcm@mtunb.UUCP (was-John McMillan) writes:
>In article <580@uncle.UUCP> jbm@uncle.UUCP (John B. Milton) writes:
>:
>> ...		 If so, it reads the existing BBT, formats the disk,
>>then re-writes the old BBT when the format is complete. The reason is obvious:
>>once a bad spot, always a bad spot. I, like most people would not like to give
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>a bad spot a second chance. If you want to dump the old BBT, you have to trash
>>the VHB to make the diag disk think it's a new, raw disk. The low level format
>>re-writes the ENTIRE track from index to index, gaps, headers, data, everything.
...
>     +	OK: there are also MEDIA defects.  And THESE are the BAD BLOCKS
>	which John was referring to, the ones we presume to be beyond
>	salvation.

I really did mean what I said. Perhaps I should have been more specific. What
I was referring to was places on the disk that are physically not responding
correctly. Many, many other things can go wrong that do not mesh with the "bad
data read, this must be a bad block" idea. The format routine on the diagnostics
disk was written for the user. It was written to find pre-existing bad spots
that are expected to be there. The diagnostics assume that the hardware is
functioning correctly. Remember, if it acts weird, you're supposed to call AT&T
service, right. My original comment was also made assuming your system is
functioning properly (or is now).

So is it time for a HwNote on what's REALLY on the disk and how disks work?

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!