kww@cbnews.ATT.COM (Kevin W. Wall) (07/25/89)
I am having trouble with my hard disk on my 3B1 (a 67Mb Miniscribe). A while back, the power supply on my 3B1 failed. I since have replaced the power supply, but its failure apparently caused the hard disk to fail as well. At the time, I thought that was no big deal; I figured I could just reformat the hard drive and everything would be okay. Well, when THAT didn't work, I thought I might have had a head crash, so I borrowed a colleague's "UNIX PC Reference Manual" and turned to the diagnostics section (Chapter 3) so seem I could confirm whether or not it was indeed a head crash. As it turns out, even the so called "Expert Mode Diagnostics Program" is virtually useless, but THAT'S another story. However, I did learn that 1) all hard disk diagnostics apparently first try to "recalibrate" the drive, and 2) this "recalibration" always would fail. For example, if I run the diagnostic to park the heads on the hard disk, I will get the following output: WINCHESTER DISK TEST Hard disk restore failed **ERROR** Test: Hard disk test (drive 0) Subtest: Park Disk Heads Error: WINCHESTER: Can't Recal; Response = 4 Enter y[Y] to Abort, Return to continue Anything that I tried run that concerned the hard disk resulted in the same: Error: WINCHESTER: Can't Recal; Response = 4 however, all diagnostics on other devices (e.g., floppy, CPU, memory, etc.) passed with no errors. Now my questions are these: 1) What exactly does "Can't Recal; Response = 4" mean? In particular, what does this most likely indicate? (E.g., a problem with the disk controller, with the hard disk media, etc.) 2) Can I safely assume that the problem is in the hard disk UNIT (all the pieces inside the smaller cage containing the disk itself), as opposed to being some kind of problem on the mother board? [The reason that I want to no this is that I have a 72Mb Miniscribe (Model # 6085) that I could (would like to) install to replace the (bad?) drive, but don't want to do this it there is a problem some- where on the mother board which will cause this one to fail too.] 3) Is there any way, short of opening the sealed drive itself, to tell if the problem is a head crash (vs. say, the failure of the hard disk controller)? [Note that the no portion of the 3B1 is still under warranty, so I am not adverse to opening the disk unit and peering inside. However, I would like to send the disk drive to be refurbished, so I don't want to break the seal if this will keep companies from trying to fix it or if I have a good chance of making things worse than they already are.] 4) Assuming the disk is repairable (how do can I tell, especially without opening it?) does anyone know of a reliable and reputable company that you would trust fixing it? [Recovery of the data is only of secondary interest. I have (about) everything either on floppy, or on a mainframe at work, or am willing to do without. I estimate to recover everything will take approximately 72 hrs. of connect time at 2400 baud so if recovery is CHEAP enough, I may be interested; it depends if I could have them selectively recover certain directories (e.g., /usr/local and my $HOME directory).] Some final info that might be pertinent. I was running version 3.51 of the operating system when the system "crashed", and ran version 3.51 of the diagnostics disk to analyze the problem. The specifics on the disk itself follow. (Fortunately, I wrote this info down a long time ago, as the diagnostics to get this information now fail!) Volume Name: mi67-4 1024 Cylinders. 8 heads per Cylinder. Configured as: Total space: 65,536 blocks Partition 0 = 64 blocks Partition 1 = 5000 blocks User space = 60,472 blocks One final note: please E-mail reply directly to me as opposed to congesting this newsgroup with a bunch of follow-up discussions. If enough interest is expressed, I'll summarize to the net. Please try to explain in layman's terms; I'm a UNIX hacker, not a hardware jock. :-) Thanks in advance for your help! -- In person: Kevin W. Wall AT&T Bell Laboratories Usenet/UUCP: {att!}cblpf!kww 6200 E. Broad St. Internet: kww@cblpf.att.com Columbus, Oh. 43213 "Death is life's way of firing you!" -- Hack rumor
clewis@eci386.uucp (Chris Lewis) (07/27/89)
Talk about timely! I'm posting instead of mailing because of the co-incidence (and we *know* what's wrong with ours... Graphically.) In article <8569@cbnews.ATT.COM> kww@cbnews.ATT.COM (Kevin W. Wall) writes: >I am having trouble with my hard disk on my 3B1 (a 67Mb Miniscribe). So am I. > For example, if I run the diagnostic >to park the heads on the hard disk, I will get the following output: > > WINCHESTER DISK TEST > > Hard disk restore failed > **ERROR** > Test: Hard disk test (drive 0) > Subtest: Park Disk Heads > Error: WINCHESTER: Can't Recal; Response = 4 > Enter y[Y] to Abort, Return to continue That's what we get too. > 1) What exactly does "Can't Recal; Response = 4" mean? In particular, > what does this most likely indicate? (E.g., a problem with the disk > controller, with the hard disk media, etc.) That the drive cannot find the first track on the disk. I'm not sure how good the diagnostics are, but I do know that you really cannot do anything with a disk if you don't recalibrate first. Unless your diagnostics are capable of actually testing the controller (which I doubt in this case), it's hard to tell whether it's the controller or disk. My system originally didn't boot HD or floppy, but we eventually got the floppy running, ruling out the rest of the logic board except possibly the controller itself. Could have been a bad diagnostic floppy (was an off-the-net copy of s4diag that I had booted successfully once before) that prevented the floppy boot at home. > 2) Can I safely assume that the problem is in the hard disk UNIT > (all the pieces inside the smaller cage containing the disk itself), > as opposed to being some kind of problem on the mother board? Not necessarily. > [The reason that I want to no this is that I have a 72Mb Miniscribe > (Model # 6085) that I could (would like to) install to replace the > (bad?) drive, but don't want to do this it there is a problem some- > where on the mother board which will cause this one to fail too.] *Very* unlikely - on ST506 drives you can do almost anything to the connections without harm (eg: getting either cable backwards...) Unless something's wrong with the powersupply - a VOM would come in handy. > 3) Is there any way, short of opening the sealed drive itself, to > tell if the problem is a head crash (vs. say, the failure of the > hard disk controller)? No. But there can be additional evidence. Eg: loud scraping noises. Which is what I been getting louder and louder over the previous week or two. Originally thought it was the fan dying, but once I had the cover off, it became obvious where it was really coming from. Another thing that might help is opening the 3b1, disconnecting the ribbon cable from the power supply to the logic board, and powering the thing up. If the drive spins up reasonably quietly with no activity on the HD drive LED (seen through the perforations on the HD cage), you probably still have a good drive. Mine made noises and the LED gave me a repeating flash ... flash-flash-flash ... flash-flash-flash ... flash-flash-flash code. Which might mean something if you have the right manuals. If there is a true head crash, chances are that the drive isn't worth repairing.... Generally speaking, repair houses charge a fixed rate (on the order of $500-$1000) to repair a drive. Even then, you generally don't get data recovery (especially if some of the oxide is missing...) And you can usually buy a new drive for less than the repair cost. If the drive is truly zorched, I don't think that a repair centre would care whether you had peeked inside. Once you find one, you could always ask. In our case, our resident expert on 3b1 noises took the chance and opened the drive this morning. He has managed to take one apart, fix it, and have it work after he's closed it up, but it should really be done in a "clean room". Oh my! Heads 3 and 5 fell off, and there's this neat 1/2" wide stripe of melted aluminum where the head supports touched down on two of the surfaces. Lost a few square inches of oxide. Starting at cylinder 0. There go my comp.sources.unix and comp.sources.misc archives - they're just reddish dust on the workbench now. Sigh. I must be sick - I'm actually giggling about it... I understand Jim Joyce (of UNIX bookstore fame) makes a living recovering data from mangled drives, but he charges quite a bit (quite a bit for a hobbyist, not that much for a company who's got lots of money riding on their disk contents). None of the people we go to for repairs would be of use to you "down there". Now I just have to see if I can get another drive... -- Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc. UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis Phone: (416)-595-5425
erict@flatline.UUCP (J. Eric Townsend) (07/28/89)
In article <1989Jul26.174524.21833@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: >Somebody else wrote: >> WINCHESTER DISK TEST >> >> Hard disk restore failed >> **ERROR** >> Test: Hard disk test (drive 0) >> Subtest: Park Disk Heads >> Error: WINCHESTER: Can't Recal; Response = 4 >> Enter y[Y] to Abort, Return to continue According to an AT&T tech who came out and replaced the HD in my 3b1 (while it was under warranty), this is something that could be fixed from floppy-unix, if AT&T had bothered to ship a program that could do the super-low level format needed to test the hard drive. This is where I start to lose understanding of the subject, so I only *think* I'm correct. There are two levels of formatting: The normal level, what "we" use, merely erases the disk, and sets up the base for the file system. There is a lower level format that actually writes the 0 block (or wherever the "what am I" information for the drive is stored). This "what am I" information is what the 3b1 uses to format the hard drive. Currently, there is no way to do a "you are a X" format on a drive. (I've done this on IBM PClones, however. :-( Anyway, maybe somebody with a tad more HD knowledge can xlate the above into technical-talk, and correct it where necessary. -- J. Eric Townsend "[Leslie Stahl was] a pussy compared to [Dan] Rather." uunet!sugar!flatline!erict -- George Herbert Walker Bush com6@uhnix1.uh.edu 511 Parker #2, Houston, Tx 77007 EastEnders Mailing list: eastender@flatline.UUCP
clewis@eci386.uucp (Chris Lewis) (07/28/89)
In article <850@flatline.UUCP> erict@flatline.UUCP (J. Eric Townsend) writes: >According to an AT&T tech who came out and replaced the HD in >my 3b1 (while it was under warranty), this is something that could >be fixed from floppy-unix, if AT&T had bothered to ship a program >that could do the super-low level format needed to test the hard drive. >This is where I start to lose understanding of the subject, so >I only *think* I'm correct. You're not. He may have been simply mistaken, or trying to make sure that you only buy drives from AT&T because "only we can format 'em". [Only a possiblity, I see no evidence of this with the AT&T people I deal with] Proof? Simple: almost every single ST506 controller uses a slightly different pattern of bits for the physical representation of sectors headers and trailers. We do hardware maintenance on a host of machines, and I can assure you that when you take a disk from another type of machine and insert it in a 3b1, the formats are different, and the diagnostic floppy formatter *does* do low level formats. (Otherwise, I'd never get a new drive for my machine ;-) The "disk erase and file system preparation" program he was refering to is UNIX "mkfs" and is the second stage of preparing a HD for UNIX. (analogous to low level formatters and FDISK on DOS) However, there are at least a few companies that do not provide low level formatters for HD's, or other similar things like requiring tape drivers to only accept a tape with a label that only the machine's vendor can write. So they can have a captive media market. You can take some comfort that at least one of these companies (quite large at one point I may add) has gone belly up. So did the company that took 'em over. -- Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc. UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis Phone: (416)-595-5425
jbm@uncle.UUCP (John B. Milton) (07/31/89)
In article <850@flatline.UUCP> erict@flatline.UUCP (J. Eric Townsend) writes: >In article <1989Jul26.174524.21833@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: >>Somebody else wrote: [ error ] > >According to an AT&T tech who came out and replaced the HD in >my 3b1 (while it was under warranty), this is something that could >be fixed from floppy-unix, if AT&T had bothered to ship a program >that could do the super-low level format needed to test the hard drive. Hmmm. Far too general a statement to be entirely corrrect. >This is where I start to lose understanding of the subject, so >I only *think* I'm correct. Well, ok >There are two levels of formatting: The normal level, what "we" >use, merely erases the disk, and sets up the base for the file system. I can tell you've been too close to DOS land. >There is a lower level format that actually writes the 0 block (or >wherever the "what am I" information for the drive is stored). This >"what am I" information is what the 3b1 uses to format the hard drive. >Currently, there is no way to do a "you are a X" format on a drive. >(I've done this on IBM PClones, however. :-( Well, there are several levels to formatting the hard disk drive. What the diag disk does is a "low level format", that is, it sends a format track command to the WD1010 hard disk controller chip. Oh, yeah, then how does it remeber the old bad block table when you format a drive twice. Easy, before formatting it checks to see if the disk about to be formatted is a UNIXpc disk with a good VHB. If so, it reads the existing BBT, formats the disk, then re-writes the old BBT when the format is complete. The reason is obvious: once a bad spot, always a bad spot. I, like most people would not like to give a bad spot a second chance. If you want to dump the old BBT, you have to trash the VHB to make the diag disk think it's a new, raw disk. The low level format re-writes the ENTIRE track from index to index, gaps, headers, data, everything. There is a "sort of" lower level format which involves warping the format by changing the gap sizes. This is done to AVOID bad spots, is very time consuming, and not very reliable. Some of the PC "low level" format programs can do this. What the DOS format command does is something completely different. It just fills the disk with a pattern (FD I think). This is all it can do because DOS has no way to tell just what kind of hard disk controller (chip) is down there. John -- John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu (614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!
jcm@mtunb.ATT.COM (was-John McMillan) (08/03/89)
It is my impression, from 'wasting' several days in the bowels of the S4 diagnostics code, that NOTHING can be done from software that will recover a disk which fails re-calibration. (And those bowels were NOT a pretty site!-) Those days were spent trying to redeem the lost soul of an MX-2190 -- this was not a mere intellectual exercise, it was a personal crisis!-) -- as only 140 MB of lost sources can be.... If anyone can correct my impression that a failed re-calibration prevents ALL useful WDx020 operations, please advise. Until then I will presume that the controller CANNOT write to a disk for which the base-reference -- the recalibration point -- cannot be found: Ya can't FORMAT track 0 if ya can't FIND track 0. john mcmillan -- att!mtunb!jcm -- ...speaking for self, only...
jcm@mtunb.ATT.COM (was-John McMillan) (08/03/89)
In article <580@uncle.UUCP> jbm@uncle.UUCP (John B. Milton) writes: : > ... If so, it reads the existing BBT, formats the disk, >then re-writes the old BBT when the format is complete. The reason is obvious: >once a bad spot, always a bad spot. I, like most people would not like to give ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >a bad spot a second chance. If you want to dump the old BBT, you have to trash >the VHB to make the diag disk think it's a new, raw disk. The low level format >re-writes the ENTIRE track from index to index, gaps, headers, data, everything. Be more charitable, John: Redemption IS possible, for some... but it takes love and attention. Brother McMillan has spent long nights with several disks, and twice the WORD has driven satan outta that disk! We've been here before... o yes, children, this HAS BEEN SAID BEFORE! Formatting a disk puts META information on the disk: it's like spraying the lines in the parking lot. Without these lines, the disk cannot identify where the data is to be placed/grabbed. Each disk sector has a leading-edge gap, mark, identifier, gap/mark, user data, and final gap (sometimes w/ CRC). (OK -- I'm faking it, I haven't looked at the code for years!-) Reading/Writing requires finding the sector identifier and precisely dribbling/sucking-up the data after hitting the internal mark. (Nice technical terms, eh?) A bad block, then, is one which has: Proven itself to be unreadable, or unreliably readable. Types of BAD BLOCKS: + SOME blocks are simply a victim of poor penmenship: If a block is badly written, it may be unreadable. And bad vibrations -- not to mention bad karma -- or power glitches might cause bad writes. (Likewise, some RETRIES may indicate NOT a bad block, but an Anomaly occuring during the READ cycle.) ++ If the write error is in the USER DATA field, the block may be recoverable by performing a FULL-BLOCK write: this will overlay the badly written bits without trying a read first. (If you write only 100 bytes, the other bytes have to be READ first.) ++ However, if the META information is corrupted -- by overwriting or by a marginal write during formatting -- only a re-format of that sector (track) can reclaim the block. + OK: there are also MEDIA defects. And THESE are the BAD BLOCKS which John was referring to, the ones we presume to be beyond salvation. When my 67 MB drive began having MAJOR problems with bad blocks in the SWAP -- I developed 120+ BAD BLOCKS over two weeks, and some odd messages from the diagnostics/surface check code -- I backed up everything. Feature -- this took only 3 days because CPIO was failing to verify dump after dump. Then I ZERO'd the Bad block list, and reformatted. And now, I have NO BAD BLOCKS. And there's been not a single read error in the subsequent 4 weeks. So I say, John: Bad Block lists aren't holy. There are varying reasons why Blocks are entered. And good reasons for considering a reformat -- 'though the manufacturers list of defects *MAY* be worth copying in. john mcmillan -- att!mtunb!jcm -- speaking for hizzelf, only PS: In a recent power hit in Lincroft, several 3B1 disks expired. Oddly, none of us with Spike/Noise suppression units were hurt. Then there's the fellah who hadn't put his suppression unit in service yet... sad fellah. Don't you be sad: use line conditioning.
thad@cup.portal.com (Thad P Floryan) (08/05/89)
John McMillan concludes one of his recent postings with: " PS: In a recent power hit in Lincroft, several 3B1 disks expired. Oddly, none of us with Spike/Noise suppression units were hurt. Then there's the fellah who hadn't put his suppression unit in service yet... sad fellah. Don't you be sad: use line conditioning. " Sage advice. Prior to installing a UPS _AND_ a line-conditioner on every system here, I could expect several failures a week (on any of Amiga, UNIXPC, several homebrews, etc.). Even turning on a flourescent room light or turning off a modem while writing to a floppy would trash the disk; now I can operate drill motors, etc. on the same line circuit with impunity ... ZERO errors for over 3-1/2 years now, and ALL my systems are operated 24 hours/day, 7 days/wk. In my quest for 100% system reliability, I rented a line monitor for 30 days and let it record everything on the AC power. What it saw almost made me poop my pants ... literally. 2000V spikes, hash, RF, etc etc etc even lossage of a cycle (of the 60 Hz) now and then (and this was NOT during the "normal" power outages for this area). The types of crap one finds on the AC power line are caused by any/all of: air conditioners, refrigerators, flourescent lamps, any other inductive or capacitive loads (modems, printers, fans, etc.), thunderstorm activity ANYWHERE near your power grid, hospitals and medical equipment, construction activity (esp. the ol' backhoe digging up power lines), air pollution and acid rain, animals "playing" around power lines/transformers (and this includes your neighbors' kids with their kites), and anything else that plugs into the AC power line. If your livelihood depends on reliable system operation, you're living on borrowed time (or walking the edge) if you don't have at least a good spike and surge/transient suppressor between the wall outlet and your system(s). I even have special modem protectors (designed/built by GTE) to work in conjunction with the "Primary Phone Line Protector" (the "box" where the phone service enters your site) installed on every line by one's local TelCo. Yeah, I sometimes joke about operating my computers under candlelight during a power failure (when the UPS is powering everything), but the peace of mind is definitely worth it. The only thing I haven't got working with the 1200 Watt units yet is getting the signals from the UPS' DB-9 connector into the UNIXPC to carry on the dialogue between the UPS and the computer as has been done with the Convergent Tech Miniframes (these UPS systems were DESIGNED for use with the Miniframe under contract to SAFE in Arizona). If you want to contact them for the address of a dealer near you: SAFE Power Systems, Inc. 528 West 21st Street Tempe, AZ 85282 602/894-6864 "PC" magazine had a good article several years ago about surge protectors; some they tested even AMPLIFIED the spikes to yet higher voltages! At least you know that nothing over approx. 4,000 volts will come into your system ... 4000V is the flashover point between the two prongs (hot and neutral) on your standard USA AC power plug. Thad Floryan [ thad@cup.portal.com (OR) ..!sun!portal!cup.portal.com!thad ]
jbm@uncle.UUCP (John B. Milton) (08/07/89)
In article <1583@mtunb.ATT.COM> jcm@mtunb.UUCP (was-John McMillan) writes: >In article <580@uncle.UUCP> jbm@uncle.UUCP (John B. Milton) writes: >: >> ... If so, it reads the existing BBT, formats the disk, >>then re-writes the old BBT when the format is complete. The reason is obvious: >>once a bad spot, always a bad spot. I, like most people would not like to give > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>a bad spot a second chance. If you want to dump the old BBT, you have to trash >>the VHB to make the diag disk think it's a new, raw disk. The low level format >>re-writes the ENTIRE track from index to index, gaps, headers, data, everything. ... > + OK: there are also MEDIA defects. And THESE are the BAD BLOCKS > which John was referring to, the ones we presume to be beyond > salvation. I really did mean what I said. Perhaps I should have been more specific. What I was referring to was places on the disk that are physically not responding correctly. Many, many other things can go wrong that do not mesh with the "bad data read, this must be a bad block" idea. The format routine on the diagnostics disk was written for the user. It was written to find pre-existing bad spots that are expected to be there. The diagnostics assume that the hardware is functioning correctly. Remember, if it acts weird, you're supposed to call AT&T service, right. My original comment was also made assuming your system is functioning properly (or is now). So is it time for a HwNote on what's REALLY on the disk and how disks work? John -- John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu (614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!