canoaf@ntvax.UUCP (Augustine Cano) (02/11/89)
Hello netland: I got 2 responses to my posting about intermittent HD errors with crash on my 3b1 with a Seagate 4096. This is the second such drive, the first one, still under warranty, overflowed the bad block table. Days can pass without any problem and suddenly I'll get half a dozen in a row. The actual error (with crash) is always as follows: Drive 0, cmd 0 #HDERR ST:51 ... (repeated usually 3 times) panic: Hard disk timeout Please record panic message. Press hardware reset to reboot. where `#HDERR:ST51 ...' has been at different times: #HDERR ST:51 EF:10 CL:4257 CH:4203 SN:4208 SC:4204 SDH:4224 DMACNT:FFFF DCRREG:94 MCRREG:8900 #HDERR ST:51 EF:10 CL:4280 CH:4202 SN:420E SC:4202 SDH:4222 DMACNT:FFFF DCRREG:92 MCRREG:8B00 #HDERR ST:51 EF:10 CL:4257 CH:4203 SN:420A SC:4202 SDH:4226 DMACNT:FFFF DCRREG:96 MCRREG:8100 #HDERR ST:51 EF:10 CL:4257 CH:4203 SN:4200 SC:4204 SDH:4225 DMACNT:FFFF DCRREG:B5 MCRREG:8B00 #HDERR ST:51 EF:10 CL:4283 CH:4203 SN:4204 SC:4202 SDH:4224 DMACNT:FFFF DCRREG:94 MCRREG:8100 From my original posting: > Is it possible that the disk really does not have a bad spot but that a > combination of factors triggers a software bug in the kernel or driver? Christopher J. Calabrese from AT&T Bell Laboratories, Murray Hill, NJ said: > I've never run accross such problems with the disk drivers; > however, it could be a bad disk controler chip, or a bad ribon cable. > I've seen that around here before. Many, many thanks go also to Brant Cheikes, who sent a long and detailed account of exactly the same problem I'm having. On his advice, I called Ben Wollberg (415)678-1353 (8a-5p PST), who fixed his machine. Ben told me that the first thing he would do would be to backup the whole disk and reformat it. He said that this might make the problem disappear. If it didn't, I would probably have to have the disk repaired. Apparently the test they run to find and map the bad sectors takes 7 hours. I wonder if I can do something similar with the test disk. Can anybody out there tell me what I need to tell the test program to do an exhaustive format-write-read check that would detect all intermittent errors? In any case, a summary of Brant's response follows: > The problems all showed up as HDERR's logged to /usr/adm/unix.log. > The errors would come in groups of three or four, and would always be > accompanied by a mechanical whine from the disk. I believe that noise > indicates that the drive is "recalibrating," retracting the heads and > resetting itself in some way. The errors were highly intermittent; I > could go several days without an error, then suddenly get several in > one day. Weather did not seem to be a factor, nor did temperature. I > checked the power output from my power supply, and found no variation > even while the drive was running the random seek diagnostic test. > > Occasionally, the errors would cause recoverable disk errors. Things > like missing blocks in the free list, things that fsck could fix. No > data was ever lost, to my knowledge, but it really sucked having to > fsck the disk every few days. > > Then, the machine started crashing. > The accompanying whine in these cases lasted several seconds, and the > system was hung while it was going on. Then boom, the panic and a > reset was necessary. Well, I hope this helps someone out there... Augustine F. Cano <canoaf@dept.csci.unt.edu>
brant@manta.pha.pa.us (Brant Cheikes) (02/14/89)
In article <388@ntvax.UUCP> canoaf@ntvax.UUCP (Augustine Cano) posts a summary of two (!) responses he received to his query about HD errors. >[...] Brant Cheikes [...] sent a long and detailed >account of exactly the same problem I'm having. On his advice, I called >Ben Wollberg (415)678-1353 (8a-5p PST), who fixed his machine. The bad news is that he didn't fix my machine. The problem remains, though so far the machine has stopped crashing. But the HDERRs are still being generated at a pretty good clip. >Apparently the >test they run to find and map the bad sectors takes 7 hours. [...] Despite running this program on my disk, the errors came back immediately. I'm increasingly dubious that the problem is bad sectors. It looks to me like transient read or write errors, and thus some electronic problem in the drive (though it seems odd that we're having identical difficulties). In any event, I will know more as soon as I swap my (repaired) 40 Mb (Hitachi) drive back in. OUTSTANDING QUESTIONS I'D KILL FOR ANSWERS TO: 1. What are the meanings of the following fields in the HDERR message: ST, EF, SC, DCRREG, MCRREG. 2. Does anyone know the specs for the WD1010 controller chip? In particular, I have been told by a tech at Seagate that the ST-4096 will recal if step pulses are spaced more than 7ns apart. Does the controller chip meet this requirement? 3. Is there anyone who has a ST-4096 in their UNIXpc and is having no difficulty? -- Brant Cheikes University of Pennsylvania, Department of Computer and Information Science brant@manta.pha.pa.us, brant@linc.cis.upenn.edu, bpa!manta!brant
dpb@tellab5.tellabs.CHI.IL.US (Darryl Baker) (02/15/89)
I'd like to put my $.02 in. One thing I found when I switched from a WD1010 to WD2010 was the the sporatic disk errors I was getting disappeared. Funny thing do you think there was a bug in the WD1010? :-) -- __ _ __ / ) // / ) / / / __. __ __ __ , // /--< __. /_ _ __ Darryl Baker /__/_(_/|_/ (_/ (_/ (_/_</_ /___/_(_/|_/ <_</_/ (_ dpb@tellabs.chi.il.us / dpb@liltyke.chi.il.us '
ignatz@chinet.chi.il.us (Dave Ihnat) (02/16/89)
Well, good news and a modifier. I've been happily running an ST-4096 with no problems for months. The modifier is that I did the large-disk field mod, and installed a WD2010, so if the problem is with the WD1010, then I wouldn't see it... Dave Ihnat Analysts International Corp. ignatz@homebru.chi.il.us