jbm@uncle.UUCP (John B. Milton) (02/17/89)
In article <465@manta.pha.pa.us> brant@manta.pha.pa.us (Brant Cheikes) writes: ... >OUTSTANDING QUESTIONS I'D KILL FOR ANSWERS TO: Well, I don't think you'll have to kill... >1. What are the meanings of the following fields in the HDERR message: > ST, EF, SC, DCRREG, MCRREG. A typical error message in /usr/adm/unix.log: HDERR ST:51 EF:10 CL:FF80 CH:FF01 SN:FF00 SC:FF02 SDH:FF25 DMACNT:FFFF DCRREG:95 MCRREG:9100 Tue Aug 2 02:00:38 1988 All disk requests go through a generic part of the disk driver, gd. This part does not know the difference between a floppy and a hard disk, it just knows blocks. It calls various low level, disk specific routines to get it's work done. The error above comes from the low level hard disk routine, gdhd. If it were a floppy error (FDERR), it comes from the floppy low level routine, gdfp. You get one of these after each error from the WD1010 hard disk controller. If it retries 8 times, you get 8 messages. After the low level routine returns to gd, it is SUPPOSED to print one of these: drv:0 part:2 blk:20294 rpts:2 Sat Jun 11 14:43:06 1988 The "rpts" is how many retries (?DERRs) the low level driver took to complete the request. I have no idea why SOMETIMES you get this message and most of the time you don't. If the low level driver retries <sys/gdisk.h>:GDRETRIES (15) times, you will get one of these little babies in the [!!] icon: Unrecoverable hard disk error These are the real badies. Another thing you might notice when this happens, is that the hard disk will do a slow seek to track zero. This will happen when there are 5 retries left. Some drives are VERY noisy doing slow seeks, others you won't notice at all. You WILL notice a big delay just before the [!!] icon pops up if the head was WAY out on the disk. Now to the question about the meaning of all the fields in the HDERR message. All of the values are the state of things when an interrupt occures, usually at the end of a disk operation. All of the information here comes from pages 3-8 to 3-10 of the "Storage Management Products Handbook, 1986", from Western Digital. HDERR ST:51 EF:10 CL:FF80 CH:FF01 SN:FF00 SC:FF02 SDH:FF25 DMACNT:FFFF DCRREG:95 MCRREG:9100 Tue Aug 2 02:00:38 1988 ST: The "status register" from the WD1010 Bit 7 Busy. Set when the controller is accessing the disk. Bit 6 Ready. Reflects DRDY, pin 28; DREADY*, pin 22 from the HD Bit 5 Write Fault. Reflects WF, pin 30; DFAULT*, pin 12 from the HD Bit 4 Seek Complete. Reflects SC, pin 32; DSCOMPL*, pin 8 from the HD Bit 3 Data Request. Reflects DBRQ, pin 36. Driven by the WD1010, NC. Bit 2 Reserved. Always 0 Bit 1 Command in Progress. Bit 0 Unrecoverable error. Indicates the error register should be checked. EF: The "error register" from the WD1010 Bit 7 Bad Block Detect. From what I can tell about how things are done on our systems, this feature is not used. We use a direct mapping method where the position of bad blocks is determined by the bad block table. If this gets turned on, it is some kind of glitch on the disk. Bit 6 CRC Data Field. This one deserves a direct quote: "This bit is set when a CRC error occures in the data field. With Retry enabled, ten more attempts are made to read the sector correctly. If none of these attempts are successful, the Error Status is set also (bit 0 in the Status Register). If one of the attempts is suc- cessful, this bit remains set to inform the Host that a marginal condition exists. However, the Error Status bit is not set. Even if errors exist, the data can be read." On our machiones, if bits 7, 5, 1 or 0 are set or if the error register is not zero!, or if there was DMA trouble, an HDERR message will be printed. This is extremely good. It means every time there is the slightest flicker in the data, you will get an error message. If you get only one, the error is probably transient and does not mean anything. You should NOT try to lock out the block! If you get a bunch of CRC errors, but a good read, this is probably a weak spot and should be locked out. Bit 5 Reserved. Always zero. Bit 4 ID not found. Like CRC, this bit is set when the ID field for the requested sector can not be found, or has a bad CRC. Bit 3 Reserved. Always zero. Bit 2 Aborted Command. Should never happen on our system. If you get it, it probably means BAD power line trouble. Bit 1 Track Zero Error. This is very bad, and usually indicates a very bad hardware failure in the drive, so you'll never see it until you get a second hard drive on your system :) Bit 0 Data Address Mark Not Found. Yet another thing not found. Our driver DOES NOT use the built-in retry feature of the WD1010. This means that when the driver retries 15 times on a CRC error, you only get 15, not 150 retries. Retries can apparently be DISABLED for both the hard disk and the floppy disk through an ioctl(f,GDRETRY,1). I could see doing this for floppies, where you would want to discard any marginal disks, but I don't see much practical use for the hard drive. The values CL, CH, SN, SC, SDH are also registers in the WD1010. They are printed as simple %x, but are only significant in the low order 8 bits. The top 8 bits are just bus garbage, and will vary with the machine, whether, time of day, or phase of the moon. CL: Cylinder low. This register hold the least significant 8 bits of the of the cylinder number. CH: You guessed it, Cylinder High. It contains the high order two (2) bits of the cylinder. Also as you might have guessed, it is the high order three (3) bits if you have a WD2010. SN: Sector number. 'nough said. SC: Sector count. Our driver DOES use this feature to transfer multiple sectors. SDH: This is a catch all. It contains ESSDDHHH: Extension bit, Sector size, Drive select, and Head select. These values compare with the sector header on the disk. The sector size and head are written on the disk, but the drive number is NOT. This is why you can format any drive at any select code, and then move it to another select code without problems. The extension bit is not used on our machine. When turned on, it extends the place for the CRC from two bytes to seven bytes for externaly generated ECC codes. The DMACNT is a register in our custom DMA circuitry. From <sys/iohw.h>: #define DMA_CNT_MASK 0x3fff /* Bits 13...0 holds dma count */ #define DMA_ERROR 0x8000 /* dma error bit mask, 0 = error */ The DCRREG is a miscellaneous register for disk stuff (top of page 11 in the schematics). This register can not be read, so what is shown is the value of dcr_save. From <sys/iohw.h>: #define NOT_FDRST 0x80 /* 0 = reset, 1 = not reset */ #define FDR0 0x40 /* 1 = floppy selected */ #define FDMTR 0x20 /* 1 = floppy motor on */ #define NOT_HDRST 0x10 /* 0 = hdc reset, 1 = hdc not reset */ #define HDR0 0x08 /* 1 = hard disk 0 selected */ #define HDSEL 0x07 /* Head select mask */ The MCRREG is another Miscellaneous Control Register for some of everything. It is on the bottom right of page 15 of the schematics. The only bit of any importance here is DMA_READ. As with DCRREG, this register can not be read. The value printed comes from mcr_save. From <sys/hardware.h>: #define CLRSINT 0x8000 /* CLRSINT- toggle from 1 to 0 and back to 1 to dismiss level 6, 60 hertz interrupt */ #define DMA_READ 0x4000 /* DMAR/W- 0 = disk DMA write 1 = disk DMA read */ #define LPSTB 0x2000 /* LPSTB+ toggle from 0 to 1 and back to 0 to strobe data to line printer */ #define MCKSEL 0x1000 /* MCKSEL- 0 = modem RX & TX selected 1 = programmable Baud Rate generator is selected */ #define LED3 0x800 /* LED3- 0 = on, 1 = off */ #define LED2 0x400 /* LED2- 0 = on, 1 = off */ #define LED1 0x200 /* LED1- 0 = on, 1 = off */ #define LED0 0x100 /* LED0- 0 = on, 1 = off */ As you can see from the bit definitions above, only the low order 8 bits of DCRREG are used, and only the high order 8 bits of MCDRREG are used. You can print out the current values of dcr_save and mcr_save with: $ adb /unix /dev/kmem dcr_save/x mcr_save/x ^D Well, that should do that for a while (until the flames roll in :) > >2. Does anyone know the specs for the WD1010 controller chip? Ahhhhhhhhhhhhhh yup. > In > particular, I have been told by a tech at Seagate that the ST-4096 > will recal if step pulses are spaced more than 7ns apart. Does the > controller chip meet this requirement? If your drive can not handle slow seeks, its junk. In this case I think you have bad information. It is possible that the ST-4096 has a bad resonance problem around 7ms. Our driver tells the WD1010 to seek the drives as fast as it can all the time. From <sys/space.h>:------------+ struct gdsw gdsw[] = { V { 0,{"WIN", 1023, 4, 17, 68,0, 0, 512}, -1,0,16, 1023*16,40, ghdintr, ghdstart, HDMAXCYL, HDMAXBADBLK, gdhdbbq, gdhdbb,0}, The zero here means a 35us seek time. This leaves it up to the drive to seek however fast it can. The WD1010 then wait for the drive to signal when it has completed the seek. This is fantastic for voice coil drives. When a recal is done when there's an error, the driver sends a RESTORE command. This is also done at power up. With a restore command, the WD1010 waits for the drive to signal seek complete BETWEEN EVERY STEP! This is why some drives are so very noisy when the recal. The head has just stoped, the drive said it's done, then the WD1010 tells it to start moving again! GO STOP GO STOP GO STOP... >3. Is there anyone who has a ST-4096 in their UNIXpc and is having no > difficulty? Try jan@bagend. I have heard of others. Lastly, if you want to convert that nasty HDERR to a block number so you can use the diagnostics to lock it out: ---- HDERR ST:51 EF:40 CL:4260 CH:4201 SN:420C SC:4202 SDH:4222 DMACNT:FFFF DCRREG:92 MCRREG:9F00 Sat Jun 11 14:43:05 1988 HDERR ST:51 EF:40 CL:4260 CH:4201 SN:420C SC:4202 SDH:4222 DMACNT:FFFF DCRREG:92 MCRREG:9F00 Sat Jun 11 14:43:05 1988 drv:0 part:2 blk:20294 rpts:2 Sat Jun 11 14:43:06 1988 ---- This is a real example from my unix.log. Let's run down everything here. ST=51:01010001: Ready, Seek complete, Error. EF=40:01000000: CRC data field CH=4201, CL=4260: Cylinder is $0160=256+96=352 SN=420C: Sector number is 12 SC=4202: Sector count for this transfer is 2. We don't know what the original count was, and we don't care. During a multiple sector transfer, the SN is incremented and the SC is decremented as each sector is transfered. This makes it easy to pause and retry in the middle of a multi-sector transfer. The bottom line is that the SN is accurate. SDH=4222:[0 01 00 010]: Extension off (good), Sector size=01 (512, good), Drive=0, no surprise for this machine, Head=010=2 (the third head), no surprise as this is a 9 head drive. DMACNT:FFFF, Hmm. The DMA_ERROR bit is on. I don't know how to interpret the rest of the count register when there is an error. I don't think it matters. DCRREG:92:10010010: FDRST* not asserted, FDDRIVE0* is asserted, FDMOTOR* is asserted, HDRST* is not asserted, DDRIVE0* is asserted, HDSELx=010=2=3rd. Well, this is interesting. It looks like the floppy drive was on when this hard drive error happened! The HDSELx lines DO correspond with the SDH reg, which is good. MCRREG:9F00:10011111: CLSINT, DMA_READ=0=disk DMA write, LPSTB, MCKSEL, and now for the important one: ooh! aah! all four LEDs were off! (so what) drv:0 It was the first hard drive part:2 It was the file system partition (I only have one) blk:20294 See below rpts:2 This matches, there were two HDERR lines Sat Jun 11 14:43:06 1988: Ok, so I was goofing with the floppy drive then. I picked this case out of my unix.log because it DID have this gd line. Most of my bunches of HDERR lines DO NOT have them. As you can also see the high order 8 bits of the WD1010 registers were 42 in this example, and they were FF in the first example, like I said phase of the moon. Now for the fun part. Lets map this cryptic shit back to the real world. I've got two flavors, depending on how you like to think: sector = ((((CH%256)*256+(CL%256))*HEADS)+(SDH%8)*SECTORS_PER_TRACK)+(SN%256) or sector = (((((CH&0xff)<<8)+(CL&0xff))*HEADS)+(SDH&0x07))<<4)+(SN&0xff) Our "blocks" are 1024 or 2 sectors, so the block=sector/2. There are 16 data sectors, or 8 blocks per track. (((((4201&0xff)<<8)+(4260&0xff))*9)+(4222&0x07))<<4)+(420C&0xff) ((256+96)*9+2)*16+12 (392*9+2)*16+12 (3168+2)*16+12 3170*16+12 50720+12 50732 This is the sector number from the beginning of /dev/rfp000 25366 This is the block offset - 72 Size of my bad block table - 5000 Size of my swap partition ======= 20294 What do you know! it matches! Now for even more fun! If this had been a recent HDERR message, I would now run that neato-wiz-bang "bf" program Brant Cheikes just posted: $ bf /dev/fp002 20294 block 20294 inode 8853 $ ncheck -i 8853 /dev/fp002: 8853 /usr/spool/news/comp/os/minix/1594 Ahh! Wouldn't you know it! I've got news stomping on my soft blocks! I did not obtain permission from anyone for use of the information contained in this article, so there. Western Digital does have good terse documentation, just the way I like it: guts and no fluff. Update on the second hard drive board: The board layout looks like it'll fly, except for some cosmetic stuff. I got pricing from Saturn Electronics in MI: 2.5" x 3.0", 312 holes, no solder mask, no silk screen, double sided, plated through holes, all holes one bit size. $200.00 Prototype quantity (about 10), including setup $100.00 Setup fee $3.75 Per board, 50 qty. ( $187.50) $3.00 Per board, 100 qty. ( $300.00) $2.76 Per board, 250 qty. ( $690.00) $2.09 Per board, 1000 qty. ($2090.00) $1.76 Per board, 5000 qty. ($8800.00) If someone knows of a better price, call me. If you like these prices, call me, and I'll give you the number of the local rep. I am in no way associated with Saturn, I just heard of them by word of mouth. John -- John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu (614) h:294-4823, w:764-2933; Got any good 74LS503 circuits?
dsueme@chinet.chi.il.us (dave sueme) (02/19/89)
Very informative article. Whatever we are paying you, it ain't enough. dave sueme
brant@manta.pha.pa.us (Brant Cheikes) (02/22/89)
Let me begin with a sincere and appreciative public THANK YOU! to John
Milton for his lengthy, informative, and authoritative discussion of
UNIXpc hard disk error messages and their interpretation. Thanks are
owed as well to the irrepressible Lenny Tropiano for a private
communique covering mostly the same territory. As David Sueme so
eloquently put it, whatever we're paying these folks ain't enough.
Nonetheless, I was slightly puzzled by John's rather inconclusive
conclusion:
>Ahh! Wouldn't you know it! I've got news stomping on my soft blocks!
After walking thru a particular error message, tracking it down to a
particular file and so forth, he makes the above offhand comment about
"soft blocks," then proceeds to change the subject.
So, if I may be so bold as to inquire further, what pray tell are
"soft blocks?" John's treatment suggests that they need inspire only
the mildest concern. Yet in my case, having a new "soft block" error
logged almost on a daily basis, that would seem inappropriate.
Comments, John?
--
Brant Cheikes
University of Pennsylvania, Department of Computer and Information Science
brant@manta.pha.pa.us, brant@linc.cis.upenn.edu, bpa!manta!brant
jbm@uncle.UUCP (John B. Milton) (02/23/89)
In article <468@manta.pha.pa.us> brant@manta.pha.pa.us (Brant Cheikes) writes: >Let me begin with a sincere and appreciative public THANK YOU! to John You're welcome. ... >>Ahh! Wouldn't you know it! I've got news stomping on my soft blocks! Excuse my attempt at levity. Yes, there is some concern here. In my case I have not gotten anymore hits on these spots. A good way to check whether a certain HDERR is hard (always bad), soft (sometimes bad), or transient (usually not related to the disk at all), is to: cp /dev/rfp000 /dev/null Ignore the "bad copy to /dev/null", and check /usr/adm/unix.log to see if you have any new messages. Track down the file, and: ln file /usr/adm/bad+junk Just to make sure the bad spot doesn't get loose. Repeat this at different times of the day. Try to pick times when your machine is under extreme conditions: 5-6p.m. for lowest line voltage, 3a.m. for highest. Afternoon, or whenever for highest temperature, etc. You might even set up a temporary cron line to do this. You could also kick a second one off 5 minutes after the first to see if your errors are seek related. Don't do too much of this, as it can put a lot of wear on the moving parts of you head assembly! REMEMBER! when smgr finds that /usr/adm/unix.log has exceeded 10k in size, it quietly deletes it! Shame on you if you don't have something like this run out of cron every night: cd /usr/adm if [ -f unix.log ]; then cat unix.log >>UNIX.log rm unix.log fi About once a month, I go through this file and delete all the FDERR lines from floppy formatting. After you have collected enough HDERR lines, you can get all the suspect files in one place and flog them to get a feel for how "hard" a given bad spot is. If you get a continuous stream of transient (one hit) spots when scanning the whole disk, it is probably electronics, and not the hard disk surface. John -- John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu (614) h:294-4823, w:764-2933; Got any good 74LS503 circuits?