[comp.sys.att] Summary: Hard disk errors on a 3b1; HwNote13

jbm@uncle.UUCP (John B. Milton) (02/17/89)

In article <465@manta.pha.pa.us> brant@manta.pha.pa.us (Brant Cheikes) writes:
...
>OUTSTANDING QUESTIONS I'D KILL FOR ANSWERS TO:

Well, I don't think you'll have to kill...

>1. What are the meanings of the following fields in the HDERR message:
>   ST, EF, SC, DCRREG, MCRREG.

A typical error message in /usr/adm/unix.log:

HDERR ST:51 EF:10 CL:FF80 CH:FF01 SN:FF00 SC:FF02 SDH:FF25 DMACNT:FFFF DCRREG:95 MCRREG:9100 Tue Aug  2 02:00:38 1988

All disk requests go through a generic part of the disk driver, gd. This part
does not know the difference between a floppy and a hard disk, it just knows
blocks. It calls various low level, disk specific routines to get it's work
done. The error above comes from the low level hard disk routine, gdhd. If it
were a floppy error (FDERR), it comes from the floppy low level routine, gdfp.
You get one of these after each error from the WD1010 hard disk controller. If
it retries 8 times, you get 8 messages. After the low level routine returns to
gd, it is SUPPOSED to print one of these:

drv:0 part:2 blk:20294 rpts:2 Sat Jun 11 14:43:06 1988

The "rpts" is how many retries (?DERRs) the low level driver took to complete
the request. I have no idea why SOMETIMES you get this message and most of
the time you don't. If the low level driver retries <sys/gdisk.h>:GDRETRIES (15)
times, you will get one of these little babies in the [!!] icon:

	Unrecoverable hard disk error

These are the real badies. Another thing you might notice when this happens,
is that the hard disk will do a slow seek to track zero. This will happen when
there are 5 retries left. Some drives are VERY noisy doing slow seeks, others
you won't notice at all. You WILL notice a big delay just before the [!!] icon
pops up if the head was WAY out on the disk.

Now to the question about the meaning of all the fields in the HDERR message.
All of the values are the state of things when an interrupt occures, usually
at the end of a disk operation. All of the information here comes from pages
3-8 to 3-10 of the "Storage Management Products Handbook, 1986", from Western
Digital.

HDERR ST:51 EF:10 CL:FF80 CH:FF01 SN:FF00 SC:FF02 SDH:FF25 DMACNT:FFFF DCRREG:95 MCRREG:9100 Tue Aug  2 02:00:38 1988

ST: The "status register" from the WD1010
  Bit 7 Busy. Set when the controller is accessing the disk.
  Bit 6 Ready. Reflects DRDY, pin 28; DREADY*, pin 22 from the HD
  Bit 5 Write Fault. Reflects WF, pin 30; DFAULT*, pin 12 from the HD
  Bit 4 Seek Complete. Reflects SC, pin 32; DSCOMPL*, pin 8 from the HD
  Bit 3 Data Request. Reflects DBRQ, pin 36. Driven by the WD1010, NC.
  Bit 2 Reserved. Always 0
  Bit 1 Command in Progress.
  Bit 0 Unrecoverable error. Indicates the error register should be checked.

EF: The "error register" from the WD1010
  Bit 7 Bad Block Detect. From what I can tell about how things are done on
    our systems, this feature is not used. We use a direct mapping method where
    the position of bad blocks is determined by the bad block table. If this
    gets turned on, it is some kind of glitch on the disk.
  Bit 6 CRC Data Field. This one deserves a direct quote:
      "This bit is set when a CRC error occures in the data
       field. With Retry enabled, ten more attempts are made
       to read the sector correctly. If none of these attempts
       are successful, the Error Status is set also (bit 0 in
       the Status Register). If one of the attempts is suc-
       cessful, this bit remains set to inform the Host that
       a marginal condition exists. However, the Error Status
       bit is not set. Even if errors exist, the data can be read."
    On our machiones, if bits 7, 5, 1 or 0 are set or if the error register is
    not zero!, or if there was DMA trouble, an HDERR message will be printed.
    This is extremely good. It means every time there is the slightest flicker
    in the data, you will get an error message. If you get only one, the error
    is probably transient and does not mean anything. You should NOT try to
    lock out the block! If you get a bunch of CRC errors, but a good read,
    this is probably a weak spot and should be locked out.
  Bit 5 Reserved. Always zero.
  Bit 4 ID not found. Like CRC, this bit is set when the ID field for the
    requested sector can not be found, or has a bad CRC.
  Bit 3 Reserved. Always zero.
  Bit 2 Aborted Command. Should never happen on our system. If you get it, it
    probably means BAD power line trouble.
  Bit 1 Track Zero Error. This is very bad, and usually indicates a very bad
    hardware failure in the drive, so you'll never see it until you get a
    second hard drive on your system :)
  Bit 0 Data Address Mark Not Found. Yet another thing not found.

Our driver DOES NOT use the built-in retry feature of the WD1010. This means
that when the driver retries 15 times on a CRC error, you only get 15, not
150 retries. Retries can apparently be DISABLED for both the hard disk and
the floppy disk through an ioctl(f,GDRETRY,1). I could see doing this for
floppies, where you would want to discard any marginal disks, but I don't see
much practical use for the hard drive.

The values CL, CH, SN, SC, SDH are also registers in the WD1010. They are
printed as simple %x, but are only significant in the low order 8 bits. The
top 8 bits are just bus garbage, and will vary with the machine, whether, time
of day, or phase of the moon.

CL: Cylinder low. This register hold the least significant 8 bits of the of the
  cylinder number.
CH: You guessed it, Cylinder High. It contains the high order two (2) bits of
  the cylinder. Also as you might have guessed, it is the high order three (3)
  bits if you have a WD2010.
SN: Sector number. 'nough said.
SC: Sector count. Our driver DOES use this feature to transfer multiple sectors.
SDH: This is a catch all. It contains ESSDDHHH: Extension bit, Sector size,
  Drive select, and Head select. These values compare with the sector header
  on the disk. The sector size and head are written on the disk, but the drive
  number is NOT. This is why you can format any drive at any select code, and
  then move it to another select code without problems. The extension bit is
  not used on our machine. When turned on, it extends the place for the CRC
  from two bytes to seven bytes for externaly generated ECC codes.

The DMACNT is a register in our custom DMA circuitry. From <sys/iohw.h>:
#define	DMA_CNT_MASK		0x3fff	/* Bits 13...0 holds dma count */
#define	DMA_ERROR		0x8000	/* dma error bit mask, 0 = error */

The DCRREG is a miscellaneous register for disk stuff (top of page 11 in the
schematics). This register can not be read, so what is shown is the value
of dcr_save. From <sys/iohw.h>:
#define	NOT_FDRST		0x80	/* 0 = reset, 1 = not reset */
#define	FDR0			0x40	/* 1 = floppy selected */
#define	FDMTR			0x20	/* 1 = floppy motor on */
#define	NOT_HDRST		0x10	/* 0 = hdc reset, 1 = hdc not reset */
#define	HDR0			0x08	/* 1 = hard disk 0 selected */
#define	HDSEL			0x07	/* Head select mask */

The MCRREG is another Miscellaneous Control Register for some of everything.
It is on the bottom right of page 15 of the schematics. The only bit of any
importance here is DMA_READ. As with DCRREG, this register can not be read.
The value printed comes from mcr_save. From <sys/hardware.h>:
#define	CLRSINT			0x8000	/*  CLRSINT- toggle from 1 to 0 and
					back to 1 to dismiss level 6, 60
					hertz interrupt */
#define DMA_READ		0x4000	/* DMAR/W- 0 = disk DMA write
						   1 = disk DMA read */
#define LPSTB			0x2000	/* LPSTB+  toggle from 0 to 1 and back
					to 0 to strobe data to line printer */
#define MCKSEL			0x1000	/* MCKSEL- 0 = modem RX & TX selected
						   1 = programmable Baud Rate
							generator is selected */
#define LED3			0x800	/* LED3-  0 = on, 1 = off */
#define LED2			0x400	/* LED2-  0 = on, 1 = off */
#define LED1			0x200	/* LED1-  0 = on, 1 = off */	
#define LED0			0x100	/* LED0-  0 = on, 1 = off */

As you can see from the bit definitions above, only the low order 8 bits of
DCRREG are used, and only the high order 8 bits of MCDRREG are used. You can
print out the current values of dcr_save and mcr_save with:
$ adb /unix /dev/kmem
dcr_save/x
mcr_save/x
^D

Well, that should do that for a while (until the flames roll in :)

>
>2. Does anyone know the specs for the WD1010 controller chip?
Ahhhhhhhhhhhhhh yup.

>                                                               In
>   particular, I have been told by a tech at Seagate that the ST-4096
>   will recal if step pulses are spaced more than 7ns apart.  Does the
>   controller chip meet this requirement?
If your drive can not handle slow seeks, its junk. In this case I think you have
bad information. It is possible that the ST-4096 has a bad resonance problem
around 7ms. Our driver tells the WD1010 to seek the drives as fast as it can
all the time. From <sys/space.h>:------------+
struct gdsw gdsw[] = {                       V
	{	0,{"WIN", 1023, 4, 17, 68,0, 0, 512},
		-1,0,16, 1023*16,40, ghdintr, ghdstart,
		HDMAXCYL, HDMAXBADBLK, gdhdbbq, gdhdbb,0},	

The zero here means a 35us seek time. This leaves it up to the drive to seek
however fast it can. The WD1010 then wait for the drive to signal when it has
completed the seek. This is fantastic for voice coil drives. When a recal is
done when there's an error, the driver sends a RESTORE command. This is also
done at power up. With a restore command, the WD1010 waits for the drive to
signal seek complete BETWEEN EVERY STEP! This is why some drives are so very
noisy when the recal. The head has just stoped, the drive said it's done, then
the WD1010 tells it to start moving again! GO STOP GO STOP GO STOP...

>3. Is there anyone who has a ST-4096 in their UNIXpc and is having no
>   difficulty?
Try jan@bagend. I have heard of others.

Lastly, if you want to convert that nasty HDERR to a block number so you
can use the diagnostics to lock it out:

----
HDERR ST:51 EF:40 CL:4260 CH:4201 SN:420C SC:4202 SDH:4222 DMACNT:FFFF DCRREG:92 MCRREG:9F00 Sat Jun 11 14:43:05 1988

HDERR ST:51 EF:40 CL:4260 CH:4201 SN:420C SC:4202 SDH:4222 DMACNT:FFFF DCRREG:92 MCRREG:9F00 Sat Jun 11 14:43:05 1988

drv:0 part:2 blk:20294 rpts:2 Sat Jun 11 14:43:06 1988
----

This is a real example from my unix.log. Let's run down everything here.

ST=51:01010001: Ready, Seek complete, Error.
EF=40:01000000: CRC data field
CH=4201, CL=4260: Cylinder is $0160=256+96=352
SN=420C: Sector number is 12
SC=4202: Sector count for this transfer is 2. We don't know what the original
  count was, and we don't care. During a multiple sector transfer, the SN is
  incremented and the SC is decremented as each sector is transfered. This
  makes it easy to pause and retry in the middle of a multi-sector transfer.
  The bottom line is that the SN is accurate.
SDH=4222:[0 01 00 010]: Extension off (good), Sector size=01 (512, good),
  Drive=0, no surprise for this machine, Head=010=2 (the third head), no
  surprise as this is a 9 head drive.
DMACNT:FFFF, Hmm. The DMA_ERROR bit is on. I don't know how to interpret the
  rest of the count register when there is an error. I don't think it matters.
DCRREG:92:10010010: FDRST* not asserted, FDDRIVE0* is asserted, FDMOTOR* is
  asserted, HDRST* is not asserted, DDRIVE0* is asserted, HDSELx=010=2=3rd.
  Well, this is interesting. It looks like the floppy drive was on when this
  hard drive error happened! The HDSELx lines DO correspond with the SDH reg,
  which is good.
MCRREG:9F00:10011111: CLSINT, DMA_READ=0=disk DMA write, LPSTB, MCKSEL, and
  now for the important one: ooh! aah! all four LEDs were off! (so what)

drv:0       It was the first hard drive
part:2      It was the file system partition (I only have one)
blk:20294   See below
rpts:2      This matches, there were two HDERR lines
Sat Jun 11 14:43:06 1988: Ok, so I was goofing with the floppy drive then.

I picked this case out of my unix.log because it DID have this gd line. Most of
my bunches of HDERR lines DO NOT have them. As you can also see the high order
8 bits of the WD1010 registers were 42 in this example, and they were FF in the
first example, like I said phase of the moon.

Now for the fun part. Lets map this cryptic shit back to the real world.

I've got two flavors, depending on how you like to think:

sector = ((((CH%256)*256+(CL%256))*HEADS)+(SDH%8)*SECTORS_PER_TRACK)+(SN%256)

or

sector = (((((CH&0xff)<<8)+(CL&0xff))*HEADS)+(SDH&0x07))<<4)+(SN&0xff)

Our "blocks" are 1024 or 2 sectors, so the block=sector/2. There are 16 data
sectors, or 8 blocks per track.

  (((((4201&0xff)<<8)+(4260&0xff))*9)+(4222&0x07))<<4)+(420C&0xff)
  ((256+96)*9+2)*16+12
  (392*9+2)*16+12
  (3168+2)*16+12
  3170*16+12
  50720+12
  50732  This is the sector number from the beginning of /dev/rfp000
  25366  This is the block offset
-    72  Size of my bad block table
-  5000  Size of my swap partition
=======
  20294  What do you know! it matches!

Now for even more fun! If this had been a recent HDERR message, I would now
run that neato-wiz-bang "bf" program Brant Cheikes just posted:

$ bf /dev/fp002 20294
block 20294 inode 8853

$ ncheck -i 8853
/dev/fp002:
8853	/usr/spool/news/comp/os/minix/1594

Ahh! Wouldn't you know it! I've got news stomping on my soft blocks!

I did not obtain permission from anyone for use of the information contained
in this article, so there. Western Digital does have good terse documentation,
just the way I like it: guts and no fluff.

Update on the second hard drive board: The board layout looks like it'll fly,
except for some cosmetic stuff. I got pricing from Saturn Electronics in MI:

2.5" x 3.0", 312 holes, no solder mask, no silk screen, double sided, plated
through holes, all holes one bit size.

 $200.00   Prototype quantity (about 10), including setup
 $100.00   Setup fee
   $3.75   Per board,   50 qty. ( $187.50)
   $3.00   Per board,  100 qty. ( $300.00)
   $2.76   Per board,  250 qty. ( $690.00)
   $2.09   Per board, 1000 qty. ($2090.00)
   $1.76   Per board, 5000 qty. ($8800.00)

If someone knows of a better price, call me. If you like these prices, call
me, and I'll give you the number of the local rep. I am in no way associated
with Saturn, I just heard of them by word of mouth. 

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:294-4823, w:764-2933;  Got any good 74LS503 circuits?

dsueme@chinet.chi.il.us (dave sueme) (02/19/89)

Very informative article.  Whatever we are paying you, it ain't enough.

dave sueme

brant@manta.pha.pa.us (Brant Cheikes) (02/22/89)

Let me begin with a sincere and appreciative public THANK YOU! to John
Milton for his lengthy, informative, and authoritative discussion of
UNIXpc hard disk error messages and their interpretation.  Thanks are
owed as well to the irrepressible Lenny Tropiano for a private
communique covering mostly the same territory.  As David Sueme so
eloquently put it, whatever we're paying these folks ain't enough.

Nonetheless, I was slightly puzzled by John's rather inconclusive
conclusion:

>Ahh! Wouldn't you know it! I've got news stomping on my soft blocks!

After walking thru a particular error message, tracking it down to a
particular file and so forth, he makes the above offhand comment about
"soft blocks," then proceeds to change the subject.

So, if I may be so bold as to inquire further, what pray tell are
"soft blocks?"  John's treatment suggests that they need inspire only
the mildest concern.  Yet in my case, having a new "soft block" error
logged almost on a daily basis, that would seem inappropriate.
Comments, John?
-- 
Brant Cheikes
University of Pennsylvania, Department of Computer and Information Science
brant@manta.pha.pa.us, brant@linc.cis.upenn.edu, bpa!manta!brant

jbm@uncle.UUCP (John B. Milton) (02/23/89)

In article <468@manta.pha.pa.us> brant@manta.pha.pa.us (Brant Cheikes) writes:
>Let me begin with a sincere and appreciative public THANK YOU! to John
You're welcome.

...
>>Ahh! Wouldn't you know it! I've got news stomping on my soft blocks!
Excuse my attempt at levity. Yes, there is some concern here. In my case I have
not gotten anymore hits on these spots. A good way to check whether a certain
HDERR is hard (always bad), soft (sometimes bad), or transient (usually not
related to the disk at all), is to:

cp /dev/rfp000 /dev/null

Ignore the "bad copy to /dev/null", and check /usr/adm/unix.log to see if you
have any new messages. Track down the file, and:

ln file /usr/adm/bad+junk

Just to make sure the bad spot doesn't get loose. Repeat this at different
times of the day. Try to pick times when your machine is under extreme
conditions: 5-6p.m. for lowest line voltage, 3a.m. for highest. Afternoon, or
whenever for highest temperature, etc. You might even set up a temporary cron
line to do this. You could also kick a second one off 5 minutes after the
first to see if your errors are seek related. Don't do too much of this, as it
can put a lot of wear on the moving parts of you head assembly!

REMEMBER! when smgr finds that /usr/adm/unix.log has exceeded 10k in size,
it quietly deletes it! Shame on you if you don't have something like this run
out of cron every night:

cd /usr/adm
if [ -f unix.log ]; then
	cat unix.log >>UNIX.log
	rm unix.log
fi

About once a month, I go through this file and delete all the FDERR lines from
floppy formatting. After you have collected enough HDERR lines, you can get
all the suspect files in one place and flog them to get a feel for how "hard"
a given bad spot is. If you get a continuous stream of transient (one hit)
spots when scanning the whole disk, it is probably electronics, and not the
hard disk surface.

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:294-4823, w:764-2933;  Got any good 74LS503 circuits?