[comp.unix.xenix] fixed disk error

romwa@gpu.utcs.toronto.edu (Mark Dornfeld) (04/29/88)

Can anyone help with this Xenix error message?

error on fixed disk (minor 40), block=16544
Error Type 0, Code 3, Unit 0
Write/Drive Fault

The message started appearing with different blocks identified
about a month after installation. About a dozen are now
listed.

I have run 'badtrk' already suspecting a flaw on the disk, but
no new bad tracks appeared.

Is there a way to find out what Cylinder/Head contains the
suspect tracks and put them in the bad track table?

Mark T. Dornfeld
Royal Ontario Museum
100 Queens Park
Toronto, Ontario, CANADA
M5S 2C6

mark@utgpu!rom      - or -     romwa@utgpu

stu@jpusa1.UUCP (Stu Heiss) (05/02/88)

In article <1988Apr29.151753.3956@gpu.utcs.toronto.edu> romwa@gpu.utcs.toronto.edu (Mark Dornfeld) writes:
-
-Can anyone help with this Xenix error message?
-
-error on fixed disk (minor 40), block=16544
-Error Type 0, Code 3, Unit 0
-Write/Drive Fault
-
-The message started appearing with different blocks identified
-about a month after installation. About a dozen are now
-listed.
-
-I have run 'badtrk' already suspecting a flaw on the disk, but
-no new bad tracks appeared.
-
You can use badtrk to map out a block once you know the cylinder/head/sector.
The non-destructive test is not good enough to catch much.  The destructive
one is pretty good but can miss too.
-
-Is there a way to find out what Cylinder/Head contains the
-suspect tracks and put them in the bad track table?
-
Look at /usr/adm/messages unless you have the misfortune of a bad block
associated with that file - it happened here once.  If this does
happen, do the following:
$ mv /usr/adm/messages /usr/adm/messages.bad
$ touch /usr/adm/messages
When you start haveing disk troubles, *CHECK THE CABLES!!!*.  This is
*so* obvious that I never do it first and it has been the problem on
three different machines I'm responsible for.  I'm really going to look
there first next time it happens :-).  In particular, look for
connector pins that have lost the springiness or are bent, corosion on
the edge connector (remove with a pencil eraser), and if the cable is
bent right at the connector, a possible wire break.  If this doesn't
turn up anything, get ready for some hair pulling.  I usually do a low
level format, mkfs, and start trying to dd the raw device a number of
times to see if I can isolate a bad block or get some confidence that
the problem was cured with the format.  You may want to try swapping
cables and disk controller if you have access to some spares.

stephm@sco.COM (Stephen P. Marr) (05/02/88)

romwa@gpu.utcs.toronto.edu (Mark Dornfeld) writes:
>
<...>Can anyone help with this Xenix error message?
>
>error on fixed disk (minor 40), block=16544
>Error Type 0, Code 3, Unit 0
>Write/Drive Fault
>
><...>
>
>Mark T. Dornfeld
>
><...>
>
>mark@utgpu!rom      - or -     romwa@utgpu

A Write/Drive Fault means that the controller went bye-bye.
I'm responsible for running some 35+ machines here at SCO,
and I've seen this error on two machines in the last 2.5 years.

On the first occasion, I went through the same grief as you
trying to figure out what the bejeezus was wrong; I tried badtrk'ing
just about anything that seemed anywhere near the error location
(I figured the the location by calculating the start of the filesystem,
and knowing my drive parameters, I figured out where an offset of
XXX blocks was, and badtrk'd the track before it, the track after,
and the offending track.

I got a similar error in an unrelated region within two days.
"GARFLE" says I, as I proceeded to do it all over again, all the
time thinking, "If this keeps up, there won't be much of a disk left."

Again, within two days it happened again; so I replaced the drive.
That still didn't fix the problem.  So I replaced the controller.
I've since had the controller tested by the manufacturer, and it indeed
turned up faulty, and the original drive has worked perfectly in another
machine ever since.

So, my advice to you is to replace the controller.

Best of luck to you,
-- 
Steph Marr,  The Santa Cruz Operation Inc.,  ...!{uunet,ihnp4,ucscc}!sco!stephm
Internet: (MX Handlers) stephm@sco.COM  (Others) @ucscc.ucsc.edu:stephm@sco.COM
"There was coffee.  Life would go on."     --William Gibson

jack@turnkey.TCC.COM (Jack F. Vogel) (05/02/88)

In article <1988Apr29.151753.3956@gpu.utcs.toronto.edu> romwa@gpu.utcs.toronto.edu (Mark Dornfeld) writes:
>
>Can anyone help with this Xenix error message?
>
>error on fixed disk (minor 40), block=16544
>Error Type 0, Code 3, Unit 0
>Write/Drive Fault
>
>.[...]
>I have run 'badtrk' already suspecting a flaw on the disk, but
>no new bad tracks appeared.
 
What type of drive is this? The reason I ask is that I had similar
behavior using an Atasi drive here. I would get intermittent errors
but when doing even a destructive drive scan the bad sectors would not
be found. I hate to tell you this, but eventually the drive really gave
up the ghost. It sounds like you may be experiencing a similar problem. I
would suggest you get ready to purchase a new drive for the system, or if it
is a new drive to get it replaced. The clue is you have drive errors without
definite bad sectors found, this indicates failing drive mechanics rather than
bad media. One final possibility is a failing controller. In our case the
bad drive was the second one, so the controller was extremely unlikely.

>Is there a way to find out what Cylinder/Head contains the
>suspect tracks and put them in the bad track table?
>

Yes, remember 17 blocks (sectors) per track times x heads per cylinder. I
believe (somebody correct me if wrong) that the blocks are numbered by
cylinder meaning head 0 - X times the 17 sectors will equal the block numbers,
then move to the next track; or track1,head0 will be block 0-16; track1,head1
will be block 17-33, etc. However, as I indicated above, I suspect you have
a creeping drive death here, and that marking indicated block errors will 
not solve your problem.

					Hate to be the bearer of bad news.
						Best regards,


-- 
Jack F. Vogel
Turnkey Computer Consultants, Costa Mesa, CA
UUCP: ...{nosc|uunet}!turnkey!jack 
Internet: jack@turnkey.TCC.COM

romwa@gpu.utcs.toronto.edu (Mark Dornfeld) (05/08/88)

In article <497@scovert> stephm@sco.COM (Stephen P. Marr) writes:
>romwa@gpu.utcs.toronto.edu (Mark Dornfeld) writes:
>>
><...>Can anyone help with this Xenix error message?
>>
>>error on fixed disk (minor 40), block=16544
>>Error Type 0, Code 3, Unit 0
>>Write/Drive Fault
>>
>><...>
>>
>>Mark T. Dornfeld
>>
>><...>
>>
>>mark@utgpu!rom      - or -     romwa@utgpu
>
>A Write/Drive Fault means that the controller went bye-bye.
>I'm responsible for running some 35+ machines here at SCO,
>and I've seen this error on two machines in the last 2.5 years.
>
I've gotten some good advice on this problem and in general
everybody's experience is with a bad controller or disk.

But here's an update:  I noticed a pattern of the times of the
bad writes and all except two of them occur between 9 and 10
AM and between 5 and 7 PM.  This machine is in a high security
collection room in the Museum and there is some type of
security device with motion scanners and whatnot very near the
computer. ( I learned this after the trouble started
happening.)  When the doors are either unlocked in the AM or
locked in the PM, the security company sends some signals into
the security system for verification.  The clustering of these
bad writes leads me to suspect some high frequency
interference triggering an error message or, in fact, a bad
write to disk.

This is only the current theory and I will suspect everything
until I find the problem, but the plot of the times sure isn't
anywhere near random.

There are no cron processes scheduled during these times so
nothing should be writing to disk anyway. 

Any more help/ideas are welcome.

Mark T. Dornfeld
Royal Ontario Museum
100 Queens Park
Toronto, Ontario, CANADA
M5S 2C6

mark@utgpu!rom      - or -     romwa@utgpu