[net.unix-wizards] RA60 hard error on 4.2BSD

pes@mitre-bedford.ARPA (04/19/85)

	I understand that hard disk errors with DEC UDA-50 controllers
and RA60/RA80 drives have been a problem for many VAX UNIX installations.
There was an article posted recently to UNIX-INFO by Ed Merrill and Andy 
Linton which addressed bad block replacement and DEC standard 144 and 166
disk media.  However, it did not provide me with a solution to my immediate
problem.

	I'm running 4.2BSD on a VAX11/780 with RA60 disk drives.
Occasionally while executing tasks which have a lot of disk activity, I
get the following errors at the console:

     uda0: hard error, disk transfer error, unit 0, grp 0x0, hdr 0x26aec
     uda0: hard error, SDI error, unit 0, event 0353, hdr 0x0

The second error repeats about 6 times.

	What do these errors mean?  I'm not an expert on device drivers,
but I've got some local opinions that the driver is to blame.  Has
anyone had similar problems or know of a cause?  Or better yet a fix?
Thanks.

Paul Silvey
pes@mitre-bedford.arpa

mmason@psu-cs.UUCP (Mark C. Mason) (04/22/85)

> 
>      uda0: hard error, disk transfer error, unit 0, grp 0x0, hdr 0x26aec
>      uda0: hard error, SDI error, unit 0, event 0353, hdr 0x0
> 
> The second error repeats about 6 times.

> 	What do these errors mean?  I'm not an expert on device drivers,
> anyone had similar problems or know of a cause?  Or better yet a fix?


	Ever since our DEC service rep started replacing the spindle
brushes on our 3 ra81s on a regular basis, we have had this problem all 
but disappear.  When the error messages start cropping up again, usually 
after about 3 mo., we check the brushes and usually find one or two that 
need replacement.  You might also persuade your rep to check the hardware
revs on your disks;  new ones seem to come out about twice a year.


					Mark

zemon@fritz.UUCP (Art Zemon) (04/24/85)

In article <> pes@mitre-bedford.ARPA writes:
>	I'm running 4.2BSD on a VAX11/780 with RA60 disk drives.
>Occasionally while executing tasks which have a lot of disk activity, I
>get the following errors at the console:
>
>     uda0: hard error, disk transfer error, unit 0, grp 0x0, hdr 0x26aec
>     uda0: hard error, SDI error, unit 0, event 0353, hdr 0x0

I am having the same problem with an RA81 on an 11/750.  Any
advice counterindicative of reformating the disk would be
greatly appreciated.  Phone me if necessary.  I'm currently
planning to reformat the entire disk on May 5.

Thanks,
-- 
	-- Art Zemon
	   FileNet Corp.
	   ...! {decvax, ihnp4, ucbvax} !trwrb!felix!zemon

shimell@stc.UUCP (Dave Shimell) (04/26/85)

In article <10072@brl-tgr.ARPA> pes@mitre-bedford.ARPA writes:
>
>	I understand that hard disk errors with DEC UDA-50 controllers
>and RA60/RA80 drives have been a problem for many VAX UNIX installations.
>.............................................................  Has
>anyone had similar problems or know of a cause?  Or better yet a fix?
>Thanks.
>
>Paul Silvey
>pes@mitre-bedford.arpa

	We run binary Ultrix on our 785 and 750's.  Since the beginning
	of the year we experienced bad block problems on one of our
	ra81's.  HDA replacements were tried 3 times by DEC, each
	seeming to cure the problem until a month or so later.  Then we
	would get a crash with a bad block in the inode area.  This is
	particularly painful as fsck can't fix this problem.

	In the end DEC did two things:

	1.	Modified the strapping on each of our HDA'a.  (The
		straps set in a certain way cause less uncorrectable
		ECC errors - contact DEC Field Service.)

	2.	Delivered /rabads - this is a standalone program from
		release 1.1 Ultrix.  Rabads can be used to inspect and
		patch the bad block table.  Once patched, the UDA50
		ensures that the disk appears contiguous to the
		operating system.

	Clearly, rabads could be used on any O/S providing it can be
	loaded into memory (it's a standalone program).  Now the bad
	news - I'm not sure whether DEC would supply you with rabads
	unless you run Ultrix.  However, if you think you have a bad
	block and your Field Service Engineer can't get the hardware to
	work reliably, it might be in both your interests for DEC to
	supply rabads.  Unfortunately, since rabads is DEC proprietary
	software, I am unable to send it to anyone.

	These two mods seem (fingers crossed) to have solved our
	problems.  Yes, we have had crashes but they do not follow the
	patern experienced previously.

Regards,
Dave Shimell.
shimell@stc.UUCP
{root44, ukc, idec, stl, creed, stc-[bcdf]}!stc-a!shimell
-- 

Regards,
Dave Shimell.
shimell@stc-a.UUCP
{root44, ukc, idec, stl, creed, stc-[bcdf]}!stc-a!shimell

jsdy@hadron.UUCP (Joseph S. D. Yao) (05/07/85)

>      uda0: hard error, disk transfer error, unit 0, grp 0x0, hdr 0x26aec
>      uda0: hard error, SDI error, unit 0, event 0353, hdr 0x0
> The second error repeats about 6 times.
> 	What do these errors mean?  ...

While working on a System V driver, this happened a lot.  The field
service person had just declared the hardware to be perfect, so I also
assumed it was a problem in a not-yet-quite-bugless driver.  I worked
for weeks (off and on) to make the driver more and more perfect.

The problem was hardware.  One of the boards in the drive itself (the
interface to the outside world, i think) had to be replaced.  For a day
we ran on the opposite side of the dual access.  We found this all out
when it got so bad we went back to DEC.  Our regular field service
person then came out to run diagnostics.  She found the problem
immediately.  (*sigh*)

Try swapping the drive cables in back from "A" to "B", and spinning
up the drive with the "B" button pushed in, rather than the "A".
See if that makes a difference.  Then get your field service to run
lots and lots of diagnasties (on A) to show your advisors.

	Joe Yao		hadron!jsdy@seismo.{ARPA,UUCP}

ron@ron1.UUCP (Ron Saad) (06/26/85)

>>      uda0: hard error, disk transfer error, unit 0, grp 0x0, hdr 0x26aec
>>      uda0: hard error, SDI error, unit 0, event 0353, hdr 0x0
>> The second error repeats about 6 times.
>> 	What do these errors mean?  ...
> 
> While working on a System V driver, this happened a lot.  The field
> service person had just declared the hardware to be perfect, so I also
> assumed it was a problem in a not-yet-quite-bugless driver.  I worked
> for weeks (off and on) to make the driver more and more perfect.
> 
> The problem was hardware.  One of the boards in the drive itself (the
> interface to the outside world, i think) had to be replaced.  For a day
> we ran on the opposite side of the dual access.  We found this all out
> when it got so bad we went back to DEC.  Our regular field service
> person then came out to run diagnostics.  She found the problem
> immediately.  (*sigh*)
> 
> Try swapping the drive cables in back from "A" to "B", and spinning
> up the drive with the "B" button pushed in, rather than the "A".
> See if that makes a difference.  Then get your field service to run
> lots and lots of diagnasties (on A) to show your advisors.
> 
> 	Joe Yao		hadron!jsdy@seismo.{ARPA,UUCP}

*** REPLACE THIS LINE WITH YOUR LUNCH ***

We are rather new on the net, so I missed the original article. I tried
to contact the author of the 'solution' several times via mail, but
never got a response, so I assume my message never got there.

We have been having the same problem with our UDA50-RA81-RA60 system.
We are running a VAX 11/780 with 4.2 BSD. Since the problem occurred on
a Sys-V machine also, I assume it's not the drivers.

Our service people have replaced every board on the system (all the
RA60 boards, the uda50, the personality module on the RA81), all with
no success - the problem keeps occurring. It wouldn't be so bad if it
just crashed the system, but sometimes UNIX does not recover, and just
hangs there till I come in and force a reboot.

If anyone out there in wizard-land can give us more information, it
would be greatly appreciated (the service people are now going to
replace the power supply boards ...  :-)

If the person who posted the problem originally has succeeded in
solving the problem - PLEASE tell us how ?
-- 
------------the opinions expressed above etc. etc.  --------------

					Ron Saad  (4Z4UY)
					Sys Adm -  Center for Advanced
					Technology in Telecommunications
					Polytechnic Institute of New York

UUCP:	...{ihnp4,seismo}!{philabs,cmcl2}!ron1!ron
MAIL:	333 Jay St. Brooklyn, N.Y. 11201
PHONE:	(718) 643-7303