[comp.unix.ultrix] RA81 disk error/dump problem

jjb@ares.cs.wayne.edu (Jon J. Brewster) (02/22/90)

This concerns a uVAX3600 running Ultrix V3.1.  We've recently started
having problems with dump -- we get lots of bread messages with block
no.s and a final one that says "More than 32 block read errors from
64884"  Uerf shows many bad block replacement events for that disk, all
for the same LBN, 177837.  However, it also lists "PREVIOUS RBN 0." and
"NEW RBN 0.".  I assume that RBN means Replacement Block Number.  It
also says BAD BLOCK REPL CAUSE x0048.

A couple of puzzlements: when I do the icheck/ncheck two-step on the
block no.'s listed in the "(This should not happen) bread..." messages,
I find that the blocks are located in a very few files (i.e., many
blocks pointed at by far fewer inodes).  Further, the inode numbers are
almost consecutive.  Seems somehow reasonable.  However, icheck claims
that block 64884 is a free block.  So, what does that last message
indicate?

Second, I would hope that the problems uerf reports would somehow
relate to this, but I can't see any relationship between the LBN
and any of the bread errors.  Does anyone know the significance
of the RBN 0 lines?  The drive is an old one, left over from an 11/780
of yesteryear.  I wonder if the lines mean that it's incapable of
doing bad block replacement?  (I've run radisk on the disk, and
it seems to have no effect.  The -s option always gets caught in a
loop saying that LBN 177837 is replaced, and the -r option returns
with no message.)

Please e-mail me and I'll summarize if there's any interest.

TIA,

--
 Jon J. Brewster
 jjb@cs.wayne.edu
 ...!umich!wsu-cs!jjb

alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) (02/22/90)

In article <1091@wsu-cs>, jjb@ares.cs.wayne.edu (Jon J. Brewster) writes:
> This concerns a uVAX3600 running Ultrix V3.1.  We've recently started
> having problems with dump -- we get lots of bread messages with block
> no.s and a final one that says "More than 32 block read errors from
> 64884"  Uerf shows many bad block replacement events for that disk, all
> for the same LBN, 177837.  However, it also lists "PREVIOUS RBN 0." and
> "NEW RBN 0.".  I assume that RBN means Replacement Block Number.  It
> also says BAD BLOCK REPL CAUSE x0048.

	I haven't studied the BBR algorithm closely enough to
	see what is really going on, but it sounds like something
	is causing BBR to break.  I saw a problem similar to this
	on an V3.x field test, but it was fixed.  As a guess it
	could be the RCT is corrupted and causing BBR to fail in
	mysterious ways.  Generally when BBR doesn't work you'll
	get message like:

		Media is write-protected.
		Back up media and reformat.
> 
>  Jon J. Brewster
>  jjb@cs.wayne.edu
>  ...!umich!wsu-cs!jjb

	For reference:

		RBN - Replacement Block Number.
		RCT - Replacement Control Table.

	LBN numbers from radisk are reported relative to the beginning
	of the disk, where most other ULTRIX utilities report block
	numbers relative to the file system (partition).

	Solution?  You'll want to back up as much as the disk as
	you can.  Since dump doesn't work try tar, but you'll need
	to avoid the file with the messed up blocks.  Since ncheck
	and icheck are being useless try find and cp:

		find file-system -print -exec cp {} /dev/null \;

	After getting a back up have Field Service format the disk.

-- 
Alan Rollow				alan@nabeth.enet.dec.com