[comp.unix.ultrix] uq0 being reset

russ@mays.ucr.edu (02/07/90)

We have a VAXstation II running Ultrix 2.0 with
a TK50, and two RD53 drives (one recently added).
From time to time console messages appear saying
	Force Error Modifier set LBN ......
	ra1g: hard error sn .....
Lately these have been increasing, so I did a dump
of ra1g, newfs and restore.  The restore died
with
	uq0 being reset
and the system hung. Now I can't do anything
to that disk partition without the system hanging
and the uq0 message appearing.
Any suggestions as to how I can get my disk working
again? Any help appreciated.

Thanks,
Russ Harvey
russ@ucrmath.ucr.edu
{ucsd,uci}!ucrmath!russ

grr@cbmvax.commodore.com (George Robbins) (02/13/90)

In article <3888@ucrmath.UCR.EDU> russ@mays.ucr.edu () writes:
> We have a VAXstation II running Ultrix 2.0 with
> a TK50, and two RD53 drives (one recently added).
> From time to time console messages appear saying
> 	Force Error Modifier set LBN ......
> 	ra1g: hard error sn .....

It is important to understand that these messages are basically *fatal* -
meaning that you need to take action as soon you see them.  It is probably
an indication that either your drive wasn't adequately formatted/tested
initially or that it is picking up a new errors.

Typically, after getting a hard error, you want to do a backup, address
the error condition and the restore the filesystem.  You can use something
like "tar cvf /dev/null /mount_point" to try to figure out which file(s)
the bad spot(s) are in, if you care.

If the bad block(s) are in inodes or other unpleasant spots, your system may
crash when accessing the mounted filesystem or the filesystem may becomre
more corrupt.  Your uq0 hang sounds more like a drive/controller problem,
though I don't have any experience with the RD type drives.

> Lately these have been increasing, so I did a dump
> of ra1g, newfs and restore.  The restore died
> with
> 	uq0 being reset
> and the system hung. Now I can't do anything
> to that disk partition without the system hanging
> and the uq0 message appearing.

> Any suggestions as to how I can get my disk working
> again? Any help appreciated.

Sigh...  Run whatever formatter / initialization diagnotics you have at
hand.  The drive or drive/controller combination may be flakey.  If you
don't have anything, then it might be worth having DEC come in - a flaky
drive can cost a lot of your time and pulled out hair...
-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr@cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)

alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) (02/13/90)

In article <3888@ucrmath.UCR.EDU>, russ@mays.ucr.edu writes:
> We have a VAXstation II running Ultrix 2.0 with
> a TK50, and two RD53 drives (one recently added).
> From time to time console messages appear saying
> 	Force Error Modifier set LBN ......
> 	ra1g: hard error sn .....

> Lately these have been increasing, so I did a dump
> of ra1g, newfs and restore.  The restore died
> with
> 	uq0 being reset

	The "uq" mentioned is the disk controller.  For some
	reason ULTRIX didn't get any response from a command
	it gave to the controller so reset it.

	The two most likely causes are:

	1.  The disk is dead (*).  You might try reformatting it,
	    but that will probably only hang the diagnostics.

	2.  The disk controller is sick or dead.  If you can still
	    access the other disk then it may only be sick.  If
	    you have a Field Service contract give them a call and
	    suggest that the disk controller is broken.

> Thanks,
> Russ Harvey
> russ@ucrmath.ucr.edu
> {ucsd,uci}!ucrmath!russ

	* Refer to the Monty Python "Dead Parrot" sketch for more 
	  information about this type of "dead".

-- 
Alan Rollow				alan@nabeth.enet.dec.com

alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) (02/14/90)

In article <9648@cbmvax.commodore.com>, grr@cbmvax.commodore.com (George Robbins) writes:
> In article <3888@ucrmath.UCR.EDU> russ@mays.ucr.edu () writes:
> > We have a VAXstation II running Ultrix 2.0 with
> > a TK50, and two RD53 drives (one recently added).
> > From time to time console messages appear saying
> > 	Force Error Modifier set LBN ......
> > 	ra1g: hard error sn .....
> 
> It is important to understand that these messages are basically *fatal* -
> meaning that you need to take action as soon you see them.  It is probably
> an indication that either your drive wasn't adequately formatted/tested
> initially or that it is picking up a new errors.

	One of the features of the Digitial Storage Architecture
	(DSA) is that it tries to provide applications a view
	of disks that make them appear to be error free.  It does
	this mapping bad sectors to good ones.  Any initially bad
	sectors are mapped when the disk is formatted.  For errors
	that occur after formatting there are parts of the architecture
	that describe what is to be done.

	For this commentary I'll call the process Bad Block Replacement
	(or BBR).  There two kinds of BBR, static and dynamic.  Pre-V2.0
	version of ULTRIX and BSD 4.2 (and probably 4.3) do static BBR.
	If a bad block appeared and had to be fixed you booted a stand-
	alone program (rabads I think) that would let you scan the disk	
	would do the BBR for you.  Dynamic BBR has been supported by
	every version of ULTRIX since V2.0 and some disk controllers.

	The UDA50, KDA50 and KDB50 disk controllers will report a bad
	block to the host and expect the host to perform BBR.  The
	RQDX3 and HSC family will do the BBR themselves.  Part of the
	BBR process is to attempt to read the block many times in order
	to get a good copy of the data.  If the attempt fails then the
	original copy of the data is written to a replacement block and
	a bit is set in the block header.  This is the "Forced Error"
	referred to in the error message.  The block is good, but the
	data is corrupted from what it should have been.  Rather than
	gloss over it, the drivers force an Input error when the block
	is accessed.  The bit gets cleared when it is written to.

	In V2.0 and later is a program called radisk(8) that has options
	to scan for bad blocks, clear forced errors and start the BBR
	algorithm for a specific block or set of blocks (more on this
	one later).  The command to clear a forced error is:

		radisk -c LBN length special

		LBN is the logical block number of where the forced
		error is.  The length is generally 1, but if you have
		set of sequential forced errors you can get them all
		at once.  The last argument is the special device file
		for the disk.

	NOTE:  Radisk should only be run with the system single user.
	This is a documented restriction of the program.

	The scan operation tells the controller to scan the disk and
	doesn't transfer any of the data back to host.  This makes it
	faster than doing something like a dd(1) to read every block.
	The command is:

		radisk -s LBN length special

	If you want to scan the entire disk you can use:

		radisk -s 0 -1 special

	and radisk(8) will figure out the length.  The command to
	force BBR is:

		radisk -r LBN special

	The algorithm doesn't automatically replace a block, but
	execises it to make sure that it is bad.  If the block isn't
	bad then it won't replace it.

> 
> Typically, after getting a hard error, you want to do a backup, address
> the error condition and the restore the filesystem.  You can use something
> like "tar cvf /dev/null /mount_point" to try to figure out which file(s)
> the bad spot(s) are in, if you care.

	Once you've cleared a Forced Error on a replaced block you
	need to determine if the block was important.  George's
	suggestion is ok, but if know the block numbers and can
	translate them into blocks numbers within the partition
	there are simpiler ways of finding the file.

	First identify where the block is:

		icheck -b block-number special

	Icheck(8) can take a list of block numbers and identify
	where the blocks are.  It will say whether the block is
	part of the inode list (and which inodes), a data block
	of a file, a free block, a superblock (or backup superblock),
	etc.  If the block belongs to a file you can track down the
	file name by the inode number with:

		ncheck -i inode-number special

	This can be slow, so if you can mount the file system another
	method is to use ls and grep:

		ls -Rli | grep inode-number

	Once you know the file you can replace it with a good copy of
	it from a backup or the distribution (or other system).  Some-
	times it will be a file that is easily recreated (object file
	for example).

	If a block of inodes is bad you'll have to determine if any
	of them are used.  Generally for this I use fsck so I can
	repair any damage that there is.  Sometimes the damage will
	bad enough that it's simplier to restore from a backup.
> 
> If the bad block(s) are in inodes or other unpleasant spots, your system may
> crash when accessing the mounted filesystem or the filesystem may becomre
> more corrupt.  

	For this reason its a good idea to avoid mounting the file
	system until you know where the problem is.

> George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
> but no way officially representing:   domain: grr@cbmvax.commodore.com
> Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)

-- 
Alan Rollow				alan@nabeth.enet.dec.com

grr@cbmvax.commodore.com (George Robbins) (02/14/90)

In article <709@shodha.dec.com> alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) writes:
> In article <9648@cbmvax.commodore.com>, grr@cbmvax.commodore.com (George Robbins) writes:
> > 
> > It is important to understand that these messages are basically *fatal* -
> > meaning that you need to take action as soon you see them...
> 
> 	One of the features of the Digitial Storage Architecture
> 	(DSA) is that it tries to provide applications a view
> 	of disks that make them appear to be error free...
> 	                                ...  The block is good, but the
> 	data is corrupted from what it should have been.  Rather than
> 	gloss over it, the drivers force an Input error when the block
> 	is accessed.  The bit gets cleared when it is written to.

Thanks to Alan for posting all the additional info.  Since we've escaped from
the original "yer drive/controller is dead" subject, I'll expand a bit also...

1) The "forced error" is essentially a "tombstone".  The damage was done some-
   time before, the bad block was replaced, but the "tombstone" marks that
   place where the "corpse", the un-recovered data, lies.

   Obviously, whether to corrupted data is important to recover or not is
   installation/case dependent.  HOWEVER, even though only a few bits or
   bytes may have gotten zapped, it is important to understand that the
   forced error condition does propagate up to the user level software.

   This means that if a program bothers to do any error checking, it's
   likely to toss it's cookies or at least stop processing that file then
   and there.  For example dump will print "shouldn't happen errors" and
   tar truncates the file and prints a "size changed" message.  What your
   pet applications does is another question.

   If you're getting these messages, it is still important to act on them
   with due haste, though not necessarily panic.

2) Generally, you will tend to get the forced error message with the same
   block number repetitively - there was only one original error/replacement
   but each time you read the file you'll get the forced error again, for
   example as part of your daily backup run...

   If you start getting a variety of block numbers, then it's a pretty good
   indication that your drive is starting to go south or maybe (especially
   if it's a 3-rd party drive) you didn't run enough surface analysis to
   initially pick up all the bad spots.

3) You can use "/etc/uerf -o full -D" to pick up the gory history of the
   problem, however interpretation of the error log is non-trivial.

   I can't remember off-hand whether a "BAD BLOCK REPLACEMENT FAILED"
   message actually gets logged to the console - I've always ended up seeing
   the "forced error" messages rather than the original error.

   The value of /etc/uerf is somewhat compromised by the amount of pro-forma
   crap that get stuck in there, especially if you have some thing like 12K
   "fixed up unaligned access" messages (lps20 "lpscomm" thanks 8-) gumming
   up the works.

   I've attached a little shell script that I run nightly that mails me a
   summary of accumulated errors.  Periodically, I clear the logfile and
   let it start over.

   One of the other guys posted a program to do some log selection/analysis
   which might be a better starting point - I haven't messed with it yet...

Here's the error log analyzer...
--------------------------------------------------------------------------------
#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	daily
# This archive created: Wed Feb 14 07:32:11 1990
export PATH; PATH=/bin:$PATH
echo shar: extracting "'daily'" '(360 characters)'
if test -f 'daily'
then
	echo shar: will not over-write existing file "'daily'"
else
sed 's/^	X//' << \SHAR_EOF > 'daily'
	X#! /bin/sh -
	X(
	X# extract summary/counts from uerf garbage
	X
	Xecho ""
	Xecho "Error Log Messages:"
	X/etc/uerf | \
	X	egrep '^MESSAGE|^ERROR' | \
	X	sed -e 's/.*MESSAGE *//' -e 's/.*ERROR SYNDROME *//' | \
	X	sort | \
	X	uniq -c
	X
	X# should do some kind of (crude) rotation...
	X
	X# LOG=/usr/adm/syserr/`hostname`
	X# cat $LOG >> $LOG.old
	X# cat /dev/null > $LOG
	X
	X) 2>&1 | mail root
SHAR_EOF
if test 360 -ne "`wc -c < 'daily'`"
then
	echo shar: error transmitting "'daily'" '(should have been 360 characters)'
fi
fi # end of overwriting check
#	End of shell archive
exit 0
-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr@cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)