[comp.unix.ultrix] Problems with a uVAX II/gpx

jeff@ms.uky.edu (Jeff Anderson) (10/12/89)

Hi.  We have a MicroVax II/GPX running Ultrix V3.0.  This system has two
disk drives (drive 0 = rd53, drive 1 = rd54).  The problem is that about
once every two weeks, the system hangs.  Here is a description from the
person having the problem:

"                              ....  I get the "Force Error Modifier Set:
LBN xxxxx" message occasionally, where xxxxx is some number which changes.
I also get message "pid xxx was killed on swap error".  Apparently these two
messages come as a pair.   After getting several of these messages, some of
the window programs do not work, xterm does not work, or xterm does not 
create a window at all or the system does not respond to keyboard, etc.
Usually the last recourse is to shutdown the system and reboot.   What happens
is that the system won't boot complaining root file system corruption, etc so 
that you are forced to rebuild the root file system from a dump tape.
This has been happening about once in every other week. "

These messages are appearing on the console.  I believe that he uses both
DECwindows and a few X windows programs at the same time.  Could these
problems be related to that?  

So, has anybody seen something like this before?  Is the problem software
or hardware?  I would appreciate any help in finding out just what the
problem is.  Thanks,


-- 
Jeff Anderson				Internet: jeff@engr.uky.edu  
Dept. of Electrical Engineering		          jeff@ms.uky.edu
University of Kentucky   		UUCP: {rutgers | uunet}!ukma!ukecc!jeff
Lexington, KY 40506			BITNET:  jeff@UKMA.BITNET

schedler@qadgop.DEC.COM (Richard Schedler) (10/13/89)

Sounds like you have bad blocks on your swap partition.  There's a 
section in the Ultrix System Manager's Guide that explains bad block 
replacement and how to recover from Forced Error Modifier Set flags,
also check out the radisk(8) manpage.

	--Richard

--------
Richard Schedler                      Internet: schedler@src.dec.com 
Systems Research Center		      UUCP:     decwrl!schedler
Digital Equipment Corporation

alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) (10/16/89)

In article <12923@s.ms.uky.edu>, jeff@ms.uky.edu (Jeff Anderson) writes:
> 
> "                              ....  I get the "Force Error Modifier Set:
> LBN xxxxx" message occasionally, where xxxxx is some number which changes.
> I also get message "pid xxx was killed on swap error".  

	I forget the exact name, but in the system management
	doc set is a guide or a section on disk errors.  A
	Forced Error happens when a bad block is replaced 
	and the data isn't correct.  This should only happens
	on reads, since a write should always be able to find
	a good block to write to.
	
> 
> These messages are appearing on the console.  I believe that he uses both
> DECwindows and a few X windows programs at the same time.  Could these
> problems be related to that?  

	Could be.  Since disk errors are a hardware problem, get
	the disk fixed and see if the other problems persist.

> 
> So, has anybody seen something like this before?  Is the problem software
> or hardware?  
>
	Hardware. 

	If you can't find your system management docs, here is
	basically what is happening.

	Per a request by a host, a DSA disk controller tries to
	read a block and gets a data error.  What happens next
	depends on the disk controller.

	RQDX*, HSC* and RF disks - The controller invokes it's
	bad block replacement procedure.  Generally it will try
	many times to get correct data.  Eventually it will get
	a good copy of the data or give up.  Then it will pick
	a replacement block and write the data there.  If the
	data was good it will return the data to the host and
	(probably independently) tell the host there was an error
	it fixed.  If the data was bad, it returns the incorrect
	data and includes a bit that tells the host the data is
	wrong.  In the header for this block is a copy of that bit
	so that future accesses will know the data is wrong even
	though the new block is ok.

	UDA50, KDA50 and KDB50 - The controller tells the host that
	there was an I/O error and lets the host deal with it.  The
	host will probably want to follow a similar procedure that
	the previous set of controllers do.  In V2.0 and later this
	procedure was added to the driver.  Before V2.0 there was
	a standalone utility to do the correct replacement procedure.
	I don't think BSD had the utility.

	The ways to fix a "forced error" are to rewrite the block
	(preferably) with correct data, if it's available or clear
	the bit with the radisk(8) program.
> 
> -- 
> Jeff Anderson				Internet: jeff@engr.uky.edu  


-- 
Alan Rollow				alan@nabeth.enet.dec.com

hurf@batcomputer.tn.cornell.edu (Hurf Sheldon) (10/19/89)

	My experience has been that rd53 drives on rqdx3's get
	'bitrot' every so often and start having errors as you
	mention. Going thru the numbers of whatever the current
	version supports to correct this has never been successful.
	Once the disk starts, they happen more and more frequently.
	The only sure cure we have found is reformatting. If you have
	a vs2000 you can reformat for an rqdx3 with out Dec's 
	proprietary formatter or, if you have Dec support on your
	system they can format your drive for you or get you their
	'customer' diag set which has the formatter on it...
	DEC knows this as they refuse to replace an rd53 under
	service without first reformatting the drive and testing
	it. (Hours of system down time when they could just plug
	in a reformatted one and go home and play with it... You are
	stuck with a full restore in any case) BTW -the service guys
	don't have the freedom to make this decision so it is their
	time that gets hosed, too.
	You can probably sell your controller and drives for more
	than half the cost of a new 320mb maxtor and dilog controller...
	
	hurf
-- 
     Hurf Sheldon			 Network: hurf@ionvax.tn.cornell.edu
     Lab of Plasma Studies		  Bitnet: hurf@CRNLION
     369 Upson Hall, Cornell University, Ithaca, N.Y. 14853  ph:607 255 7267
  I got a job in science; I bought a Porsche; Now, everyone takes me seriously.