[comp.os.vms] Uncorrectable ECC errors on an RA60 - What are they? Worry?

cameron@runx.ips.oz (James Cameron) (04/06/88)

I'm wondering what to do about errors reported by VAX/VMS on my RA60 disk.

The rate of errors is increasing. Two weeks ago I would see one or two errors
in one week. Now I see about three per day. I am concerned.

According to ANALYZE/ERROR_LOG and VAXsim, the errors are ECC errors, which
trigger an attempted bad-block replacement operation. The three types that
I have seen are;

      Uncorrectable ECC error,
      Four symbol ECC error, and
      Five symbol ECC error.

The bad block replacement operation reports 'BLOCK VERIFIED GOOD' on each
error. I guess this means that the block that caused the error is still
considered OK for use. Can anyone confirm this? (BLOCK VERIFIED BAD would
therefore mean that it's confirmed bad and re-vectored)

As for the cause and correction of the problem - I shall leave that up to
DEC Field Service - it's their job. What I'd like to ask all of you is this;

	Is an 'UNCORRECTABLE ECC ERROR' an error that causes the operation to
	fail? (i.e. some user somewhere gets a stack dump).

If so, how can I determine;

	a) What user (VMS process) encountered the problem? (The error log
	   retains the full process ID - how do I translate this into the
	   process ID used by ACCOUNTING and SHOW SYSTEM?)

	b) In what file, and also what virtual block number, did the
	   error occur? The only information in the error log is the logical
	   block number on the disk - how can I translate this LBN into a
	   file-name (or id) and a VBN?

	   I'd rather not try to dump the header of every file I'm worried
	   about, there are rather a alot of them. Has anyone needed to do
	   this before - did they write a program to scan INDEX.SYS?

	c) On what surface of the disk (head number?) did the error occur?
	   If many errors are occurring on one head could this indicate that
	   the head is damaged? (maybe leave this one to Field Service too,
	   but the point is that the error log doesn't tell me where the
	   logical block is - maybe this is just impossible in DSA)

If anyone can offer any suggestions, I will listen. Please reply by mail -
if there is sufficient interest I will summarise later.

Some other questions...

	Does anyone know the password that Field Service should use to
	change the VAXsim_Monitor process error rate margins? F.S. down-
	under doesn't seem to know - I've asked.

	Is there any way to determine how many replacement blocks are 
	available on a DSA disk? (RA60/RA81...) Or, how many blocks have
	been re-vectored?

	What is this difference between a four and five symbol ECC error?

Thank you for taking the time to read this.

James Cameron, Kilpatrick Green Pty. Ltd., P.O. Box N366, Sydney 2000 Australia
Internet: cameron@runx.ips.oz.au  UUCP: uunet!runx.ips.oz.au!cameron