[comp.sys.apollo] Bad blocks on DN10000 disk: crash: which files affected?

szabo_p@maths.su.oz.au (Paul Szabo) (07/18/90)

We have a DN10000 with two 697MB disks, striped (on the one controller).
Lately, the disk(s) have developed bad spots. Attempting to access files
stored at these spots occasionally causes the 10000 to crash.

My question is: Is there a way of finding out what file is stored at a
specific place on the disk?

In more detail:

The /systest/ssr_util/lsyserr utility gives messages like:
  12:32:44 am (AEST)  disk error
    Ctrl_# = 0  Unit_# = 0    Phys daddr = 21176 \
      disk operation completed successfully after crc correction \
        (OS/disk manager)
    Above disk chains a multiple-disk group - actual error is on:
    Ctrl_# = 0  Unit_# = 0    Phys daddr RELATIVE to this drive = 108BB

The question is: Is there a way of finding out what file is stored at
that place on the disk? I tried the Apollo Response Center, but did not
get a positive answer yet. I would be very grateful for any insight.

I would like to go to INVOL and add this block to the bad spot list, but
first need to know which file is going to be affected. I do not wish to
re-install the OS (and user files) from tape.

I have tried SALVOL (options -a -s), but the problem is intermittent and
I only got one problem file this way, with the message
The following disk blocks had driver level I/O errors:
    /z/x/root/new_users/template_pm.pmthree/user_data
     vtocx = 211703,  uid = 494B011D.5001A581
Error: status code = 0,  read  error at 10087F (logical), 21174 (physical)

Note that there seems to be a discrepancy between the address 21174
reported by SALVOL, and the address 21176 reported by lsyserr (this is
the closest I could find).

At the suggestion of the Apollo engineer (he hoped that DEX would report
UID's, and then I could use /systest/ssr_util/upath) I also ran EX DEX,
RUN WIN -ENTIRE. The (hex) address 21176 [ = 8 + 14 * (6 + 15 * 645),
since the drive has 1630 Cylinders, 15 Heads, and 14 Sectors] is found
in the report 
Error: (WIN.DEX/Test 170) Read Disk Test, Rev 1.2 Pass 1
Uncorrectable ECC error
Error Code = $23          Controller # = 0          Unit # = 0   
Cylinder # =  645,$285    Head # =  6,$6            Sector # = 8,$8


Note that the above is just one example of bad blocks, both lsyserr
and DEX complain about a dozen of them.

My last gripe is: when I tried to access the problem directory
.../template_pm.pmthree/user_data, sometimes there were no problems
whatsoever (I suppose these correspond the the lsyserr entry 'completed
after crc correction') but at other times the 10000 simply crashed...

Paul Szabo       szabo_p@maths.su.oz

thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (07/19/90)

> We have a DN10000 with two 697MB disks, striped (on the one controller).
> Lately, the disk(s) have developed bad spots. Attempting to access files
> stored at these spots occasionally causes the 10000 to crash.
> 
> My question is: Is there a way of finding out what file is stored at a
> specific place on the disk?
> 
> In more detail:
> 
> The /systest/ssr_util/lsyserr utility gives messages like:
>   12:32:44 am (AEST)  disk error
>     Ctrl_# = 0  Unit_# = 0    Phys daddr = 21176 \
>       disk operation completed successfully after crc correction \
>         (OS/disk manager)
>     Above disk chains a multiple-disk group - actual error is on:
>     Ctrl_# = 0  Unit_# = 0    Phys daddr RELATIVE to this drive = 108BB
> 

First -- it's nice to know (sorry, but misery loves company) that other
people have the errors with "completed successfully after crc...."
Our problems no longer crash the node, thank God.

Next -- what's the power-supply configuration you have?  We added additional
controllers and disks, and found the errors cropping up.  Our DN10K was 
purchased early enough that they weren't shipping them with full power
supplies (there are about 10 plug-in modules potentially on the board).
As a general insurance policy, and at their engineer's suggestion, we
added the final (12V) supply block (these are _NOT_ cheap).  We also had
to invol the boot volume, because a disk controller had gone bad, and take 
the disk with it (related?).  Since then, we've had very few errors in
the LSYSERR log, and they seem to be the same set (marginal spots on the
disk?)  Because we can't easily take down our DN10K, we live with them, as
long as they're not crashing the node.

> The question is: Is there a way of finding out what file is stored at
> that place on the disk? I tried the Apollo Response Center, but did not
> get a positive answer yet. I would be very grateful for any insight.
> 
> I would like to go to INVOL and add this block to the bad spot list, but
> first need to know which file is going to be affected. I do not wish to
> re-install the OS (and user files) from tape.
> 
> I have tried SALVOL (options -a -s), but the problem is intermittent and
> I only got one problem file this way, with the message
> The following disk blocks had driver level I/O errors:
>     /z/x/root/new_users/template_pm.pmthree/user_data
>      vtocx = 211703,  uid = 494B011D.5001A581
> Error: status code = 0,  read  error at 10087F (logical), 21174 (physical)
> 
> Note that there seems to be a discrepancy between the address 21174
> reported by SALVOL, and the address 21176 reported by lsyserr (this is
> the closest I could find).
> 
> At the suggestion of the Apollo engineer (he hoped that DEX would report
> UID's, and then I could use /systest/ssr_util/upath) I also ran EX DEX,
> RUN WIN -ENTIRE. The (hex) address 21176 [ = 8 + 14 * (6 + 15 * 645),
> since the drive has 1630 Cylinders, 15 Heads, and 14 Sectors] is found
> in the report 
> Error: (WIN.DEX/Test 170) Read Disk Test, Rev 1.2 Pass 1
> Uncorrectable ECC error
> Error Code = $23          Controller # = 0          Unit # = 0   
> Cylinder # =  645,$285    Head # =  6,$6            Sector # = 8,$8
> 
> 
> Note that the above is just one example of bad blocks, both lsyserr
> and DEX complain about a dozen of them.

Finally, on your REAL question.  Yes, there's a way.  In the /systest/ssr_util
directory there's a program called "rwvol".  BE VERY CAREFUL WITH IT!  I have
only attempted (W)riting the disk when it's sufficiently dead as to cause no
additional grief.  (R)eading it seems to be ok -- it doesn't get DATA unless 
the volume isn't mounted, but it does get other interesting information --
     $ rwvol
     
     <NASTY WARNING MESSAGE HERE -- PAY ATTENTION TO IT>
     
     Select disk: [w=Winch|s=Storage mod|f=Floppy|q=Quit][ctrl#:][unit#] w
     R or W: r
     Daddr: 1034
     Start: 
     (header at 3190000, buffer at 3190400)
     End: 
     
     uid:    49C57CA8 0000BCF0
     page:   47
     time:   49C57D9A
     type:   0
     chksum: 0
     daddr:  1034
     
     (Note: volume in use; no data returned.)
     
     Done!
     $ upath 49C57CA8.0000BCF0
     /usr/apollo/include/tml.h
Note the nice uid that it returns!  I didn't know about "upath", so you gave me
the half that I didn't know about.



> My last gripe is: when I tried to access the problem directory
> .../template_pm.pmthree/user_data, sometimes there were no problems
> whatsoever (I suppose these correspond the the lsyserr entry 'completed
> after crc correction') but at other times the 10000 simply crashed...
You're probably right.  I guess I'd hesitate to call it a "gripe" when the
system is able to recover and continue working.    :-)

John Thompson (jt)
Honeywell, SSEC
Plymouth, MN  55441
thompson@pan.ssec.honeywell.com

As ever, my opinions do not necessarily agree with Honeywell's or reality's.
(Honeywell's do not necessarily agree with mine or reality's, either)

rees@dabo.ifs.umich.edu (Jim Rees) (07/19/90)

In article <1990Jul18.043746.712@metro.ucc.su.OZ.AU>,
szabo_p@maths.su.oz.au (Paul Szabo) writes:
  The question is: Is there a way of finding out what file is stored at
  that place on the disk?

Someone else already mentioned rwvol.  Another useful little program is
fixvol.  Use both with extreme caution.

  Note that there seems to be a discrepancy between the address 21174
  reported by SALVOL, and the address 21176 reported by lsyserr (this is
  the closest I could find).

Salvol is reporting an lvol daddr, and lsyserr is reporting a physical
daddr.  There is some cruft on the volume before the start of the lvol,
accounting for the difference here (although I'm not sure why it's 2 and not
1).