szabo_p@maths.su.oz.au (Paul Szabo) (07/18/90)
We have a DN10000 with two 697MB disks, striped (on the one controller). Lately, the disk(s) have developed bad spots. Attempting to access files stored at these spots occasionally causes the 10000 to crash. My question is: Is there a way of finding out what file is stored at a specific place on the disk? In more detail: The /systest/ssr_util/lsyserr utility gives messages like: 12:32:44 am (AEST) disk error Ctrl_# = 0 Unit_# = 0 Phys daddr = 21176 \ disk operation completed successfully after crc correction \ (OS/disk manager) Above disk chains a multiple-disk group - actual error is on: Ctrl_# = 0 Unit_# = 0 Phys daddr RELATIVE to this drive = 108BB The question is: Is there a way of finding out what file is stored at that place on the disk? I tried the Apollo Response Center, but did not get a positive answer yet. I would be very grateful for any insight. I would like to go to INVOL and add this block to the bad spot list, but first need to know which file is going to be affected. I do not wish to re-install the OS (and user files) from tape. I have tried SALVOL (options -a -s), but the problem is intermittent and I only got one problem file this way, with the message The following disk blocks had driver level I/O errors: /z/x/root/new_users/template_pm.pmthree/user_data vtocx = 211703, uid = 494B011D.5001A581 Error: status code = 0, read error at 10087F (logical), 21174 (physical) Note that there seems to be a discrepancy between the address 21174 reported by SALVOL, and the address 21176 reported by lsyserr (this is the closest I could find). At the suggestion of the Apollo engineer (he hoped that DEX would report UID's, and then I could use /systest/ssr_util/upath) I also ran EX DEX, RUN WIN -ENTIRE. The (hex) address 21176 [ = 8 + 14 * (6 + 15 * 645), since the drive has 1630 Cylinders, 15 Heads, and 14 Sectors] is found in the report Error: (WIN.DEX/Test 170) Read Disk Test, Rev 1.2 Pass 1 Uncorrectable ECC error Error Code = $23 Controller # = 0 Unit # = 0 Cylinder # = 645,$285 Head # = 6,$6 Sector # = 8,$8 Note that the above is just one example of bad blocks, both lsyserr and DEX complain about a dozen of them. My last gripe is: when I tried to access the problem directory .../template_pm.pmthree/user_data, sometimes there were no problems whatsoever (I suppose these correspond the the lsyserr entry 'completed after crc correction') but at other times the 10000 simply crashed... Paul Szabo szabo_p@maths.su.oz
thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (07/19/90)
> We have a DN10000 with two 697MB disks, striped (on the one controller). > Lately, the disk(s) have developed bad spots. Attempting to access files > stored at these spots occasionally causes the 10000 to crash. > > My question is: Is there a way of finding out what file is stored at a > specific place on the disk? > > In more detail: > > The /systest/ssr_util/lsyserr utility gives messages like: > 12:32:44 am (AEST) disk error > Ctrl_# = 0 Unit_# = 0 Phys daddr = 21176 \ > disk operation completed successfully after crc correction \ > (OS/disk manager) > Above disk chains a multiple-disk group - actual error is on: > Ctrl_# = 0 Unit_# = 0 Phys daddr RELATIVE to this drive = 108BB > First -- it's nice to know (sorry, but misery loves company) that other people have the errors with "completed successfully after crc...." Our problems no longer crash the node, thank God. Next -- what's the power-supply configuration you have? We added additional controllers and disks, and found the errors cropping up. Our DN10K was purchased early enough that they weren't shipping them with full power supplies (there are about 10 plug-in modules potentially on the board). As a general insurance policy, and at their engineer's suggestion, we added the final (12V) supply block (these are _NOT_ cheap). We also had to invol the boot volume, because a disk controller had gone bad, and take the disk with it (related?). Since then, we've had very few errors in the LSYSERR log, and they seem to be the same set (marginal spots on the disk?) Because we can't easily take down our DN10K, we live with them, as long as they're not crashing the node. > The question is: Is there a way of finding out what file is stored at > that place on the disk? I tried the Apollo Response Center, but did not > get a positive answer yet. I would be very grateful for any insight. > > I would like to go to INVOL and add this block to the bad spot list, but > first need to know which file is going to be affected. I do not wish to > re-install the OS (and user files) from tape. > > I have tried SALVOL (options -a -s), but the problem is intermittent and > I only got one problem file this way, with the message > The following disk blocks had driver level I/O errors: > /z/x/root/new_users/template_pm.pmthree/user_data > vtocx = 211703, uid = 494B011D.5001A581 > Error: status code = 0, read error at 10087F (logical), 21174 (physical) > > Note that there seems to be a discrepancy between the address 21174 > reported by SALVOL, and the address 21176 reported by lsyserr (this is > the closest I could find). > > At the suggestion of the Apollo engineer (he hoped that DEX would report > UID's, and then I could use /systest/ssr_util/upath) I also ran EX DEX, > RUN WIN -ENTIRE. The (hex) address 21176 [ = 8 + 14 * (6 + 15 * 645), > since the drive has 1630 Cylinders, 15 Heads, and 14 Sectors] is found > in the report > Error: (WIN.DEX/Test 170) Read Disk Test, Rev 1.2 Pass 1 > Uncorrectable ECC error > Error Code = $23 Controller # = 0 Unit # = 0 > Cylinder # = 645,$285 Head # = 6,$6 Sector # = 8,$8 > > > Note that the above is just one example of bad blocks, both lsyserr > and DEX complain about a dozen of them. Finally, on your REAL question. Yes, there's a way. In the /systest/ssr_util directory there's a program called "rwvol". BE VERY CAREFUL WITH IT! I have only attempted (W)riting the disk when it's sufficiently dead as to cause no additional grief. (R)eading it seems to be ok -- it doesn't get DATA unless the volume isn't mounted, but it does get other interesting information -- $ rwvol <NASTY WARNING MESSAGE HERE -- PAY ATTENTION TO IT> Select disk: [w=Winch|s=Storage mod|f=Floppy|q=Quit][ctrl#:][unit#] w R or W: r Daddr: 1034 Start: (header at 3190000, buffer at 3190400) End: uid: 49C57CA8 0000BCF0 page: 47 time: 49C57D9A type: 0 chksum: 0 daddr: 1034 (Note: volume in use; no data returned.) Done! $ upath 49C57CA8.0000BCF0 /usr/apollo/include/tml.h Note the nice uid that it returns! I didn't know about "upath", so you gave me the half that I didn't know about. > My last gripe is: when I tried to access the problem directory > .../template_pm.pmthree/user_data, sometimes there were no problems > whatsoever (I suppose these correspond the the lsyserr entry 'completed > after crc correction') but at other times the 10000 simply crashed... You're probably right. I guess I'd hesitate to call it a "gripe" when the system is able to recover and continue working. :-) John Thompson (jt) Honeywell, SSEC Plymouth, MN 55441 thompson@pan.ssec.honeywell.com As ever, my opinions do not necessarily agree with Honeywell's or reality's. (Honeywell's do not necessarily agree with mine or reality's, either)
rees@dabo.ifs.umich.edu (Jim Rees) (07/19/90)
In article <1990Jul18.043746.712@metro.ucc.su.OZ.AU>, szabo_p@maths.su.oz.au (Paul Szabo) writes: The question is: Is there a way of finding out what file is stored at that place on the disk? Someone else already mentioned rwvol. Another useful little program is fixvol. Use both with extreme caution. Note that there seems to be a discrepancy between the address 21174 reported by SALVOL, and the address 21176 reported by lsyserr (this is the closest I could find). Salvol is reporting an lvol daddr, and lsyserr is reporting a physical daddr. There is some cruft on the volume before the start of the lvol, accounting for the difference here (although I'm not sure why it's 2 and not 1).