[comp.sys.apollo] disk data check on DN10K

paul@eye.com (Paul B. Booth) (01/26/91)

Hi all.  We've recently been trying to run some LARGE ray-traces on our DN10k.  
Most of these are batch jobs that write out large numbers of images in
sequence, and we tend to run 4 such jobs at a time (1 job/cpu).  What's been
happening though, is that these jobs will bomb out after some random number of
frames are generated with the error: "disk data check."  Re-running the job
may work, or it may bomb out after a differrnt number of frames.  Any clue as
to what this means?  Someone here thought that it might be due to bad spots
on the disk.  Any ideas how I might determine this?  My (limited) understanding
of invol tells me that invol will print out the factory-written badspot list
of a disk and allow me to add to that list, but I don't know how to find
bad-spots that might have developed since the disk was shipped from the
factory.  I've run salvol to no avail, and am basically w/o clue at this point.

Thanks in advance for any help!
--
Paul B. Booth  (paul@eye.com) (...!hplabs!hpfcla!eye!paul)
-------------------------------------------------------------------------------
3D/EYE, Inc., 2359 N. Triphammer Rd., Ithaca, NY  14850    voice: (607)257-1381
                                                             fax: (607)257-7335

krowitz@RICHTER.MIT.EDU (David Krowitz) (01/28/91)

"disk data check" is an error indicating either a bad spot on the disk, or
some other low-level problem which is causing the system to mimic a disk
I/O hardware problem (bad power supplies or bad disk controllers can cause
these errors to crop up). Check your system error log -- use the the program
/systest/ssr_util/lsyserr (list system errors) to print out the error log
in human readable form -- and look for recently dated disk errors. If the
disk errors seem to be repeating themselves at a small number of disk 
addresses (DADDR's they are called in the listing) then you probably have
a new bad spot developing on your disk. In this case, you can shut down
your machine and EX INVOL to add these disk addresses to your bad spot
list. Then EX SALVOL to remove the blocks from the system and reboot your
machine. Note that "disk block header errors" at "daddr=1"
tend to be spurious errors under SR9.7 and possibly under earlier versions
of SR10.x. If the disk errors are scattered about the disk in many locations
and are *not* repeating errors, then your problem is probably not with the
disk itself, but with either your power supply or your disk controller --
which requires a field service technician to diagnose unless you are up
to running diagnostics via the EX DEX command. If your error appear to be
bad spots (ie. the addresses show up repeatedly), and if you remove the
blocks from the system and *more* blocks appear to go bad, then you may be
facing an upcoming HDA (head/disk assembly) failure. HDA failures usually
develop over a bit of time, giving you some warning; but they are total
and unrecoverable -- the entire disk is frequently chewed up (physically
chewed up, with little bits of iron filings floating about inside where
your file system used to be). A system which is experiencing more than
a few (say less than 6) new bad spots a year should be carefully kept with
up-to-date backup tapes on hand.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (01/29/91)

In article <9101281506.AA05903@richter.mit.edu> krowitz@RICHTER.MIT.EDU (David Krowitz) writes:
>                              A system which is experiencing more than
>a few (say less than 6) new bad spots a year should be carefully kept with
>up-to-date backup tapes on hand.

Since Aug. 6/90, we have had 122 disk errors spread over the 4 760MB
disks on our DN10000 (mainly on the striped system disk pair). There
aren't that many repeats in the list. The invol/salvol procedure given
by DK works just fine (we just did it), and will even attempt to copy
deallocated blocks that belong to files/directories to good blocks.
We used to replace disks that were giving lots of errors, but that has
never helped in the past. Now the local office feels that we are better
off with the disks we know, rather than new/refurbished ones which have been
given a good shaking during shipping (we have had a lot of DOA disks).

As DK says, we do daily backups of all modified files just in case.
(BTW, the Workstation Solutions TapeAT version 3.1.3 now works on our
DN10000.)
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775