paul@eye.com (Paul B. Booth) (01/26/91)
Hi all. We've recently been trying to run some LARGE ray-traces on our DN10k. Most of these are batch jobs that write out large numbers of images in sequence, and we tend to run 4 such jobs at a time (1 job/cpu). What's been happening though, is that these jobs will bomb out after some random number of frames are generated with the error: "disk data check." Re-running the job may work, or it may bomb out after a differrnt number of frames. Any clue as to what this means? Someone here thought that it might be due to bad spots on the disk. Any ideas how I might determine this? My (limited) understanding of invol tells me that invol will print out the factory-written badspot list of a disk and allow me to add to that list, but I don't know how to find bad-spots that might have developed since the disk was shipped from the factory. I've run salvol to no avail, and am basically w/o clue at this point. Thanks in advance for any help! -- Paul B. Booth (paul@eye.com) (...!hplabs!hpfcla!eye!paul) ------------------------------------------------------------------------------- 3D/EYE, Inc., 2359 N. Triphammer Rd., Ithaca, NY 14850 voice: (607)257-1381 fax: (607)257-7335
krowitz@RICHTER.MIT.EDU (David Krowitz) (01/28/91)
"disk data check" is an error indicating either a bad spot on the disk, or some other low-level problem which is causing the system to mimic a disk I/O hardware problem (bad power supplies or bad disk controllers can cause these errors to crop up). Check your system error log -- use the the program /systest/ssr_util/lsyserr (list system errors) to print out the error log in human readable form -- and look for recently dated disk errors. If the disk errors seem to be repeating themselves at a small number of disk addresses (DADDR's they are called in the listing) then you probably have a new bad spot developing on your disk. In this case, you can shut down your machine and EX INVOL to add these disk addresses to your bad spot list. Then EX SALVOL to remove the blocks from the system and reboot your machine. Note that "disk block header errors" at "daddr=1" tend to be spurious errors under SR9.7 and possibly under earlier versions of SR10.x. If the disk errors are scattered about the disk in many locations and are *not* repeating errors, then your problem is probably not with the disk itself, but with either your power supply or your disk controller -- which requires a field service technician to diagnose unless you are up to running diagnostics via the EX DEX command. If your error appear to be bad spots (ie. the addresses show up repeatedly), and if you remove the blocks from the system and *more* blocks appear to go bad, then you may be facing an upcoming HDA (head/disk assembly) failure. HDA failures usually develop over a bit of time, giving you some warning; but they are total and unrecoverable -- the entire disk is frequently chewed up (physically chewed up, with little bits of iron filings floating about inside where your file system used to be). A system which is experiencing more than a few (say less than 6) new bad spots a year should be carefully kept with up-to-date backup tapes on hand. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (01/29/91)
In article <9101281506.AA05903@richter.mit.edu> krowitz@RICHTER.MIT.EDU (David Krowitz) writes: > A system which is experiencing more than >a few (say less than 6) new bad spots a year should be carefully kept with >up-to-date backup tapes on hand. Since Aug. 6/90, we have had 122 disk errors spread over the 4 760MB disks on our DN10000 (mainly on the striped system disk pair). There aren't that many repeats in the list. The invol/salvol procedure given by DK works just fine (we just did it), and will even attempt to copy deallocated blocks that belong to files/directories to good blocks. We used to replace disks that were giving lots of errors, but that has never helped in the past. Now the local office feels that we are better off with the disks we know, rather than new/refurbished ones which have been given a good shaking during shipping (we have had a lot of DOA disks). As DK says, we do daily backups of all modified files just in case. (BTW, the Workstation Solutions TapeAT version 3.1.3 now works on our DN10000.) -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775