[comp.sys.apollo] DSP90/MSD-500 Disk Errors

adam@hyper.lap.upenn.edu (Adam Feigin) (02/14/88)

I've got a DSP90 with an MSD-500 that is exhibiting the following
problem. I'm getting lots of "disk block header errors" at the first
address on the disk. Has anybody out there seen this behavior ?? Is it
a known bug/problem ??? We've had this problem for over a year, and
everything has been replaced, including the MSD-500 itself, but the
problem does not go away. Our local SSE thought that it could be
a grounding problem, but everything checked out okay. The DSP doesn't
crash or anything, but I've got a bad feeling that eventually nasty things
are going to start happening.  An excerpt from the system error log is
included below.

Wednesday, January 13, 1988
   2:12:54 pm (EST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)
Sunday, January 24, 1988
   12:50:42 pm (EST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)
Wednesday, February 3, 1988
   2:52:13 pm (EST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)
   2:54:44 pm (EST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)
Friday, February 5, 1988
   9:31:36 am (EST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)
Monday, February 8, 1988
   10:16:45 am (EST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)

If anybody has a solution to this problem, I'd sure appreciate hearing about 
it !!!!

							Adam

------------------------------------------------------------------------------
ARPAnet: {root,adam}@{hyper,apollo}.lap.upenn.edu          
UUCP: {harvard,decwrl,rutgers,ihnp4}!super.upenn.edu!hyper.lap.upenn.edu!adam

                                                      Adam Feigin
						   Network Administrator
					         Language Analysis Project
					         University of Pennsylvania
 -----------------------------------------------------------------------------

krowitz@mit-richter.UUCP (David Krowitz) (02/16/88)

Hmm ... most of our DN460/600 disk drives, our DN560 MSD-190, and
our DSP80 with two MSD-500's get these errors from time to time.
We have seen no long term problem with this, but if you wanted to
you could backup your disk, shut down the node, run FBS (ie. EX FBS)
to find the bad spots on the disk (this will take a *long* time),
run INVOL and make certain the bad blocks are marked in the
system list, reformat the disk with INVOL, and restore you file
system. You will spend a lot of time doing this, and I'm not certain
it's worth it, but it might make you feel more secure.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter@eddie.mit.edu
mit-erl!mit-richter!krowitz@eddie.mit.edu
mit-erl!mit-richter!krowitz@mit-eddie.arpa
krowitz@mit-mc.arpa
(in order of decreasing preference)

johnm@CAEN.ENGIN.UMICH.EDU (John Muckler) (02/16/88)

We're experiencing the same problems at the University of Michigan.
Especially on DN3000's & 4000's with 348 meg disks.  A UCR has been
submitted but no response has been recieved yet.  It is not very
reassuring...  Some assitance would be greatly appreciated!


--------------------------
John E. Muckler
Mgr. Computer Oper.
University of Michigan
College of Engineering/CAEN

rye@CAEN.ENGIN.UMICH.EDU (Ryland S. Marshall) (02/16/88)

    We were having the same problem.  It is a bogus error.  If data 
address 1 was bad you would not be able to read the disk.
Apollo sent us a patch tape that contains a new version of invol.
It seems to have taken care of the problem (so far).  Ask your
sales rep for the version of invol that was last modified 11-30-87.
                                                                
Best of luck,
rye.

adam@hyper.lap.upenn.edu (Adam Feigin) (02/16/88)

Well, I'm starting to suspect that there's something wrong with the
disk controller (either the multibus SMD board or the drive electronics
itself) as the MSD-500 was replaced in June with a BRAND NEW CDC drive,
and yet these errors keep occuring (I'm thankful that Apollo ISN'T
vanilla U**X, as errors on the first disk address would be disastrous !!)
When you get these errors, are they at the first address on the disk, or
are they at various places on the media (We also get errors at assorted
disk addresses, but the errors at the first disk address happen at least
once or twice a week) ??

						Adam

------------------------------------------------------------------------------
ARPAnet: {root,adam}@{hyper,apollo}.lap.upenn.edu          
UUCP: {harvard,decwrl,rutgers,ihnp4}!super.upenn.edu!hyper.lap.upenn.edu!adam

                                                      Adam Feigin
						   Network Administrator
					         Language Analysis Project
					         University of Pennsylvania
 -----------------------------------------------------------------------------

kwongj@caldwr.caldwr.gov (James Kwong) (02/18/88)

In article <3392@super.upenn.edu>, adam@hyper.lap.upenn.edu (Adam Feigin) writes:
> Well, I'm starting to suspect that there's something wrong with the
> disk controller (either the multibus SMD board or the drive electronics
> itself) as the MSD-500 was replaced in June with a BRAND NEW CDC drive,
> and yet these errors keep occuring (I'm thankful that Apollo ISN'T
> vanilla U**X, as errors on the first disk address would be disastrous !!)
> When you get these errors, are they at the first address on the disk, or
> are they at various places on the media (We also get errors at assorted
> disk addresses, but the errors at the first disk address happen at least
> once or twice a week) ??
> 
> 						Adam
> 
> ------------------------------------------------------------------------------


We're having the same problem here with our DSP90/Control Data Corp.500 mb. 
storage module except the header errors always occur on vol. 1 addr 1.

A service rep. came out and moved the label to another spot on the 
storage module (ran some kind of fix_vol or something like that) and 
salvol the disk to no avail.

We still get the disk block header error now and then. I noticed that
if I partnered the DSP 90 to a diskless node the frequency of the 
error messages increased from once every several months to once every
few days.

I was told basically the same thing; the problem was probably caused 
by a grounding problem somewhere, that it was common among the DSP 90s 
and that the errors probably occurred when the disk is in a reading 
mode, and not writing mode and as such I should not have to worry too
much about the integrity of the data. 

Still though, I would feel less paronoid if these errors didn't crop up 
now and then. In your case with the errors occuring at different spots,
your assessment of the cause sounds reasonable. I take it that these
header messages (other that vol. 1 addr 1) also refer to the the storage 
module and not the error messages caused by specifying non-existing 
devices on the DSP 90 such as a cartridge drive or floppy drive.

Excerpt from our 'lsyserr':
Friday, December 18, 1987
   3:45:19 pm (PST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)
Monday, December 21, 1987
   3:14:54 pm (PST)  disk error
      storage module, volx=1, daddr=1: disk block header error (OS/disk manager)


James Kwong  Calif. Depart. of H2O Resources, Sacramento, CA 95802
ucdavis.edu!caldwr!kwongj (Internet) ...!ucbvax!ucdavis!caldwr!kwongj (UUCP)

-- 
James Kwong  Calif. Depart. of H2O Resources, Sacramento, CA 95802
ucdavis.edu!caldwr!kwongj (Internet) ...!ucbvax!ucdavis!caldwr!kwongj (UUCP)
"Our program who art in memory, HELLO be thy name.. "
The opinions expressed above are mine, not those of the State of California or the California Department of Water Resources.

rees@apollo.uucp (Jim Rees) (02/19/88)

Note that "disk block header error" usually means that the block was read
OK, but had funky stuff in the block header.  This does not imply a disk
read error, and any attempt to put the block in the badspot list will only
make things worse.

The block header is an extra thing stuck in front of the data that
contains useful stuff like the uid of the object that the data came
from.  In theory this makes it easier to salvage the disk if something
goes wrong.  I think this idea came from Parc, where it was used in
the Alto file systems.

There are various programs in /systest/ssr_util (rwvol, fixvol) that
can read and display block headers.  But take the warnings seriously --
you can screw up your disk with these if you don't know what you're
doing.