[comp.sys.apollo] CRC Errors on DN10K disks

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (08/10/90)

In article <9008101419.AA02633@pan.ssec.honeywell.com> thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) writes:
> ...
>We have also had some difficulties with disk errors that are reported as :
>    8:59:21 pm (CDT)  disk error
>      Winchester  Ctrl_# = 0  Unit_# = 1    Phys daddr = 45C3: disk operation \
>                  completed successfully after crc correction (OS/disk manager)
>      Above disk chains a multiple-disk group - actual error is on:
>      Winchester  Ctrl_# = 1  Unit_# = 1    Phys daddr RELATIVE to this drive...
>This is the same as someone reported earlier.  We have been told that this
>problem MAY be because "HP tests there disks more rigorously, and marginal
>ones aren't installed."  Unfortunately, we have gotten a (smaller) number of
>the same error from the original Apollo disk that was there.  Our guess is
>that the power supply, which has some bricks that were manufactured during
>a dubious time period (failures have been attributed to bricks made during
>this time) is not giving out enough juice to lay down a reliable format on
>the INVOL.  After we get a new supply, we're going to re-format the disks
>and monitor the problem.

Our DN10020 has 4 760 MB disks, and we get one of these messages every 2
days on average (but they tend to come in clumps), and are concentrated on 2
of the 4 disks. They have appeared mainly since SR10.2.p, but that may
be a coincidence, since 3 of our 4 disks were replaced not long before
SR10.2.p was installed (we had 3 head crashes in 2 weeks).
I would be very interested to hear if replacing the power supply does
fix this, as our power supply board was replaced recently (it kept
turning off the power, so it may have been acting up when those 3 disks
were invol'ed).
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775

thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (08/28/90)

About a month (?) ago, I posted the first part of this to the net --
> In article <9008101419.AA02633@pan.ssec.honeywell.com> thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) writes:
> > ...
> >We have also had some difficulties with disk errors that are reported as :
> >    8:59:21 pm (CDT)  disk error
> >      Winchester  Ctrl_# = 0  Unit_# = 1    Phys daddr = 45C3: disk operation \
> >                  completed successfully after crc correction (OS/disk manager)
> >      Above disk chains a multiple-disk group - actual error is on:
> >      Winchester  Ctrl_# = 1  Unit_# = 1    Phys daddr RELATIVE to this drive...
> >This is the same as someone reported earlier.  We have been told that this
> >problem MAY be because "HP tests there disks more rigorously, and marginal
> >ones aren't installed."  Unfortunately, we have gotten a (smaller) number of
> >the same error from the original Apollo disk that was there.  Our guess is
> >that the power supply, which has some bricks that were manufactured during
> >a dubious time period (failures have been attributed to bricks made during
> >this time) is not giving out enough juice to lay down a reliable format on
> >the INVOL.  After we get a new supply, we're going to re-format the disks
> >and monitor the problem.
> 
> Our DN10020 has 4 760 MB disks, and we get one of these messages every 2
> days on average (but they tend to come in clumps), and are concentrated on 2
> of the 4 disks. They have appeared mainly since SR10.2.p, but that may
> be a coincidence, since 3 of our 4 disks were replaced not long before
> SR10.2.p was installed (we had 3 head crashes in 2 weeks).
> I would be very interested to hear if replacing the power supply does
> fix this, as our power supply board was replaced recently (it kept
> turning off the power, so it may have been acting up when those 3 disks
> were invol'ed).

Some updated information:  
    Our dn10000's power supply was replaced, and the 2nd volume (2 sector-striped
third-party 760MB disks) was re-INVOLed on August 14.  To fully abuse the disk,
I did a full INVOL (format,read,write) on each disk separately, and then did 
another full INVOL of the striped volume.  During this time, it accumulated
various disk errors (data checks etc) expected that were recorded in the
error log.  All daddrs were then entered via option 9 into the badspot list
(and all came up as duplicate entries).  SALVOL was then run, although it did
nothing apparent except re-set the salvol-req'd flag.

    Since that time, we have loaded about 650MB of info on the disk.  It has not
been abused, but the disk has been reasonably heavily used for the last 2 weeks.

    No disk errors have occurred on the mounted volume.  A small number of
disk errors (~ the expected number) have been recorded on the BOOT volume.  That
disk pair has not been re-INVOLed as yet.  It will be as soon as we can afford
to take it down and reload the O/S (i.e. 10.3.p if I'm lucky).

Conclusions?  Well, it's too soon to say for sure, but it _appears_ that our
disk errors were caused by a marginal power supply not being able to correctly /
completely / reliably lay down the disk format.  Earlier we had power-supply
bricks added, and a controller replaced, without this improvement.  Should 
people with dn10000s contact service personnel?  I'd say yes.  Even if it
wasn't causing the disk problems, HP/Apollo stated that the power supply 
boards have potential problems if manufactured in a certain period.  Unfortunately,
I don't know when it was, and I wouldn't suggest ripping apart your 10K anyway.


Happy hacking!

John Thompson (jt)
Honeywell, SSEC
Plymouth, MN  55441
thompson@pan.ssec.honeywell.com

As ever, my opinions do not necessarily agree with Honeywell's or reality's.
(Honeywell's do not necessarily agree with mine or reality's, either)