[comp.unix.wizards] SI 9900 hangs

eichelbe@nadc.arpa (J. Eichelberger) (07/13/87)

We are running 4.3 BSD UNIX on a VAX 11/780.

We have a problem with 9766-disk related hangs on our VAX.  We have a
Systems Industries (SI) 9900 controller connected to three 9766 drives.
We have the controller set in the SI-mode instead of the "act like a true
RM05 from DEC" mode.  Bu we still use the hp driver from the 4.3 BSD.

Every so often we get a system hang.  All activity on the SI controller
stops.  Hitting the reset on the toggle switch restarts everything.  Most
of the time we don't see any error messages.  If we do see one, it's
hp0: not ready

Has anyone seen this type of problem?  Our field service if trying to
blame the problem on the hp driver since dumps of various memory
locations (we halt the machine when we hang and do EXAMINEs of various
memory locations per the field service) indicate, they say, that the
driver is not resetting an error state.  I want to know why I get an
error in the first place!!!  They can't answer that.

Well, anyway, the last register dump we gave them indicated that the
drive status register for drive #1 (counting from 0) had an error.
The register contained 51C0, and from some manual they decoded the
value as follows:
VV set=Volume Valid
DRY set=Drive Ready
DPR set=Driver Present
MOL=Medium Online
ERR set=Error

and

Error Register #1 contained 4000, supposedly meaning

Bit 14 set=Unsafe (Huh???)

There is all kind of other stuff available from the dump, but I won't
bother you with it here.

If anyone can help, please drop me a line.  Maybe a driver fix can be
made.  Maybe field service should analyze the 9766 for problems.  I
think the latter is in order.  We are using the standard 4.3 BSD hp.c
for the driver.

Thank you.

            Jon Eichelberger
            eichelbe@NADC.ARPA

jim@cs.strath.ac.uk (Jim Reid) (07/21/87)

In article <8269@brl-adm.ARPA> eichelbe@nadc.arpa (J. Eichelberger) writes:
>We are running 4.3 BSD UNIX on a VAX 11/780.
>
>We have a problem with 9766-disk related hangs on our VAX.  We have a
>Systems Industries (SI) 9900 controller connected to three 9766 drives.
>
>Every so often we get a system hang.  All activity on the SI controller
>stops.  Hitting the reset on the toggle switch restarts everything.  Most
>of the time we don't see any error messages........

We had a similar experience two years ago when we installed a 9900 and a
couple of Eagles on our 750. Our service engineer had seen this many
times (at other sites) before, fixed us up and we've been OK since. The
problem was in the controller power supply: as I recall the supply had
to deliver exactly -5.05V.

Please get in touch if you want precise details. I don't have them to
hand right now.

		Jim

mangler@cit-vax.Caltech.Edu (System Mangler) (07/24/87)

In article <8269@brl-adm.ARPA>, eichelbe@nadc.arpa (J. Eichelberger) writes:
> Every so often we get a system hang.	All activity on the SI controller
> stops.  Hitting the reset on the toggle switch restarts everything.  Most
> of the time we don't see any error messages.	If we do see one, it's
> hp0: not ready

We get these too.  The last time it happened, I checked the registers,
and all 5 drives had just completed a seek (a sign that the massbus
datapath is marked "busy") and they all had the UNS (unsafe) bit set.

cithex.caltech.edu has the same problem.  They run VMS.

The SI local office says this is a common problem, typically caused
by marginal power supply output.  I've not had the chance to verify
this (there isn't enough downtime on this machine for my purposes).

>   We are using the standard 4.3 BSD hp.c for the driver.

I sure hope that you're using error-free packs.  The error position
determination algorithm in that driver depends on separate counters
for the number of bytes DMA'd and the number of bytes read/written;
but the SI 9900 uses the first counter for both, so on an error, the
driver thinks that more sectors were written than actually were.

(Before you start flaming about SI:  that problem is easy to work
around compared to some of the bugs I've seen in the Emulex SC7000).

Seismo has a version of hp.c that works around this by looking at
the track/sector register (HPDA) instead.  It can still lose data
silently, but if retries are a rare event, it can be lived with.

Don Speck   speck@vlsi.caltech.edu  {ll-xn,rutgers,amdahl}!cit-vax!speck

jim@strath-cs.UUCP (07/31/87)

In article <3321@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes:
>In article <8269@brl-adm.ARPA>, eichelbe@nadc.arpa (J. Eichelberger) writes:
>> Every so often we get a system hang.	All activity on the SI controller
>> stops.  Hitting the reset on the toggle switch restarts everything.  Most
>> of the time we don't see any error messages.	If we do see one, it's
>> hp0: not ready
>
>The SI local office says this is a common problem, typically caused
>by marginal power supply output.  I've not had the chance to verify
>this (there isn't enough downtime on this machine for my purposes).

We had this problem quite often just after we installed a 9900. The
SI engineer tweaked the voltage on the 9900 power supply and the hangs
stopped happening. Eighteen months later, we've had no repetitions.

The engineer told me that the 9900 is susceptible to mains glitches
if the DC voltage isn't *exactly* right. The scenario he explained was:
"A mains glitch causes the 5.05 V supply to drop for a moment and come
back again. The momentary loss of power stops the controller (it thinks
the mains power is lost) but is not long enough for the 9900 to
re-initialise itself properly. Flicking the reset switch on the controller
does this and everything picks up from where it left off."

[I suppose this may be too optimistic: the transfer in progress at the
time of the hang could be totally screwed if the controller's buffer gets
mangled by the loss of volts and the controller firmware still considers
the buffer contents valid. We didn't see file or filesystem corruption
when we reset the hangs, though.]

I've no idea why having the DC supply exactly right cures this, but then
I know next to nothing about hardware or electronics.

		Jim

rbj@icst-cmr.arpa (Root Boy Jim) (07/31/87)

   eichelbe@nadc.arpa (J. Eichelberger) writes:
   > Every so often we get a system hang. All activity on the SI controller
   > stops.  Hitting the reset on the toggle switch restarts everything.  Most
   > of the time we don't see any error messages. If we do see one, it's
   > hp0: not ready

   We get these too.  The last time it happened, I checked the registers,
   and all 5 drives had just completed a seek (a sign that the massbus
   datapath is marked "busy") and they all had the UNS (unsafe) bit set.

We get these also, but only have two disks: a CDC 9766 (RM05) and an Eagle.
I haven't checked the error bits, but flipping the switch works for us too.
This isn't usually a problem unless I am at home.

   cithex.caltech.edu has the same problem.  They run VMS.

Too bad.

   The SI local office says this is a common problem, typically caused
   by marginal power supply output.  I've not had the chance to verify
   this (there isn't enough downtime on this machine for my purposes).

Well, the FE's tweaked our voltage, and it seemed to help a bit, but
we still get the error. Perhaps too much current is being drawn and the
voltage drops anyway.

   >   We are using the standard 4.3 BSD hp.c for the driver.

   I sure hope that you're using error-free packs.  The error position
   determination algorithm in that driver depends on separate counters
   for the number of bytes DMA'd and the number of bytes read/written;
   but the SI 9900 uses the first counter for both, so on an error, the
   driver thinks that more sectors were written than actually were.

I hope so too. Either that, or layout your partition table to avoid
any cylinders (or tracks) with bad sectors on them.

Another solution I might propose is to hack the driver so that it
compares the desired sector with the bad block table *before* it
takes the BSE hit instead of after. Yes, I know, it does take time, and
since most transfers are multi sector, you'd have to break up a `block'
that contained a bad sector into as many as three transfers. Of course,
you could also mark the adjacent sectors bad, in groups of eight (assuming
4k blocks), but I think the bad144 scheme would map the sectors backwards
within the block. Oy!

   (Before you start flaming about SI:  that problem is easy to work
   around compared to some of the bugs I've seen in the Emulex SC7000).

Mangler: please mail me the scoop on Emulex. I also work on a system
that has them.

   Seismo has a version of hp.c that works around this by looking at
   the track/sector register (HPDA) instead.  It can still lose data
   silently, but if retries are a rare event, it can be lived with.

For $300 (I think), SI will sell you their version. The 4.2 one seemed
a bit better than the 4.3 one tho. It also has hacks for online formatting,
header verification, and you don't have to reboot when you add bad blocks.

   Don Speck   speck@vlsi.caltech.edu  {ll-xn,rutgers,amdahl}!cit-vax!speck

	(Root Boy) Jim Cottrell	<rbj@icst-cmr.arpa>
	National Bureau of Standards
	Flamer's Hotline: (301) 975-5688

ron@topaz.rutgers.edu.UUCP (08/18/87)

Don't forget to hook up those ground cables that SI so nicely provides
you as well.

-Ron

alen@cogen.UUCP (Alen Shapiro) (08/18/87)

In article <654@stracs.cs.strath.ac.uk> jim@cs.strath.ac.uk writes:
>In article <3321@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes:
>>In article <8269@brl-adm.ARPA>, eichelbe@nadc.arpa (J. Eichelberger) writes:
>>> Every so often we get a system hang.	All activity on the SI controller
>>> stops.  Hitting the reset on the toggle switch restarts everything.  Most
>>> of the time we don't see any error messages.	If we do see one, it's
>>> hp0: not ready
>>
>>The SI local office says this is a common problem, typically caused
>>by marginal power supply output.  I've not had the chance to verify
>>this (there isn't enough downtime on this machine for my purposes).
>
>We had this problem quite often just after we installed a 9900. The
>SI engineer tweaked the voltage on the 9900 power supply and the hangs
>stopped happening. Eighteen months later, we've had no repetitions.
> .......etc.....

At Turing Institute Glasgow we had 2 Vax 750s and 2 SI9900s. We frequently
(couple of times a month) had the aforementioned problem. Without fail, hitting
the reset button worked. We had our mains monitored and the hangs corresponded
with voltage spikes/dropouts not severe enough to take down our
750s but bad enough to cause the 9900s to act like Rip-Van-Winkle
(our pdp11/24 frequently didn't survive though).

I seem to remember SI installed a driver fix that could detect when the problem
occured and did an off-spindle* seek to reset the drive.

Is anyone at TI or SI listening? colin@tivax do you remember this?

*off-spindle seek = seek till you hit the end-stop

ps Hi Jim

--alen the lisa slayer (it's a long story)
	...!seismo!esosun!cogen!alen

jim@cs.strath.ac.uk (Jim Reid) (08/20/87)

In article <331@cogen.UUCP> alen@cogen.UUCP (Alen Shapiro) writes:
>At Turing Institute Glasgow we had 2 Vax 750s and 2 SI9900s. We frequently
>(couple of times a month) had the aforementioned problem. Without fail, hitting
>the reset button worked. We had our mains monitored and the hangs corresponded
>with voltage spikes/dropouts not severe enough to take down our
>750s but bad enough to cause the 9900s to act like Rip-Van-Winkle

This is the crux of problem. The 9900 power supply is susceptible to
spikes/dropouts in the mains supply. The controller detects the loss of
its DC power and forces the drives to do an "off-spindle" seek. [The
reasoning being if the power has gone (or just about to go), it would
be better to retract the heads in case they parked in the middle of the
disk.] Then when the DC power comes back a few moments later, the
controller can't recover causing someone to manually flick the
reset switch.

I'll talk to our SI repairpersons and post a detailed explanation soon.

>I seem to remember SI installed a driver fix that could detect when the problem
>occured and did an off-spindle* seek to reset the drive.

Strange. I've never heard of this "driver fix". It does sound a bit of a
kludge, though I suppose getting the kernel to poke the controller would
probably work if SI couldn't make the power supply work correctly.

>ps Hi Jim

Hi Alen - We'd better stop meeting like this!

		Jim