[net.unix-wizards] disk system hangs on a VAX-750 with SI controller.

dem@uwslh.UUCP (David E. Miran) (02/13/86)

We have a problem with our disk system hanging and need help.
We are running a VAX-11/750 with 4.2BSD unix.
The disk system is a Systems Industries 9900 controller with a Fujitsu
eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive).
The controller emulates a massbus device and we use the hp device driver
that came with 4.2BSD.
Occasionally (usually about once a month but last week as much as 5 times
per day) the computer will hang as though a task completion interrupt from
the disk had been lost.  Pressing the reset switch on the controller will
cause the system to pick up running as though nothing had happened.
Another site near us also has seen this problem.  We just installed
a new power supply and the latest firmware in the controller, but it did
not help.  SI suggested using their version of the hp driver, but
(a) did not think it would help and (b) I don't see anything in their changes
that appears to address this problem.
Does anyone know what is going on and what we can do about the problem?
1.  Is it hardware (and if so has anyone identified it well enough for
    SI to fix it)?
2.  Is it software and if so can someone point me to a fix.
3.  I have considered going into the hp driver and modifying it so that
    there is a timeout on pending disk activity (or maybe use the system
    lightning bolt) so that if we wait over one second for a disk interrupt
    we will consider the disk drive to be hung and issue a reset.
    Has anyone tried this and will you send me the code if you have.
    Or when it hangs does the controller become so stuck that a software
    issued reset is ignored and we have to hit the switch.
Any help on this will be appreciated.
-- 
David E. Miran         ...!{seismo,harvard,topaz,ihnp4}!uwvax!uwslh!dem
Wisconsin State Hygiene Lab          (608) 262-0019
University of Wisconsin
465 Henry Mall
Madison, WI  53706

chris@umcp-cs.UUCP (Chris Torek) (02/14/86)

>I have considered going into the hp driver and modifying it so that
>there is a timeout on pending disk activity [...].  Has anyone tried
>this and will you send me the code if you have.

Look at vaxuba/rk.c (the RK07 driver); it has a watchdog timer.
Sometimes I think uba.c should have a generic watchdog timer.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

leon@mmm.UUCP (Leon Schilmoeller) (02/17/86)

> We have a problem with our disk system hanging and need help.
> We are running a VAX-11/750 with 4.2BSD unix.
> The disk system is a Systems Industries 9900 controller with a Fujitsu
> eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive).
> The controller emulates a massbus device and we use the hp device driver
> that came with 4.2BSD.
> Occasionally (usually about once a month but last week as much as 5 times
> per day) the computer will hang as though a task completion interrupt from
> the disk had been lost.  Pressing the reset switch on the controller will
> cause the system to pick up running as though nothing had happened.
> Another site near us also has seen this problem.  We just installed
> a new power supply and the latest firmware in the controller, but it did
> not help.  SI suggested using their version of the hp driver, but
> (a) did not think it would help and (b) I don't see anything in their changes
> that appears to address this problem.
> Does anyone know what is going on and what we can do about the problem?
> 1.  Is it hardware (and if so has anyone identified it well enough for
>     SI to fix it)?
> 2.  Is it software and if so can someone point me to a fix.
> 3.  I have considered going into the hp driver and modifying it so that
>     there is a timeout on pending disk activity (or maybe use the system
>     lightning bolt) so that if we wait over one second for a disk interrupt
>     we will consider the disk drive to be hung and issue a reset.
>     Has anyone tried this and will you send me the code if you have.
>     Or when it hangs does the controller become so stuck that a software
>     issued reset is ignored and we have to hit the switch.
> Any help on this will be appreciated.
> -- 
> David E. Miran         ...!{seismo,harvard,topaz,ihnp4}!uwvax!uwslh!dem
> Wisconsin State Hygiene Lab          (608) 262-0019
> University of Wisconsin
> 465 Henry Mall
> Madison, WI  53706

We have experienced the same problem with the si9900 on a VAX 11/780.  We
have 3 eagles and a 9766 on the one controller.  Using the hp driver from
SI does not make any difference, at least not for us.  It has been
some time since a reset has been needed on my system, but as you have
indicated, there appears to be no real pattern.  Sorry I do not have a
fix, but just wanted to let you know it is not only a 750 problem.

Leon Schilmoeller			ihnp4!mmm!leon
3M					612-736-9653
St. Paul, MN 55144

gdmr@cstvax.UUCP (George D M Ross) (02/19/86)

In article <136@uwslh.UUCP> dem@uwslh.UUCP writes:
>We are running a VAX-11/750 with 4.2BSD unix.
>The disk system is a Systems Industries 9900 controller with a Fujitsu
>eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive).
>Occasionally (usually about once a month but last week as much as 5 times
>per day) the computer will hang as though a task completion interrupt from
>the disk had been lost.  Pressing the reset switch on the controller will
>cause the system to pick up running as though nothing had happened.
>-- 
>David E. Miran         ...!{seismo,harvard,topaz,ihnp4}!uwvax!uwslh!dem

We too have experienced this problem.  Our configuration is a 750 running
4.2BSD with a SI 9900 controller and a couple of Eagles.  We are running the
SI hp driver (you need it if your drives have any bad blocks).  The symptoms
were exactly the same -- occasionally the system would hang until the reset
switch on the 9900 was pressed, at which time the driver would wake up and
things would take off again.

Our controller was replaced (to fix a different problem) but the effect
persisted, so it looks like it's not a controller problem per se.  However,
several months ago we had a big hefty earthing cable connected between the
controller and the 750, since when we haven't seen the effect.  Coincidence,
maybe....?

-- 
George D M Ross, Dept. of Computer Science, Univ. of Edinburgh, Scotland
Phone: +44 31-667 1081 x2730
JANET: gdmr@UK.AC.ed.cstvax  --> ARPA: gdmr@cstvax.ed.AC.UK
UUCP:  <UK>!ukc!cstvax!gdmr

lee@unmvax.UUCP (Lee Ward) (02/23/86)

> The disk system is a Systems Industries 9900 controller with a Fujitsu
> eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive).
> The controller emulates a massbus device and we use the hp device driver
> that came with 4.2BSD.

We had the same problem a couple of years ago and then the HDA on the
eagle took a hike somewhere. After replacement everything went fine.
The symptoms were exactly as you describe.

At unm-la on a 750 with EMULEX controller and eagles it happens to them.
Their cure: Don't ever, ever power the line printer on/off. It works
like magic.

Currently, we are having the problems you describe once again. One exception
though. The reset switch no longer lets the OS continue. Unix is up
and going through the scheduler. As long as a process doesn't go to the
disk everything is fine. We know it is the SI9900/eagles. We think a bad block
of some sort on the eagle aggravates a problem in the controller and everything
hangs up. We can reproduce the effect at any time, by simply backing up
/usr! We have had our controller worked on to no avail. We have used
two versions of the SI drivers and the berkeley one (modified to print
out the real address on error) and we get the same behavior. We are EXTREMELY
sick of the problem and would also appreciate any insights. Many people
have suggested that we just reformat the /usr partition/disk. We want the
problem to go away, not just work around it to find it again later.

-- 
			--Lee (Ward)
			{ucbvax,convex,gatech,pur-ee}!unmvax!lee