dem@uwslh.UUCP (David E. Miran) (02/13/86)
We have a problem with our disk system hanging and need help. We are running a VAX-11/750 with 4.2BSD unix. The disk system is a Systems Industries 9900 controller with a Fujitsu eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive). The controller emulates a massbus device and we use the hp device driver that came with 4.2BSD. Occasionally (usually about once a month but last week as much as 5 times per day) the computer will hang as though a task completion interrupt from the disk had been lost. Pressing the reset switch on the controller will cause the system to pick up running as though nothing had happened. Another site near us also has seen this problem. We just installed a new power supply and the latest firmware in the controller, but it did not help. SI suggested using their version of the hp driver, but (a) did not think it would help and (b) I don't see anything in their changes that appears to address this problem. Does anyone know what is going on and what we can do about the problem? 1. Is it hardware (and if so has anyone identified it well enough for SI to fix it)? 2. Is it software and if so can someone point me to a fix. 3. I have considered going into the hp driver and modifying it so that there is a timeout on pending disk activity (or maybe use the system lightning bolt) so that if we wait over one second for a disk interrupt we will consider the disk drive to be hung and issue a reset. Has anyone tried this and will you send me the code if you have. Or when it hangs does the controller become so stuck that a software issued reset is ignored and we have to hit the switch. Any help on this will be appreciated. -- David E. Miran ...!{seismo,harvard,topaz,ihnp4}!uwvax!uwslh!dem Wisconsin State Hygiene Lab (608) 262-0019 University of Wisconsin 465 Henry Mall Madison, WI 53706
chris@umcp-cs.UUCP (Chris Torek) (02/14/86)
>I have considered going into the hp driver and modifying it so that >there is a timeout on pending disk activity [...]. Has anyone tried >this and will you send me the code if you have. Look at vaxuba/rk.c (the RK07 driver); it has a watchdog timer. Sometimes I think uba.c should have a generic watchdog timer. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu
leon@mmm.UUCP (Leon Schilmoeller) (02/17/86)
> We have a problem with our disk system hanging and need help. > We are running a VAX-11/750 with 4.2BSD unix. > The disk system is a Systems Industries 9900 controller with a Fujitsu > eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive). > The controller emulates a massbus device and we use the hp device driver > that came with 4.2BSD. > Occasionally (usually about once a month but last week as much as 5 times > per day) the computer will hang as though a task completion interrupt from > the disk had been lost. Pressing the reset switch on the controller will > cause the system to pick up running as though nothing had happened. > Another site near us also has seen this problem. We just installed > a new power supply and the latest firmware in the controller, but it did > not help. SI suggested using their version of the hp driver, but > (a) did not think it would help and (b) I don't see anything in their changes > that appears to address this problem. > Does anyone know what is going on and what we can do about the problem? > 1. Is it hardware (and if so has anyone identified it well enough for > SI to fix it)? > 2. Is it software and if so can someone point me to a fix. > 3. I have considered going into the hp driver and modifying it so that > there is a timeout on pending disk activity (or maybe use the system > lightning bolt) so that if we wait over one second for a disk interrupt > we will consider the disk drive to be hung and issue a reset. > Has anyone tried this and will you send me the code if you have. > Or when it hangs does the controller become so stuck that a software > issued reset is ignored and we have to hit the switch. > Any help on this will be appreciated. > -- > David E. Miran ...!{seismo,harvard,topaz,ihnp4}!uwvax!uwslh!dem > Wisconsin State Hygiene Lab (608) 262-0019 > University of Wisconsin > 465 Henry Mall > Madison, WI 53706 We have experienced the same problem with the si9900 on a VAX 11/780. We have 3 eagles and a 9766 on the one controller. Using the hp driver from SI does not make any difference, at least not for us. It has been some time since a reset has been needed on my system, but as you have indicated, there appears to be no real pattern. Sorry I do not have a fix, but just wanted to let you know it is not only a 750 problem. Leon Schilmoeller ihnp4!mmm!leon 3M 612-736-9653 St. Paul, MN 55144
gdmr@cstvax.UUCP (George D M Ross) (02/19/86)
In article <136@uwslh.UUCP> dem@uwslh.UUCP writes: >We are running a VAX-11/750 with 4.2BSD unix. >The disk system is a Systems Industries 9900 controller with a Fujitsu >eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive). >Occasionally (usually about once a month but last week as much as 5 times >per day) the computer will hang as though a task completion interrupt from >the disk had been lost. Pressing the reset switch on the controller will >cause the system to pick up running as though nothing had happened. >-- >David E. Miran ...!{seismo,harvard,topaz,ihnp4}!uwvax!uwslh!dem We too have experienced this problem. Our configuration is a 750 running 4.2BSD with a SI 9900 controller and a couple of Eagles. We are running the SI hp driver (you need it if your drives have any bad blocks). The symptoms were exactly the same -- occasionally the system would hang until the reset switch on the 9900 was pressed, at which time the driver would wake up and things would take off again. Our controller was replaced (to fix a different problem) but the effect persisted, so it looks like it's not a controller problem per se. However, several months ago we had a big hefty earthing cable connected between the controller and the 750, since when we haven't seen the effect. Coincidence, maybe....? -- George D M Ross, Dept. of Computer Science, Univ. of Edinburgh, Scotland Phone: +44 31-667 1081 x2730 JANET: gdmr@UK.AC.ed.cstvax --> ARPA: gdmr@cstvax.ed.AC.UK UUCP: <UK>!ukc!cstvax!gdmr
lee@unmvax.UUCP (Lee Ward) (02/23/86)
> The disk system is a Systems Industries 9900 controller with a Fujitsu > eagle disk drive and two CDC 9730-80 disk drives (a sealed 67 M drive). > The controller emulates a massbus device and we use the hp device driver > that came with 4.2BSD. We had the same problem a couple of years ago and then the HDA on the eagle took a hike somewhere. After replacement everything went fine. The symptoms were exactly as you describe. At unm-la on a 750 with EMULEX controller and eagles it happens to them. Their cure: Don't ever, ever power the line printer on/off. It works like magic. Currently, we are having the problems you describe once again. One exception though. The reset switch no longer lets the OS continue. Unix is up and going through the scheduler. As long as a process doesn't go to the disk everything is fine. We know it is the SI9900/eagles. We think a bad block of some sort on the eagle aggravates a problem in the controller and everything hangs up. We can reproduce the effect at any time, by simply backing up /usr! We have had our controller worked on to no avail. We have used two versions of the SI drivers and the berkeley one (modified to print out the real address on error) and we get the same behavior. We are EXTREMELY sick of the problem and would also appreciate any insights. Many people have suggested that we just reformat the /usr partition/disk. We want the problem to go away, not just work around it to find it again later. -- --Lee (Ward) {ucbvax,convex,gatech,pur-ee}!unmvax!lee