eichelbe@nadc.arpa (J. Eichelberger) (07/13/87)
We are running 4.3 BSD UNIX on a VAX 11/780. We have a problem with 9766-disk related hangs on our VAX. We have a Systems Industries (SI) 9900 controller connected to three 9766 drives. We have the controller set in the SI-mode instead of the "act like a true RM05 from DEC" mode. Bu we still use the hp driver from the 4.3 BSD. Every so often we get a system hang. All activity on the SI controller stops. Hitting the reset on the toggle switch restarts everything. Most of the time we don't see any error messages. If we do see one, it's hp0: not ready Has anyone seen this type of problem? Our field service if trying to blame the problem on the hp driver since dumps of various memory locations (we halt the machine when we hang and do EXAMINEs of various memory locations per the field service) indicate, they say, that the driver is not resetting an error state. I want to know why I get an error in the first place!!! They can't answer that. Well, anyway, the last register dump we gave them indicated that the drive status register for drive #1 (counting from 0) had an error. The register contained 51C0, and from some manual they decoded the value as follows: VV set=Volume Valid DRY set=Drive Ready DPR set=Driver Present MOL=Medium Online ERR set=Error and Error Register #1 contained 4000, supposedly meaning Bit 14 set=Unsafe (Huh???) There is all kind of other stuff available from the dump, but I won't bother you with it here. If anyone can help, please drop me a line. Maybe a driver fix can be made. Maybe field service should analyze the 9766 for problems. I think the latter is in order. We are using the standard 4.3 BSD hp.c for the driver. Thank you. Jon Eichelberger eichelbe@NADC.ARPA
jim@cs.strath.ac.uk (Jim Reid) (07/21/87)
In article <8269@brl-adm.ARPA> eichelbe@nadc.arpa (J. Eichelberger) writes: >We are running 4.3 BSD UNIX on a VAX 11/780. > >We have a problem with 9766-disk related hangs on our VAX. We have a >Systems Industries (SI) 9900 controller connected to three 9766 drives. > >Every so often we get a system hang. All activity on the SI controller >stops. Hitting the reset on the toggle switch restarts everything. Most >of the time we don't see any error messages........ We had a similar experience two years ago when we installed a 9900 and a couple of Eagles on our 750. Our service engineer had seen this many times (at other sites) before, fixed us up and we've been OK since. The problem was in the controller power supply: as I recall the supply had to deliver exactly -5.05V. Please get in touch if you want precise details. I don't have them to hand right now. Jim
mangler@cit-vax.Caltech.Edu (System Mangler) (07/24/87)
In article <8269@brl-adm.ARPA>, eichelbe@nadc.arpa (J. Eichelberger) writes: > Every so often we get a system hang. All activity on the SI controller > stops. Hitting the reset on the toggle switch restarts everything. Most > of the time we don't see any error messages. If we do see one, it's > hp0: not ready We get these too. The last time it happened, I checked the registers, and all 5 drives had just completed a seek (a sign that the massbus datapath is marked "busy") and they all had the UNS (unsafe) bit set. cithex.caltech.edu has the same problem. They run VMS. The SI local office says this is a common problem, typically caused by marginal power supply output. I've not had the chance to verify this (there isn't enough downtime on this machine for my purposes). > We are using the standard 4.3 BSD hp.c for the driver. I sure hope that you're using error-free packs. The error position determination algorithm in that driver depends on separate counters for the number of bytes DMA'd and the number of bytes read/written; but the SI 9900 uses the first counter for both, so on an error, the driver thinks that more sectors were written than actually were. (Before you start flaming about SI: that problem is easy to work around compared to some of the bugs I've seen in the Emulex SC7000). Seismo has a version of hp.c that works around this by looking at the track/sector register (HPDA) instead. It can still lose data silently, but if retries are a rare event, it can be lived with. Don Speck speck@vlsi.caltech.edu {ll-xn,rutgers,amdahl}!cit-vax!speck
jim@strath-cs.UUCP (07/31/87)
In article <3321@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes: >In article <8269@brl-adm.ARPA>, eichelbe@nadc.arpa (J. Eichelberger) writes: >> Every so often we get a system hang. All activity on the SI controller >> stops. Hitting the reset on the toggle switch restarts everything. Most >> of the time we don't see any error messages. If we do see one, it's >> hp0: not ready > >The SI local office says this is a common problem, typically caused >by marginal power supply output. I've not had the chance to verify >this (there isn't enough downtime on this machine for my purposes). We had this problem quite often just after we installed a 9900. The SI engineer tweaked the voltage on the 9900 power supply and the hangs stopped happening. Eighteen months later, we've had no repetitions. The engineer told me that the 9900 is susceptible to mains glitches if the DC voltage isn't *exactly* right. The scenario he explained was: "A mains glitch causes the 5.05 V supply to drop for a moment and come back again. The momentary loss of power stops the controller (it thinks the mains power is lost) but is not long enough for the 9900 to re-initialise itself properly. Flicking the reset switch on the controller does this and everything picks up from where it left off." [I suppose this may be too optimistic: the transfer in progress at the time of the hang could be totally screwed if the controller's buffer gets mangled by the loss of volts and the controller firmware still considers the buffer contents valid. We didn't see file or filesystem corruption when we reset the hangs, though.] I've no idea why having the DC supply exactly right cures this, but then I know next to nothing about hardware or electronics. Jim
rbj@icst-cmr.arpa (Root Boy Jim) (07/31/87)
eichelbe@nadc.arpa (J. Eichelberger) writes: > Every so often we get a system hang. All activity on the SI controller > stops. Hitting the reset on the toggle switch restarts everything. Most > of the time we don't see any error messages. If we do see one, it's > hp0: not ready We get these too. The last time it happened, I checked the registers, and all 5 drives had just completed a seek (a sign that the massbus datapath is marked "busy") and they all had the UNS (unsafe) bit set. We get these also, but only have two disks: a CDC 9766 (RM05) and an Eagle. I haven't checked the error bits, but flipping the switch works for us too. This isn't usually a problem unless I am at home. cithex.caltech.edu has the same problem. They run VMS. Too bad. The SI local office says this is a common problem, typically caused by marginal power supply output. I've not had the chance to verify this (there isn't enough downtime on this machine for my purposes). Well, the FE's tweaked our voltage, and it seemed to help a bit, but we still get the error. Perhaps too much current is being drawn and the voltage drops anyway. > We are using the standard 4.3 BSD hp.c for the driver. I sure hope that you're using error-free packs. The error position determination algorithm in that driver depends on separate counters for the number of bytes DMA'd and the number of bytes read/written; but the SI 9900 uses the first counter for both, so on an error, the driver thinks that more sectors were written than actually were. I hope so too. Either that, or layout your partition table to avoid any cylinders (or tracks) with bad sectors on them. Another solution I might propose is to hack the driver so that it compares the desired sector with the bad block table *before* it takes the BSE hit instead of after. Yes, I know, it does take time, and since most transfers are multi sector, you'd have to break up a `block' that contained a bad sector into as many as three transfers. Of course, you could also mark the adjacent sectors bad, in groups of eight (assuming 4k blocks), but I think the bad144 scheme would map the sectors backwards within the block. Oy! (Before you start flaming about SI: that problem is easy to work around compared to some of the bugs I've seen in the Emulex SC7000). Mangler: please mail me the scoop on Emulex. I also work on a system that has them. Seismo has a version of hp.c that works around this by looking at the track/sector register (HPDA) instead. It can still lose data silently, but if retries are a rare event, it can be lived with. For $300 (I think), SI will sell you their version. The 4.2 one seemed a bit better than the 4.3 one tho. It also has hacks for online formatting, header verification, and you don't have to reboot when you add bad blocks. Don Speck speck@vlsi.caltech.edu {ll-xn,rutgers,amdahl}!cit-vax!speck (Root Boy) Jim Cottrell <rbj@icst-cmr.arpa> National Bureau of Standards Flamer's Hotline: (301) 975-5688
ron@topaz.rutgers.edu.UUCP (08/18/87)
Don't forget to hook up those ground cables that SI so nicely provides you as well. -Ron
alen@cogen.UUCP (Alen Shapiro) (08/18/87)
In article <654@stracs.cs.strath.ac.uk> jim@cs.strath.ac.uk writes: >In article <3321@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes: >>In article <8269@brl-adm.ARPA>, eichelbe@nadc.arpa (J. Eichelberger) writes: >>> Every so often we get a system hang. All activity on the SI controller >>> stops. Hitting the reset on the toggle switch restarts everything. Most >>> of the time we don't see any error messages. If we do see one, it's >>> hp0: not ready >> >>The SI local office says this is a common problem, typically caused >>by marginal power supply output. I've not had the chance to verify >>this (there isn't enough downtime on this machine for my purposes). > >We had this problem quite often just after we installed a 9900. The >SI engineer tweaked the voltage on the 9900 power supply and the hangs >stopped happening. Eighteen months later, we've had no repetitions. > .......etc..... At Turing Institute Glasgow we had 2 Vax 750s and 2 SI9900s. We frequently (couple of times a month) had the aforementioned problem. Without fail, hitting the reset button worked. We had our mains monitored and the hangs corresponded with voltage spikes/dropouts not severe enough to take down our 750s but bad enough to cause the 9900s to act like Rip-Van-Winkle (our pdp11/24 frequently didn't survive though). I seem to remember SI installed a driver fix that could detect when the problem occured and did an off-spindle* seek to reset the drive. Is anyone at TI or SI listening? colin@tivax do you remember this? *off-spindle seek = seek till you hit the end-stop ps Hi Jim --alen the lisa slayer (it's a long story) ...!seismo!esosun!cogen!alen
jim@cs.strath.ac.uk (Jim Reid) (08/20/87)
In article <331@cogen.UUCP> alen@cogen.UUCP (Alen Shapiro) writes: >At Turing Institute Glasgow we had 2 Vax 750s and 2 SI9900s. We frequently >(couple of times a month) had the aforementioned problem. Without fail, hitting >the reset button worked. We had our mains monitored and the hangs corresponded >with voltage spikes/dropouts not severe enough to take down our >750s but bad enough to cause the 9900s to act like Rip-Van-Winkle This is the crux of problem. The 9900 power supply is susceptible to spikes/dropouts in the mains supply. The controller detects the loss of its DC power and forces the drives to do an "off-spindle" seek. [The reasoning being if the power has gone (or just about to go), it would be better to retract the heads in case they parked in the middle of the disk.] Then when the DC power comes back a few moments later, the controller can't recover causing someone to manually flick the reset switch. I'll talk to our SI repairpersons and post a detailed explanation soon. >I seem to remember SI installed a driver fix that could detect when the problem >occured and did an off-spindle* seek to reset the drive. Strange. I've never heard of this "driver fix". It does sound a bit of a kludge, though I suppose getting the kernel to poke the controller would probably work if SI couldn't make the power supply work correctly. >ps Hi Jim Hi Alen - We'd better stop meeting like this! Jim