mdm@cocktrice.uucp (Mike Mitchell) (01/24/91)
I am having unexplaned system crashes where the machine will hang while unattended. I am running ESIX 3.2 Rev D and leave the machine on 100% of the time. I will return after a leaving the machine for a few hours, and it will be hung. The drive light is on indicating that some sort of disk activity was happening on or about the time of the system crash. During system installation, I selected the Fast File System for all partitions created. Could this be the culprit, or do I need to chase a different problem? Upon opening the case, there does not seem to be a heat problem, so I am not sure what may be causing my difficulties. Suggestions? Thanks for your time and effort. -- Mike Mitchell Email: mdm@cocktrice.uucp 2020 Calle Lorca #43 Phone: (505) 471-7639 H Santa Fe, New Mexico 87505 (505) 473-4482 W
larry@nstar.rn.com (Larry Snyder) (01/24/91)
mdm@cocktrice.uucp (Mike Mitchell) writes: >I am having unexplaned system crashes where the machine will hang >while unattended. I am running ESIX 3.2 Rev D and leave the machine >on 100% of the time. I will return after a leaving the machine >for a few hours, and it will be hung. The drive light is on indicating >that some sort of disk activity was happening on or about the time >of the system crash. what type of controller? I did have this problem on one of our machines last year (running 386/ix) with a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing the controller (actually returning it as DOA) solved the problem.. -- Larry Snyder, NSTAR Public Access Unix 219-289-0282 (HST/PEP/V.32/v.42bis) regional UUCP mapping coordinator {larry@nstar.rn.com, ..!uunet!nstar!larry, larry%nstar@iuvax.cs.indiana.edu}
bill@bilver.uucp (Bill Vermillion) (01/25/91)
In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes: >mdm@cocktrice.uucp (Mike Mitchell) writes: >>I am having unexplaned system crashes where the machine will hang >>while unattended. I am running ESIX 3.2 Rev D and leave the machine >>on 100% of the time. I will return after a leaving the machine >>for a few hours, and it will be hung. The drive light is on indicating >>that some sort of disk activity was happening on or about the time >>of the system crash. >what type of controller? I did have this problem on >one of our machines last year (running 386/ix) with >a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing >the controller (actually returning it as DOA) solved >the problem.. Larry's suggestion may be correct. There is NO problem with ESIX. I am running a node with a full feed incomng and outgoing. Runs just fine. This past week in a news blockage, it moved about 12,000 messages in a short time. I am running the fast file system on the / and on the news partition and a 51k file system on a third partition. The ONLY problem I have had occured this AM. Power failure did something so that on a reboot it paniced and rebooted, over and over. Put in the distribution disk 1, then 2 - it asked if I wanted a quick recovery. Said y, it then said reboot. It save the old inittab, the old passwd and the old shadow in addition to the old kernel, that was probably corrupt. At the boot I copied over a previously saved copy of a good kernel. Total time - less than 10 minutes. I have had some other systems that weren't that easy to recover. I suspect you have a contoller problem. Running a WD1007 w Maxtor ESDI here. -- Bill Vermillion - UUCP: uunet!tarpit!bilver!bill : bill@bilver.UUCP
fangchin@elaine44.stanford.edu (Chin Fang) (01/27/91)
In article <1991Jan25.155639.388@bilver.uucp> bill@bilver.uucp (Bill Vermillion) writes: > [..some stuff deleted] >I am running the fast file system on the / and on the news partition and a >51k file system on a third partition. > I use the BSD FFS on all my partitions (btw, long file names are relatively safe for /usr and /use2. NOT for /) >The ONLY problem I have had occured this AM. Power failure did something >so that on a reboot it paniced and rebooted, over and over. > Humm... I have encountered this too. When the ESIX incarnation of BSD FFS is used, this tends to happen. Back when I was using rev. B, rebooting without shutdown never caused any problem like panicing. But rev. B does not have FFS. Any relations here? >Put in the distribution disk 1, then 2 - it asked if I wanted a quick >recovery. Said y, it then said reboot. It save the old inittab, the old >passwd and the old shadow in addition to the old kernel, that was probably >corrupt. > >At the boot I copied over a previously saved copy of a good kernel. > Yes, I have done the *almost* same. In all cases that I have encountered, the kernal was NOT corrupted! I always could reuse the old kernal. What I did was just mv *.SAV to * (well, global renaming implied here) and unix.SAV to unix and then reboot. It works so far but I never understand why the OS got in trouble earlier. Any illumination would be appreciated. >Total time - less than 10 minutes. I have had some other systems that >weren't that easy to recover. > I believe that. It is always relatively painless. (Except the first time!) >I suspect you have a contoller problem. Running a WD1007 w Maxtor ESDI >here. > I use WD1007SVH w Miniscribe 3130E > Chin Fang Mechanical Engineering Department Stanford University fangchin@portia.stanford.edu
davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (01/30/91)
In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes: | what type of controller? I did have this problem on | one of our machines last year (running 386/ix) with | a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing | the controller (actually returning it as DOA) solved | the problem.. This sometimes happens with two drives on a WD 1006 or 1007 with multiple drives. My understanding is that an i/o is started on one drive while a seek is started on the other. If they both finish at the same time a single interrupt is issued and the driver has to check the controller status to get both conditions. I was told that ISC said it was a hardware problem and SCO just issued the fix for the driver (xnx133). I have never seen the problem on a machine with a single drive, nor a case where the SCO driver change failed to fix the problem (five 1007s, two 1006s). -- bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen) sysop *IX BBS and Public Access UNIX moderator of comp.binaries.ibm.pc and 80386 mailing list "Stupidity, like virtue, is its own reward" -me
kdenning@pcserver2.naitc.com (Karl Denninger) (03/05/91)
In article <3021@sixhub.UUCP> davidsen@sixhub.UUCP (bill davidsen) writes: >In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes: > >| what type of controller? I did have this problem on >| one of our machines last year (running 386/ix) with >| a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing >| the controller (actually returning it as DOA) solved >| the problem.. > > This sometimes happens with two drives on a WD 1006 or 1007 with >multiple drives. My understanding is that an i/o is started on one >drive while a seek is started on the other. If they both finish at the >same time a single interrupt is issued and the driver has to check the >controller status to get both conditions. > > I was told that ISC said it was a hardware problem and SCO just >issued the fix for the driver (xnx133). Turning on the flag "CCAP_NOSEEK" in the HPDD driver config file will fix this problem with ISC machines. I have seen it too. It is caused by a little-used (for DOS) capability in the WD series of controllers; that is the ability to do some operations on two drives at once. Specifically, you can seek one drive and while it's moving the heads, do an I/O operation on the other. The CCAP patch to the space.c file has fixed the problem every time I have encountered it on ISC 2.x. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.
rcbarn@rw7.urc.tue.nl (Raymond Nijssen) (03/06/91)
kdenning@pcserver2.naitc.com (Karl Denninger) writes: >In article <3021@sixhub.UUCP> davidsen@sixhub.UUCP (bill davidsen) writes: >>In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes: >> >>| what type of controller? I did have this problem on >>| one of our machines last year (running 386/ix) with >>| a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing >>| the controller (actually returning it as DOA) solved >>| the problem.. >> >> This sometimes happens with two drives on a WD 1006 or 1007 with >>multiple drives. My understanding is that an i/o is started on one >>drive while a seek is started on the other. If they both finish at the >>same time a single interrupt is issued and the driver has to check the >>controller status to get both conditions. > >Turning on the flag "CCAP_NOSEEK" in the HPDD driver config file will fix >this problem with ISC machines. More precisely: to turn this feature off, you have to 1) edit the file /etc/conf/pack.d/dsk/space.c change the 'disk_config_tbl' entry for either a primary or secondary AT hard disk, which looks like: (CCAP_RETRY | CCAP_ERRCOR), /* capabilities */ to (CCAP_RETRY | CCAP_ERRCOR | CCAP_NOSEEK), /* capabilities */ 2) rebuild the kernel 3) reboot (Thanks to Doug Pintar at ISC) A better alternative is to replace the controller board with one without this bug. There is no _documented_ way for you to tell whether a WD1006 adapter is buggy or not. However, when I encountered this problem some time ago, I posted an inquiry in this newsgroup in which I asked people to tell me (a) if they had seen it too, and (b) the text on top of the WD1006 chip. *Everybody* who had seen this problem told me it said 'PROTO' on the chip; Everybody who did not have a chip with 'PROTO' on it told me they had never had any problem. So have a close look at your adapter, and tell me if the magic word is there. -Raymond