[comp.unix.sysv386] ESIX File System Selection

mdm@cocktrice.uucp (Mike Mitchell) (01/24/91)

I am having unexplaned system crashes where the machine will hang
while unattended. I am running ESIX 3.2 Rev D and leave the machine
on 100% of the time. I will return after a leaving the machine
for a few hours, and it will be hung. The drive light is on indicating
that some sort of disk activity was happening on or about the time
of the system crash.

During system installation, I selected the Fast File System for all
partitions created. Could this be the culprit, or do I need to
chase a different problem? Upon opening the case, there does not seem
to be a heat problem, so I am not sure what may be causing my
difficulties.

Suggestions? Thanks for your time and effort.

-- 
Mike Mitchell		                          Email: mdm@cocktrice.uucp
2020 Calle Lorca #43	                          Phone: (505) 471-7639 H
Santa Fe, New Mexico 87505	                         (505) 473-4482 W

larry@nstar.rn.com (Larry Snyder) (01/24/91)

mdm@cocktrice.uucp (Mike Mitchell) writes:

>I am having unexplaned system crashes where the machine will hang
>while unattended. I am running ESIX 3.2 Rev D and leave the machine
>on 100% of the time. I will return after a leaving the machine
>for a few hours, and it will be hung. The drive light is on indicating
>that some sort of disk activity was happening on or about the time
>of the system crash.

what type of controller?  I did have this problem on
one of our machines last year (running 386/ix) with
a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing
the controller (actually returning it as DOA) solved
the problem..

-- 
   Larry Snyder, NSTAR Public Access Unix 219-289-0282 (HST/PEP/V.32/v.42bis)
                        regional UUCP mapping coordinator 
  {larry@nstar.rn.com, ..!uunet!nstar!larry, larry%nstar@iuvax.cs.indiana.edu}

bill@bilver.uucp (Bill Vermillion) (01/25/91)

In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes:
>mdm@cocktrice.uucp (Mike Mitchell) writes:

>>I am having unexplaned system crashes where the machine will hang
>>while unattended. I am running ESIX 3.2 Rev D and leave the machine
>>on 100% of the time. I will return after a leaving the machine
>>for a few hours, and it will be hung. The drive light is on indicating
>>that some sort of disk activity was happening on or about the time
>>of the system crash.

>what type of controller?  I did have this problem on
>one of our machines last year (running 386/ix) with
>a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing
>the controller (actually returning it as DOA) solved
>the problem..

Larry's suggestion may be correct.  There is NO problem with ESIX.  I am
running a node with a full feed incomng and outgoing.   Runs just fine.
This past week in a news blockage, it moved about 12,000 messages in a
short time.

I am running the fast file system on the /  and on the news partition and a
51k file system on a third partition.

The ONLY problem I have had occured this AM.  Power failure did something
so that on a reboot it paniced and rebooted, over and over.

Put in the distribution disk 1, then 2 - it asked if I wanted a quick
recovery.  Said y, it then said reboot.  It save the old inittab, the old
passwd and the old shadow in addition to the old kernel, that was probably
corrupt.

At the boot I copied over a previously saved copy of a good kernel.

Total time - less than 10 minutes.   I have had some other systems that
weren't that easy to recover.

I suspect you have a contoller problem.  Running a WD1007 w Maxtor ESDI
here.

-- 
Bill Vermillion - UUCP: uunet!tarpit!bilver!bill
                      : bill@bilver.UUCP

fangchin@elaine44.stanford.edu (Chin Fang) (01/27/91)

In article <1991Jan25.155639.388@bilver.uucp> bill@bilver.uucp (Bill Vermillion) writes:
>
[..some stuff deleted]

>I am running the fast file system on the /  and on the news partition and a
>51k file system on a third partition.
>

I use the BSD FFS on all my partitions (btw, long file names are relatively
safe for /usr and /use2. NOT for /)

>The ONLY problem I have had occured this AM.  Power failure did something
>so that on a reboot it paniced and rebooted, over and over.
>

Humm... I have encountered this too.  When the ESIX incarnation of BSD
FFS is used, this tends to happen.  Back when I was using rev. B, rebooting
without shutdown never caused any problem like panicing.  But rev. B does
not have FFS.  Any relations here?

>Put in the distribution disk 1, then 2 - it asked if I wanted a quick
>recovery.  Said y, it then said reboot.  It save the old inittab, the old
>passwd and the old shadow in addition to the old kernel, that was probably
>corrupt.
>
>At the boot I copied over a previously saved copy of a good kernel.
>

Yes, I have done the *almost* same.  In all cases that I have encountered, the 
kernal was NOT corrupted!  I always could reuse the old kernal.  What I
did was just mv *.SAV to * (well, global renaming implied here) and 
unix.SAV to unix and then reboot.  It works so far but I never understand why 
the OS got in trouble earlier.  Any illumination would be appreciated.

>Total time - less than 10 minutes.   I have had some other systems that
>weren't that easy to recover.
>

I believe that.  It is always relatively painless.  (Except the first time!)

>I suspect you have a contoller problem.  Running a WD1007 w Maxtor ESDI
>here.
>

I use WD1007SVH w Miniscribe 3130E
>

Chin Fang
Mechanical Engineering Department
Stanford University
fangchin@portia.stanford.edu

davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (01/30/91)

In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes:

| what type of controller?  I did have this problem on
| one of our machines last year (running 386/ix) with
| a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing
| the controller (actually returning it as DOA) solved
| the problem..

  This sometimes happens with two drives on a WD 1006 or 1007 with
multiple drives. My understanding is that an i/o is started on one
drive while a seek is started on the other. If they both finish at the
same time a single interrupt is issued and the driver has to check the
controller status to get both conditions.

  I was told that ISC said it was a hardware problem and SCO just
issued the fix for the driver (xnx133). I have never seen the problem on
a machine with a single drive, nor a case where the SCO driver change
failed to fix the problem (five 1007s, two 1006s).

-- 
bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen)
    sysop *IX BBS and Public Access UNIX
    moderator of comp.binaries.ibm.pc and 80386 mailing list
"Stupidity, like virtue, is its own reward" -me

kdenning@pcserver2.naitc.com (Karl Denninger) (03/05/91)

In article <3021@sixhub.UUCP> davidsen@sixhub.UUCP (bill davidsen) writes:
>In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes:
>
>| what type of controller?  I did have this problem on
>| one of our machines last year (running 386/ix) with
>| a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing
>| the controller (actually returning it as DOA) solved
>| the problem..
>
>  This sometimes happens with two drives on a WD 1006 or 1007 with
>multiple drives. My understanding is that an i/o is started on one
>drive while a seek is started on the other. If they both finish at the
>same time a single interrupt is issued and the driver has to check the
>controller status to get both conditions.
>
>  I was told that ISC said it was a hardware problem and SCO just
>issued the fix for the driver (xnx133). 

Turning on the flag "CCAP_NOSEEK" in the HPDD driver config file will fix
this problem with ISC machines.

I have seen it too.  It is caused by a little-used (for DOS) capability in
the WD series of controllers; that is the ability to do some operations on
two drives at once.  Specifically, you can seek one drive and while it's
moving the heads, do an I/O operation on the other.

The CCAP patch to the space.c file has fixed the problem every time I have
encountered it on ISC 2.x.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.

rcbarn@rw7.urc.tue.nl (Raymond Nijssen) (03/06/91)

kdenning@pcserver2.naitc.com (Karl Denninger) writes:
>In article <3021@sixhub.UUCP> davidsen@sixhub.UUCP (bill davidsen) writes:
>>In article <1991Jan24.143542.19808@nstar.rn.com> larry@nstar.rn.com (Larry Snyder) writes:
>>
>>| what type of controller?  I did have this problem on
>>| one of our machines last year (running 386/ix) with
>>| a WD 1006-SRV2 (1:1 16 bit RLL controller) and replacing
>>| the controller (actually returning it as DOA) solved
>>| the problem..
>>
>>  This sometimes happens with two drives on a WD 1006 or 1007 with
>>multiple drives. My understanding is that an i/o is started on one
>>drive while a seek is started on the other. If they both finish at the
>>same time a single interrupt is issued and the driver has to check the
>>controller status to get both conditions.
>
>Turning on the flag "CCAP_NOSEEK" in the HPDD driver config file will fix
>this problem with ISC machines.

More precisely: to turn this feature off, you have to

	1) edit the file /etc/conf/pack.d/dsk/space.c
           change the 'disk_config_tbl' entry for either a primary or
	   secondary AT hard disk, which looks like:
		(CCAP_RETRY | CCAP_ERRCOR), /* capabilities */
           to
	        (CCAP_RETRY | CCAP_ERRCOR | CCAP_NOSEEK), /* capabilities */

	2) rebuild the kernel
        3) reboot

(Thanks to Doug Pintar at ISC)

A better alternative is to replace the controller board with one without
this bug. There is no _documented_ way for you to tell whether a WD1006 
adapter is buggy or not. However, when I encountered this problem some
time ago, I posted an inquiry in this newsgroup in which I asked people
to tell me (a) if they had seen it too, and (b) the text on top of the WD1006
chip. *Everybody* who had seen this problem told me it said 'PROTO' on 
the chip; Everybody who did not have a chip with 'PROTO' on it told
me they had never had any problem.

So have a close look at your adapter, and tell me if the magic word is there.

-Raymond