[comp.unix.sysv386] ISC 2.2 Hangs on disk I/O

doug@pravda (Doug MacKenzie) (03/12/91)

I have a Modular Circuit Technology's 25MHz cache 386 system.
I have a Western Digital WD1006V-SR2 RLL controller with a Segate
ST-277-1 65MB RLL Harddisk with DOS in the first ~8Meg partion and
ISC 2.2 UNIX in the rest.  /etc/partions contains an accurate map
of the bad sectors in the unix partion.

The problem:
  Some times during disk I/O, the disk light will just stay on solid
  and the system is locked up.  I can not get any response other than
  turning the stupid thing off.

  It happens sometimes during boot up when it is scanning the disk.
  It happens sometimes during large GREP's.
  It just seems to happen!!!
  This is not a panic.  It is just hung.

Does anyone have any ideas?

		Doug MacKenzie
		doug@cc.gatech.edu

drector@orion.oac.uci.edu (David Rector) (03/13/91)

In <24113@hydra.gatech.EDU> doug@pravda (Doug MacKenzie) writes:

>I have a Modular Circuit Technology's 25MHz cache 386 system.
>I have a Western Digital WD1006V-SR2 RLL controller with a Segate
>ST-277-1 65MB RLL Harddisk with DOS in the first ~8Meg partion and
>ISC 2.2 UNIX in the rest.  /etc/partions contains an accurate map
>of the bad sectors in the unix partion.

>The problem:
>  Some times during disk I/O, the disk light will just stay on solid
>  and the system is locked up.  I can not get any response other than
>  turning the stupid thing off.

>  It happens sometimes during boot up when it is scanning the disk.
>  It happens sometimes during large GREP's.
>  It just seems to happen!!!
>  This is not a panic.  It is just hung.

>Does anyone have any ideas?

>		Doug MacKenzie
>		doug@cc.gatech.edu

Ah, another sufferer.  The disease seems to be incurable, but you won't
notice it after a while.

No one has firmly diagnosed this problem with the otherwise admirable
WD1006, but the behavior suggests that an error condition is not being
correctly handled by the ISC driver.  Conjecture: controller is asked
to read a flakey sector; CRC comes up bad too many times and controller
gives up; driver doesn't have a time-out and never sees error message
or resets the controller; system hangs.

The cure is time.  ISC relocates sectors which show frequent bad
crc's.  After a few weeks the problem will occur only rarely.
Experiments suggest that the process might be speeded by writing and
running a short program that will read all of the disk sectors
(whether in a file or not).

Boot sectors, which cannot be relocated, might cause more permanent
problems.  Change the boot cylinder.

The supurb performance of the WD1006 makes up for this minor inconvenience.

-- 
David L. Rector				drector@orion.oac.uci.edu
Dept. of Math.				U. C. Irvine, Irvine CA 92717

drector@orion.oac.uci.edu (David Rector) (03/13/91)

This is posted for a friend: Ira Baxter - baxter@slcs.slb.com

------------------------

   I have a Modular Circuit Technology's 25MHz cache 386 system.
   I have a Western Digital WD1006V-SR2 RLL controller with a Segate
   ST-277-1 65MB RLL Harddisk with DOS in the first ~8Meg partion and
   ISC 2.2 UNIX in the rest.  /etc/partions contains an accurate map
   of the bad sectors in the unix partion.

   The problem:
     Some times during disk I/O, the disk light will just stay on solid
     and the system is locked up.  I can not get any response other than
     turning the stupid thing off.

     It happens sometimes during boot up when it is scanning the disk.
     It happens sometimes during large GREP's.
     It just seems to happen!!!
     This is not a panic.  It is just hung.

   Does anyone have any ideas?


I have been living with exactly these symptoms and my WD1006-SRV2 on
my Micronics 386-20 cache motherboard system, with ISC 1.0.6 and ISC
2.0.2.  I actually went (I used to live near them) and beat Western
Digital over the head about it.

All I ever got was "Well, if we can't make it fail here then it
isn't broken."   I never had the guts to take them my system and
let them play with it for two weeks.

I also broadcast (like you) to the net in desperation and got no results.

I can fix the hangup with a simple, hard reset, which is the logical
equivalent of a power off/on cycle, but not CTRL-ALT-DEL.
Forturnately, it only seems to happen to me when my machine is
cold, so I simply don't turn it off anymore.  Consequently I
almost never see the problem (the last time I saw it was 6 months
ago, after I ripped the machine apart to play with its configuration).
This form is pretty livable, but you may not be so lucky.

How old is your controller?  Why don't you send me the numbers written
by WD on it, and I'll compare them to mine; perhaps there is simply a
bad run of controllers.  I personally think it is a bug in the track
buffering logic; something about reading ahead and getting a fault.
But I doubt it will ever get fixed.

If you get any other net feedback, I'd like to hear it.
I'm currently not reading the 386 unix newsgroup; your message
got forwarded to me by another owner of a WD1006, who has had
zero trouble with his on his 486.

baxter@slcs.slb.com

-----------------------
-- 
David L. Rector				drector@orion.oac.uci.edu
Dept. of Math.				U. C. Irvine, Irvine CA 92717

todd@toolz.uucp (Todd Merriman) (03/13/91)

doug@pravda (Doug MacKenzie) writes:
>  Some times during disk I/O, the disk light will just stay on solid
>  and the system is locked up.  I can not get any response other than
>  turning the stupid thing off.

I have had the same symptoms on my system with Interactive 2.2,
SCSI disk, and SCSI tape.

After *much* fretting, I found that something was modifying
/unix.  Rebuilding the kernel (without reconfiguring) fixes the problem.
I can always count on it to happen shortly after rebuilding the system (disk
format, partitioning, install core and kconfig, then restoring 
everything from tape).  Even if I rebuild the kernel after restoring
everything, the symptoms will appear sometime later.  After again
rebuilding the kernel, everything works fine.

Note:  rebuilding the kernel many times will quickly fill your disk
containing the root file system.

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Todd Merriman - Software Toolz, Inc.                   * Maintainer of the  *
* 8030 Pooles Mill Dr., Ball Ground, GA 30107-9610       * Software           *
* ...emory.edu!toolz.uucp!todd                           * Entrepreneur's     *
* V-mail (800) 869-3878, (404) 889-8264                  * mailing list       *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

edhall@rand.org (Ed Hall) (03/14/91)

In article <27DD4757.22595@orion.oac.uci.edu> drector@orion.oac.uci.edu (David Rector) writes:
>In <24113@hydra.gatech.EDU> doug@pravda (Doug MacKenzie) writes:
>>The problem:
>>  Some times during disk I/O, the disk light will just stay on solid
>>  and the system is locked up.  I can not get any response other than
>>  turning the stupid thing off.
>
>>  It happens sometimes during boot up when it is scanning the disk.
>>  It happens sometimes during large GREP's.
>>  It just seems to happen!!!
>>  This is not a panic.  It is just hung.
>
>Ah, another sufferer.  The disease seems to be incurable, but you won't
>notice it after a while.
>
>No one has firmly diagnosed this problem with the otherwise admirable
>WD1006, but the behavior suggests that an error condition is not being
>correctly handled by the ISC driver.  Conjecture: controller is asked
>to read a flakey sector; CRC comes up bad too many times and controller
>gives up; driver doesn't have a time-out and never sees error message
>or resets the controller; system hangs.

This could be, but I've also seen people on the net complaining about
this behavior with ESIX.  Both drivers may be broken--or the WD1006.

Years ago (well, maybe two) someone posted to comp.sys.ibm.pc a letter
from someone at Western Digital asking for people experiencing this
problem to contact her.  The implication was that they were going to
give up on solving the problem unless they found evidence that more
than a few individuals were affected.  I had just got my WD1006SV2 and
it hadn't frozen up, so like a dummy I didn't save the posting.  Of
course, within the next week or so my system froze.  But it's only
happened every few weeks, so I'm not too upset; if it were more than
just my home system I'd certainly have pursued is...

BTW, this is NOT the problem a batch of WD1006's had with overlapped
seeks.  That particular problem only existed on ISC 2.* systems with
two disks, and could be fixed by a small change to the driver's
configuration.

		-Ed Hall
		edhall@rand.org

martin@saturn.uucp (Martin J. Schedlbauer) (03/16/91)

In article <27DD916A.14624@orion.oac.uci.edu> drector@orion.oac.uci.edu (David Rector) writes:
>This is posted for a friend: Ira Baxter - baxter@slcs.slb.com
>
>------------------------
>
>   I have a Modular Circuit Technology's 25MHz cache 386 system.
>   I have a Western Digital WD1006V-SR2 RLL controller with a Segate
>   ST-277-1 65MB RLL Harddisk with DOS in the first ~8Meg partion and
>   ISC 2.2 UNIX in the rest.  /etc/partions contains an accurate map
>   of the bad sectors in the unix partion.
>
>   The problem:
>     Some times during disk I/O, the disk light will just stay on solid
>     and the system is locked up.  I can not get any response other than
>     turning the stupid thing off.
>

It happened to me twice too under a different combination:

	Esix 3.2D
	AMI 386-25
	Maxtor 8760E (650 MB ESDI)
	UltraStore 12F ESDI controller

During VERY heavy file I/O sometimes the drive light just stays on, the
system is NOT dead, but yu can't get anywhere until you do a hard reset.

Somebody said that this is a known problem with the WD controllers if you
have two drives and two seeks are started at the same time. I took out my
second drive and I am running single now and it happened again about a week
ago. So that doesn't seem to be the problem.

	...Martin



-- 
==============================================================================
Martin J. Schedlbauer	| martin@saturn.UUCP	| ...!ulowell!saturn!martin
8 Gilman Road		| mschedlb@ulowell.edu	| ...!uunet!wang!saturn!martin
Billerica, MA 01862 USA	| CIS: 76675, 3364	| /\/\/\/\/\/\/\/\/\/\/\/\/\/\

grant@bluemoon.uucp (Grant DeLorean) (03/17/91)

todd@toolz.uucp (Todd Merriman) writes:

>Note:  rebuilding the kernel many times will quickly fill your disk
>containing the root file system.

 You know, rm -r unix.* in the right directory cures that problem in
a hurry...
-- 
\  Grant DeLorean  (grant@bluemoon)    {n8emr|nstar}!bluemoon!grant  /
"You need only reflect that one of the best ways to get yourself a reputation 
as a dangerous citizen these days is to go about repeating the very phrases
which our founding fathers used in the struggle for independence."-C.A. Beard

davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (03/18/91)

In article <24113@hydra.gatech.EDU> doug@pravda (Doug MacKenzie) writes:
| I have a Modular Circuit Technology's 25MHz cache 386 system.
| I have a Western Digital WD1006V-SR2 RLL controller with a Segate

|   Some times during disk I/O, the disk light will just stay on solid
|   and the system is locked up.  I can not get any response other than
|   turning the stupid thing off.

| Does anyone have any ideas?

  I think I know what the problem is, you see this with 1006 and 1007
disk controllers. SCO put a fix in their driver for it, which cure the
problem. ISC hasn't figured it out, as far as I know, and at one time
was blaming it on the controller.

  This family of controllers has a possibility of returning interrupts
in an unusual way. Since it's documented to be possible I can't call it
a bug. Replacing the controller will probably fix it, switching to SCO
will definitely fix it.

  ISC may have a fix by now, I haven't been tracking it.
-- 
bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen)
    sysop *IX BBS and Public Access UNIX
    moderator of comp.binaries.ibm.pc and 80386 mailing list
"Stupidity, like virtue, is its own reward" -me

davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (03/18/91)

In article <27DD916A.14624@orion.oac.uci.edu> drector@orion.oac.uci.edu (David Rector) writes:

| I have been living with exactly these symptoms and my WD1006-SRV2 on
| my Micronics 386-20 cache motherboard system, with ISC 1.0.6 and ISC
| 2.0.2.  I actually went (I used to live near them) and beat Western
| Digital over the head about it.

  Here's a chance for the revitalized ISC support to shine! SCO has had a
fix out for this problem for over a year. I bet that if ISC asked WD
they could get the info the SCO used to solve the problem and fix it for
ISC, too.

  How about it, Marty? You've seen people say they have the problem,
too, can you continue to build on your short but impressive record for
getting fixes out of ISC quickly?

  Since this *seems* to be driver rather than kernel, it could probably
be posted, avoiding the cost associated with a crash fix to people who
don't need it. Note the doubt in my last statement... Still, the problem
has been around for several years, and SCO has demonstrated that it can
be fixed without breaking support for other controllers.
-- 
bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen)
    sysop *IX BBS and Public Access UNIX
    moderator of comp.binaries.ibm.pc and 80386 mailing list
"Stupidity, like virtue, is its own reward" -me

wgb@balkan.TNT.COM (William G. Bunton) (03/18/91)

In article <3446@sixhub.UUCP> davidsen@sixhub.UUCP (bill davidsen)
writes (about ISC and WD disk controllers):
>Since it's documented to be possible I can't call it
>a bug.

Right.  And since it's documented that the u area is writeable,
allowing you to change your uid, that's not a bug either?  (Not to
pick on ISC, it just seemed a rather widely-known counter example to
the above attitude).

>"Stupidity, like virtue, is its own reward" -me

I'm sorry, but this surely seems to fit your above statement.
-- 
William G. Bunton              | Since it's documented to be possible, I
wgb@balkan.tnt.com             | can't call it a bug.
Tools & Techniques, Austin, TX |                        -- Bill Davidson

chris@alderan.uucp (Christoph Splittgerber) (03/19/91)

In article <1991Mar17.030620.29587@bluemoon.uucp> grant@bluemoon.uucp (Grant DeLorean) writes:
>todd@toolz.uucp (Todd Merriman) writes:
>
> You know, rm -r unix.* in the right directory cures that problem in
>a hurry...

In the wrong directory it might cure almost everything :-)

-- 
************************ Brain fault (core dumped) *************************
Replies-To:  chris@alderan.uucp        UUCP: uunet!mcsun!unido!alderan!chris 
Phone:       +49 711 344375            Fax:  +49 711 3460684

savage@tigger.Colorado.EDU (Metallica Rules) (03/30/91)

In article <1991Mar16.133513.428@saturn.uucp> martin@saturn.UUCP (Martin J. Schedlbauer) writes:
>In article <27DD916A.14624@orion.oac.uci.edu> drector@orion.oac.uci.edu (David Rector) writes:
>>
>>   The problem:
>>     Some times during disk I/O, the disk light will just stay on solid
>>     and the system is locked up.  I can not get any response other than
>>     turning the stupid thing off.
>>

Yes, I had this problem as well, but it only happened when I was in vpix.
Since then, I have re-partioned my hard-drive to be just unix... Which
freed up about an extra 50Mg for me and now I don't have the problem any more.
During the repartioning I increased my swap space by about 5-10Mg's as well.
Either by having more swap space or having more than 5 Mg's free on my HD, I
know have 47MG's free, cleared up the problem for me.

Chuck Savage
-- 
savage%tigger@boulder.colorado.edu