[comp.unix.i386] ISC disk driver bug?

baxter@ics.uci.edu (Ira Baxter) (12/18/89)

This thread indicated 2.0.2 hangs up sometimes when going for heavy disk
accesses, using SCSI controllers.  There has been some discussion about
Newbury drives being at fault.

Assuming that the disk drives, and not the controllers, are at fault,
there is *no* reasonable excuse for the ISC drivers to fail to
diagnose this problem (there really ought to be an I/O transaction timeout
of some kind!!).  So even if the Newbury drives are busted, so is the
ISC driver.

I use a Western Digital WD1006SRV2 (RLL 1-1 track-buffered)
controller.  I (and others with 1006s) have seen identical symptoms;
until I saw this thread, I assumed there was something funny about the
WD1006.  I have been chasing this problem unsuccessfully even with
WD's aid.  I conclude the problem is more due to the ISC drivers than
the hardware.

Would somebody at ISC look into this problem and discuss solutions,
or tell us why it can't be the ISC drivers?

Thanks,

--
Ira Baxter

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (12/19/89)

In article <258C899B.28434@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:

| I use a Western Digital WD1006SRV2 (RLL 1-1 track-buffered)
| controller.  I (and others with 1006s) have seen identical symptoms;
| until I saw this thread, I assumed there was something funny about the
| WD1006.  I have been chasing this problem unsuccessfully even with
| WD's aid.  I conclude the problem is more due to the ISC drivers than
| the hardware.

  Karl Denninger posted a note on jumper settings with the 1006 which
could cure your problem.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

baxter@ics.uci.edu (Ira Baxter) (12/19/89)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:

>In article <258C899B.28434@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:

>| I use a Western Digital WD1006SRV2 (RLL 1-1 track-buffered)
>| controller.  I (and others with 1006s) have seen identical symptoms;
>| until I saw this thread, I assumed there was something funny about the
>| WD1006.  I have been chasing this problem unsuccessfully even with
>| WD's aid.  I conclude the problem is more due to the ISC drivers than
>| the hardware.

>  Karl Denninger posted a note on jumper settings with the 1006 which
>could cure your problem.
>-- 

There are only two interesting jumpers on a WD1006 according to my WD
documentation: W1-1,2, handling "latched" mode (unfortunately, the
docs *don't* say what this does), and W1-5-6, which disables cache
control.  If the cure is disabling the cache, then the point of buying
the controller was wasted.  I'm waiting for WD to tell me what
"latched" mode does.  In any case, *the ISC drivers* should diagnose a
problem, rather than the system merely hanging, unless "latched" mode
causes the controller to put data in a location not requested by the
driver... which seems impossible, since this controller only does PIO
transfers, and therefore the target addresses in the machine are
controlled by the IN instructions, not the controller.

--
Ira Baxter

larry@nstar.UUCP (Larry Snyder) (12/19/89)

> problem, rather than the system merely hanging, unless "latched" mode
> causes the controller to put data in a location not requested by the
> driver... which seems impossible, since this controller only does PIO
> transfers, and therefore the target addresses in the machine are
> controlled by the IN instructions, not the controller.
> 

Well, I was running SCO Xenix and had the same problems.  I tried to return
the controller for credit - and was told that the controller has not been
approved by SCO so my return would be subject to a 25% restocking fee (this
was from the distributor - Tech Data). 

I'm no longer running the controller, like SCO - and am not having any 
problems.  The 2372B might not be as fast - but it works.


-- 
Larry Snyder, Northern Star Communications, Notre Dame, IN
uucp: root@nstar -or- ...!iuvax!ndmath!nstar!root

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (12/19/89)

In article <258D393B.1670@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:

| >  Karl Denninger posted a note on jumper settings with the 1006 which
| >could cure your problem.
| >-- 
| 
| There are only two interesting jumpers on a WD1006 according to my WD
| documentation: W1-1,2, handling "latched" mode (unfortunately, the
| docs *don't* say what this does), and W1-5-6, which disables cache
| control.  If the cure is disabling the cache, then the point of buying
| the controller was wasted.  

  Sorry I mentioned it. I don't have time to reinvent solutions.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (12/19/89)

In article <258D393B.1670@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:
| davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
| 
| >  Karl Denninger posted a note on jumper settings with the 1006 which
| >could cure your problem.
| >-- 
| 
| There are only two interesting jumpers on a WD1006 according to my WD
| documentation: W1-1,2, handling "latched" mode (unfortunately, the
| docs *don't* say what this does), and W1-5-6, which disables cache
| control.  If the cure is disabling the cache, then the point of buying
| the controller was wasted.  I'm waiting for WD to tell me what
| "latched" mode does.  

  Sorry I mentioned it. I was just happy to have the board working
right, didn't feel the need to reinvent the fix myself.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

jackv@turnkey.gryphon.COM (Jack F. Vogel) (12/20/89)

In article <258D393B.1670@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:
 
[ ..... discussion of controller problem deleted....]

>..... W1-1,2, handling "latched" mode (unfortunately, the
>docs *don't* say what this does), and W1-5-6, which disables cache
>control.  If the cure is disabling the cache, then the point of buying
>the controller was wasted.  I'm waiting for WD to tell me what
>"latched" mode does.
 
No need to wait for WD, latched mode doesn't do much of anything. It controls
whether the drive select signal is "latched" or not. For instance, on a 1003 I
think it is hardwired (not sure but any of them I've seen were set up that
way). The visible evidence of this mode is that the drive select light on
the drive last selected will stay on, whereas when its off and no drive is
selected no drive light will be on.

Exciting, right :-}?

-- 
Jack F. Vogel			jackv@seas.ucla.edu
AIX Technical Support	              - or -
Locus Computing Corp.		jackv@ifs.umich.edu

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (12/20/89)

In article <511092@nstar.UUCP> larry@nstar.UUCP (Larry Snyder) writes:

   I'm no longer running the controller, like SCO - and am not having any 
   problems.  The 2372B might not be as fast - but it works.

With ESIX, I get bandwidth out of the 2372B that matches that
published for the WD1006. With caching disabled. BTW, the way to enable
caching with the ACB is jumper J2-5, if I remember correctly. I
run it disabled because it only helps really in the two following conditions:

	sequential reads from the block device

	reads from a filesystem that has been made with too small a gap

which are not my type of things, and slows down more than a
little on writes to the filesystem. Somebody observed that my
tests have been using too large a blocking factor (on purpose, I
wanted to see the maximum performance obtainable), and that read
ahead may be more useful with reads using smaller block sizes. I
seem to remember that I tried and this is not really true. Will
do some more experiments...

--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

karl@ddsw1.MCS.COM (Karl Denninger) (12/21/89)

In article <258D393B.1670@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:
>davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
>
>>In article <258C899B.28434@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:
>
>>| I use a Western Digital WD1006SRV2 (RLL 1-1 track-buffered)
>>| controller.  I (and others with 1006s) have seen identical symptoms;
>>| until I saw this thread, I assumed there was something funny about the
>>| WD1006.  I have been chasing this problem unsuccessfully even with
>>| WD's aid.  I conclude the problem is more due to the ISC drivers than
>>| the hardware.

>>  Karl Denninger posted a note on jumper settings with the 1006 which
>>could cure your problem.
>>-- 

>There are only two interesting jumpers on a WD1006 according to my WD
>documentation: W1-1,2, handling "latched" mode (unfortunately, the
>docs *don't* say what this does), and W1-5-6, which disables cache
>control.  

Turn off latched mode (install W1-1,2).

If the problem goes away, then your DRIVES can't handle it.  

If the problem does NOT go away then the problem is either (1) in the
drivers, or (2) you have a bad WD1006.  There are a lot of bad ones around;
we have had to return many of them recently.  Symptoms are exactly as
described -- random locks under heavy load.   The disk activity light is ON
solidly in these cases when it hangs.

The boards we are getting now are manufactured in a different location and
have a completely different format for the serial number.  None of these have
been bad (so far).

If the system is hanging with the activity lights OFF this is not the
problem; look to ISC in that case.

--
Karl Denninger (karl@ddsw1.MCS.COM, <well-connected>!ddsw1!karl)
Public Access Data Line: [+1 708 566-8911], Voice: [+1 708 566-8910]
Macro Computer Solutions, Inc.		"Quality Solutions at a Fair Price"

karl@ddsw1.MCS.COM (Karl Denninger) (12/22/89)

In article <6404@turnkey.gryphon.COM> jackv@turnkey.gryphon.COM writes:
>In article <258D393B.1670@paris.ics.uci.edu> baxter@ics.uci.edu (Ira Baxter) writes:
> 
>[ ..... discussion of controller problem deleted....]
>
>>..... W1-1,2, handling "latched" mode (unfortunately, the
>>docs *don't* say what this does), and W1-5-6, which disables cache
>>control.  If the cure is disabling the cache, then the point of buying
>>the controller was wasted.  I'm waiting for WD to tell me what
>>"latched" mode does.
> 
>No need to wait for WD, latched mode doesn't do much of anything. It controls
>whether the drive select signal is "latched" or not. For instance, on a 1003 I
>think it is hardwired (not sure but any of them I've seen were set up that
>way). The visible evidence of this mode is that the drive select light on
>the drive last selected will stay on, whereas when its off and no drive is
>selected no drive light will be on.
>
>Exciting, right :-}?

I talked to one of the Interactive Guru's this AM.  He gave me a fix --
which WORKED.  Try this one on for size if you are having problems with 2
drives on a WD1006 series board (specific symptoms - random errors on BOTH
drives during the verify of the second disk):

	In /etc/conf/pack.d/dsk/space.c there is a line which reads:
	
	(CCAP_RETRY | CCAP_ERRCOR), /* capabilities */

Modify this to read:
	
	(CCAP_RETRY | CCAP_ERRCOR | CCAP_NOSEEK), /* capabilities */

Relink your kernel, and reboot.  Presto -- problem solved.

Thanks go to dougp@ico (Doug Pintar); he was the one that put me onto this.

If we can keep getting this kind of good, hard information from ISC we just 
might have to admit that we're impressed with the support.  They certainly
solved this problem for me.  :-)

--
Karl Denninger (karl@ddsw1.MCS.COM, <well-connected>!ddsw1!karl)
Public Access Data Line: [+1 708 566-8911], Voice: [+1 708 566-8910]
Macro Computer Solutions, Inc.		"Quality Solutions at a Fair Price"