[comp.unix.i386] ISC 2.0.2: "PANIC athd_recvdata: LOGIC ERROR missing MEMBREAK"

baxter@zola.ICS.UCI.EDU (Ira Baxter) (08/18/90)

I have a Micronics 20Mhz 32kb cache 386 with WD1006SRV2 controller,
running ISC 2.0.2, HPDD configured with the standard file system.
Under normal operation, I get an occasional (1/month) latchup of the WD1006
which I have unsuccessfully been trying to track down for months.  The
latchup causes the disk access light to go on hard, but no disk
activity.  One can use ALT-SYSRQ to switch virtual consoles, so UNIX
is still alive.

One absolutely repeatable experiment is this:
    # Boot up system fresh.  Only this user logged in.
    # place interlace-1 formatted diskette in drive
    cp /dev/dsk/f0q15dt /dev/null
    # wait for above cp to complete
    # You may have to use chmod to make this 0p1 readable
    cp /dev/0p1 /dev/null

After hitting <CR> on the second cp command, I instantly get:
    PANIC athd_recvdata: LOGIC ERROR
    missing MEMBREAK, trying to dump 2464 pages
... and then my WD1006 latches up.

The PANIC looks like a bug in the OS.  It doesn't seem to fail with
/dev/dsk/0p0.   I *do* have a dos partition mounted, which 0p1 is
supposed to represent, but I can't see why that makes a difference.
Can anybody else repeat this?

It appears that the latchup is caused by UNIX trying to do the dump.
Why should the dump logic also be buggy?  My suspicion is that the
UNIX drivers attempt another disk transfer on the drive without
waiting for the previous one to complete, and thus the latchup.
Perhaps another path through the conventional disk drivers make the
same mistake sometimes... this would explain the symptoms I see under
normal operation.

ISC, any comment?

IDB
(714) 856-6693  ICS Dept/ UC Irvine, Irvine CA 92717

tim@comcon.UUCP (Tim Brown) (08/21/90)

In article <9008171007.aa06635@PARIS.ICS.UCI.EDU>, baxter@zola.ICS.UCI.EDU (Ira Baxter) writes:
> 
> I have a Micronics 20Mhz 32kb cache 386 with WD1006SRV2 controller,
> running ISC 2.0.2, HPDD configured with the standard file system.
> Under normal operation, I get an occasional (1/month) latchup of the WD1006
> which I have unsuccessfully been trying to track down for months.  The
[ complete discription deleted]

The WD1006SVR2 has a known history of locking up with ISC.  Something
to do with the card being unable to recover from errored seek aheads.
As I understand it the 1006 does multiple seeks and if one fails it
*should* go back and do single seeks, which it doesn't do correctly
thus it locks up.  BTW, ISC *said* they would have a work around in
the 2.2 kernel.  They don't.  I have seen the same thing with 2.2.
THe only fix I know of is to get another controller.  I went with
adaptec.  I have an associate that is going to switch a dtc7287 for a
wd1006.

Anyone have any experience with the DTC7287?

-- 
Tim Brown            |
Computer Connection  |
towers!comcon!tim    |

dougp@ico.isc.com (Doug Pintar) (08/27/90)

In article <483@comcon.UUCP> tim@comcon.UUCP (Tim Brown) writes:
>The WD1006SVR2 has a known history of locking up with ISC.  Something
>to do with the card being unable to recover from errored seek aheads.
>As I understand it the 1006 does multiple seeks and if one fails it
>*should* go back and do single seeks, which it doesn't do correctly
>thus it locks up.  BTW, ISC *said* they would have a work around in
>the 2.2 kernel.  They don't.  I have seen the same thing with 2.2.
>THe only fix I know of is to get another controller.

Well, this is kinda sorta true.  The HPDD, by default, WILL do overlapped
seeks on multi-drive AT-controller systems.  To pull this stunt off, it counts
on the controller doing retries if a data transfer request gets a 'drive busy'
error when it's still performing a previously-requested seek operation.  The
WD series of controllers had been able to do this since the 1001, 'way back
when.  Somehow it got broken in some revs of the 1006.  The fix, which has
been around as long as the HPDD and *requires* no change in the product, is
to change the file /etc/conf/pack.d/dsk/space.c AFTER configuring the HPDD
for your system.  In the 'disk_config_tbl' entry for either a primary or
secondary AT hard disk is a line that looks like:
	(CCAP_RETRY | CCAP_ERRCOR), /* capabilities */
change this to be:
	(CCAP_RETRY | CCAP_ERRCOR | CCAP_NOSEEK), /* capabilities */
to cripple the overlapped seek stuff.  This should fix the problem if it's
really a multi-drive seek condition that's doing you in.

Good luck,
DLP

raymond@ele.tue.nl (Raymond Nijssen) (08/30/90)

In article <1990Aug27.161420.11723@ico.isc.com> dougp@ico.ISC.COM (Doug Pintar) writes:
>In article <483@comcon.UUCP> tim@comcon.UUCP (Tim Brown) writes:
>>The WD1006SVR2 has a known history of locking up with ISC.  Something
>>to do with the card
  [..... details deleted .....]
>>THe only fix I know of is to get another controller.
>The WD series of controllers had been able to do this since the 1001, 'way back
>when.  Somehow it got broken in some revs of the 1006.
                              ^^^^^^^^^^^^
Ha! Now I'm getting really curious! Tell more! Which revs for instance, and
how can they be identified? And do you mean that other revs are ok?

For instance, I have a WD1006V-SR2 controller here which locks up with ISC.
On top of the largest chip on the card it says in caps: PROTO
Should I interpret this as it being just a prototype chip, which would 
presumably be quite unstable and full of bugs? My vendor told me that he had
never seen such controllers without this 'remark', but now he says he is 
trying to get one for me without it. Is it likely that he'll succeed in this?
To be complete, I'll copy the whole text on the chip:
          (c) WDC'87
          WD42C22A-JU
          10-2 8836                    <- it's made in the 36th week of 1988
          668721105
          PROTO
and on a small label, it says: WD1006V-SR2 F002 X8   (The X8 might be another
revision/generation mark)

What I would like to know is if there are people running ISC with an 1006 RLL
controller which does not lock up, and what it says on their controller. 
Maybe someone from WD might comment on this, anyway, I'll post a summary
of the reactions I hope to get.
______________________________________________________________________________
Raymond X.T. Nijssen  / Don't speak if you  / Oh VMS, please forgive me all
raymond@ele.tue.nl   / speak for yourself  / unfriendly things I said about you