[comp.protocols.tcp-ip.ibmpc] Ethernet cards, PC/NFS, DS8390

clements@bbn.com (Bob Clements) (09/27/89)

I've just wasted a fair amount of time chasing a problem.  I'll briefly
summarize it, to try to save others the time and pain, and then ask
a question.

On an IBM-PC using PC-NFS, I have been getting a steady low level
of obscure errors from the Microsoft Linker, and occasionally an
error from the C compiler, caused by non-repeatable data errors.
They recently became more frequent and I decided to track them
down to see whether I had a sick PC or network or a conflicting
TSR or what.

I switched PC mainframes from a 386 to a 286.  Problem still
there.  Reinstalled PC-NFS.  Still bad.  Used COPY and DIFF to
capture some bad data and then examined it.  Observed a pattern
of 15 or 16 bytes being copied over another group of 15 or 16
bytes at a location 64 bytes later in the file. (If you've been
there before, this probably tells you the answer.  It didn't tell
me, yet.) I swapped ethernet cards.  The problem went away.  (I
hadn't swapped them before because I didn't want to bother
updating the ethers files.)

Analysis:

    PC-NFS (like all sun NFSs) is implemented over UDP, but with
    UDP checksumming turned off.  (^%$&@^#!$% !!!)

    If an ethernet packet gets clobbered, the error may therefore
    not be detected by NFS, since it isn't checksumming.

    I can't find my reference, but I believe I've heard that this
    particular failure mode is one present in early revs of the
    DS8390 ethernet chip. (I sure hope my memory is right;  if
    not, I'm unfairly maligning National. They did have a number
    of glitches and I THINK this is one of them.)

    We have been having some broadcast storms lately, increasing
    the odds of this failure and causing the recent increase in
    symptoms.

    The new card has a newer rev DS8390 (8824 versus 8742C4 date code).

I should have recognized the failure mode earlier.  I wrote the
WD8003E and 3C503 drivers in the Clarkson collection (though they're
not what I was using with PC-NFS, of course) so I should have known.

My questions (which I should get purchasing to check, but as long
as I have the floor): 1) Anyone know where I can get a new
date code DS8390 in quantity one?   2) Any idea if Western Digital
will upgrade this WD8003E after it's out of warranty? (It's not
their fault, of course.)

Bob Clements, K1BC, clements@bbn.com

clements@bbn.com (Bob Clements) (10/06/89)

In article <46100@bbn.COM> clements@bbn.com (Bob Clements) (me) writes:
>I've just wasted a fair amount of time chasing a problem.  I'll briefly
>summarize it, to try to save others the time and pain, and then ask
>a question. [...]

This is a followup to my message about problems with PC-NFS,
a WD8003E Ethernet card and an apparently out-of-date 8390
Ethernet chip.  It's a progress report, a bit long, and not
yet conclusive.  Those not interested in all this, skip this
message now...


In <46100@bbn.com> I described a specific failure I was having
with errors in files read by PC-NFS over the ethernet.  The
failure was quite specific: 15 bytes of data were duplicated 64
bytes after their correct appearance, replacing 15 other bytes of
data.  The problem stayed the same in a couple of different PCs
and after re-installing the relevant software.  It went away when
I replaced the ethernet card with another supposedly identical
one.  One difference was that the working card had a much newer
ethernet controller chip, the National 8390.  Knowing that early
8390s had problems, I concluded that the different chip was the
cause of my failures, and asked the net for advice on getting
replacements.

The net is wonderful.  Some comments were posted and much more
info came in by private email. (I won't use the names of those
who sent private email.)

A number of people reported similar problems.  Some reported that
they replaced the card with others that had later rev 8390s and
that this solved their problems, just as in my case.  Two persons
(one at National Semiconductor, though not in the group that
designed the 8390) reported that, yes, I had described a problem
which existed in early 8390 chips.

One very generous person at another company offered to send me
a new-rev 8390 to replace the one I had trouble with.  (By the
way, it's a "DP8390", not "DS8390" as I initially called it.)

So, I figured, I can wrap this up.  Put the new 8390 on the
failing card, test it, and report to the net.

I received the donated chip and installed it.  (Not easy -- I had
to remove and replace an un-socketed 48-pin DIP.)  Fired it up.
It communicated!  So I hadn't destroyed the card or chip.  Ran
the heavy load test.  It FAILED, exactly like the original chip!
Same exact symptom.

So there's something else wrong with the card.  I don't know
what, yet.  But I figured I had better send a progress report,
to correct my initial analysis which blamed the DP8390.

Here are some further facts I've learned:  The early DP8390 chips
did have failures under load.  I found documentation of other
failure patterns and some hangs, but NOT this specific pattern.
But see the above comments from private email confirming this
problem.  (One correspondent asked for a specific written
citation of the problem.  I couldn't find this particular one,
but others are described in a 3-Com tech manual for the 3C503,
which also uses this chip, and in WD's driver software sources
which they release through dealers.  Another email correspondent
is getting a bug list from his National Semiconductor contacts
and will report.)

The latest design rev of the 8390 is "C" and that is supposed to
be OK.  The original one on the card that fails was a "B" chip,
and the one on the card that works is also a "B", but with a
later date code.  The new donated one is a "C", now on the
still-failing card.  (The rev letter is in the part number,
"DP8390BN" or "DP8390CN", the "N" meaning plastic DIP.)

I spoke to Western Digital's support group.  They said that they
repair out-of-warranty cards for a flat $75 fee, but they would
not replace the DP8390 just because it had a 1987 date code.
They had to see it fail.  I was, at that time, convinced the 8390
was bad, so this bothered me.  Now it looks like something else
is wrong, so replacing the 8390 would not have helped (and it did
not, as I've now proved).  WD claimed that there are software
workarounds for all the 8390 errors and therefore my software
(PC-NFS 3.0) must be bad.

A correspondent from Sun Microsystems' PC-NFS group commented
that he didn't go along with that analysis. (But didn't agree
with my feeling that non-checksummed UDP for NFS was a big
loser.)

Unless I get more inspiration, I think I'll just use the failing
card on my three-node ethernet at home which is lightly loaded
and only runs TCP where the checksums will save me from
occasional failures.  If anyone has inspiration to offer, I'll do
more experiments.

Just for excruciating completeness, here are the details on
the two cards I've been working with. [I have two more from
the early manufacturing date.  They are in the at-home net and
I haven't taken them in to the office to see whether they fail
the same way.]

    Failing card before chip replacement:
	Hardware address        Chip date and rev
	00:00:c0:c5:64:10       +B8742C4   DP8390BN  NS32490BN
    Failing card after chip replacement:
	Hardware address        Chip date and rev
	00:00:c0:c5:64:10       +B8924F    DP8390CN  NS32490CN
    Working card:
	Hardware address        Chip date and rev
	00:00:c0:37:04:10       +B8824     DP8390BN  NS32490BN

So: My apologies to National.  It looks as though this failure
is elsewhere on the WD8003E.

My thanks to the net for advice and reports, even though some of
the reports seemed to absolutely confirm my first analysis.

My thanks to the gentleman who sent me the new DP8390CN!

And if anyone knows what is really broken, let us know, because
some other netters have reported similar failures and they would
like to know, too.

[Sorry to go on at such length.  I felt it was only fair to give
a thorough followup.]

Bob Clements, K1BC, clements@bbn.com