clements@bbn.com (Bob Clements) (09/27/89)
I've just wasted a fair amount of time chasing a problem. I'll briefly summarize it, to try to save others the time and pain, and then ask a question. On an IBM-PC using PC-NFS, I have been getting a steady low level of obscure errors from the Microsoft Linker, and occasionally an error from the C compiler, caused by non-repeatable data errors. They recently became more frequent and I decided to track them down to see whether I had a sick PC or network or a conflicting TSR or what. I switched PC mainframes from a 386 to a 286. Problem still there. Reinstalled PC-NFS. Still bad. Used COPY and DIFF to capture some bad data and then examined it. Observed a pattern of 15 or 16 bytes being copied over another group of 15 or 16 bytes at a location 64 bytes later in the file. (If you've been there before, this probably tells you the answer. It didn't tell me, yet.) I swapped ethernet cards. The problem went away. (I hadn't swapped them before because I didn't want to bother updating the ethers files.) Analysis: PC-NFS (like all sun NFSs) is implemented over UDP, but with UDP checksumming turned off. (^%$&@^#!$% !!!) If an ethernet packet gets clobbered, the error may therefore not be detected by NFS, since it isn't checksumming. I can't find my reference, but I believe I've heard that this particular failure mode is one present in early revs of the DS8390 ethernet chip. (I sure hope my memory is right; if not, I'm unfairly maligning National. They did have a number of glitches and I THINK this is one of them.) We have been having some broadcast storms lately, increasing the odds of this failure and causing the recent increase in symptoms. The new card has a newer rev DS8390 (8824 versus 8742C4 date code). I should have recognized the failure mode earlier. I wrote the WD8003E and 3C503 drivers in the Clarkson collection (though they're not what I was using with PC-NFS, of course) so I should have known. My questions (which I should get purchasing to check, but as long as I have the floor): 1) Anyone know where I can get a new date code DS8390 in quantity one? 2) Any idea if Western Digital will upgrade this WD8003E after it's out of warranty? (It's not their fault, of course.) Bob Clements, K1BC, clements@bbn.com
freiss@nixpbe.UUCP (freiss) (09/28/89)
clements@bbn.com (Bob Clements) writes: >I've just wasted a fair amount of time chasing a problem. I'll briefly >summarize it, to try to save others the time and pain, and then ask >a question. Thanks - it did indeed save me a lot of pain. The problem you described is very similar to the one that had me stumped some days ago. > I can't find my reference, but I believe I've heard that this > particular failure mode is one present in early revs of the > DS8390 ethernet chip. (I sure hope my memory is right; if > not, I'm unfairly maligning National. They did have a number > of glitches and I THINK this is one of them.) [...] > The new card has a newer rev DS8390 (8824 versus 8742C4 date code). Miracle! It solves all my problems too. Looks like I was on the wrong path alltogether - I was busy reverse engineering PC-NFS, and the problem is purely hardware. If it's not too much to ask, could you post where you read about that particular failure mode in early 8390s? I face the problem of telling 200+ people in our R & D department that they can't use PC-NFS because we've bought "bad" ethernet cards, and something in print could save my life :-) Judging from the quantity of mail I got after I posted my problem, this affects quite a lot of people. >My questions (which I should get purchasing to check, but as long >as I have the floor): 1) Anyone know where I can get a new >date code DS8390 in quantity one? 2) Any idea if Western Digital >will upgrade this WD8003E after it's out of warranty? (It's not >their fault, of course.) Don't know where you can get quantity one chips, but we sure are going to argue with our friendly neighbourhood WD representative about replacements / upgrades. NFS not using UDP checksumming is one of my pet gripes too - usually the lower protocol layers are safe enough on an ethernet, but if you NFS thru bridges and routers, oh brother... Still, so far it has worked ok. >Bob Clements, K1BC, clements@bbn.com -Martin -- Martin Freiss UUCP: USA: ..!uunet!philabs!linus!nixbur!freiss.pad Nixdorf Computer AG !USA: ..!mcvax!unido!nixpbe!freiss.pad Dept. DS-CC 22 NERV: freiss.pad AMPR: dg5kx@db0bq Pontanusstr. 55 Voice: +49 5251 14 6153 FAX: +49 5251 14 6108 D-4792 Paderborn "Drink wet cement, get stoned."
clements@bbn.com (Bob Clements) (10/06/89)
In article <46100@bbn.COM> clements@bbn.com (Bob Clements) (me) writes: >I've just wasted a fair amount of time chasing a problem. I'll briefly >summarize it, to try to save others the time and pain, and then ask >a question. [...] This is a followup to my message about problems with PC-NFS, a WD8003E Ethernet card and an apparently out-of-date 8390 Ethernet chip. It's a progress report, a bit long, and not yet conclusive. Those not interested in all this, skip this message now... In <46100@bbn.com> I described a specific failure I was having with errors in files read by PC-NFS over the ethernet. The failure was quite specific: 15 bytes of data were duplicated 64 bytes after their correct appearance, replacing 15 other bytes of data. The problem stayed the same in a couple of different PCs and after re-installing the relevant software. It went away when I replaced the ethernet card with another supposedly identical one. One difference was that the working card had a much newer ethernet controller chip, the National 8390. Knowing that early 8390s had problems, I concluded that the different chip was the cause of my failures, and asked the net for advice on getting replacements. The net is wonderful. Some comments were posted and much more info came in by private email. (I won't use the names of those who sent private email.) A number of people reported similar problems. Some reported that they replaced the card with others that had later rev 8390s and that this solved their problems, just as in my case. Two persons (one at National Semiconductor, though not in the group that designed the 8390) reported that, yes, I had described a problem which existed in early 8390 chips. One very generous person at another company offered to send me a new-rev 8390 to replace the one I had trouble with. (By the way, it's a "DP8390", not "DS8390" as I initially called it.) So, I figured, I can wrap this up. Put the new 8390 on the failing card, test it, and report to the net. I received the donated chip and installed it. (Not easy -- I had to remove and replace an un-socketed 48-pin DIP.) Fired it up. It communicated! So I hadn't destroyed the card or chip. Ran the heavy load test. It FAILED, exactly like the original chip! Same exact symptom. So there's something else wrong with the card. I don't know what, yet. But I figured I had better send a progress report, to correct my initial analysis which blamed the DP8390. Here are some further facts I've learned: The early DP8390 chips did have failures under load. I found documentation of other failure patterns and some hangs, but NOT this specific pattern. But see the above comments from private email confirming this problem. (One correspondent asked for a specific written citation of the problem. I couldn't find this particular one, but others are described in a 3-Com tech manual for the 3C503, which also uses this chip, and in WD's driver software sources which they release through dealers. Another email correspondent is getting a bug list from his National Semiconductor contacts and will report.) The latest design rev of the 8390 is "C" and that is supposed to be OK. The original one on the card that fails was a "B" chip, and the one on the card that works is also a "B", but with a later date code. The new donated one is a "C", now on the still-failing card. (The rev letter is in the part number, "DP8390BN" or "DP8390CN", the "N" meaning plastic DIP.) I spoke to Western Digital's support group. They said that they repair out-of-warranty cards for a flat $75 fee, but they would not replace the DP8390 just because it had a 1987 date code. They had to see it fail. I was, at that time, convinced the 8390 was bad, so this bothered me. Now it looks like something else is wrong, so replacing the 8390 would not have helped (and it did not, as I've now proved). WD claimed that there are software workarounds for all the 8390 errors and therefore my software (PC-NFS 3.0) must be bad. A correspondent from Sun Microsystems' PC-NFS group commented that he didn't go along with that analysis. (But didn't agree with my feeling that non-checksummed UDP for NFS was a big loser.) Unless I get more inspiration, I think I'll just use the failing card on my three-node ethernet at home which is lightly loaded and only runs TCP where the checksums will save me from occasional failures. If anyone has inspiration to offer, I'll do more experiments. Just for excruciating completeness, here are the details on the two cards I've been working with. [I have two more from the early manufacturing date. They are in the at-home net and I haven't taken them in to the office to see whether they fail the same way.] Failing card before chip replacement: Hardware address Chip date and rev 00:00:c0:c5:64:10 +B8742C4 DP8390BN NS32490BN Failing card after chip replacement: Hardware address Chip date and rev 00:00:c0:c5:64:10 +B8924F DP8390CN NS32490CN Working card: Hardware address Chip date and rev 00:00:c0:37:04:10 +B8824 DP8390BN NS32490BN So: My apologies to National. It looks as though this failure is elsewhere on the WD8003E. My thanks to the net for advice and reports, even though some of the reports seemed to absolutely confirm my first analysis. My thanks to the gentleman who sent me the new DP8390CN! And if anyone knows what is really broken, let us know, because some other netters have reported similar failures and they would like to know, too. [Sorry to go on at such length. I felt it was only fair to give a thorough followup.] Bob Clements, K1BC, clements@bbn.com