achille@cernvax.UUCP (achille petrilli) (10/20/89)
Hi there, we are experiencing lot of problems on ours, ethernet based, Apollos. The problem has been seen both on 3500 and 3000 with ethernet as primary (and only in most cases) network. The node will loose contact with the network, both at DDS and tcp/ip level, rtstat -dev shows enormous numbers for 'no resources', some 20000 per second (yes, twenty thousand !), but the node is not receiving even 20 per second (we checked that with an ethernet analyzer). We are running sr10.1 on one of the nodes we've been investigating more. There some 30 machines on ethernet mostly running 9.7, plus 2 dn10000 (sr10.1) and 1 3500 (sr10.1) and 1 3000 (sr10.1, secondary network). We traced down the problem to be related to dn3xxx to dn10k interactions. A way to reproduce the problem, 100 %, is to do from the dn3xxx: ls -l //dn10k This will slowly start telling you that does not find some directories (that are there) and the number of 'no resources' will skyrocket. Now the dn3xxx is gone. The dn10k is instead perfectly happy. Has anybody seen that ? Is there any patch available ? Our SSR cannot reproduce the problem in their place (don't have a dn10k and they run on ring) and we are a little bit unconfortable in letting them try these things on our net :-) The work-around we've found is to NOT access any dn10k from dn3xxx ethernet based nodes (which of course defeats the whole purpose of networking). For the time being it's OK given the small number of dn10ks and of sr10 machines, but what can we do in a few months time when everybody will go sr10 ? Help !!! Thanks in advance, Achille Petrilli Cray & PWS Operations
dbfunk@ICAEN.UIOWA.EDU (David B Funk) (10/21/89)
WRT posting <1127@cernvax.UUCP> > Hi there, we are experiencing lot of problems on ours, ethernet based, > Apollos. > The problem has been seen both on 3500 and 3000 with ethernet as primary > (and only in most cases) network. > The node will loose contact with the network, both at DDS and tcp/ip level, > rtstat -dev shows enormous numbers for 'no resources', some 20000 per second I can think of 2 possible causes of this problem: 1) The "ethernet8_microcode" that was shipped with sr10.1 is seriously flawed. The sr9.7 version was not perfect but not nearly as bad as the 10.1. It is worst in a DDS & IP Ethernet environment, if you are only running IP the sr9.7 was OK the sr10.1 was marginal. There are various sr10.1 patches out for this but most of them aren't worth messing with. The best solution is to get a copy of "/sys/ethernet8_microcode" from a sr10.2 system, even from the Beta1 sr10.2 release. This "ethernet8_microcode" works VERY well and can be safely installed on sr9.7 & sr10.x systems. We've been using it for 2 months now and are quite pleased with it. Talk to your local Apollo office, they may be able to get it for you. Just copy the file into /sys and reboot. Here's a "rtstat" off one of our ring/E-net gateways, note the low E-net error rates: $ rtstat -dev -net ---------------------------------------------------------------- 80FF1500.12E88 pkts routed: 526964 queue oflo: 0 Ring pkts sent: 2559538 pkts rcvd: 2743208 NACKs 987 WACKs 27872 Xmit bus err 0 Xmit timeouts 303 Token inserted 58 Rcv DMA EOR 0 Rcv CRC error 0 Rcv timeouts 1 Rcv bus error 0 Rcv xmtr error 1033 towards net: 80FF1500 ref cnt: 2389382 towards net: 80FF1300 ref cnt: 229121 ETH802.3_AT pkts sent: 309484 pkts rcvd: 266299 Hdwr xmits 1700895 Hdwr rcvs 3304708 CRC errors 0 Misalignments 5 No resource 0 Over-run 2 Adapter err 0 Full socket 694 towards net: 80FF4000 ref cnt: 313524 2) There is a bug in the sr10.0 & sr10.1 implementation of the "rgyd". This can cause various strange problems that often look like the rgyd dying. When this happens, system operations that deal with user IDs or protections (like "getpwuid") may cause network retrys. You say that "ls -l" will generate the problem, the "-l" option causes "rgyd" operations because of the need to extract the owner name. If you do a dn3k to dn10k network operation that doesn't involve "rgyd" operations does the problem still happen? Try a utility like "/com/lst" (sr10; under sr9.7 its "/systest/lst"). Dave Funk
wescott@LNIC1.HPRC.UH.EDU (Andrew M. Wescott) (10/21/89)
Delete /sys/ethernet8_microcode on your SR 10.1 DN3xxx machines, and you'll probably be pleasantly surprised. Don't forget to reboot. There is a patch available, but things will work o.k. without it.
dbfunk@ICAEN.UIOWA.EDU (David B Funk) (10/24/89)
I want to clarify my posting <8910210447.AA02085@icaen.uiowa.edu> on the DDS on Ethernet problem: > the sr9.7 was OK the sr10.1 was marginal. There are various sr10.1 patches > out for this but most of them aren't worth messing with. The best solution > is to get a copy of "/sys/ethernet8_microcode" from a sr10.2 system, even The sr10.1 patch #m0017 is the patch that "isn't worth messing with". Patch #m0038 (from tape M68K_8907 or newer) is the same as the sr10.2 release microcode and this one DOES work and should be used. My Apollogies for any confusion that this caused. Dave Funk
marmen@is2.bnr.ca (Rob Marmen 1532773) (10/24/89)
In article <1127@cernvax.UUCP>, achille@cernvax.UUCP (achille petrilli) writes: > The node will loose contact with the network, both at DDS and tcp/ip level, > rtstat -dev shows enormous numbers for 'no resources', some 20000 per second > (yes, twenty thousand !), but the node is not receiving even 20 per second > (we checked that with an ethernet analyzer). > > We traced down the problem to be related to dn3xxx to dn10k interactions. > A way to reproduce the problem, 100 %, is to do from the dn3xxx: > > ls -l //dn10k > > This will slowly start telling you that does not find some directories (that > are there) and the number of 'no resources' will skyrocket. > Now the dn3xxx is gone. The dn10k is instead perfectly happy. > > Achille Petrilli > Cray & PWS Operations I would check the following: 1) Ethernet microcode revision level. You should be running a version no earlier than March of this year. The previous micocode was very buggy. The code is stored in /sys/ethernet8_microcode. 2) Does the number of crc and misalignment errors skyrocket as well? If so, then it may be microcode, or you have a bad connection between the two machines. I have seen drops (using utp) which technically checkout o.k., but because of a loose wire or connection, will generate lots of bad packets. rob... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- | Robert Marmen marmen@bnr.ca OR | | Bell Northern Research marmen%bnr.ca@cunyvm.cuny.edu | | (613) 763-8244 My opinions are my own, not BNRs |