[comp.sys.apollo] Doamin on Ethernet problem

achille@cernvax.UUCP (achille petrilli) (10/20/89)

Hi there, we are experiencing lot of problems on ours, ethernet based,
Apollos.

The problem has been seen both on 3500 and 3000 with ethernet as primary
(and only in most cases) network.
The node will loose contact with the network, both at DDS and tcp/ip level,
rtstat -dev shows enormous numbers for 'no resources', some 20000 per second
(yes, twenty thousand !), but the node is not receiving even 20 per second
(we checked that with an ethernet analyzer).
We are running sr10.1 on one of the nodes we've been investigating more.
There some 30 machines on ethernet mostly running 9.7, plus 2 dn10000 (sr10.1)
and 1 3500 (sr10.1) and 1 3000 (sr10.1, secondary network).

We traced down the problem to be related to dn3xxx to dn10k interactions.
A way to reproduce the problem, 100 %, is to do from the dn3xxx:

	ls -l //dn10k

This will slowly start telling you that does not find some directories (that
are there) and the number of 'no resources' will skyrocket.
Now the dn3xxx is gone. The dn10k is instead perfectly happy.

Has anybody seen that ? Is there any patch available ?
Our SSR cannot reproduce the problem in their place (don't have a dn10k and
they run on ring) and we are a little bit unconfortable in letting them
try these things on our net :-)

The work-around we've found is to NOT access any dn10k from dn3xxx ethernet
based nodes (which of course defeats the whole purpose of networking).
For the time being it's OK given the small number of dn10ks and of sr10
machines, but what can we do in a few months time when everybody will go
sr10 ?

Help !!!
Thanks in advance,
	Achille Petrilli
	Cray & PWS Operations

dbfunk@ICAEN.UIOWA.EDU (David B Funk) (10/21/89)

WRT posting <1127@cernvax.UUCP>

> Hi there, we are experiencing lot of problems on ours, ethernet based,
> Apollos.
> The problem has been seen both on 3500 and 3000 with ethernet as primary
> (and only in most cases) network.
> The node will loose contact with the network, both at DDS and tcp/ip level,
> rtstat -dev shows enormous numbers for 'no resources', some 20000 per second

I can think of 2 possible causes of this problem:

1)  The "ethernet8_microcode" that was shipped with sr10.1 is seriously flawed.
    The sr9.7 version was not perfect but not nearly as bad as the 10.1. It
    is worst in a DDS & IP Ethernet environment, if you are only running IP
    the sr9.7 was OK the sr10.1 was marginal. There are various sr10.1 patches
    out for this but most of them aren't worth messing with. The best solution
    is to get a copy of "/sys/ethernet8_microcode" from a sr10.2 system, even
    from the Beta1 sr10.2 release. This "ethernet8_microcode" works VERY well
    and can be safely installed on sr9.7 & sr10.x systems. We've been using
    it for 2 months now and are quite pleased with it. Talk to your local
    Apollo office, they may be able to get it for you. Just copy the file
    into /sys and reboot. Here's a "rtstat" off one of our ring/E-net
    gateways, note the low E-net error rates:

  $ rtstat -dev -net

  ----------------------------------------------------------------
  80FF1500.12E88   pkts routed:    526964   queue oflo:        0

   Ring            pkts sent:     2559538   pkts rcvd:   2743208
                   NACKs              987   WACKs          27872
                   Xmit bus err         0   Xmit timeouts    303
                   Token inserted      58   Rcv DMA EOR        0
                   Rcv CRC error        0   Rcv timeouts       1
                   Rcv bus error        0   Rcv xmtr error  1033
                   towards net:  80FF1500   ref cnt:     2389382
                   towards net:  80FF1300   ref cnt:      229121

   ETH802.3_AT     pkts sent:      309484   pkts rcvd:    266299
                   Hdwr xmits      1700895  Hdwr rcvs     3304708
                   CRC errors           0   Misalignments      5
                   No resource          0   Over-run           2
                   Adapter err          0   Full socket      694
                   towards net:  80FF4000   ref cnt:      313524


2)  There is a bug in the sr10.0 & sr10.1 implementation of the "rgyd".
    This can cause various strange problems that often look like the
    rgyd dying. When this happens, system operations that deal with
    user IDs or protections (like "getpwuid") may cause network retrys.
    You say that "ls -l" will generate the problem, the "-l" option causes
    "rgyd" operations because of the need to extract the owner name.
    If you do a dn3k to dn10k network operation that doesn't involve
    "rgyd" operations does the problem still happen? Try a utility
    like "/com/lst" (sr10; under sr9.7 its "/systest/lst").

Dave Funk

wescott@LNIC1.HPRC.UH.EDU (Andrew M. Wescott) (10/21/89)

Delete /sys/ethernet8_microcode on your SR 10.1 DN3xxx machines,
and you'll probably be pleasantly surprised.  Don't forget to
reboot. There is a patch available, but things will work o.k.
without it.

dbfunk@ICAEN.UIOWA.EDU (David B Funk) (10/24/89)

I want to clarify my posting <8910210447.AA02085@icaen.uiowa.edu> on
the DDS on Ethernet problem:

> the sr9.7 was OK the sr10.1 was marginal. There are various sr10.1 patches
> out for this but most of them aren't worth messing with. The best solution
> is to get a copy of "/sys/ethernet8_microcode" from a sr10.2 system, even

The sr10.1 patch #m0017 is the patch that "isn't worth messing with".
Patch #m0038 (from tape M68K_8907 or newer) is the same as the sr10.2
release microcode and this one DOES work and should be used.
My Apollogies for any confusion that this caused.

Dave Funk

marmen@is2.bnr.ca (Rob Marmen 1532773) (10/24/89)

In article <1127@cernvax.UUCP>, achille@cernvax.UUCP (achille petrilli) writes:
> The node will loose contact with the network, both at DDS and tcp/ip level,
> rtstat -dev shows enormous numbers for 'no resources', some 20000 per second
> (yes, twenty thousand !), but the node is not receiving even 20 per second
> (we checked that with an ethernet analyzer).
> 
> We traced down the problem to be related to dn3xxx to dn10k interactions.
> A way to reproduce the problem, 100 %, is to do from the dn3xxx:
> 
> 	ls -l //dn10k
> 
> This will slowly start telling you that does not find some directories (that
> are there) and the number of 'no resources' will skyrocket.
> Now the dn3xxx is gone. The dn10k is instead perfectly happy.
> 
> 	Achille Petrilli
> 	Cray & PWS Operations

I would check the following:

	1) Ethernet microcode revision level. You should be running a version
	   no earlier than March of this year. The previous micocode was very
	   buggy. The code is stored in /sys/ethernet8_microcode.

	2) Does the number of crc and misalignment errors skyrocket as well?
	   If so, then it may be microcode, or you have a bad connection
	   between the two machines. I have seen drops (using utp) which technically
	   checkout o.k., but because of a loose wire or connection, will generate
     	   lots of bad packets.

rob...  



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
| Robert Marmen             marmen@bnr.ca  OR             |
| Bell Northern Research    marmen%bnr.ca@cunyvm.cuny.edu |
| (613) 763-8244         My opinions are my own, not BNRs |