[comp.dcom.lans] ether errors -- vaxen&4.3&suns&sequent&etc

david@ms.uky.edu (David Herron -- Resident E-mail Hack) (02/09/88)

hi ..

we've got a strange little problem.  we're noticing a fairly high
error rate on *some* of our machines and not others.

Our equipment is a mixed bag of:

	10 or so uVax2000's
	7 or 8 uVaxII's (some as servers, 2 as workstations) DEQNA
	an 11/750 DEUNA
	4 sun 3/60's and a 3/280
	a 26 processor sequent (ns32000 variety)
	5 AT&T 3b2's (SysVr3.1)
	2 or 3 AT&T 7300/3b1's
	2 IBM PC/RT's

All on an ethernet, gatewayed out through an Ungerman Bass NIU
box which goes to a channel on an Ungerman Bass broadband system.

The Vaxen (except for the 2000's) are running 4.3+NFS from
Mt Xinu -- we're one MR behind.  the 2000's are running Ultrix,
again one version behind.  I'm not sure which version the
Sun's are running but again I think we're one version "behind".

Anyway, the vaxen are seeing the error rates and the Sun's
aren't.  We happened to be at a DEC presentation today and
was talking to a tech guy and described the situation to him.
It tickled a memory in his brain about the Sun networking code
being "not right" in some way (and he immediately said he wasn't
bashing another vendor :-), like it was missing some piece of
code that Ultrix of course fully implemented.  (Oh, an aside,
this guy wasn't your typical DEC salesman that almost know
what s/he's talking about ...)

I'm curious if this tickles anybody's memory and if they know
what he was talking about.
-- 
<---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
<---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET
<----
<---- It takes more than a good memory to have good memories.

afb3@hou2d.UUCP (A.BALDWIN) (02/11/88)

In article <8277@e.ms.uky.edu>, david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes:
> hi ..
> 
> we've got a strange little problem.  we're noticing a fairly high
> error rate on *some* of our machines and not others.
> Sun's are running but again I think we're one version "behind".
>
>.......
> 
> Anyway, the vaxen are seeing the error rates and the Sun's
> aren't.  ......
>
>....
>
> -- 
> <---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
> <---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET
> <----
> <---- It takes more than a good memory to have good memories.

I remember (when I was able to have a DEC hardware maintenance 
contract:-) that there was a problem with the DEQNA and ULTRIX.
I've never see this problem on our ETHERNET (3 AT&T 3b2's, a MASScomp
5500, a uVAX-II, 2 Silicon Graphics IRIS workstations, 4+ Sun 
workstations, and two Integrated Solutions workstations).  However,
our network is not as large as the one described in the above article.
Also, we were/are running Ultrix 1.1.  Your release should have had the
problem this guy was talking about fixed.

Maybe new problems were interduced.  I'm really interested as we
are about to get an VAXstation 3500 and our ETHERNET is split
between two locations (we will put in a bridge soon).


Al Baldwin
AT&T-Bell Labs
...!ihnp4!hou2d!afb3


[These opinions are my own....Who else would want them!!!]

david@ms.uky.edu (David Herron -- Resident E-mail Hack) (02/16/88)

In article <8277@e.ms.uky.edu> david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes:
>hi ..
>
>we've got a strange little problem.  we're noticing a fairly high
>error rate on *some* of our machines and not others.

Well, it seems I wasn't quite clear enough before ...

When I said "errors", I meant the number returned by the netstat program.
Yes, I know that it lumps a whole bunch of errors into one statistic.
But it's what we got.  (We are working on getting some other gadgets
set up -- but lack of money precludes getting a proper ethernet analyzer).

Here's a fairly typical minute or so of our ethernet:

| 21 - e:david --> netstat -i 5
    input   (qe0)     output          input  (Total)    output
packets errs  packets errs  colls packets errs  packets errs  colls
7363687 220240 3949440 2829  42243 7489031 220240 4074784 2829  42243
36      2     19      0     0     38      2     21      0     0
33      1     9       0     0     33      1     9       0     0
33      1     9       0     0     33      1     9       0     0
55      4     4       0     0     55      4     4       0     0
16      0     7       0     0     18      0     9       0     0
22      1     1       0     0     22      1     1       0     0
23      1     1       0     0     23      1     1       0     0
37      3     7       0     0     37      3     7       0     0
25      0     4       0     0     25      0     4       0     0
54      5     3       0     0     54      5     3       0     0
28      0     7       0     0     28      0     7       0     0
22      1     0       0     0     22      1     0       0     0
23      0     2       0     0     23      0     2       0     0
36      0     12      0     0     36      0     12      0     0
27      2     3       0     0     27      2     3       0     0
51      4     1       0     0     51      4     1       0     0
3       0     1       0     0     3       0     1       0     0


The machine in question is e.ms.uky.edu, a uVaxII which serves partly
as a file server, partly as our news machine, our primary domain
nameserver, and partly as the work machine for some of the staff.  The
active connections are mainly rlogin's -- I have a couple going at the
moment which are quiet -- an nntp (to harvard) and a couple of nfs
connections.  The board being used is a DEQNA, I'm not sure if it's
a "new" or "classic" DEQNA.  We do have one machine with both; we're
running it with the "new" DEQNA right now and it's showing the same
sort of error rates.

The sun's and the sequent are different both in the error rates and in
the type of error.  The sun has "error" rates a couple orders of
magnitude less than this, and the sequent has "error" rates a couple
orders of magnitude less than the sun.  Further, they both have a strong
tendancy for collisions in preference to "error"'s.

Now, the sun (I'm sampling from the server machine) has 4 workstations
using both nfs and nd from it and seeing quite a bit of traffic.
Frequent 30 second bursts of 100-300 packets a sec on input and in
those same time slices, the output packet rate at about 2/3 the input.
In my watching right at this minute, the errors are predominately
collisions with occasional "errors".  The collisions "seem" to be
periodic -- at times there are regular (every 30 seconds) bursts of
collision activity, with a high rate if input packets at those same
bursts.  I'm inclined to point a finger at rwho over that one.  On
the other hand, the pattern isn't there all the time.  Whoever told
me to get a lan analyzer so that I'm not guessing -- I see your point.
But we don't got the bucks right now.

The sequent is showing the same sorts of activity as the Sun server,
except that it doesn't do nfs so therefore doesn't ever get those
bursts of 100-300 packets per second (or are these numbers from netstat
over the whole 5 second period?).

Yes we do have both DELNI's and DEMPR's.  We used to have a slightly
illegal configuration that had paths of >2 DELNI's, but now all of
our paths have a max of 2 DELNI's.

Picking out a uVax2000 at random, the "error" rate is in the .01%
range and ZERO collisions.

The last machine is the 11/750 with a DEUNA.  To begin with, it's 
a much quieter machine.  It serves out very little NFS and its
users don't go out to other machines very often.  But at any rate,
it does show the same sorts of error rates as the Suns and Sequent.
That is, very few "errors" and more collisions.  We are running
MtXinu's 4.3 on it -- very nearly the same system as is on the
uVaxIIen.

At the moment I'm inclined to believe that we have a couple of problems.

1. Even the "new" DEQNA's can't keep up very well.  Someone mentioned
   to us that Sun's when responding to NFS requests generate a block
   of data of whatever size the physical block size of the filesystem
   is -- which can be as much as 8K.  This of course has to be broken
   up by the ethernet driver.  The ether boards in Suns are apparently
   good enough that they can generate packets as fast as the ethernet
   spec allows.  This, coupled with the design shortcuts made with the
   DEQNA, results in the DEQNA being overrun.

   Mark Hittinger <sysmsh@ulkyvx.bitnet> mentioned that in vms there
   is an option to turn off hardware checksums in local area vaxclusters
   because of this sort of problem.  There is some sort of bug in the
   checksumming hardware which shows up in heavily loaded ethernets.
   He suggested a newer board called a DELQUA or some such.
2. Our rwho's need to be staggered in some way.  Any ideas?
3. It doesn't look like there's any bad boards -- and I can't really
   tell until I can put up an ethernet monitor of some sort.  I'll be
   doing some more sleuthing as soon as I can get a pc/ip running
   in an AT -- but we gotta put an ether board in the pc first.
4. Someone else mentioned a UB NIU with "xcvr heartbeat" enabled
   as being a problem.  We do have a UB NIU Buffered Repeater
   on our net, so I'll have to check that out.
-- 
<---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
<---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET
<----
<---- It takes more than a good memory to have good memories.

rwhite@nusdhub.UUCP (Robert C. White Jr.) (02/18/88)

Hi all,
	In response to te query about some machines having very high
numbers in their error loggs, I have this 802.3 anticdote.  This
behavior was noticed on a test instalation of the Starlan (r) product,
but I would bet cash money that it is a protocol oddity in 802.3.

	Simply put,  The observed product [which runs at 1mbs instead
of 10mbs] would post errors to the status log of a machine attempting
to assert a packet on a lan segment whenever an other machine would
attempt to claim a new address.  The interaction goes something like
this:

	1) before a name can be claimed, the system desireing to
	implement the name must be shure the name is unique.

	2) to assure itself that the name is unique the requesting
	machine sends a short data packet with a _very_ high priority.

	3) the network access unit formats this packet as a datagram
	addressed to the, hopefully mythical, name.

	4) the adapter, sensing a priority requirement [and to give
	the user a fair chance on a busy net] listens for a time
	slot on the common carrier.  If this time slot is not
	available in _very_ short order, the adapter will usurp
	bandwidth by asserting carrier on the line anyway.  When
	the other machines sense carrier they signal a collision
	and back-off to re-time and contend.

	5) When the receive lines go quiet, with a wait time aproaching
	0, the E_DATA packet is sent and must propigate through the
	network.

	6) steps 4 through 5 are retried cyclicly for itterations X
	at a delay of Y, where X is about 6 attempts and Y is about
	5 seconds [your milage may vary ;-)]

	7) Steps 2 through 6 are repeated for each name requested.

	This is an observed phenominia based on an educated guess,
but it produces the symptoms you described.  When ever a machine
adds a name to the network for any reason, this rudeness is observed.
It will his servers the hardest, as they are most likely to be sending
a packet at any given instant.  Bridge hardware will propigate the
generated messages, but in a polite fassion, and so effectivly insulate
against the problem.
	Also, as servers have a longer [on avrage] time between
resets [or stat clearing's anyway] their numbers tend to be
artifically high.  This behavior seems to be within the accepted
standard however, so cest' la gaere.

	When high envelope events [like 20 students starting their
net sessions at the same time] occur, preformance can get choppy.


	This may not be as common, or even true at all, on the
ether-net systems because of the generally faster environment.
a good experimental check is to isolate a net section and start
a stat loop. [a stats b stats c .....] Then ask for a few names
from a machine outside the loop [but on the same segment].


<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<  All the STREAM is but a page,<<|>>	Robert C. White Jr.		   <<
<<  and we are merely layers,	 <<|>>	nusdhub!rwhite  nusdhub!usenet	   <<
<<  port owners and port payers, <<|>>>>>>>>"The Avitar of Chaos"<<<<<<<<<<<<
<<  each an others audit fence,	 <<|>>	Network tech,  Gamer, Anti-christ, <<
<<  approaching the sum reel.	 <<|>>	Voter, and General bad influence.  <<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
##  Disclaimer:  You thought I was serious???......  Really????		   ##
##  Interogative:  So... what _is_ your point?			    ;-)	   ##
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

zemon@felix.UUCP (Art Zemon) (02/25/88)

I occasionally see large numbers of errors on our 11/750
running 4.2 bsd and an Interlan NI1010A board.  I have never
been able to figure out what causes the errors, or why they
dissappear as suddenly (and mysteriously) as they appear.
If you have any clues, I would appreciate hearing about
them.
--
	-- Art Zemon
	   By Computer:	    ...!hplabs!felix!zemon
	   By Air:	    Archer N33565
	   By Golly:	    moderator of comp.unix.ultrix