david@ms.uky.edu (David Herron -- Resident E-mail Hack) (02/09/88)
hi .. we've got a strange little problem. we're noticing a fairly high error rate on *some* of our machines and not others. Our equipment is a mixed bag of: 10 or so uVax2000's 7 or 8 uVaxII's (some as servers, 2 as workstations) DEQNA an 11/750 DEUNA 4 sun 3/60's and a 3/280 a 26 processor sequent (ns32000 variety) 5 AT&T 3b2's (SysVr3.1) 2 or 3 AT&T 7300/3b1's 2 IBM PC/RT's All on an ethernet, gatewayed out through an Ungerman Bass NIU box which goes to a channel on an Ungerman Bass broadband system. The Vaxen (except for the 2000's) are running 4.3+NFS from Mt Xinu -- we're one MR behind. the 2000's are running Ultrix, again one version behind. I'm not sure which version the Sun's are running but again I think we're one version "behind". Anyway, the vaxen are seeing the error rates and the Sun's aren't. We happened to be at a DEC presentation today and was talking to a tech guy and described the situation to him. It tickled a memory in his brain about the Sun networking code being "not right" in some way (and he immediately said he wasn't bashing another vendor :-), like it was missing some piece of code that Ultrix of course fully implemented. (Oh, an aside, this guy wasn't your typical DEC salesman that almost know what s/he's talking about ...) I'm curious if this tickles anybody's memory and if they know what he was talking about. -- <---- David Herron -- The E-Mail guy <david@ms.uky.edu> <---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET <---- <---- It takes more than a good memory to have good memories.
afb3@hou2d.UUCP (A.BALDWIN) (02/11/88)
In article <8277@e.ms.uky.edu>, david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes: > hi .. > > we've got a strange little problem. we're noticing a fairly high > error rate on *some* of our machines and not others. > Sun's are running but again I think we're one version "behind". > >....... > > Anyway, the vaxen are seeing the error rates and the Sun's > aren't. ...... > >.... > > -- > <---- David Herron -- The E-Mail guy <david@ms.uky.edu> > <---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET > <---- > <---- It takes more than a good memory to have good memories. I remember (when I was able to have a DEC hardware maintenance contract:-) that there was a problem with the DEQNA and ULTRIX. I've never see this problem on our ETHERNET (3 AT&T 3b2's, a MASScomp 5500, a uVAX-II, 2 Silicon Graphics IRIS workstations, 4+ Sun workstations, and two Integrated Solutions workstations). However, our network is not as large as the one described in the above article. Also, we were/are running Ultrix 1.1. Your release should have had the problem this guy was talking about fixed. Maybe new problems were interduced. I'm really interested as we are about to get an VAXstation 3500 and our ETHERNET is split between two locations (we will put in a bridge soon). Al Baldwin AT&T-Bell Labs ...!ihnp4!hou2d!afb3 [These opinions are my own....Who else would want them!!!]
david@ms.uky.edu (David Herron -- Resident E-mail Hack) (02/16/88)
In article <8277@e.ms.uky.edu> david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes: >hi .. > >we've got a strange little problem. we're noticing a fairly high >error rate on *some* of our machines and not others. Well, it seems I wasn't quite clear enough before ... When I said "errors", I meant the number returned by the netstat program. Yes, I know that it lumps a whole bunch of errors into one statistic. But it's what we got. (We are working on getting some other gadgets set up -- but lack of money precludes getting a proper ethernet analyzer). Here's a fairly typical minute or so of our ethernet: | 21 - e:david --> netstat -i 5 input (qe0) output input (Total) output packets errs packets errs colls packets errs packets errs colls 7363687 220240 3949440 2829 42243 7489031 220240 4074784 2829 42243 36 2 19 0 0 38 2 21 0 0 33 1 9 0 0 33 1 9 0 0 33 1 9 0 0 33 1 9 0 0 55 4 4 0 0 55 4 4 0 0 16 0 7 0 0 18 0 9 0 0 22 1 1 0 0 22 1 1 0 0 23 1 1 0 0 23 1 1 0 0 37 3 7 0 0 37 3 7 0 0 25 0 4 0 0 25 0 4 0 0 54 5 3 0 0 54 5 3 0 0 28 0 7 0 0 28 0 7 0 0 22 1 0 0 0 22 1 0 0 0 23 0 2 0 0 23 0 2 0 0 36 0 12 0 0 36 0 12 0 0 27 2 3 0 0 27 2 3 0 0 51 4 1 0 0 51 4 1 0 0 3 0 1 0 0 3 0 1 0 0 The machine in question is e.ms.uky.edu, a uVaxII which serves partly as a file server, partly as our news machine, our primary domain nameserver, and partly as the work machine for some of the staff. The active connections are mainly rlogin's -- I have a couple going at the moment which are quiet -- an nntp (to harvard) and a couple of nfs connections. The board being used is a DEQNA, I'm not sure if it's a "new" or "classic" DEQNA. We do have one machine with both; we're running it with the "new" DEQNA right now and it's showing the same sort of error rates. The sun's and the sequent are different both in the error rates and in the type of error. The sun has "error" rates a couple orders of magnitude less than this, and the sequent has "error" rates a couple orders of magnitude less than the sun. Further, they both have a strong tendancy for collisions in preference to "error"'s. Now, the sun (I'm sampling from the server machine) has 4 workstations using both nfs and nd from it and seeing quite a bit of traffic. Frequent 30 second bursts of 100-300 packets a sec on input and in those same time slices, the output packet rate at about 2/3 the input. In my watching right at this minute, the errors are predominately collisions with occasional "errors". The collisions "seem" to be periodic -- at times there are regular (every 30 seconds) bursts of collision activity, with a high rate if input packets at those same bursts. I'm inclined to point a finger at rwho over that one. On the other hand, the pattern isn't there all the time. Whoever told me to get a lan analyzer so that I'm not guessing -- I see your point. But we don't got the bucks right now. The sequent is showing the same sorts of activity as the Sun server, except that it doesn't do nfs so therefore doesn't ever get those bursts of 100-300 packets per second (or are these numbers from netstat over the whole 5 second period?). Yes we do have both DELNI's and DEMPR's. We used to have a slightly illegal configuration that had paths of >2 DELNI's, but now all of our paths have a max of 2 DELNI's. Picking out a uVax2000 at random, the "error" rate is in the .01% range and ZERO collisions. The last machine is the 11/750 with a DEUNA. To begin with, it's a much quieter machine. It serves out very little NFS and its users don't go out to other machines very often. But at any rate, it does show the same sorts of error rates as the Suns and Sequent. That is, very few "errors" and more collisions. We are running MtXinu's 4.3 on it -- very nearly the same system as is on the uVaxIIen. At the moment I'm inclined to believe that we have a couple of problems. 1. Even the "new" DEQNA's can't keep up very well. Someone mentioned to us that Sun's when responding to NFS requests generate a block of data of whatever size the physical block size of the filesystem is -- which can be as much as 8K. This of course has to be broken up by the ethernet driver. The ether boards in Suns are apparently good enough that they can generate packets as fast as the ethernet spec allows. This, coupled with the design shortcuts made with the DEQNA, results in the DEQNA being overrun. Mark Hittinger <sysmsh@ulkyvx.bitnet> mentioned that in vms there is an option to turn off hardware checksums in local area vaxclusters because of this sort of problem. There is some sort of bug in the checksumming hardware which shows up in heavily loaded ethernets. He suggested a newer board called a DELQUA or some such. 2. Our rwho's need to be staggered in some way. Any ideas? 3. It doesn't look like there's any bad boards -- and I can't really tell until I can put up an ethernet monitor of some sort. I'll be doing some more sleuthing as soon as I can get a pc/ip running in an AT -- but we gotta put an ether board in the pc first. 4. Someone else mentioned a UB NIU with "xcvr heartbeat" enabled as being a problem. We do have a UB NIU Buffered Repeater on our net, so I'll have to check that out. -- <---- David Herron -- The E-Mail guy <david@ms.uky.edu> <---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET <---- <---- It takes more than a good memory to have good memories.
rwhite@nusdhub.UUCP (Robert C. White Jr.) (02/18/88)
Hi all, In response to te query about some machines having very high numbers in their error loggs, I have this 802.3 anticdote. This behavior was noticed on a test instalation of the Starlan (r) product, but I would bet cash money that it is a protocol oddity in 802.3. Simply put, The observed product [which runs at 1mbs instead of 10mbs] would post errors to the status log of a machine attempting to assert a packet on a lan segment whenever an other machine would attempt to claim a new address. The interaction goes something like this: 1) before a name can be claimed, the system desireing to implement the name must be shure the name is unique. 2) to assure itself that the name is unique the requesting machine sends a short data packet with a _very_ high priority. 3) the network access unit formats this packet as a datagram addressed to the, hopefully mythical, name. 4) the adapter, sensing a priority requirement [and to give the user a fair chance on a busy net] listens for a time slot on the common carrier. If this time slot is not available in _very_ short order, the adapter will usurp bandwidth by asserting carrier on the line anyway. When the other machines sense carrier they signal a collision and back-off to re-time and contend. 5) When the receive lines go quiet, with a wait time aproaching 0, the E_DATA packet is sent and must propigate through the network. 6) steps 4 through 5 are retried cyclicly for itterations X at a delay of Y, where X is about 6 attempts and Y is about 5 seconds [your milage may vary ;-)] 7) Steps 2 through 6 are repeated for each name requested. This is an observed phenominia based on an educated guess, but it produces the symptoms you described. When ever a machine adds a name to the network for any reason, this rudeness is observed. It will his servers the hardest, as they are most likely to be sending a packet at any given instant. Bridge hardware will propigate the generated messages, but in a polite fassion, and so effectivly insulate against the problem. Also, as servers have a longer [on avrage] time between resets [or stat clearing's anyway] their numbers tend to be artifically high. This behavior seems to be within the accepted standard however, so cest' la gaere. When high envelope events [like 20 students starting their net sessions at the same time] occur, preformance can get choppy. This may not be as common, or even true at all, on the ether-net systems because of the generally faster environment. a good experimental check is to isolate a net section and start a stat loop. [a stats b stats c .....] Then ask for a few names from a machine outside the loop [but on the same segment]. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< << All the STREAM is but a page,<<|>> Robert C. White Jr. << << and we are merely layers, <<|>> nusdhub!rwhite nusdhub!usenet << << port owners and port payers, <<|>>>>>>>>"The Avitar of Chaos"<<<<<<<<<<<< << each an others audit fence, <<|>> Network tech, Gamer, Anti-christ, << << approaching the sum reel. <<|>> Voter, and General bad influence. << <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ## Disclaimer: You thought I was serious???...... Really???? ## ## Interogative: So... what _is_ your point? ;-) ## ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
zemon@felix.UUCP (Art Zemon) (02/25/88)
I occasionally see large numbers of errors on our 11/750 running 4.2 bsd and an Interlan NI1010A board. I have never been able to figure out what causes the errors, or why they dissappear as suddenly (and mysteriously) as they appear. If you have any clues, I would appreciate hearing about them. -- -- Art Zemon By Computer: ...!hplabs!felix!zemon By Air: Archer N33565 By Golly: moderator of comp.unix.ultrix