[comp.unix.aux] problems with ethernet interface

dyer@arktouros.MIT.EDU (Steve Dyer) (09/29/88)

I am running A/UX 1.0 and find that I have to reboot the machine
regularly because the network dies with the messages

   ae0: overflow NIC reset failed.
   ae6_intr: Receive overflow warning.

Following this, the network is effectively dead.  Some utilities whose
names escape me actually say so: "network down", but this is not reflected
in a call to "netstat -i", nor does a call to ifconfig reset things.

Usually I see many copies of this message on my screen when I come in
the first time each day, but I can also make it happen on command by doing
a rcp with files going to the A/UX box.  Very occasionally, a message of
the form 

   ae0 transmitter frozen, resetting

appears within a few minutes, and the system is again usable, but more often,
I never see this message and rebooting is the only way to clear things.

What's the matter here?  Is the ethernet interface bad?  A mVAX-II, an
Apollo, a Bell Tech V.3 box and a RT/PC are all on the same DELNI and
exhibit no problems whatsoever.  Any and all clues would be appreciated.
---
Steve Dyer
dyer@arktouros.MIT.EDU
dyer@spdcc.COM aka {harvard,husc6,linus,ima,bbn,m2c,mipseast}!spdcc!dyer

dixon@control.steinmetz (walt dixon) (09/30/88)

We too have experienced this problem.  I've talked to several people at
Apple including A/UX support who could not believe that things like this
really happen. (Welcome to the real world).  This problem has cost Apple
sales at our site.  Loss of sales has finially got their attention (at
least on a local level).  I've promised to send a real time trace of
network activity off to the developers so they can try to duplicate the
problem.  Will it be fixed in 1.1?  I don't know.

I suspect that the problem originates with either busts of broadcast traffic
or bad packets.  Hopefully the trace will isolate the problem.  A/UX should
definitely recover from this condition;  no other devices on this ethernet
segment have shown similar behavior.

You can recover from this condition using ifconfig. "ifconfig ae0 up" will
bring the network back up; one can also write a program to turn the network
back on.  The problem gets more interesting when you have a NFS hard mount.
I tried to write a program which would run in the background,  catch a signal
that the network was done,  and turn the network on again.  This approach
seems reasonable,  but time constraints and a lack of Unix knowledge have
prevented completion.  I'm willing to give out the code I've got to anyone
who wants to get it working.  The only condition would be that, if you get
it to work,  you post it so others can use it.

Walt Dixon			{arpa:		dixon@ge-crd.arpa	}
				{us mail:	ge corp r&d		}
				{		po box 8		}
				{		schenectady,  ny 12345	}
				{phone:		518-387-5798		}

dyer@arktouros.MIT.EDU (Steve Dyer) (09/30/88)

In article <12275@steinmetz.ge.com> dixon@control.steinmetz.ge.com (walt dixon) writes:
>You can recover from this condition using ifconfig. "ifconfig ae0 up" will
>bring the network back up; one can also write a program to turn the network
>back on.

In my experience, "ifconfig ae0 up" was a no-op.  Programs complained
"network down" anyway.  Someone else placed the line
ifconfig ae0 down; ifconfig ae0 up
for cron to execute every 5 minutes or so.  I haven't tried this yet; perhaps
explicitly turning off the software state of the interface toggles some
bit which allows the interface to be reset.

If this is as widespread as the comments I see here and the letters
I've received in the past few days, this is a major lose.  How about
a comment from Apple folk who are otherwise so diligent in fending off
meta-rumors?  Prevalence, workarounds, ideas of when this will be fixed?
---
"The network IS the computer..."

Steve Dyer

dyer@arktouros.MIT.EDU (Steve Dyer) (10/02/88)

In article <7247@bloom-beacon.MIT.EDU> dyer@arktouros.MIT.EDU (Steve Dyer) writes:
>In my experience, "ifconfig ae0 up" was a no-op.  Programs complained
>"network down" anyway.  Someone else placed the line
>ifconfig ae0 down; ifconfig ae0 up
>for cron to execute every 5 minutes or so.  I haven't tried this yet; perhaps
>explicitly turning off the software state of the interface toggles some
>bit which allows the interface to be reset.

Having discovered the network wedged again this morning, I can say that
typing:

ifconfig ae0 down; ifconfig ae0 up

does work.  

---
Steve Dyer
dyer@arktouros.MIT.EDU
dyer@spdcc.COM aka {harvard,husc6,ima,bbn,m2c,mipseast}!spdcc!dyer

pane@cat.cmu.edu (John Pane) (10/03/88)

This is a duplication of a post I made in July, for the benefit of those who
are experiencing this problem with the ethernet...

---begin forwarded message---
I started having this problem when our network was re-arranged here, and it
was so bad that I couldn't do any networking.  The new configuration had
placed me on a very busy portion of the network at CMU.

Some of the problem was tracked down to broadcasts that my A/UX machine was
making, that were being responded to by hundreds of machines on campus.
Although this doesn't completely solve the problem, here are the steps I
took which resulted in a big improvement.

1) In /etc/inittab I turned off nfs0 (the release notes tell you to turn
this on even if you're not running nfs).  I haven't noticed any loss of
functionality after turning it off.

2) I created a file /etc/resolv.conf, listing three domain name servers, so
my machine doesn't broadcast domain name resolution requests.  See the
manual entry for resolver(4).

3) Changed my broadcast address from 128.2.0.0 to 128.2.255.255 (most of the
machines in the CS department here are still using 128.2.0.0, but the plan
is to move to 128.2.255.255).  This is a temporary fix, relying on the fact
that fewer machines are currently responding to broadcasts on the new
address.

So now, my machine does less broadcasting, and because of the
change of broadcast address, receives fewer replies when it does broadcast.

The only remaining problem, which happens much less frequently, is that
100+ other machines on the network don't know about the 255.255
broadcast address, and when they receive such a broadcast (from my machine
or others) they respond by arp'ing.  This flood of arp's still causes my
networking to go down.

The fact remains that the hardware/low-level software should be able to
handle this level of traffic.  Does anybody know if the acknowledged
"defect" in the ethertalk boards could manifest itself in this way?

---end forwarded message---

John Pane
Department of Computer Science
Carnegie Mellon University
(412)268-5884

pane@cs.cmu.edu

news@steinmetz.ge.com (news) (10/03/88)

an ifconfig ae0 down.  I thought that this was common knowledge;  apparently
not.  This combination does indeed bring the network back up.
From: dixon@control.steinmetz (walt dixon)
Path: control!dixon


Walt Dixon			{ARPA:		Dixon@ge-crd.arpa	}
				{US Mail:	GE Corp R&D		}
				{		PO Box 8		}
				{		Schenectady, NY 12345	}
				{Phone:		518-387-5798		}

ragge@nada.kth.se (Ragnar Sundblad) (10/05/88)

In article <7247@bloom-beacon.MIT.EDU> dyer@arktouros.MIT.EDU (Steve Dyer) writes:
>In my experience, "ifconfig ae0 up" was a no-op.  Programs complained
>"network down" anyway.  Someone else placed the line
>ifconfig ae0 down; ifconfig ae0 up
....
>Steve Dyer

That's probably one of the bugs in the National DP8390 (described in
DP8390 Tech Update, problem #3).

"
	Problem 3
	Suspended Operation After Transmission: If Collision (COL) is
	asserted during the transmission of the last byte, the NIC will
	suspend all operations. This problem is manifested when the
	Command Register continually reads 26H.
	The NIC must be hardware reset to resume operation.

	NOTE: In a properly operating IEEE 802.3 network, a collision will
	never occur during the last byte of transmission.
"

You can find the command register at the byte at address
	0xFSSE003C, where S = slot number (in a MacII 9 <= S <= E)
if you would like to check it out. (and if you somehow manage to
look at this address).

If this is the problem, you'd better check your ethernet.

Note: I don't THINK that the EtherTalk card exchange some months ago
solved this problem.

nghiem@ut-emx.UUCP (Alex Nghiem) (10/06/88)

In article <12275@steinmetz.ge.com>, dixon@control.steinmetz (walt dixon) writes:
> We too have experienced this problem.  I've talked to several people at
> Apple including A/UX support who could not believe that things like this
> really happen. (Welcome to the real world).  This problem has cost Apple

I believe I read a Computer World article that mentioned that
Apple's Ethernet board is manufactured by 3 com and was temporarily  
was withdrawn from the market because of bugs. I don't know if the
board in question is related to this problem, but it might be
worth investigation. I read the article sometime this summer.
A corrected board should have been introduced by now.

magorian@umd5.umd.edu (Dan Magorian) (10/08/88)

In article <586@draken.nada.kth.se> ragge@nada.kth.se (Ragnar Sundblad) writes:
>In article <7247@bloom-beacon.MIT.EDU> dyer@arktouros.MIT.EDU (Steve Dyer) writes:
>>In my experience, "ifconfig ae0 up" was a no-op.  Programs complained
>>"network down" anyway.  Someone else placed the line
>>ifconfig ae0 down; ifconfig ae0 up
>....
>>Steve Dyer
>
>That's probably one of the bugs in the National DP8390 (described in
>DP8390 Tech Update, problem #3).
>
>"
>	Problem 3
>	Suspended Operation After Transmission: If Collision (COL) is
>	asserted during the transmission of the last byte, the NIC will
>	suspend all operations. This problem is manifested when the
>	Command Register continually reads 26H.
>	The NIC must be hardware reset to resume operation.
>
>	NOTE: In a properly operating IEEE 802.3 network, a collision will
>	never occur during the last byte of transmission.
>"
>Note: I don't THINK that the EtherTalk card exchange some months ago
>solved this problem.

Does anyone have some details on what the swapped Rev I or J cards patched?  We
had the earlier Rev E cards, and were experiencing the problems people are
describing (there was even one Rev C card with the earlier version of the NIC).
Swapping them out reduced lockups considerably.  Basically, it's the same card 
reworked with 4 additional jumpers (there were already 2).  On the MacOS side,
the reworks were shipped with the same 1.1 driver, but a 2.0 version later
appeared.  Comments?

Dan Magorian
Computer Science Center
University of Maryland