[comp.sys.sun] ie0: lost interrupt: resetting

rudolf@oce.orst.edu (Jim Rudolf) (02/16/89)

I vaguely remember seeing this discussion somewhere before.  My apologies
if it has already been run into the ground.

We have two 3/280 servers running SunOS 3.5.  Almost on a weekly basis,
usually during a period of moderate net activity, one of the servers will
start spewing forth with:

	ie0: lost interrupt: resetting

If this starts happening, we'll generally start getting a few of these too:

	NFS getattr failed for server neptune: RPC: Timed out

Quoting from the man page:
     ie%d: lost interrupt: resetting
          The driver and 82586  chip  have  lost  synchronization
          with  each  other.   The  driver  recovers by resetting
          itself and the chip.

Our ethernet boards must not have read the man pages, because the affected
board does not recover by itself.  When this happens, the server is pretty
much hung up, and the only effective solution we've come up with is the
dreaded L1-A.  Who else has experienced this?  What did you do to stop it?

Thanks for your help,

Jim Rudolf
rudolf@oce.orst.edu
College of Oceanography
Oregon State University

[[ I can find no evidence of a previous discussion about this in either
volume 6 or 7. --wnl ]]

arisco@cadillac.cad.mcc.com (John Arisco) (03/01/89)

We also had a problem with "ie0: lost interrupt: resetting".  It is one of
the bugs fixed in the 3.5.1 patch tape from Sun Software Support.
__________

Reference Number:  1006375

        Synopsis:  ie0: lost interrupt: resetting

        Description:

        Heavy nfs activity on a Sun-3/280 nfs file server can
        result in the following:

                ie0: lost interrupt: resetting

        Files Changed:  /usr/sys/OBJ/if_ie.o

        Special Installation Instructions:

        You must rebuild your kernel.  Please refer to KERNEL REBUILD
        at the end of this document.

 John Arisco, MCC CAD Program | ARPA: arisco@mcc.com | Phone: [512] 338-3576
 Box 200195, Austin, TX 78720 | UUCP: ...!cs.utexas.edu!milano!cadillac!arisco

wwtz@uunet.uu.net (Wolfgang Wetz) (03/02/89)

rudolf@oce.orst.edu (Jim Rudolf) writes:
>We have two 3/280 servers running SunOS 3.5.  Almost on a weekly basis,
>usually during a period of moderate net activity, one of the servers will
>start spewing forth with:
>	ie0: lost interrupt: resetting

This is a problem which occurs during heavy load on the ethernet. We
experienced this problem here too.  The bad thing about this "interrupt
lost" is, that there is no way to recover from, except processor
interrupt/reboot.  We were told by Sun Switzerland to upgrade to SunOS
3.5.1.  Having done this, the problem went away.

best regards
Wolfgang Wetz, Systems Administrator, Scientific Computing Centre
   c/o CIBA-GEIGY AG, R-1045.330, CH-4002 Basel, Switzerland
 Internet: wwtz%cgch.uucp@uunet.uu.net		      Amateur Radio: HB9PCX
 UUCP:     ...!mcvax!cernvax!cgch!wwtz                Phone: (+41) 61 697 54 25
 BITNET:   wwtz%cgch.uucp@cernvax.bitnet              Fax:   (+41) 61 697 32 88

meier@rutgers.edu (Christopher M. Meier) (03/02/89)

rudolf@oce.orst.edu (Jim Rudolf) writes:
>...[getting "ie0: lost interrupt: resetting" messages]
>If this starts happening, we'll generally start getting a few of these too:
>	NFS getattr failed for server neptune: RPC: Timed out

Until I read this line, I thought this was written by someone here.
We have no 'neptune' 280.

We have seen this, but mostly at times when someone is adding/removing a
number of nodes from the ethernet.  This seems to cause lots of 'noise' on
the cable(s).  I can't qualify the 'noise', as I haven't had a sniffer to
use during one of those times.  It has also been seen when the network
traffic is heavy, but someone somewhere may have been fooling with the
cable.  We don't have a solution.

@ Christopher  M.  Meier   ms:  MN65-2300   Honeywell Systems & Research Center
@ Research Scientist/SIP   (612) 782-7191   3660 Technology Drive
@ meier@SRC.Honeywell.COM  !SRCSIP!meier    Mpls, MN  55418

dinah@shell.UUCP (Dinah Anderson) (03/02/89)

Jim Rudolf writes about a problem with ie0: lost interrupt: resetting
errors. (v7n157) We have seen this a couple of times and I believe a new
CPU resolved the problem.

We are having a problem with 

ie0: no carrier

errors. The are often accompanied by:

ie0: Ethernet jammed
or
ie0: WARNING: if_snd full

messages. We see 5-20 of the no carrier errors per hour on most of our
file servers. We are not monitoring the workstations as closely, but they
are receiving them also.

Our network topology currently consists of DEC LAN bridges connecting
local segments and BridgeComm bridges (56kb and T-1) connected the
individual sites.  Physical connections consist of both twisted pair
(synoptics) and regular "thick" connections. The problem appears on
systems at different sites with both twisted and "thick" connections. (We
are currently migrating to routers and are aware of the problems with
bridges everywhere.)

We are working with Sun on a resolution, but was curious to know if anyone
has seen this problem.

Dinah Anderson 
Shell Oil Company, Information Center (713) 795-3287
...!{sun,psuvax,soma,rice,ut-sally,ihnp4}!shell!dinah

cander@ucbvax.berkeley.edu (Charles Anderson) (03/07/89)

rudolf@oce.orst.edu (Jim Rudolf):
> We have two 3/280 servers running SunOS 3.5.  Almost on a weekly basis,
> usually during a period of moderate net activity, one of the servers will
> start spewing forth with: 
> 	ie0:  lost interrupt: resetting
> 
> Our ethernet boards must not have read the man pages, because the affected
> board does not recover by itself.  When this happens, the server is pretty
> much hung up, and the only effective solution we've come up with is the
> dreaded L1-A.  Who else has experienced this?  What did you do to stop it?

I saw this on some 3/160's running SunOS 3.4 (I think).  It was happening
multiple times per day on each file server (of course it started on
Thanksgiving, and I had to come in all weekend to reboot machines). We
eventually tracked the problem down to a faulty, pre-802.3 transciever on
the net that was wrting packets that were all 1's (0xFFFFFFF...).

We were fortunate in a number of ways: we had a network analyzer, the
problem was happening frequently (up to 20% of the packets on the net were
errors), and we could divide and conquer our net without stepping on too
many users' toes.  Please pardon the following plug... I highly recommend
Exelan's network analyzer, LANalyzer EX 5000.  It's extremely valuable for
these kinds of problems.

Charles.
{sun, amdahl, ucbvax, pyramid, uunet}!unisoft!cander

paula@june.cs.washington.edu (Paul Allen) (03/09/89)

In article <8902080633.AA07926@oce.orst.edu> rudolf@oce.orst.edu (Jim Rudolf) writes:
>We have two 3/280 servers running SunOS 3.5.  Almost on a weekly basis,
>usually during a period of moderate net activity, one of the servers will
>start spewing forth with:
>
>	ie0: lost interrupt: resetting
> [...]

I posted something about this last year.  Not sure now which issue it
appeared in.  I got mail from three different sites between Aug 13 and
Sept 19.  The apparent fix was from leonid%TAURUS.BITNET@CUNYVM.CUNY.EDU.
He suggessted replacing the transceiver that connects the affected
machine(s) to the Ethernet coax.  In our case, we had 5 3/280's connected
through a fan-out unit to a single transceiver.  We were seeing several
crashes per day spread randomly over the 5 machines.  The transceiver got
replaced (possibly as part of some unrelated work) and we haven't seen the
lost interrupt message since.

Paul Allen

Paul L. Allen                       | pallen@atc.boeing.com
Boeing Advanced Technology Center   | ...!uw-beaver!ssc-vax!bcsaic!pallen

todds@uunet.uu.net (Todd Sandor) (03/23/89)

>rudolf@oce.orst.edu (Jim Rudolf) writes:
>>...[getting "ie0: lost interrupt: resetting" messages]
>>If this starts happening, we'll generally start getting a few of these too:
>>	NFS getattr failed for server neptune: RPC: Timed out

You don't specify which SunOS version but we were experiencing the same
problem under SunOS 3.5 and was fixed with 3.5.1 fix tape, bug fix
reference # 1006375.  Hope this helps.

Todd Sandor                                      P.O. Box 9707
Cognos Incorporated                              3755 Riverside Dr.
VOICE:  (613) 738-1440   FAX: (613) 738-0002     Ottawa, Ontario
UUCP: uunet!mitel!sce!cognos!todds     CANADA  K1G 3Z4