rudolf@oce.orst.edu (Jim Rudolf) (02/16/89)
I vaguely remember seeing this discussion somewhere before. My apologies if it has already been run into the ground. We have two 3/280 servers running SunOS 3.5. Almost on a weekly basis, usually during a period of moderate net activity, one of the servers will start spewing forth with: ie0: lost interrupt: resetting If this starts happening, we'll generally start getting a few of these too: NFS getattr failed for server neptune: RPC: Timed out Quoting from the man page: ie%d: lost interrupt: resetting The driver and 82586 chip have lost synchronization with each other. The driver recovers by resetting itself and the chip. Our ethernet boards must not have read the man pages, because the affected board does not recover by itself. When this happens, the server is pretty much hung up, and the only effective solution we've come up with is the dreaded L1-A. Who else has experienced this? What did you do to stop it? Thanks for your help, Jim Rudolf rudolf@oce.orst.edu College of Oceanography Oregon State University [[ I can find no evidence of a previous discussion about this in either volume 6 or 7. --wnl ]]
arisco@cadillac.cad.mcc.com (John Arisco) (03/01/89)
We also had a problem with "ie0: lost interrupt: resetting". It is one of the bugs fixed in the 3.5.1 patch tape from Sun Software Support. __________ Reference Number: 1006375 Synopsis: ie0: lost interrupt: resetting Description: Heavy nfs activity on a Sun-3/280 nfs file server can result in the following: ie0: lost interrupt: resetting Files Changed: /usr/sys/OBJ/if_ie.o Special Installation Instructions: You must rebuild your kernel. Please refer to KERNEL REBUILD at the end of this document. John Arisco, MCC CAD Program | ARPA: arisco@mcc.com | Phone: [512] 338-3576 Box 200195, Austin, TX 78720 | UUCP: ...!cs.utexas.edu!milano!cadillac!arisco
wwtz@uunet.uu.net (Wolfgang Wetz) (03/02/89)
rudolf@oce.orst.edu (Jim Rudolf) writes: >We have two 3/280 servers running SunOS 3.5. Almost on a weekly basis, >usually during a period of moderate net activity, one of the servers will >start spewing forth with: > ie0: lost interrupt: resetting This is a problem which occurs during heavy load on the ethernet. We experienced this problem here too. The bad thing about this "interrupt lost" is, that there is no way to recover from, except processor interrupt/reboot. We were told by Sun Switzerland to upgrade to SunOS 3.5.1. Having done this, the problem went away. best regards Wolfgang Wetz, Systems Administrator, Scientific Computing Centre c/o CIBA-GEIGY AG, R-1045.330, CH-4002 Basel, Switzerland Internet: wwtz%cgch.uucp@uunet.uu.net Amateur Radio: HB9PCX UUCP: ...!mcvax!cernvax!cgch!wwtz Phone: (+41) 61 697 54 25 BITNET: wwtz%cgch.uucp@cernvax.bitnet Fax: (+41) 61 697 32 88
meier@rutgers.edu (Christopher M. Meier) (03/02/89)
rudolf@oce.orst.edu (Jim Rudolf) writes: >...[getting "ie0: lost interrupt: resetting" messages] >If this starts happening, we'll generally start getting a few of these too: > NFS getattr failed for server neptune: RPC: Timed out Until I read this line, I thought this was written by someone here. We have no 'neptune' 280. We have seen this, but mostly at times when someone is adding/removing a number of nodes from the ethernet. This seems to cause lots of 'noise' on the cable(s). I can't qualify the 'noise', as I haven't had a sniffer to use during one of those times. It has also been seen when the network traffic is heavy, but someone somewhere may have been fooling with the cable. We don't have a solution. @ Christopher M. Meier ms: MN65-2300 Honeywell Systems & Research Center @ Research Scientist/SIP (612) 782-7191 3660 Technology Drive @ meier@SRC.Honeywell.COM !SRCSIP!meier Mpls, MN 55418
dinah@shell.UUCP (Dinah Anderson) (03/02/89)
Jim Rudolf writes about a problem with ie0: lost interrupt: resetting errors. (v7n157) We have seen this a couple of times and I believe a new CPU resolved the problem. We are having a problem with ie0: no carrier errors. The are often accompanied by: ie0: Ethernet jammed or ie0: WARNING: if_snd full messages. We see 5-20 of the no carrier errors per hour on most of our file servers. We are not monitoring the workstations as closely, but they are receiving them also. Our network topology currently consists of DEC LAN bridges connecting local segments and BridgeComm bridges (56kb and T-1) connected the individual sites. Physical connections consist of both twisted pair (synoptics) and regular "thick" connections. The problem appears on systems at different sites with both twisted and "thick" connections. (We are currently migrating to routers and are aware of the problems with bridges everywhere.) We are working with Sun on a resolution, but was curious to know if anyone has seen this problem. Dinah Anderson Shell Oil Company, Information Center (713) 795-3287 ...!{sun,psuvax,soma,rice,ut-sally,ihnp4}!shell!dinah
cander@ucbvax.berkeley.edu (Charles Anderson) (03/07/89)
rudolf@oce.orst.edu (Jim Rudolf): > We have two 3/280 servers running SunOS 3.5. Almost on a weekly basis, > usually during a period of moderate net activity, one of the servers will > start spewing forth with: > ie0: lost interrupt: resetting > > Our ethernet boards must not have read the man pages, because the affected > board does not recover by itself. When this happens, the server is pretty > much hung up, and the only effective solution we've come up with is the > dreaded L1-A. Who else has experienced this? What did you do to stop it? I saw this on some 3/160's running SunOS 3.4 (I think). It was happening multiple times per day on each file server (of course it started on Thanksgiving, and I had to come in all weekend to reboot machines). We eventually tracked the problem down to a faulty, pre-802.3 transciever on the net that was wrting packets that were all 1's (0xFFFFFFF...). We were fortunate in a number of ways: we had a network analyzer, the problem was happening frequently (up to 20% of the packets on the net were errors), and we could divide and conquer our net without stepping on too many users' toes. Please pardon the following plug... I highly recommend Exelan's network analyzer, LANalyzer EX 5000. It's extremely valuable for these kinds of problems. Charles. {sun, amdahl, ucbvax, pyramid, uunet}!unisoft!cander
paula@june.cs.washington.edu (Paul Allen) (03/09/89)
In article <8902080633.AA07926@oce.orst.edu> rudolf@oce.orst.edu (Jim Rudolf) writes: >We have two 3/280 servers running SunOS 3.5. Almost on a weekly basis, >usually during a period of moderate net activity, one of the servers will >start spewing forth with: > > ie0: lost interrupt: resetting > [...] I posted something about this last year. Not sure now which issue it appeared in. I got mail from three different sites between Aug 13 and Sept 19. The apparent fix was from leonid%TAURUS.BITNET@CUNYVM.CUNY.EDU. He suggessted replacing the transceiver that connects the affected machine(s) to the Ethernet coax. In our case, we had 5 3/280's connected through a fan-out unit to a single transceiver. We were seeing several crashes per day spread randomly over the 5 machines. The transceiver got replaced (possibly as part of some unrelated work) and we haven't seen the lost interrupt message since. Paul Allen Paul L. Allen | pallen@atc.boeing.com Boeing Advanced Technology Center | ...!uw-beaver!ssc-vax!bcsaic!pallen
todds@uunet.uu.net (Todd Sandor) (03/23/89)
>rudolf@oce.orst.edu (Jim Rudolf) writes: >>...[getting "ie0: lost interrupt: resetting" messages] >>If this starts happening, we'll generally start getting a few of these too: >> NFS getattr failed for server neptune: RPC: Timed out You don't specify which SunOS version but we were experiencing the same problem under SunOS 3.5 and was fixed with 3.5.1 fix tape, bug fix reference # 1006375. Hope this helps. Todd Sandor P.O. Box 9707 Cognos Incorporated 3755 Riverside Dr. VOICE: (613) 738-1440 FAX: (613) 738-0002 Ottawa, Ontario UUCP: uunet!mitel!sce!cognos!todds CANADA K1G 3Z4