bonnetf@apo.esiee.fr (bonnet-franck) (02/11/91)
Hi, We are in touble with our TCP/IP gateway machine. The problem is the following : Sometimes ( generally during the week-end ...) this machine seems to hang TCP/IP traffic without any logical reason. 1 - it cannot accept any TCP/IP connection. 2 - The 2 disks are still visibles and we can access to them using DOMAIN_OS with a "cd //dsp3" command. 3 - Listing the processes on this machine gives the folling : ---------------------- [surf:23143] ps -ax //dsp3 PID TTY STAT TIME COMMAND 1 ? S < 0:59 [ init ] 2 ? R 36165:10 null 3 ? S 26:28 purifier 4 ? S 0:47 purifier 5 ? S 52:16 unwired_dxm 6 ? S 0:00 pinger 7 ? S 117:31 netreceive 8 ? S 755:26 netpaging 9 ? S 11:38 wired_dxm 10 ? S 1755:25 netrequest 21 ? S 31:08 netpaging 22 ? S 30:58 netpaging 25 ? S 39:23 netrequest 27 ? S 39:21 netrequest 24664 ? S 3:28 [ named ] 24668 ? S 10:35 [ rwhod ] ----------------------- 4 - There is no spm so we cannot crp to it, the only processes which seems to run are init,named,rwhod. 5 - This machine is connected to a HP835 which is our external gateway to the internet. 6 - The ethernet controller ( 802.3 ) seems OK. This machine is a DN3500 with 16 Mb of memory and two 700 Mb disks,it is running DOMAIN_OS 10.2 with BSD4.3 and AEGIS, normally the following processes are running on it : - tcpd - sendmail - named - rwhod - writed - spm I precise that this machine is also a big file server for all our students so it is not easy to stop it at any time ... I know it is not very smart to do this but we have no choice at this time. - Is it a known bug ? - Does the 10.3 solve this trouble ? - WHAT CAN I DO ??? -------------------------------------------------------------------------------| bonnetf@apo.esiee.fr | | Frank Bonnet | Surfing ... | E.S.I.E.E | | BP99 93162 Noisy le Grand cedex.FRANCE. | the rest is details ! | Fax : 33 1 45 92 66 99 | | -------------------------------------------------------------------------------|
krowitz@RICHTER.MIT.EDU (David Krowitz) (02/11/91)
First of all, the Apollo file services (ie. cd //my_apollo) do not require TCP/IP. The Apollo file system uses it's own network protocals which are independent of the TCP/IP protocals. Your problem is that the /etc/tcpd TCP/IP daemon is dying for some unknown reason. When it goes away, many of the other servers which rely upon TCP/IP will also die (ie. sendmail, nfs, telnet, rlogin, ftp, etc.). Apollo-specific services like CRP/SPM and the MBX helper are not affected, since they use the same protocal suite as the Apollo file system. You can generally fix the TCP/IP problem by simply loggin in as "root" and running the /etc/tcpd program by hand. It will run for a few seconds, and then fork off a copy to run in the background as a server process. You can also use the DM's "ex" command to exit from the DM and kill all of the servers running on the machine *except* for the Apollo file-system servers and then use the level-2 shell's "go" command to reload all of the normal servers that are started by the /etc/rc, /etc/rc.local, and /etc/rc.user shell scripts at boot time. Using, "ex" and then "go" will usually kill all jobs running on the local machine without affecting other Apollo users whose files are on the local disk, and will usually get the machine's servers up and running cleanly (unless something in the Domain/OS kernal has gone belly-up :-{ ). -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
hanche@imf.unit.no (Harald Hanche-Olsen) (02/12/91)
In article <9102111537.AA10181@richter.mit.edu> krowitz@RICHTER.MIT.EDU (David Krowitz) writes:
You can generally fix the TCP/IP problem by
simply loggin in as "root" and running the /etc/tcpd program by
hand.
If you log in via DM, say `/etc/server -p /etc/tcpd', or DM will kill
tcpd when you log out. Also, if tcpd has died you need to remove
`node_data/systmp/tcp_data first, or tcpd won't start (at least not on
our machines).
- Harald Hanche-Olsen <hanche@imf.unit.no>
Division of Mathematical Sciences
The Norwegian Institute of Technology
N-7034 Trondheim, NORWAY
system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (02/14/91)
In article <9102111305.AA09778@apo.esiee.fr> bonnetf@apo.esiee.fr (bonnet-franck) writes: >We are in touble with our TCP/IP gateway machine. >The problem is the following : > Sometimes ( generally during the week-end ...) this machine >seems to hang TCP/IP traffic without any logical reason. > ... lots of stuff deleted ... >I precise that this machine is also a big file server for all our students so >it is not easy to stop it at any time ... I know it is not very smart to do >this but we have no choice at this time. > >- Is it a known bug ? Yes, but is partially patched by one of the SR10.2 Domain/OS patches. Sorry I don't remember which one. Make sure you have the proper Ethernet microcode (the SR10.2 or later version) on ALL your Ethernet nodes. >- Does the 10.3 solve this trouble ? No. We still see this at SR10.3 on a DN4500 running X and NFS (both heavy (ab)users of TCP/IP). It used to hang once a month or so, but since SR10.3 and NFS, it is once a week. On our DN10000, it used to be once a week, but as of SR10.3.p + NFS + USENET news, our MTBH (mean time between hangs) is about 2 days. Our system is also a file server for all our users (150), and when it dies, it is a big pain, but rebooting is the only solution I know of. >- WHAT CAN I DO ??? Complain to Apollo if you have a software support contract. -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775
krowitz@RICHTER.MIT.EDU (David Krowitz) (02/15/91)
Hmmm ... I've got nearly *exactly* the same configuration, ie. a DN3500-8MB, with 697 MB disk, Apollo token ring, and ethernet cards. It acts as the sole gateway between our Apollo token ring and the MIT campus-wide ethernet. We are running SR10.2 with (I believe) the standard (ie. non-patched) TCP/IP daemon -- and we have *no* problems with the machine. Our software versions are as follows: $ bldt **** Node 1C3B2 **** "//rayleigh" Domain/OS kernel(7), revision 10.2, October 13, 1989 12:51:22 pm $ ts /etc/tcpd Ver Name Time Stamp File Name -------------------------------------------------------------- c 1 net_main 1989/09/19 18:17:00 EST (Tue) /etc/tcpd Our "ifconfig" commands in the /etc/rc.local file are as follows: /etc/ifconfig dr0 18.138.0.117 netmask defaultmask /etc/ifconfig eth0 18.83.0.110 netmask defaultmask We do *not* run /etc/routed on any of our nodes other than the gateway. The MIT campus network office doesn't even recommend that, since there is only 1 route from the Apollo token ring network to our building ethernet (ie. our DN3500 gateway), and only 1 route from the building ethernet to the campus FDDI optical backbone. The networking office claims that the BSD implementation of the routing protocals is inherently inclined to jam you (your) network, and therefore prefers all machines to use static routing. (other than the machines which connect the building ethernets to the backbone -- which are special machines which only do routing). The jamming apparently results from a cascading effect that can occur when all the "routed"s in an internet try to update each other. I can't get a time stamp for our ethernet microcode file (/com/ts won't report such things for files of type "nil"), but you can try comparing the file size: $ ld -a /sys/ethernet8_microcode sys type blocks current type uid used length attr rights name file nil 7 7168 P prwx- /sys/ethernet8_microcode 1 entry listed, 7 blocks used. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)