[comp.sys.apollo] TCP/IP hangup

bonnetf@apo.esiee.fr (bonnet-franck) (02/11/91)

Hi,

We are in touble with our TCP/IP gateway machine.

The problem is the following :

 Sometimes ( generally during the week-end ...) this machine
seems to hang TCP/IP traffic without any logical reason. 

 1 -  it cannot accept any TCP/IP connection. 

 2 -  The 2 disks are still visibles and we can access to them
      using DOMAIN_OS with a "cd //dsp3" command.

 3 -  Listing the processes on this machine gives the folling :

----------------------

[surf:23143] ps -ax //dsp3
  PID TTY     STAT  TIME COMMAND
    1 ?       S <   0:59 [ init ]
    2 ?       R   36165:10 null
    3 ?       S    26:28 purifier
    4 ?       S     0:47 purifier
    5 ?       S    52:16 unwired_dxm
    6 ?       S     0:00 pinger
    7 ?       S   117:31 netreceive
    8 ?       S   755:26 netpaging
    9 ?       S    11:38 wired_dxm
   10 ?       S   1755:25 netrequest
   21 ?       S    31:08 netpaging
   22 ?       S    30:58 netpaging
   25 ?       S    39:23 netrequest
   27 ?       S    39:21 netrequest
24664 ?       S     3:28 [ named ]
24668 ?       S    10:35 [ rwhod ]

-----------------------

  4 -  There is no spm so we cannot crp to it, the only processes which
       seems to run are init,named,rwhod.

  5 -  This machine is connected to a HP835 which is our external gateway to 
       the internet. 

  6 -  The ethernet controller ( 802.3 ) seems OK.

This machine is a DN3500 with 16 Mb of memory and two 700 Mb disks,it is running
DOMAIN_OS 10.2 with BSD4.3 and AEGIS, normally the following processes are 
running on it :

  - tcpd
  - sendmail
  - named
  - rwhod
  - writed
  - spm

I precise that this machine is also a big file server for all our students so 
it is not easy to stop it at any time ... I know it is not very smart to do 
this but we have no choice at this time.

- Is it a known bug ?

- Does the 10.3 solve this trouble ?

- WHAT CAN I DO ???                      

-------------------------------------------------------------------------------|
bonnetf@apo.esiee.fr                     |                                     |
Frank Bonnet                             | Surfing ...                         |
E.S.I.E.E                                |                                     |
BP99 93162 Noisy le Grand cedex.FRANCE.  | the rest is details !               |
Fax   : 33 1 45 92 66 99                 |                                     |
-------------------------------------------------------------------------------|

krowitz@RICHTER.MIT.EDU (David Krowitz) (02/11/91)

First of all, the Apollo file services (ie. cd //my_apollo) do
not require TCP/IP. The Apollo file system uses it's own network
protocals which are independent of the TCP/IP protocals.

Your problem is that the /etc/tcpd TCP/IP daemon is dying for some
unknown reason. When it goes away, many of the other servers which
rely upon TCP/IP will also die (ie. sendmail, nfs, telnet, rlogin,
ftp, etc.). Apollo-specific services like CRP/SPM and the MBX helper
are not affected, since they use the same protocal suite as the
Apollo file system. You can generally fix the TCP/IP problem by
simply loggin in as "root" and running the /etc/tcpd program by
hand. It will run for a few seconds, and then fork off a copy to
run in the background as a server process. You can also use the
DM's "ex" command to exit from the DM and kill all of the servers
running on the machine *except* for the Apollo file-system servers
and then use the level-2 shell's "go" command to reload all of
the normal servers that are started by the /etc/rc, /etc/rc.local,
and /etc/rc.user shell scripts at boot time. Using, "ex" and then
"go" will usually kill all jobs running on the local machine without
affecting other Apollo users whose files are on the local disk, and
will usually get the machine's servers up and running cleanly (unless
something in the Domain/OS kernal has gone belly-up :-{ ).


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

hanche@imf.unit.no (Harald Hanche-Olsen) (02/12/91)

In article <9102111537.AA10181@richter.mit.edu> krowitz@RICHTER.MIT.EDU (David Krowitz) writes:

   You can generally fix the TCP/IP problem by
   simply loggin in as "root" and running the /etc/tcpd program by
   hand.

If you log in via DM, say `/etc/server -p /etc/tcpd', or DM will kill
tcpd when you log out.  Also, if tcpd has died you need to remove
`node_data/systmp/tcp_data first, or tcpd won't start (at least not on
our machines).

- Harald Hanche-Olsen <hanche@imf.unit.no>
  Division of Mathematical Sciences
  The Norwegian Institute of Technology
  N-7034 Trondheim, NORWAY

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (02/14/91)

In article <9102111305.AA09778@apo.esiee.fr> bonnetf@apo.esiee.fr (bonnet-franck) writes:
>We are in touble with our TCP/IP gateway machine.
>The problem is the following :
> Sometimes ( generally during the week-end ...) this machine
>seems to hang TCP/IP traffic without any logical reason. 
>   ... lots of stuff deleted ...
>I precise that this machine is also a big file server for all our students so 
>it is not easy to stop it at any time ... I know it is not very smart to do 
>this but we have no choice at this time.
>
>- Is it a known bug ?

Yes, but is partially patched by one of the SR10.2 Domain/OS patches.
Sorry I don't remember which one. Make sure you have the proper Ethernet
microcode (the SR10.2 or later version) on ALL your Ethernet nodes.

>- Does the 10.3 solve this trouble ?

No. We still see this at SR10.3 on a DN4500 running X and NFS (both heavy
(ab)users of TCP/IP). It used to hang once a month or so, but since SR10.3
and NFS, it is once a week. On our DN10000, it used to be once a
week, but as of SR10.3.p + NFS + USENET news, our MTBH (mean time between hangs)
is about 2 days. Our system is also a file server for all our users (150),
and when it dies, it is a big pain, but rebooting is the only solution I
know of.

>- WHAT CAN I DO ???                      

Complain to Apollo if you have a software support contract.
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775

krowitz@RICHTER.MIT.EDU (David Krowitz) (02/15/91)

Hmmm ... I've got nearly *exactly* the same configuration, ie. a DN3500-8MB, with
697 MB disk, Apollo token ring, and ethernet cards. It acts as the sole gateway
between our Apollo token ring and the MIT campus-wide ethernet. We are running
SR10.2 with (I believe) the standard (ie. non-patched) TCP/IP daemon -- and we have
*no* problems with the machine. Our software versions are as follows:

$ bldt

     **** Node 1C3B2 ****   "//rayleigh"
Domain/OS kernel(7), revision 10.2, October 13, 1989  12:51:22 pm

$ ts /etc/tcpd
Ver Name              Time Stamp                     File Name
--------------------------------------------------------------
c 1 net_main          1989/09/19 18:17:00 EST (Tue)  /etc/tcpd

Our "ifconfig" commands in the /etc/rc.local file are as follows:

	/etc/ifconfig dr0  18.138.0.117 netmask defaultmask
	/etc/ifconfig eth0 18.83.0.110  netmask defaultmask


We do *not* run /etc/routed on any of our nodes other than the gateway.
The MIT campus network office doesn't even recommend that, since there is
only 1 route from the Apollo token ring network to our building ethernet
(ie. our DN3500 gateway), and only 1 route from the building ethernet to
the campus FDDI optical backbone. The networking office claims that the BSD
implementation of the routing protocals is inherently inclined to jam you
(your) network, and therefore prefers all machines to use static routing.
(other than the machines which connect the building ethernets to the backbone --
which are special machines which only do routing). The jamming apparently
results from a cascading effect that can occur when all the "routed"s in
an internet try to update each other.

I can't get a time stamp for our ethernet microcode file (/com/ts won't report
such things for files of type "nil"), but you can try comparing the file size:

$ ld -a /sys/ethernet8_microcode

sys   type      blocks  current
type  uid         used   length   attr rights       name

file  nil            7      7168  P    prwx-        /sys/ethernet8_microcode

1 entry listed, 7 blocks used.





 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)