[comp.sys.apollo] TCP problem

lid@cernvax.UUCP (lid) (09/11/87)

We are experiencing a lot of problems with TCP/IP here and I would like to
know if someone else out there has seen/solved similar problems.

We have an Apollo ring with 2 TCP/IP gateways (Interlan NM10 + dsp[89]0)
that connect to our Ethernet, all the machines on Ethernet know about both
gateways just like all non-gateway Apollo do.
Without a specific reason from time to time some ethernet hosts become
unreachable from non-gateway Apollos (but they still work from the gateways),
the message we get on the Apollos is "destination timed out".
The routing information on Apollo is OK, some other ethernet hosts are still
reachable.
We almost always have problems with Wollongong TCP/IP on VAX/VMS and with
Wiscnet TCP/IP on IBM/VM systems, never (almost) with Ultrix.
It seems to me that the Apollo side is OK, I monitored a "ftp vaxvms" and
saw:
	a) the request went thru the Apollo gateway,
	b) reached the vax (found out with Wollongong's netstat),
	c) the vax knew about both Apollo gateways,
	d) but the ftp'ing Apollo never got an answer.

TCP/IP on the Vax was working just a couple of minutes before and nobody
of the system people was working on it during that time.
Just note that the Vax was still able to reach other ethernet hosts,
including the Apollo gateway.

Does all this resemble something you've already seen ?
Any suggestion about what the problem can be ?

It seems to me that when we only had 1 gateway things were working
better, but I wouldn't bet on this because the TCP/IP usage is steadily
growing on our site, so may it was working better just because of the
smaller amount of traffic...

As far as Apollo hosts are concerned the thing works as you would expect:
if the first gateway in the routing table of the Apollo does not answer,
the second one is used, providing automatic fall back for Apollo users.
All the exercise started to provide more reliable and continued
operations of TCP/IP !!!

Thanks to all who can give me some hints.

	Achille Petrilli

krowitz@mit-richter.UUCP (David Krowitz) (09/11/87)

I have seen the error message "destination timed out" before
here at MIT. The problem in the past seemed to be a problem
with the routing tables on the ethernet host getting flushed
periodically (probably because it hadn't seen any info [or any
correct info] from the rip servers on the Apollos). What would
happen is that the non-gateway Apollos would know to route
the packets for the ethernet host via the Apollo gateway, but
the ethernet host would lose the routing table that told it
to reply to the non-gateway Apollos via the Apollo gateway.
The ethernet host can always talk to the Apollo gateway 
because it is on the same network -- no routing info is
needed. We solved this problem on our BSD 4.2 machines by
forcing a static entry for the Apollo network into the
routing tables of the ethernet hosts by using the command
"route add apollo-net apollo-gw 1" in the machine's startup
file (/etc/rc.local) Unlike the info provided by the
rip server/routed programs, this entry does not get flushed
if the ethernet host doesn't here any routing info from the
Apollo gateway periodically.


 -- David Krowitz

mit-erl!mit-kermit!krowitz@eddie.mit.edu
mit-erl!mit-kermit!krowitz@mit-eddie.arpa
krowitz@mit-mc.arpa
(in order of decreasing preference)

giebelhaus@hi-csc.UUCP (Timothy R. Giebelhaus) (09/13/87)

I have two ucr's which cover this:
0d885985 and 0d885997

The answer to both of them say that this problem will be fixed 
with release 3.1 of TCP.  3.1 seems to be over due (ready to be
released any day).  You may want to bother your salesman about
it.

On my network, it is only the UNIX machines which get lost.  I
assume that the routed's are not communicating correctly somehow.
I'd bet it to be a bug in UNIX.