[comp.protocols.tcp-ip] routing problems? Core gateways?

brian@sdcsvax.UCSD.EDU (Brian Kantor) (01/13/88)

I can't ping SEISMO.CSS.GOV from our host 'SDCSVAX.UCSD.EDU' at
26.5.0.3.  I can ping SEISMO quite nicely from our neighbor on the same
IMP (NOSC.MIL at 26.0.0.3).  We are both using the same default gateway
at 26.2.0.73, which seems to be the way the packets are going.

I guess that SEISMO could have different routes back to SDCSVAX than they 
do back to NOSC, but I don't know WHY they'd have them, since we're both
on the same IMP.

SEISMO isn't an isolated case - I can't ping any of the following hosts
from SDCSVAX, but I can from NOSC:

	uxe.cso.uiuc.edu	128.174.5.54
	mcc.com			10.3.0.62
	ius2.cs.cmu.edu		128.2.254.176
	cs.utah.edu		10.0.0.4

There are others.  And there are hundreds of mail messages waiting in
our queue because we can't get to their nameservers to get delivery 
addresses.  (There ARE hosts on network 10 that I can ping, btw, such 
as UCBVAX.)

I called the Milnet NOC, and they seemed to think that it's a core
gateway problem and told me to call BBN.  BBN assures me that they were
already working on the problem - seems that they see a problem with one
of the gateways, but I don't know what that problem would be.

I'd like to understand what's going on, as well as get it fixed.  The
only thing I can think of is that NOSC is connected to the IMP with
1822, and we're connected with X.25, but don't quite see how that would
make a difference either.  NOSC is running EGP for our 128.54 network, 
so I don't think that's the problem - if indeed that is even relevant,
since we're dealing directly with network 26 source addresses in the
pings.

Oh yeah, we've been having this sort of problem since Friday.  I'm not
sure it's always the same hosts.

I'm stumped.  Anybody got any ideas?  

	Brian Kantor
		UCSD Office of Academic Computing
		Academic Network Operations Group  
		UCSD B-028, La Jolla, CA 92093 USA
		brian@sdcsvax.ucsd.edu	(619) 534-6865

dag@hub.ucsb.edu (Darren Daggit) (01/14/88)

In article <4481@sdcsvax.UCSD.EDU> brian@sdcsvax.UCSD.EDU (Brian Kantor) writes:
>I can't ping SEISMO.CSS.GOV from our host 'SDCSVAX.UCSD.EDU' at
>26.5.0.3.  I can ping SEISMO quite nicely from our neighbor on the same
>IMP (NOSC.MIL at 26.0.0.3).  We are both using the same default gateway
>at 26.2.0.73, which seems to be the way the packets are going.
>
>SEISMO isn't an isolated case - I can't ping any of the following hosts
>from SDCSVAX, but I can from NOSC:
>
>	uxe.cso.uiuc.edu	128.174.5.54
>	mcc.com			10.3.0.62
>	ius2.cs.cmu.edu		128.2.254.176
>	cs.utah.edu		10.0.0.4

Brian,

I can get to all of these hosts, and haven't seen any problems routing to them.
We are sending stuff through SDSC to the NSFNET backbone and on from there,
so things probably aren't touching the same piece of ethernet cable that you
are on, but the systems are reachable.

There have been reported problems that affect at least one of these systems
(cs.utah.edu).  The gateway psc.psc.edu has evidently been in trouble over 
the past few days, and things haven't been shuttled around the way they
should.

>There are others.  And there are hundreds of mail messages waiting in
>our queue because we can't get to their nameservers to get delivery 
>addresses.  (There ARE hosts on network 10 that I can ping, btw, such 
>as UCBVAX.)

I don't have my map of how the UCSD campus is connected to SDSC at hand, but
UCBVAX may not be a good example.  If you are routing things through the least
number of hops then stuff to UCBVAX probably isn't going to your default
gateway, the fastest route would be through the SDSC Proteon and out to the 
BARRNET (There are two R's in BARRNET aren't there?) backbone.  Is this right?
I have noticed that when the SDSC link into the NSFNET goes down we can still
get to the BARRNET, so the SDSC Proteon is routing things correctly in that
respect.

>I called the Milnet NOC, and they seemed to think that it's a core
>gateway problem and told me to call BBN.  BBN assures me that they were
>already working on the problem - seems that they see a problem with one
>of the gateways, but I don't know what that problem would be.

That is probably the PSC.PSC.EDU gateway, maybe it is effecting more than
one site.
 
  I hope this helps,
     --darren

>	Brian Kantor
>		UCSD Office of Academic Computing
>		Academic Network Operations Group  
>		UCSD B-028, La Jolla, CA 92093 USA
>		brian@sdcsvax.ucsd.edu	(619) 534-6865


-----------------------------------------------------------------------------
Darren Griffiths                      BITNET   -   DAG@SBITP
Systems Manager                       ARPA     -   DAG@NOBBS.UCSB.EDU
Physics Computer Services                          (128.111.8.50)
University of California              HEPNET   -   NOBBS::DAG  
Santa Barbara, CA  93106                           13326::DAG
(805)961-2602

brian@sdcsvax.UCSD.EDU (Brian Kantor) (01/15/88)

Well, we're a little closer to the solution - at least we know what the
real problem is, although not the cause.

As I said, we're X.25 connected to our Milnet IMP.  That requires that
we have an X.25 virtual circuit for each separate host or gateway on the
Milnet that we want to talk to.  If we are the first person to send a
packet to that address, we open the VC in the range from 1 to 32.  If
the distant host is the first to send a packet to us, the IMP must open
the VC in the range 64-33.  If a VC is open already (no matter whether
the IMP or our interface opened it), it will be used.  If a VC is idle
for a configurable length of time (currently 60 seconds), the circuit
will be closed.

It seems that our problem is caused by the IMP not being able to open
VCs to us.  We CAN open outgoing VCs, but if the first packet from a
host or gateway on the Milnet is originated by them instead of us, then
the VC connection won't happen and things don't work.

Our problem with Seismo and other hosts is that they are on the far
side of gateways to the Arpanet, so that packets from them to us must
pass through a minimum of TWO gateways - one from them to Arpa, then one
of the Arpa-Milnet mailbridges - in order to get to us.  That
means that since the ICMP echo response (as well as all other traffic from
them to us) might well come to our host through a different mailbridge
than the one through which we send the packets to them, they might
indeed wind up seeing our packets but we won't see theirs, because the
X.25 VC from the IMP for that mailbridge couldn't be opened.

Clear as mud, eh?

We think it started after the IMP was reloaded after a power failure
last Saturday.  The NOC has somebody looking into it, they tell me.  In
the meantime, we're working around the problem by sending a single ping
packet to each of the seven mailbridges once every 45 seconds.  That's
sufficient to keep the X.25 virtual circuits active so people can get to
us.  It doesn't cure the problem, but it's a whole lot more livable this
way.

Deep gratitude to Mike Brescia of BBN who helped figure out what's going
on - or not going on, in this case.

	Brian Kantor	UCSD Office of Academic Computing
			Academic Network Operations Group  
			UCSD B-028, La Jolla, CA 92093 USA

oconnor@SCCGATE.SCC.COM (Michael J. O'Connor) (01/15/88)

Brian,
	Your problem is similar to the one we experience with PSN 7.0
software.  In our case, the incoming VC would be created but only one
packet would be received.  If we pinged the origin of the VC, traffic
would begin flowing.  The problem was attributed to the new PSN code
expecting an X.25 RR after the first packet which our X.25 didn't provide.
This was a change from PSN 6 behavior.  We also set up a periodic pinging
of the mailbridges in order to keep mail flowing (the suggestion came from
BBN).
	What is dissimilar is the fact that we are an Arpanet node while
you are Milnet.  I didn't think PSN 7.0 was to be installed in the Milnet
just yet so unless someone slipped up reloading your PSN I don't know how
you could have the same problem as us.
	We are using Sun's X.25 in our gateway and were told that only Sun
users experienced the "hung VC" problem with PSN 7.0.
	I'd be interested to know whether or not you have the resources to
determine whether you get no incoming packets on the bad VCs or just one.

			Mike

pogran@ccq.bbn.COM (Ken Pogran) (01/16/88)

Mike,

I can confirm that PSN 7.0 has NOT been installed in the MILNET
yet, and there's no chance that that PSN was accidentally reloaded
with PSN 7.0 software (hosts on that node wouldn't be able to
communicate AT ALL with the rest of the MILNET if that had
happened!).

I'm not plugged in to the trouble-shooting process on the MILNET,
so I can't speculate on what Brian's problem might actually be.
I agree with you that the external symptoms do resemble the
ARPANET PSN 7 "stuck VC" problem, but I think the explanation has
to turn out to be quite different.

Regards,
 Ken

P.S.: With regard to the point that "only Sun users experienced
the 'hung VC' problem with PSN 7.0", it's fair to say that Sun's
was the only X.25 implementation that we FOUND on the ARPANET
that encountered that problem; it's possible that other
implementors could have made the same design decisions that Sun
made and could have had the problem, too.  But, as I said above,
that's a PSN 7 problem and certainly isn't what Brian's seeing on
the MILNET right now.