[comp.protocols.tcp-ip] Troubleshooting routing problems

brian@ucsd.EDU (Brian Kantor) (07/08/89)

Occasionally we here at UCSD seem to suffer from connectivity problems
that I think are a result of lost routing information.  The symptoms are
that we stop being able to reach some networks or they us.

To be more specific about it, our campus Ethernet is connected via a
Proteon router to the San Diego Supercomputer Center's Ethernet and to
several other networks around California - "CERFnet".  We rarely have
trouble reaching those networks.  However, from time to time, some
networks don't seem to be reachable from our campus network, but can be
reached from machines on the SDSC Ether or from other CERFNet members.

For example, right at this moment I can't ping any machines on the
192.31.103 network where RELAY.CS.NET and its nameservers live, nor can
we ping anything on the Purdue campus.  Yet both are quite reachable
from SDSC.  The NIC was unreachable for more than a day, and we haven't
been able to get info from the UK nameservers for more than a week. 
I don't get network unreachable ICMP messages.

Our routing table consists of a few subnet entries and a default route
to the SDSC Proteon.

SDSC has recently lost their network guru, and whilst they are trying
quite hard to help, they're not quite up to speed just yet.

What I think is happening is that the reachability information for the
UCSD network isn't getting propagated as well as it might be.  I suspect
that my outgoing pings are probably reaching their destinations, but
that the return ping response can't find a route back to our network.

How can I test this from here (or elsewhere)?

	Brian Kantor	UCSD Postmaster
		UCSD Office of Academic Computing	(619) 534-6865
		UCSD C-010, La Jolla, CA 92093 USA	fax: 619 534 7018
		brian@ucsd.edu	BRIAN@UCSD ucsd!brian

Gene.Hastings@BOOLE.ECE.CMU.EDU (07/09/89)

Brian, forgive me if some of what I say is old hat to you, but I feel it is
preferable to give too much information than too little.

My first caveat is to distinguish between the statements "I can't reach the
Internet." and "I can't reach this group of interesting machines." The
reason for this is that the world beyond UCSD and SDSC is not homogeneous,
and that it is possible that certain groups of networks may have a specific
point of failure (such as SRI-NIC and SIMTEL-20, neither of which are
directly connected to NSFNET, but rely on inter-backbone gateways between
NSFNET, ARPANET and MILNET). The value of this distinction is that it may
provide some hint as to the nature of the failure. 

The fact that you get no error messages back indicates that routing
announcements of your networks are not reaching the far end, and thus the
return traffic is dropped (that is, your traffic fails on the return path,
not on tha outgoing). This kind of thing is enormously hard to toubleshoot
without the aid of someone at another site, preferably the other end of the
path you're trying to troubleshoot.

What things can you do? A very powerful tool is traceroute, which has been
described here before (which is my way of admitting I can't recall all of
the pointers), and differs from the other tools in that it does not require
special authentication to use, or running a particular protocol on the
intervening routers.

Other useful tools are in the SGMP/SNMP family (you can query a routing
agent as to individual routes, or its entire routing table), which you may
be able to use depending upon the nature of your agreements with your
regional as to posession of the proper session/community names. Another
tool which provides useful information (in the absence of any other, at
least) is RIP query.) Even if you do not have personal access to the tools,
your regional NOCs should, and may be able to talk you through the tests. 

Gene

kwe@bu-cs.BU.EDU (kwe@bu-it.bu.edu (Kent W. England)) (07/10/89)

In article <1823@ucsd.EDU> brian@ucsd.EDU (Brian Kantor) writes:
>Occasionally we here at UCSD seem to suffer from connectivity problems
>that I think are a result of lost routing information.  The symptoms are
>that we stop being able to reach some networks or they us.
>[...]
>What I think is happening is that the reachability information for the
>UCSD network isn't getting propagated as well as it might be.  I suspect
>that my outgoing pings are probably reaching their destinations, but
>that the return ping response can't find a route back to our network.
>
>How can I test this from here (or elsewhere)?

	I think you are right.	It is hard for you to troubleshoot
this yourself.  You need help.  The SDSCnet people should be able to
deal with these things in response to mail from you as the campus
representative, exactly like you posted to tcp-ip.  Let SDSCnet or
CERFnet have another shot at solving your problem for you.

	As a local user, you should be able to ask your campus network
manager (perhaps that is you) who can call on the regional network
operations people who can call on MERIT, the backbone network people.
MERIT has the tools and techniques to solve these problems, but they
need to limit their interaction to the regional technical people.
There are too many people on the net for them to work with everyone
directly.

	In the case of Purdue and CSnet, they were once well served by
arpanet, and since the arpanet has evaporated very rapidly, many
organizations are scrambling to migrate to new network services, and
that means the NSF-Internet.

	Right now, connectivity to many organizations and for many
internetwork connections still takes place using default routes.
Default routes tend to break when widespread connectivity changes are
made, like taking down the arpanet.  The most common default is still
the good ol' arpanet, and many a slip twixt Hither and Yon on that old
caravan route.

	(I don't find any purdue nets or the cs.net in routing
information from the backbone via jvncnet.  It could be temporary, but
I think not.  My default routing still works.  Lucky for me, they can
find my in their defaults.)

	--Kent England

brian@ucsd.EDU (Brian Kantor) (07/11/89)

Well, we found the problem - seems one of the intermediate routers had a
default route pointing to a machine which no longer exists and which
used to be that site's Arpanet gateway.  Once that was fixed things
started to flow again.

Thanks all for your suggestions; they did help us!

	Brian Kantor	UCSD Office of Academic Computing
			Academic Network Operations Group  
			UCSD C-010, La Jolla, CA 92093 USA
			brian@ucsd.edu ucsd!brian BRIAN@UCSD

heker@JVNCA.CSC.ORG (Sergio Heker) (07/13/89)

I tend to agree with Kent that troubleshooting routing problems require
the interaction with other Networks.  But a more general statement can
be made that includes not only routing but End to End service.  This means
connectivity as well as performance.  In this, more general case we need
to remember that the Internet is a "network" has distributed management or 
in other words, each of the Internet components is managed and operated
by different (autonomous) entities.  These "entities" have different
levels of service (hours of operation and type of support, e.g. tools).
One of the greatest efforts to put some light into this problem, in my
opinion, is the NSFnet backbone.  MERIT has been developing the 
infrastructure to be able to look into problems that affect users across
country that use the NSFnet network to pass Inter-regional traffic, and
is doing a very good job assisting the Regional Networks to get problems
resolved.  The Regional Networks have a role in dealing with the regional
users and helping them to get the problems outside their campuses resolved.
This requires among other things, that Regional Networks be prepared (have
the facilities and infrastructure) to help their users.  This raises the
point of who the users of the Regional Network are.  One answer is the
institutions connected to it, the other answer is the people that pass
traffic.  If the Regional Network users are the "Campus" Network Organization
(for the Campuses that have one), then they are responsible for assisting
their users (the people that send the traffic).

The JvNCnet network, like other networks has been dealing with all these 
issues for the last three years, and is working closely with the Institution
members (Campuses), with MERIT and with other Regional Networks to assist
users.  Consistent with this spirit of cooperation we have met a number of
times with the principals of the Regional and State Networks in the North
East of the Country (PREPnet, NYSERnet, NEARnet and JvNCnet) to discuss
technical issues of Regional to Regional nature. 
meetings have been very productive, and will continue in the future, in
order to provide for the necessary coordination among the peer networks
to free the end-user of unnecessary complications.
In doing this we have developed a group within the Network Department,
called the Network Information Services Group, with the function of providing
information to the JvNCnet members (among other things).  A Network Operations
Group deals with the daily operations of the network.  Two other groups 
sometimes not visible but nevertheless very important in supporting our
network are the Network Engineering Group and the Network Installation and
Maintenance Group.  This organization and the facilities available consitute
our infrastructure to be able to support our community of users.

A problem that we have encountered, is that some of the end users (or the
Campus' users) don't know who to contact when there is a problem on the
network.  Ocassionally, they call the wrong person, or the person that cannot
help them to resolve the problem, or get forwarded a number of times.  This
only causes frustration for the end users.  We are in the process, through
the Network Information Services Group of initiating some training to the 
JvNCnet Member sites so they can assist the end users.  This effort will
be discussed in the next JvNCnet Regional Network Meeting in September.

If anyone is interested in getting more information about JvNCnet please
contact our Network Coordinator or myself at "nisc@nisc.jvnc.net" or by
phone at (609) 520-2000.

						-- Sergio




-----------------------------------------------------------------------------
|		John von Neumann National Supercomputer Center		    |
|  Sergio Heker				tel:	(609) 520-2000		    |
|  Director for Networking		fax:	(609) 520-1089		    |
|  Internet: "heker@jvnca.csc.org"	Bitnet:	"heker@jvnc"		    |
-----------------------------------------------------------------------------

schoff@SOLBOURNE.NYSER.NET ("Marty Schoffstall") (07/15/89)

I wish the problems were only routing, the reality of many situations
is that they are caused by a myriad of problems:

	1) the diameter of the Internet continues to grow, Ultrix systems
	out of the box which are configured with a
	"low" TTL's are having lots of problems right now since there are
	10's of gateway hops now between many facilities.  This is especially
	true during a failure where the redundant multiple path capability
	kicks in, but over a much "longer" path.  This week within NYSERNet
	a T1 failed in NYC and for two days RockefellerUniv communicated
	with CUNY (both in NYC) through upstate NY.

	2) networks break for periods of time and the word doesn't really
	get out.  For instance both the NYSERNet and Merit/NSFNet NOCs saw
	truelly horrible reachability problems into the MILNET this week.
	Why?  We don't know.

	3) networks run out of bandwidth, almost nothing gets through to
	some very important hosts like SRI-NIC.ARPA with its ARPANET and
	MILNET only connections during much of the day.

	4) our backup connections are mere straws in comparison to the
	fire hoses we normally use.  A T1 connection to NEARNET (of which
	CSNET has connectivity through) has been very flakey of late,
	when it doesn't work traffic backs off onto 56kbps ARPANET.

	5) and then there is routing:  string, chewing gum, glue and people
	pushing ISO "solutions"..

Good Luck, just don't lay the blame on one cause or one group.  We're all
at fault.


Marty
--------------------
    Occasionally we here at UCSD seem to suffer from connectivity problems
    that I think are a result of lost routing information.  The symptoms are
    that we stop being able to reach some networks or they us.

    To be more specific about it, our campus Ethernet is connected via a
    Proteon router to the San Diego Supercomputer Center's Ethernet and to
    several other networks around California - "CERFnet".  We rarely have
    trouble reaching those networks.  However, from time to time, some
    networks don't seem to be reachable from our campus network, but can be
    reached from machines on the SDSC Ether or from other CERFNet members.

    For example, right at this moment I can't ping any machines on the
    192.31.103 network where RELAY.CS.NET and its nameservers live, nor can
    we ping anything on the Purdue campus.  Yet both are quite reachable
    from SDSC.  The NIC was unreachable for more than a day, and we haven't
    been able to get info from the UK nameservers for more than a week. 
    I don't get network unreachable ICMP messages.

    Our routing table consists of a few subnet entries and a default route
    to the SDSC Proteon.

    SDSC has recently lost their network guru, and whilst they are trying
    quite hard to help, they're not quite up to speed just yet.

    What I think is happening is that the reachability information for the
    UCSD network isn't getting propagated as well as it might be.  I suspect
    that my outgoing pings are probably reaching their destinations, but
    that the return ping response can't find a route back to our network.

    How can I test this from here (or elsewhere)?

    	Brian Kantor	UCSD Postmaster
    		UCSD Office of Academic Computing	(619) 534-6865
    		UCSD C-010, La Jolla, CA 92093 USA	fax: 619 534 7018
    		brian@ucsd.edu	BRIAN@UCSD ucsd!brian