[mod.protocols.tcp-ip] Poor performance related to egp?

rick@SEISMO.CSS.GOV (Rick Adams) (10/31/86)

I can't help but wonder if the poor internet performance is related to the
HORRIBLE routes that egp says to use.

I have seen improvements in round trip icmp echo times of 1000% by
ignoring the route egp says to use and manually forcing a route into
the system. In many cases, it is the difference between connecting
at all and timing out. Todays horrible case has been routing to
128.96 (bellcore.com) through lbl-milnet-gw instead of the rational
relay.cs.net. Other horrible routes have included rutgers through
purdue instead of the direct rutgers arpanet host.

Are the egp routes supposed to be reasonable? I'm not that familiar with
the theory behind them, but in practice, the suck badly. When I can get
a 10 to 1 performance improvement by hard coding specific routes to
override egp, I wonder if this is part of the internet congestion problem.

It seems like a major performance gain for everyone could be realized
by having the egp core systems advertise rational routes.

---rick

braden@ISI.EDU (Bob Braden) (10/31/86)

The examples you cite of "horrible" EGP routing are probably due to the
extra-hop problem in the core.  Apparently we have not done an adequate
job of information-spreading, if you are not aware of this problem.  I
seem to recall a blaze of messages on this very subject within the past 6
months, probably on the tcp-ip list.  It began with a complaint almost
identical to yours, and ended with a scholarly explanation of the
extra-hop problem by Dave Mills.

The extra-hop problem can at worst double the core traffic, and it is
scheduled to go away when the Butterflies take over the core.  I forget
the exact predicted date from BBN, but rescue is in sight.

As for performance, in some funny sense EGP is (deliberately) designed for
poor performance, in the sense that it is intended to server as a firewall
against misbehaviour by routing domains outside the core.  It is true, as
Mike StJohns says, that EGP is not a routing protocol; it is also true that
this fact has led to serious restrictions in topology and therefore a
crash effort is being mounted to replace EGP with a routing protocol, under
the direction of the INENG and INARCH task forces. 
 
However, maybe we are asking too much of EGP.  Perhaps we are trying to
make it a technical fix for administrative problems.  To avoid bad things
like oscillations and routing loops in the face of the "diversity" (to
use a nice word) of the Internet as a whole, EGP or whatever replaces it
will always have to use long time constants and provide some sub-optimal
routes.  At the present time, the Internet is growing largely by
accretion of new Autonomous Systems, and this must lead to some 
degradation as you cross boundaries.  If we want better overall
performance, we need to persuade these systems to aggregate into bigger
systems, each run by centralized and professional Internet management,
and each using a carefully-optimized IGP.

I go into all this polemic, because lately I have been exposed to an 
awful lot of technological optimism (ask NASA about that!) about 
Internetting.  I wish we could convince some of the new players in the
Internet game that it takes great technical sophistication and wisdom
to make this stuff work well.  The Anarchy Model of Internetting,
while theoretically feasible due to EGP, is not really a very wise way
to go.

Bob Braden

JNC@XX.LCS.MIT.EDU ("J. Noel Chiappa") (11/01/86)

	I seem to explain this every 2 months.

	The problem is not caused by EGP, which is telling you exactly
what the gateway you are neighbours with is doing itself with packets
to given destinations, but the routing protocol (GGP) which is used by
the core gateways among themselves. It predates EGP, was not designed
with the pattern of information flows that you see in EGP in mind, and
is the cause of the problem. When GGP is replaced (which will probably
be when the PDP11's are) the problem will magically disappear without
any changes to EGP.

	For a more detailed explanation of the problem, look in the
TCP-IP archive for a message I sent out at Thu 6 Mar 86 18:16:01-EST
which goes into great detail. Just out of interest, were you on
TCP-IP then?

	Noel
-------

rick@SEISMO.CSS.GOV (Rick Adams) (11/03/86)

Let me see if I have this correct. Based on the letters I have received:

	There is a major problem with GGP.
	This has been known for a long time.
	There is no plan to fix it in the forseeable future.
	This problem "at most" doubles the load on the arpanet.

Can anyone explain why this doesn't warrant immediate attention?

If someone told me there was a kernel bug that "at most" wasted 50% of
my CPU, I'd be quite concerned about it. I wouldn't wait for the next
hardware release and hope it was fixed then.

Observation indicates a 10 to 1 degradation in performance, which is
not what I would expect from doubling the load.

There seems to be some belief that the BBN Butterfly will be the salvation
of the world. I hope the Butterfly being considered is a lot different from
the Butterfly sitting about 25 feet from me (css-gateway 10.2.0.25). This
particular Butterfly is one of the most unreliable things I have
ever seen. It often needs to be MANUALLY (i.e. they call me up) rebooted
several times per day. 

Waiting for a solution based on the Butterfly seems quite foolish.
Especially when people are forced to install their own leased lines because
the ARPANET performance is unacceptable. (We already have 2. I'm sure there
are many others. I find it especially ironic that our DARPA project manager can
not use the ARPANET to access our machine (unaceptable performance), but
has to use a leased line)

---rick