[mod.protocols.tcp-ip] Why is the ARPANet in such bad shape these days?

BILLW@SU-SCORE.ARPA (William "Chops" Westfield) (09/27/86)

Both repsonse and throughput between Stanford and SRi is pretty
awful, and they are only one IMP apart.  Trying to FTP a file
from a host that is further away seems nearly impossible.

Is this just a local problem, say with the Stanford IMP, or
are other people having similar problems?

Note that this is NOT a gateway related problem, since for many
of the paths Ive tried, no gateways should be involved.

BillW
-------

SRA@XX.LCS.MIT.EDU (Rob Austein) (09/28/86)

Bill,

No, the ARPANET problem is definitely not just at Stanford.  MIT has
been moderately crippled by this for weeks now (since the start of the
fall semester, which is probably -not- a coincidence).  MC and XX have
a hard time talking to each other and they are on the same IMP.  The
NOC claims that this is true for pretty much the entire ARPAnet.
Apparently MILNET is somewhate better off.

The NOC is refering to this mess as a "congestion problem" at the IMP
level.  The current theory the last few times I talked to the NOC was
that we have managed to reach the bandwidth limit of the existing
hardware.  A somewhat scary thought.  If this is in fact the case (and
there is circumstancial evidence that it is, such as the fact that the
net becomes usable again during off hours), we are in for a long
siege, since it is guarenteed to take the DCA and BBN a fair length of
time to deploy any new hardware or bring up new trunks.

Current thoughts and efforts at MIT are (1) we need more data on the
traffic going through the IMPs, and (2) we need to cut down on the
amount of traffic going through the IMPs.  The two go along with each
other to some extent (preliminary results show that roughly 25% of the
traffic through the MIT gateway is to or from XX).  Some interesting
ideas have come up for minimizing load due to email, if that turns out
to be a prime offender (surprisingly, the preliminary statistics don't
seem to indicate that).  If there is anybody else out there doing
analysis of network traffic, please share it.

Also, if there is anybody from BBN who knows more about the problem
and is willing to share it, -please- do.  It's hard to make any kind
of contingency plans in a vacuum.

--Rob

dave@RSCH.WISC.EDU (Dave Cohrs) (09/28/86)

I don't know why it's so bad, but no, it is *not* a localized problem.

Hosts at UW-Madison are also having problems reaching hosts farther
away than our local PSN.  The worst problems (of course) are reaching
hosts on the east coast, especially Rutgers and CSS.GOV sites.

The problem seems to be time/day-of-the-week related, so I assume it's
a congestion problem (response time seems pretty good right now), but
I'm not a net-watcher, so don't take that as gospel.  There also seem
to be some severe routing problems.  On one occation this past week, the
packet turnaround time from our gateway to the CSS gateway (10.0.0.25)
was about 1sec, while one hop farther, from a host on our Pronet to
10.0.0.25 was about 8sec with peaks of over 20sec and many packets were lost.

Actually, one site in the Bay Area has started setting up new UUCP
links (using good ol' dialup connections) to make sure that their mail
will get through.

dave

hedrick@TOPAZ.RUTGERS.EDU (Charles Hedrick) (09/28/86)

We apologize for the problems we have caused other sites.  I am well
aware that Rutgers is among the hardest places to reach.  This is a
combination of our 9600 baud line into the IMP, and continual crashes
of our gateway.  We have now replaced our gateway with a gateway from
Cisco.  It is based on a 68000.  It appears to be more reliable than
the old 11/23 code we were using before, and has much better tools to
monitor what is going on and adjust things.  We think that the
reliability problems will largely go away, except for TCP protocol
problems with individual hosts on our network.  Early results suggest
that the 9600 baud line has enough bandwidth to keep mail and news
flowing.  We have long since given up on telnet, though at some times
of the day even that may now be practical.  We are also exploring an
upgrade of the line speed.

karn@MOUTON.BELLCORE.COM (Phil R. Karn) (09/28/86)

I wonder how much of the existing congestion problems would go away if
DARPA banned all 4.2BSD sites from the net until they convert to 4.3?

Phil

ron@BRL.ARPA (Ron Natalie) (09/28/86)

It may not use gateways, but the ping wars between the BBN gateways
impact all net performance as their random behaviour wreaks havoc
with the IMPs virtual circuit set up time.

-Ron

Lixia@XX.LCS.MIT.EDU (Lixia Zhang) (09/29/86)

The following replies to two internet congestion related messages together.

    Date: Sat, 27 Sep 1986  21:35 EDT
    From: Rob Austein <SRA@XX.LCS.MIT.EDU>
    Subject: Why is the ARPANet in such bad shape these days?
    ......
    The NOC is refering to this mess as a "congestion problem" at the IMP
    level.  The current theory the last few times I talked to the NOC was
    that we have managed to reach the bandwidth limit of the existing
    hardware.  A somewhat scary thought...

Could someone from BBN provide measured network throughput numbers to
convince us that we indeed have hit the HARDWARE bandwidth limit?

					...If this is in fact the case (and
    there is circumstancial evidence that it is, such as the fact that the
    net becomes usable again during off hours), we are in for a long
    siege, since it is guarenteed to take the DCA and BBN a fair length of
    time to deploy any new hardware or bring up new trunks.

Better performance during off hours surely indicates that the problem is
network load-related, but does not necessarily mean that the DATA traffic
has hit the hardware limit -- there is a large percentage of non-data
traffic flowing in the net.  According to the measurement on a number of
gateways, in the week of 9/15-9/21, (as more or less the same for all time)
       43% of all received packets are addressed to a gateway
       48% of all sent packets originate at a gateway
Presumbly these gateway-gateway packets are routing updates, ICMP redirects,
etc.  But why should they take such a high percentage of the total traffic?
Can someone explain to us?

Even for data packets, I wonder if anyone has an idea about how much extra
traffic is generated by the known extra-hop routing problem.  More on this
later.

    ALSO, IF THERE IS ANYBODY FROM BBN WHO KNOWS MORE ABOUT THE PROBLEM
    AND IS WILLING TO SHARE IT, -PLEASE- DO.  IT'S HARD TO MAKE ANY KIND
    OF CONTINGENCY PLANS IN A VACUUM.

    --Rob

I capitalized the sentence, hoping no one would pretend not seeing it.

    Date: Sun, 28 Sep 86 04:48:39 edt
    From: hedrick@topaz.rutgers.edu (Charles Hedrick)
    Subject: odd routings

    I have been looking at our EGP routings.  I checked a few sites that I
    know we talk to a lot.  Our current EGP peers are yale-gw and
    css-ring-gw.  (We keep a list of possible peers, and the gateway picks
    2.  It will change if one of them becomes inaccessible.  This particular
    pair seems to be fairly stable.)  Here I what I found:
    ......
    MIT:  They seem to have 4 different networks.  The ones with direct
	  Arpanet gateways are 18 (using 10.0.0.77) and 128.52 (using
	  10.3.0.6).  EGP was telling us to use 10.3.0.27 (isi) and
	  10.2.0.37 (purdue) respectively...

This is probably caused by the EGP extra-hop problem: if MIT gateways are
EGP neighboring with isi and purdue gateways, all other core gateways will
tell you to go through isi/purdue gateways to get to MIT, even though everyone
is on ARPANET.  This should be a contributor to the cognestion too.

One question is: Can anyone tell us WHEN this extra-hop problem will be
completely eliminated?

Another question is how the stubs select core EGP neighbors; if they all
concentrate on a small number of core gateways, bottlenecks will be created,
because the extra-hop problem says that if a stub gw EGP-neighbors with a
core gw, most traffic to the stub is likely to travel through that core gw
as well.  Hedrick listed their coded-in core EGP gateway candidates in his
message.  Is the same list used by all non-core gateways?  Does someone know
how many stub gateways EGP-neighbor with one core gateway?  Will some
stub-core rebinding help relieve the congestion?

In short, reducing network overhead and fixing some long-standing protocol
problems may be a way to relieve the current poor net performance.

Lixia
-------

swb@DEVVAX.TN.CORNELL.EDU (Scott Brim) (09/29/86)

Lixia: I've always wondered about figures like that.  Aren't the
overwhelming majority of the gateways on Arpanet also decent-sized
hosts in their own right -- so that much of the traffic in your
figures might be legitimate user traffic?
							Scott

p.s. talk about degenerative congestion -- when the network gets slow
we all start sending gobs of mail back and forth in order to improve
it!

Lixia@XX.LCS.MIT.EDU (Lixia Zhang) (09/29/86)

Scott,

As far as I know, the numbers in my message were from measurements (by BBN)
on the pure forwarding gateways, NOT including hosts.

Lixia

P.S. Also talking about degenerative congestion -- if no one uses the net,
surely no congestion would exist, probably nor would the net itself.
With no congestion, people still send mail daily, though probably on
different subjects.
-------

mike@BRL.ARPA (Mike Muuss) (10/02/86)

Many sites with really large sets of LANS (including MIT and BRL)
run dedicated IP gateways as their attachment to the IMPs.
In these cases, all traffic on those IMP ports is either user
traffic or EGP.

BRL-GATEWAY and BRL-GATEWAY2 are pretty high up as the largest
source of packets on the MILNET, today.  When our Cray-XMP48
comes online on 2-November, I expect our MILNET trunks to melt.
	-Mike