[mod.protocols.tcp-ip] Congestion

nowicki@SUN.COM.UUCP (02/12/87)

I am not sure which is the right group for this discussion, but the
recent congestion problems have brought up two important points.

First, the MX record support from Berkeley for sendmail does not do any
caching.  Perhaps they thought the local name server would cache, but
not when the desired name server is down.  For example, last week
Decwrl.DEC.COM was essentially unreachable from the Arpanet.  The
DEC.COM name servers are either on the other side of Decwrl (128.45),
or behind other unreliable gateways (net 36).  Thus mail started to
pile up, and we quickly had hundreds of messages sitting in the queue.
Each run through the queue did hundreds of MX lookups which had to
timeout.  I extended our simple cache (which already remembered if
hosts are up or down) to cache the result of the MX request (especially
if the request timed out).  This got the queue flowing again.

Second, there seems to be a bug in the HDH code of the PSNs (aka
IMPs).  During periods of congestion, the HDLC layer blocks us from
sending back the "Host Up" messages that are required in HDH.  The PSN
then declares us to be down, clears its buffers, then immediately hears
the Host Up message and declares us to be back up.  This happens every
few minutes during the day.  Not only does throwing the buffered data
away increase congestion in the short term by causing more
retransmissions, there are higher-level instabilities.  If a host
tries to send us a TCP segment  or ACK during the time that the IMP
thinks we are down, they get a "Host Dead" message and reset the TCP
connection, which means the entire mail message has to be
retransmitted.  This just makes matters worse.

I have tried to contact BBN about the second problem, since it is a bug
in their software, but I keep getting the run-around.  The NOC people
just say "must be congestion".  I KNOW it is congestion, but it still
is a bug!  Does anyone at BBN read these lists?

	-- Bill Nowicki
	   Sun Microsystems

brescia@CCV.BBN.COM.UUCP (02/13/87)

>     Second, there seems to be a bug in the HDH code of the PSNs 

>     I have tried to contact BBN about the second problem, since it is a bug
>     in their software, but I keep getting the run-around.

Bill,

To answer the specific question, you should call the NOC and refer to
SR-86-03583 (eighty-six dash zero-three-five-eight-three).  This is a previous
report of the same problem, and they should be tracked together.  For some
reason, what you said and what they heard were not sufficient to make the
connection.

In general, if the people you talk do not understanding what you are
saying, you need to talk to someone who does.  I don't think that sort of
escalation path exists yet in the call-in procedures.  It probably should.

I would prefer to see problems like this solved before you have to get up in
the widest possible forum and shout.  This shout should lead to the fix for
your problem, however, so keep in mind

"If you don't get grease, squeek louder."  - paraphrase from A. Wheel

regards,
Mike

malis@CC5.BBN.COM.UUCP (02/15/87)

Mike,

Just so you (and the rest of list) know, the patch that fixes the
HDH problem was installed on the ARPANET yesterday (Friday)
afternoon.  It had previously been installed on the MILNET
(because two hosts had encountered this problem), and was
awaiting DDN approval for the ARPANET installation.

Regards,
Andy