nowicki@SUN.COM.UUCP (02/12/87)
I am not sure which is the right group for this discussion, but the recent congestion problems have brought up two important points. First, the MX record support from Berkeley for sendmail does not do any caching. Perhaps they thought the local name server would cache, but not when the desired name server is down. For example, last week Decwrl.DEC.COM was essentially unreachable from the Arpanet. The DEC.COM name servers are either on the other side of Decwrl (128.45), or behind other unreliable gateways (net 36). Thus mail started to pile up, and we quickly had hundreds of messages sitting in the queue. Each run through the queue did hundreds of MX lookups which had to timeout. I extended our simple cache (which already remembered if hosts are up or down) to cache the result of the MX request (especially if the request timed out). This got the queue flowing again. Second, there seems to be a bug in the HDH code of the PSNs (aka IMPs). During periods of congestion, the HDLC layer blocks us from sending back the "Host Up" messages that are required in HDH. The PSN then declares us to be down, clears its buffers, then immediately hears the Host Up message and declares us to be back up. This happens every few minutes during the day. Not only does throwing the buffered data away increase congestion in the short term by causing more retransmissions, there are higher-level instabilities. If a host tries to send us a TCP segment or ACK during the time that the IMP thinks we are down, they get a "Host Dead" message and reset the TCP connection, which means the entire mail message has to be retransmitted. This just makes matters worse. I have tried to contact BBN about the second problem, since it is a bug in their software, but I keep getting the run-around. The NOC people just say "must be congestion". I KNOW it is congestion, but it still is a bug! Does anyone at BBN read these lists? -- Bill Nowicki Sun Microsystems
brescia@CCV.BBN.COM.UUCP (02/13/87)
> Second, there seems to be a bug in the HDH code of the PSNs > I have tried to contact BBN about the second problem, since it is a bug > in their software, but I keep getting the run-around. Bill, To answer the specific question, you should call the NOC and refer to SR-86-03583 (eighty-six dash zero-three-five-eight-three). This is a previous report of the same problem, and they should be tracked together. For some reason, what you said and what they heard were not sufficient to make the connection. In general, if the people you talk do not understanding what you are saying, you need to talk to someone who does. I don't think that sort of escalation path exists yet in the call-in procedures. It probably should. I would prefer to see problems like this solved before you have to get up in the widest possible forum and shout. This shout should lead to the fix for your problem, however, so keep in mind "If you don't get grease, squeek louder." - paraphrase from A. Wheel regards, Mike
malis@CC5.BBN.COM.UUCP (02/15/87)
Mike, Just so you (and the rest of list) know, the patch that fixes the HDH problem was installed on the ARPANET yesterday (Friday) afternoon. It had previously been installed on the MILNET (because two hosts had encountered this problem), and was awaiting DDN approval for the ARPANET installation. Regards, Andy