tcp-ip@ucbvax.ARPA (06/15/85)
From: Mike Muuss <mike@BRL.ARPA> This message reports on the details of the recent 2-day partial mail outage at BRL. SYMPTOMS. The triggering event was an unannounced set of NIC Host Table updates which our systems picked up and installed Wednesday night. These updates changed the format of the host name entries in the table. Previously, host table entries had the form: HOST : address(es) : formalname.ARPA,formalname,nicname1,nicname2 : stuff : As part of adding alternate names in other domains, this arrangement got changed to: HOST : address(es) : formalname.ARPA,nicname1,nicname2,formalname : stuff : A secondary problem was that the host-table converter for our MMDF mail system made some assumptions about the layout of the table, such that it expected both forms of the formal host name to be listed BEFORE any nicnames. The new layout broke that assumption, with the result that all mail from BRL was having the outbound mail addresses rewritten to nicname1.ARPA, a string most hosts don't have in the table. Of the many systems on the network, the only ones that seemed overly put out about getting mail of this form were the TOPS-20 folks. They rejected a TO address of nicname1.ARPA as unknown, causing us to return lots of mail as undeliverable. A fair amount of this was official BRL correspondence being sent to other Army elements, and caused much concern in our front office. In addition to the symptoms described above, one of our older machines running an older version of our mail software (BRL-VLD, aka VLD70) somehow "forgot" what it's name was due to the new mail tables, and refused to receive ANY inbound mail AT ALL. This caused several hundred pieces of mail, much of it official business, to be returned as undeliverable. CORRECTIVE ACTION. As always happens, our mail system maintainer (Doug Kingston) was away on TDY to the USENIX conference. The first signs of trouble came early Thursday morning from BRL-VLD, and George Hartwig began experimenting to try and allieviate the problem. The full extent of the problem was not know until Dave Towson brought to my attention what he felt to be an abnormally large rate of mail rejections on messages he had sent; nearly all going to TOPS-20 sites. By Thursday evening, I had convinced the VLD70 that it knew it's name again, and coerced it to once again receive mail. Alas, in the process it seems to have lost the ability to SEND mail, but for the time being that struck me as somewhat of an improvement; users could access lots of other machines nearby for mail sending. Owing to a small stroke of luck, I was able to contact Doug Kingston, and at 2000 he was hard at work tracking the table-builder problem causing the problems with outbound mail. He installed a new version of /usr/mmdf/table/nictable on BRL-VGR, but didn't build any new databases, and didn't leave any mail on what he had done. I returned Friday at noontime to discover that our problems were not any better, and went investigating. First thing I tried was to generate a new mail database using the new version of the program Doug installed. When that database was installed, BRL-VGR lost the ability to send or receive mail. This unfortunate turn of events forced me to delve into the source code for the nictable.c program, and I was unable to determine how the previous night's code changes were supposed to help, so I implemented my own changes. After some fiddling and testing, my new version produced tables that seemed visually correct, so I installed it and generated a new database. After a half an hour of waiting for it to finish, and trying to battle the load average down below 7, I was finally able to test it. Fortuantely, it worked, and the first of many test messages wafted off to Wancho@simtel20, who was the hapless TOPS-20 repipient of all my test mail. Then, all (!) that remained was to distribute the new program to the 8 other participating BRLNET machines, and rebuild THEIR databases. By 1700 on Friday, all mail databases on the VAXen and Goulds had been rebuilt, and each system had been personally hand-checked to make certain mail could now flow. There is no telling how much mail got returned during this fiasco. Users should be encouraged to re-send their failed mail now, and it *should* go through. STATUS. The BRL VAXen (plus HEL-ACE) are all seemingly back to working order. The Gould (BRL-SEM) too. The 2 remaining PDP-11 systems are still in various states of difficulty. Both BRL-BMD and BRL-VLD are still unable to send to TOPS-20 sites, until George Hartwig can determine how to install the new nictable program there. In addition, BRL-VLD is still unable to send mail ([PARM] No valid author specification present), but at least mail does seem to be flowing in and getting delivered properly. These problems I leave to George; hopefully when Doug returns on Monday, the PDP-11s can be rapidly returned to full working order. COMMENTARY. The NIC Host Table is currently the single most important file on the MILNET; if this file is capriciously changed, the potential for network-wide harm is substantial. The good folks at the NIC have for many years tended this file with care and attention to detail; my complaint in this instance is that there was no warning given to Host Administrators or Technical Liaisons that this change was impending. With some advance notice, we could have arranged to consider the implications beforehand, and either adjusted our software before anything went wrong, or shut off our automatic host-table update mechanism until we were prepared to attack the problem in an orderly manner. Soon, the NIC Host Table is going to be replaced by the magic domain-server system, and I'm sure that there will be some growing pains associated with that. In the interim, lets finish testing our domain code, and stop having to put out fires with the existing, nearly obsolete, mechanism. I just hate Mondays, especially when they happen on Thursday. -Mike
tcp-ip@ucbvax.ARPA (06/15/85)
From: HOSTMASTER@SRI-NIC Mike, A minor change in the ordering of the nickname field in version #457 of the host table proved to be an inconvenience to BRL hosts. Therefore the following version (#458) reverted the name field to the following format: ... :domain name (if any), official name.arpa, official name, nickname: ... We are sorry for any inconveniences this has caused. In the future please send all correspondence pertaining to host tables directly to HOSTMASTER@SRI-NIC.ARPA. The fact that Hostmaster did not receive your messages directly caused a delay in correcting the problem. -------
tcp-ip@ucbvax.ARPA (06/15/85)
From: Mark Crispin <MRC@SIMTEL20.ARPA> It is interesting that only the TOPS-20 systems were put off by getting mail referring to "nickname.ARPA". That means that most of the non-TOPS-20 sites still have heuristics which recognize ".ARPA" as a special case. When the NIC started distributing a host table with the .ARPA names, I removed these heuristics, so that the TOPS-20 software would correspond with the obvious intent of the NIC table. BRL's experience should rather dramatically show how bad such heuristics can be. Remember, the RFC's are quite clear in stating that: . only official names may appear in machine-generated fields (message headers, SMTP transactions) . only official names may have .ARPA applied My guess is that BRL ignored the official entry in the host table entirely and used the remaining entries as the name list (ala the old NIC host table). It then applied .ARPA to traffic going out and stripped it coming in (the standard heuristic). Folks, this is a kludge! I believe the TOPS-20's were doing the right thing. Comments? -- Mark -- -------