[fa.tcp-ip] Partial BRL Mail Outage: Explained

tcp-ip@ucbvax.ARPA (06/15/85)

From: Mike Muuss <mike@BRL.ARPA>

This message reports on the details of the recent 2-day partial mail outage
at BRL.

SYMPTOMS.

The triggering event was an unannounced set of NIC Host Table updates which
our systems picked up and installed Wednesday night.  These updates changed
the format of the host name entries in the table.  Previously, host table
entries had the form:

HOST : address(es) : formalname.ARPA,formalname,nicname1,nicname2 : stuff :

As part of adding alternate names in other domains, this arrangement got
changed to:

HOST : address(es) : formalname.ARPA,nicname1,nicname2,formalname : stuff :

A secondary problem was that the host-table converter for our MMDF mail
system made some assumptions about the layout of the table, such that it
expected both forms of the formal host name to be listed BEFORE any
nicnames.  The new layout broke that assumption, with the result that all
mail from BRL was having the outbound mail addresses rewritten to
nicname1.ARPA, a string most hosts don't have in the table.

Of the many systems on the network, the only ones that seemed overly put out
about getting mail of this form were the TOPS-20 folks.  They rejected a
TO address of nicname1.ARPA as unknown, causing us to return lots of mail
as undeliverable.  A fair amount of this was official BRL correspondence
being sent to other Army elements, and caused much concern in our front
office.

In addition to the symptoms described above, one of our older machines
running an older version of our mail software (BRL-VLD, aka VLD70) somehow
"forgot" what it's name was due to the new mail tables, and refused to
receive ANY inbound mail AT ALL.  This caused several hundred pieces of
mail, much of it official business, to be returned as undeliverable.

CORRECTIVE ACTION.

As always happens, our mail system maintainer (Doug Kingston) was away
on TDY to the USENIX conference.  The first signs of trouble came early
Thursday morning from BRL-VLD, and George Hartwig began experimenting
to try and allieviate the problem.  The full extent of the problem was
not know until Dave Towson brought to my attention what he felt to be an
abnormally large rate of mail rejections on messages he had sent;  nearly
all going to TOPS-20 sites.

By Thursday evening, I had convinced the VLD70 that it knew it's name again,
and coerced it to once again receive mail.  Alas, in the process it seems
to have lost the ability to SEND mail, but for the time being that struck
me as somewhat of an improvement;  users could access lots of other machines
nearby for mail sending.

Owing to a small stroke of luck, I was able to contact Doug Kingston, and at
2000 he was hard at work tracking the table-builder problem causing the
problems with outbound mail.  He installed a new version of
/usr/mmdf/table/nictable on BRL-VGR, but didn't build any new databases, and
didn't leave any mail on what he had done.  I returned Friday at noontime to
discover that our problems were not any better, and went investigating.
First thing I tried was to generate a new mail database using the new
version of the program Doug installed.  When that database was installed,
BRL-VGR lost the ability to send or receive mail.

This unfortunate turn of events forced me to delve into the source code for
the nictable.c program, and I was unable to determine how the previous
night's code changes were supposed to help, so I implemented my own changes.
After some fiddling and testing, my new version produced tables that seemed
visually correct, so I installed it and generated a new database.  After a
half an hour of waiting for it to finish, and trying to battle the load
average down below 7, I was finally able to test it.  Fortuantely, it
worked, and the first of many test messages wafted off to Wancho@simtel20,
who was the hapless TOPS-20 repipient of all my test mail.

Then, all (!)  that remained was to distribute the new program to the 8
other participating BRLNET machines, and rebuild THEIR databases.  By 1700
on Friday, all mail databases on the VAXen and Goulds had been rebuilt, and
each system had been personally hand-checked to make certain mail could now
flow.

There is no telling how much mail got returned during this fiasco.  Users
should be encouraged to re-send their failed mail now, and it *should* go
through.

STATUS.

The BRL VAXen (plus HEL-ACE) are all seemingly back to working order.
The Gould (BRL-SEM) too.

The 2 remaining PDP-11 systems are still in various states of difficulty.
Both BRL-BMD and BRL-VLD are still unable to send to TOPS-20 sites, until
George Hartwig can determine how to install the new nictable program there.
In addition, BRL-VLD is still unable to send mail ([PARM] No valid author
specification present), but at least mail does seem to be flowing in and
getting delivered properly.

These problems I leave to George;  hopefully when Doug returns on Monday,
the PDP-11s can be rapidly returned to full working order.

COMMENTARY.

The NIC Host Table is currently the single most important file on the
MILNET; if this file is capriciously changed, the potential for network-wide
harm is substantial.  The good folks at the NIC have for many years tended
this file with care and attention to detail; my complaint in this instance
is that there was no warning given to Host Administrators or Technical
Liaisons that this change was impending.  With some advance notice, we could
have arranged to consider the implications beforehand, and either adjusted
our software before anything went wrong, or shut off our automatic
host-table update mechanism until we were prepared to attack the problem in
an orderly manner.

Soon, the NIC Host Table is going to be replaced by the magic domain-server
system, and I'm sure that there will be some growing pains associated with
that.  In the interim, lets finish testing our domain code, and stop having
to put out fires with the existing, nearly obsolete, mechanism.

I just hate Mondays, especially when they happen on Thursday.
	-Mike

tcp-ip@ucbvax.ARPA (06/15/85)

From: HOSTMASTER@SRI-NIC

Mike,

A minor change in the ordering of the nickname field in version #457 of
the host table proved to be an inconvenience to BRL hosts.  Therefore
the following version (#458) reverted the name field to the following
format:

... :domain name (if any), official name.arpa, official name, nickname: ...

We are sorry for any inconveniences this has caused.

In the future please send all correspondence pertaining to host tables
directly to HOSTMASTER@SRI-NIC.ARPA.  The fact that Hostmaster did not
receive your messages directly caused a delay in correcting the problem.

-------

tcp-ip@ucbvax.ARPA (06/15/85)

From: Mark Crispin <MRC@SIMTEL20.ARPA>

     It is interesting that only the TOPS-20 systems were put off by
getting mail referring to "nickname.ARPA".  That means that most of
the non-TOPS-20 sites still have heuristics which recognize ".ARPA"
as a special case.

     When the NIC started distributing a host table with the .ARPA
names, I removed these heuristics, so that the TOPS-20 software would
correspond with the obvious intent of the NIC table.

     BRL's experience should rather dramatically show how bad such
heuristics can be.  Remember, the RFC's are quite clear in stating that:
 . only official names may appear in machine-generated fields (message
   headers, SMTP transactions)
 . only official names may have .ARPA applied

     My guess is that BRL ignored the official entry in the host table
entirely and used the remaining entries as the name list (ala the old
NIC host table).  It then applied .ARPA to traffic going out and
stripped it coming in (the standard heuristic).  Folks, this is a kludge!

     I believe the TOPS-20's were doing the right thing.  Comments?

-- Mark --
-------