[comp.protocols.tcp-ip] Case of the Replicated Errors: An Internet Postmaster's Horror Story

fair@APPLE.COM ("Erik E. Fair", Your Friendly Postmaster) (05/10/91)

This Is The Network: The Apple Engineering Network.

The Apple Engineering Network has about 100 IP subnets, 224 AppleTalk
zones, and over 600 AppleTalk networks. It stretches from Tokyo, Japan,
to Paris, France, with half a dozen locations in the U.S., and 40
buildings in the Silicon Valley. It is interconnected with the Internet
in three places: two in the Silicon Valley, and one in Boston. It
supports almost 10,000 users every day.

When things go wrong with E-mail on this network, it's my problem.
My name is Fair. I carry a badge.

[insert theme from "Dragnet"]

The story you are about to read is true. The names have not been
changed so as to finger the guilty.

It was early evening, on a Monday. I was working the swing shift out of
Engineering Computer Operations under the command of Richard Herndon.
I don't have a partner.

While I was reading my E-mail that evening, I noticed that the load
average on apple.com, our VAX-8650, had climbed way out of its normal
range to just over 72.

Upon investigation, I found that thousands of Internet hosts were trying
to send us an error message. I also found 2,000+ copies of this error
message already in our queue.

I immediately shut down the sendmail daemon which was offering SMTP
service on our VAX.

I examined the error message, and reconstructed the following sequence
of events:

We have a large community of users who use QuickMail, a popular
macintosh based E-mail system from CE Software. In order to make it
possible for these users to communicate with other users who have
chosen to use other E-mail systems, ECO supports a QuickMail to
Internet E-mail gateway. We use RFC822 Internet mail format, and RFC821
SMTP as our common intermediate E-mail standard, and we gateway
everything that we can to that standard, to promote interoperability.

The gateway that we installed for this purpose is MAIL*LINK SMTP from
Starnine Systems. This product is also known as GatorMail-Q from
Cayman Systems. It does gateway duty for all of the 3,500 QuickMail
users on the Apple Engineering Network.

Many of our users subscribe, from QuickMail, to Internet mailing lists
which are delivered to them through this gateway. One such user, Mark
E. Davis, is on the unicode@sun.com mailing list, to discuss some
alternatives to ASCII with the other members of that list.

Sometime on Monday, he replied to a message that he recieved from the
mailing list. He composed a one paragraph comment on the original
message, and hit the "send" button.

Somewhere in the process of that reply, either QuickMail or 
MAIL*LINK SMTP mangled the "To:" field of the message.

The important part is that the "To:" field contained exactly one "<"
character, without a matching ">" character. This minor point caused
the massive devastation, because it interacted with a bug in sendmail.

Note that this syntax error in the "To:" field has nothing whatsoever
to do with the actual recipient list, which is handled separately, and
which, in this case, was perfectly correct.

The message made it out of the Apple Engineering Network, and over to
Sun Microsystems, where it was exploded out to all the recipients of
the unicode@sun.com mailing list.

Sendmail, arguably the standard SMTP daemon and mailer for UNIX,
doesn't like "To:" fields which are constructed as described. What it
does about this is the real problem: it sends an error message back to
the sender of the message, AND delivers the original message onward to
whatever specified destinations are listed in the recipient list.

This is deadly.

The effect was that every sendmail daemon on every host which touched
the bad message sent an error message back to us about it. I have
often dreaded the possibility that one day, every host on the Internet
(all 400,000 of them) would try to send us a message, all at once.

On monday, we got a taste of what that must be like.

I don't know how many people are on the unicode@sun.com mailing list,
but I've heard from Postmasters in Sweden, Japan, Korea, Australia,
Britain, France, and all over the U.S. I speculate that the list has
at least 200 recipients, and about 25% of them are actually UUCP sites
that are MX'd on the Internet.

I destroyed about 4,000 copies of the error message in our queues here
at Apple Computer.

After I turned off our SMTP daemon, our secondary MX sites got whacked.
We have a secondary MX site so that when we're down, someone else will
collect our mail in one place, and deliver it to us in an orderly
fashion, rather than have every host which has a message for us jump on
us the very second that we come back up.

Our secondary MX is the CSNET Relay (relay.cs.net and relay2.cs.net).
They eventually destroyed over 11,000 copies of the error message in
the queues on the two relay machines. Their postmistress was at wit's
end when I spoke to her. She wanted to know what had hit her machines.

It seems that for every one machine that had successfully contacted
apple.com and delivered a copy of that error message, there were three
hosts which couldn't get ahold of apple.com because we were overloaded
from all the mail, and so they contacted the CSNET Relay instead.

I also heard from CSNET that UUNET, a major MX site for many other
hosts, had destroyed 2,000 copies of the error message. I presume that
their modems were very busy delivering copies of the error message
from outlying UUCP sites back to us at Apple Computer.


This instantiation of this problem has abated for the moment, but I'm
still spending a lot of time answering E-mail queries from postmasters
all over the world.

The next day, I replaced the current release of MAIL*LINK SMTP with a
beta test version of their next release. It has not shown the header
mangling bug, yet.


The final chapter of this horror story has yet to be written.

The versions of sendmail with this behavior are still out there on
hundreds of thousands of computers, waiting for another chance to bury
some unlucky site in error messages.

Are you next?

[insert theme from "The Twilight Zone"]

	just the vax, ma'am,

	Erik E. Fair	apple!fair	fair@apple.com

steve@UMIACS.UMD.EDU (Steve D. Miller) (05/10/91)

   I had this same sort of error happen to me in the early days (only 500 or
so people on the list, thank goodness) of the Sun-Nets mailing list.  The
resulting errors trashed a VAX 8600 here for twelve hours or so.  In self-
defense, I added a hack to the software I use to run the Sun-Nets list:  it
checks several important header lines to be sure that they aren't too badly
botched, and if it detects an error it bounces the mail to the list
maintainer with a note that says, "the header is messed up, you'd better
take a look at this."  From what Erik said, it sounds like my software would
have kept this problem from happening.  (If someone can give me a copy of
the headers off the original mail, I can check this out.)

   The software also:

	- sets the Sender:  line and the from address in the envelope to say
	something reasonable

	- optionally trims Received:  lines (good for times when a message
	takes 9 hops to come in and another 9 to make it back out, and thus
	would otherwise trigger sendmail's fake-o loop detection)

	- allows an optional header and/or footer to be added to the body of
	the message

	- has some limited smarts (shamelessly borrowed from the mail2news
	program) that attempts to filter out ``please add/delete me'' mail
	mistakenly sent to the list readership rather than to the
	administrivia address.

   I'm sure it's not perfect (the administrivia filter is too trusting, and
in particular doesn't catch LISTSERV stuff), but if you want it, anonymous
FTP out to ftp.umiacs.umd.edu and grab pub/distribute.tar.Z.  The man entry
should be enough to get you started.  Distribute should be fairly portable
(but it will require hacking to be used with something other than sendmail).
If you make changes to this software, I'd be interested in seeing them.

	-Steve

Spoken: Steve Miller    Domain: steve@umiacs.umd.edu    UUCP: uunet!mimsy!steve
Phone: +1-301-405-6736  USPS: UMIACS, Univ. of Maryland, College Park, MD 20742