[comp.archives] [comp.archives] Re: Bye Bye BART

treese@crl.dec.com (02/19/91)

Archive-name: mail/archive-server/bart/1991-02-18
Original-posting-by: treese@crl.dec.com
Original-subject: Re: Bye Bye BART (on comp.archives)
Reposted-by: emv@ox.com (Edward Vielmetti)

Could you post this on comp.archives as a followup to the message
with the subject "[atari-st...] Bye Bye BART", with message ID
<1991Feb11.234221.6017@ox.com>?  We would like to clear up the
explanation of the situation.

Thanks for your cooperation.

Win Treese						Cambridge Research Lab
treese@crl.dec.com					Digital Equipment Corp.

[This is kind of long but illustrates some of the things that
mail-based archive servers need to deal with.  In short, not all
mail systems use the Date: header, and various systems have different
indications of lost mail to the user.  Put enough problems in the
same spot, a few days of gateway downtime and presto: disaster.

Mail-based archive servers entail some amount of risk to the service
provider, the service user, and any number of unintended relays and
gateways along the way.  If you run one be prepared to hear from
your neighbors.  --Ed]

------- Forwarded Message
 Newsgroups: comp.sys.atari.st,comp.sys.atari.8bit
 From: reid@decwrl.dec.com (Brian Reid)
 Subject: Re: Bye Bye BART
 Summary: Not Kaiser's fault
 
 I am the manager of the USENET and electronic mail gateway between Digital
 Equipment Corporation and the rest of USENET. The unfortunate incident for
 which Mr. Kaiser has been so cruelly blamed was completely an accident, and
 is the result of a "culture  clash" rather than any malice. It is perhaps
 best not to use harsh words until you have finished understanding an incident.
 
 Hans Kaiser works in Digital's software support office in Stuttgart, Germany. 
 Like most Digital field offices, it is equipped with VMS computers and
 connected to Digital's DECNET network. The converstion between internal
 DECNET and external protocols is performed by the DECWRL computer for which I
 am responsible.
 
 VMS and DECNET do not have the concept of queueing mail. When you send a
 message, either it is delivered instantly or it bounces. The idea is that you
 want the sender to know instantly if his message did not get through.
 As a result, VMS mail users have, through the years, grown accustomed to
 believing that if they do not get a "message sent" message, then their
 message did not get sent.
 
 Whenever mail is relayed from one network to another, rather than just queued, 
 the concept of "immediate delivery" is somewhat meaningless, because you
 haven't really delivered the mail, but rather have just handed it off to some
 intermediate postman. But user expectations are still very strong: if a
 user sends an internetwork message, and doesn't get back a "message sent"
 reply, his experience leads him to believe that the message was lost.
 
 Last week we had a head crash on the primary disk on our DECWRL relay
 computer, and for various reasons it took almost 3 days to get the machine
 back up. We announced this failure on the appropriate internal Digital
 newsgroups (dec.mail.config), but did not send individual notification to the
 tens of thousands users of the gateway, as we sometimes do when we are
 certain that it will be down for a long time.
 
 During this interval Hans Kaiser was trying to retrieve files from the Atari
 archive server. He is not a reader of dec.mail.config and probably did not
 know that the gateway was down. He sent some retrieval requests, and got no
 reply.
 
 Here comes the "culture clash" that I mentioned in the first paragraph.
 
 When a VMS user sends a mail message that does not get delivered, he is
 conditioned to believe that it has been lost or deleted, because that is what
 happens in the normal case. However, these messages that Kaiser sent were
 neither lost, nor deleted. They were carefully queued, waiting for the DECWRL
 gateway to come back up again, so that they could be sent.
 
 When he got no response, Kaiser sent more requests. This is the natural thing
 to do in the VMS world. If it didn't work, and if you are following
 instructions, then try again. Maybe something will have been fixed.
 I don't know exactly how many times Kaiser repeated the request over the
 3-day interval, but I am sure that if he had known that his messages were all
 being queued, instead of vanishing as he thought, that he would not have
 repeated them.
 
 Eventually (I think it was on Wednesday night, California time) the DECWRL
 gateway was brought back to life, and all of the queued messages were sent to
 the Atari archive server in one lump. Archive servers are in general
 programmed to have per-user quotas, so that if something like this happens,
 it won't bring the archive server to its knees trying to handle so many
 requests at once.
 
 Alas, here the "culture clash" strikes again.
 
 The DECNET mail protocol does not support a "time and date" mechanism. The
 only information that it records about a message, besides the message body,
 is what we Unix/IP people know as the "To" and "Cc" and "Subject" and "From"
 fields.
 
 In DECNET protocol it is up to the receiver of a message to timestamp it
 with the time that it was received. The reason for this is that since
 there is no queueing, the time that a message was received is guaranteed to
 be equal to the time that it was sent. As a result, the network mail
 protocol has no mechanism to record the time that a message was sent.
 
 The documentation for the DECWRL mail gateway, which we distribute to all
 employees who ask for it, instructs them to use the gateway by sending mail
 with a certain mail program that is not part of the software that Digital
 ships to its customers. This program, called "nmail", is helpful in smoothing
 the peak load on the gateway by queueing at certain times. However, since the
 mail-sending software knows that the mail might be queued, it records the
 time that the message was actually originated. This is because the "Date"
 field in the message will contain the time that it was delivered and not
 the time that it was actually sent. "nmail" does this by adding the date and
 time to the "From" field of the message. It really doesn't have much choice,
 because the DECNET mail protocol supports only a "To", "Subj", "From",
 and "Cc" field, and there is a fixed limit to the size of the "Subj" field.
 
 Why does this matter? It matters because the Atari archive server at the
 University of Michigan looks at the "From" field of an incoming message to
 avoid processing too many simultaneous requests from the same person.
 There is a "per-user" quota for each day. The problem is that when you send
 the mail using a mail program that encodes the date and time of the message
 in the "From" field, then every message looks like it came from a different
 user. 
 
 As a result of this, when the DECWRL mail relay came back to life last
 Wednesday, it sent many dozens of retrieval requests to Michigan all at once,
 and Michigan's software failed to understand that they were all from the same
 person because the "From" field on each of them had a different date and
 time. As a result, the Michigan archive server tried to process all of them
 at once, and, evidently, melted into a pile of slag.
 
 Since I work for a company that sells computers, I suppose the loyal thing
 for me to do at this point is to try to sell Michigan a bigger computer to
 use as the archive server, but I don't work in a sales office, I work in
 Corporate Research, and what I want is for everybody to be happy. I am very
 sorry that a combination of accidents inside Digital, in Germany and
 California, caused this unfortunate incident on a university computer at
 Michigan, and I will happily offer the services of the excellent network
 programmers at DEC Western Research to help ensure that the Michigan archive
 server does not meet this fate again. Mostly I want people to know that this
 was in no way the fault of Hans Kaiser. If it was anybody's fault, it was my
 fault, for accidentally failing to copy the serial number of a certain disk
 drive onto a service-contract renewal form for 1991, thereby leaving the disk
 unprotected by maintenance contract. Disks often fail on purpose when they
 learn that they are not covered by maintenance contract.
 
 If you have sent Mr. Kaiser (or Herr Kaiser, as he probably prefers to be
 called) a nasty message, it might be civil to send him another one letting
 him know that, now that the facts are known, you aren't so angry any more.
 If you find the need to be angry at somebody, please be angry at me. As the
 manager of an electronic mail gateway, I'm used to it.
 
 Brian Reid
 DEC Western Research Laboratory

------- End of Forwarded Message