Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/08/85)
Checksum as a replacement for missing Message-ID. ------------------------------------------------ The Message-ID is a very useful field for many purposes: (a) To preserve In-reply-to references between transferred messages. (b) To stop loops by not accepting the same message to the same recipient more than once. (c) To be able to identify that several copies of the same message are the same, which will save disk space and provide better user functionality in some systems. The problem is that many messages do not have any Message-ID-s. I am planning to modify the COM network mail interface to generate Message-ID-s for messages which lack such ID-s. These generated ID-s will be used internally in COM and will be affixed to a message if it is sent out to the networks again, e.g. by a conference/mailing list residing on a COM system. The Message-ID should uniquely identify one message, so that all copies of the same message will get the same Message-ID. Thus, if two systems independently generate a Message-ID for a message, they should produce the same message. To achieve this goal, I suggest to generate the Message-ID as a checksum of the message. If two systems independently generate a Message-ID for a message, they should preferably produce the same ID. Thus, the ID should *not* refer to the host name of the message system generating the ID, if this is not the system where the message originated. Thus, I propose to generate Message-ID-s of the format <CHECKSUM-VALUE@CHECKSUM> where the host name in RFC822 is replaced by the word "CHECKSUM". This will tell recieving systems that this is a CHECKSUM-ed ID, so that they can identify it with other CHECKSUM-ed ID-s. The alternative would be to produce ID-s in the format <CHECKSUM- VALUE@ORIGINAL-HOST>. However, it does not seem nice to generate ID-s purporting to come from a host which did not in reality generate this ID. Selection of CHECKSUM algorithm: ------------------------------- The algorithm should uniquely identify a message with very low probability of different messages getting the same ID. On the other hand, the checksum should not change for common modifications to a message, like additions of new recipients in the RFC822 header, different line foldings or conversions of TAB-s to SPACE-s. The following algorithm is proposed: The CHECKSUM contains 15 characters, in three groups of five characters. The first group is computed from the name in the FROM field, the second group from the value in the DATE field, the third group is computed from the textual contents of the message. Each group should have a checksum algorithm suitable for that group. For the FROM field, I suggest the following: (a) Use only the value of the "addr-spec" part of the FROM field (delete the "phrase" part and the <>-s, if any). (b) Upcase the characters A-Z before checksum computation. (c) Only include characters A-Z and digits 0-9 in checksum computation. (d) Compute the checksum by summation of the characters, with the weight 1 to the first character, 2 for the second, 4 for the third etc. up to 2^16 for the sixteenth character, then 1 for the seventeenth etc. (e) Take the remainder of the checksum modulo 2^24. Translate this remainder to five characters in a 32-based number system with the digits 0...9, A..V. Rationale: This checksum should be easy to compute on any computer with 32-bit word integer arithmetic. For the DATE field, I suggest as checksum the number (((YEAR+SECOND+MONTH)*31+DAY)*24+HOUR)*60+MINUTE This number is again translated to a five character string as described in (e) above. For the contents of the message, all non-printable characters, including tab and space, should be disregarded when computing the checksum. The checksum is computed using the algorithm in stage (d) and (e) described above (but not stages a-b-c). Rationale: Disregarding all non-printable characters, including tab and space, is necessary to ensure that line folding will not change the checksum.
Torgny_Tholerus_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/08/85)
One problem I can see, is the DATE field. This field might look like: 13 Apr 83 17:43 BST. or: Wed Feb 20 16:07:15 1985 How do I do to find the year field or the month field? Perhaps is the best thing to treat the DATE field just like a string, in the same way as the FROM field is treated.
Craig.Everhart@CMU-CS-A.ARPA (03/08/85)
Why use the From: or Date: fields at all? The From: field is a popular candidate for editing by automatic agents; I'm not convinced that Mr. Palme's algorithm will remove all traces of that editing. The Date:-field algorithm was underspecified (year in century, as SMTP would have? What is the origin for months? Any use of time zone information?). For that matter, I'm not sure that all agents would agree on the concept of ``printing character'' in the body of the message. Why not use an algorithm based solely on the body of the message? It can ignore characters outside the range [33,126] (decimal, inclusive); obviously it would only count characters in that range when incrementing the checksum counter. It may be less expensive to use something other than multiplication as a basis for the checksum on many small machines. Are there suitable algorithms based on bit rotations or shifts? And perhaps the whole discussion should be moved into an RFC.
Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/09/85)
The reason I choose not to treat the date field as a string, is that the checksum should not change if the date field is translated from one format to another, e.g. from the format in RFC822 to the format in X.400. This of course requires that the date field is parseable. But every single message should, when you get it, have a header in one given standard and thus date fields not belonging to this standard should not occur. One problem is the time zone. Since the format of this seems to be often wrong, I did not include it in the checksum. This means that if the time is translated to another time zone, e.g. if the date "25 Feb 85 15:01 EST" is translated by some agent handling the message into "25 Feb 85 12:01 PST" then the algorithm will create a different checksum. Question: Would such translations of dates by message handling agents be expected to occur?
Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/09/85)
FROM: William "Chops" Westfield <BillW@SU-SCORE.ARPA> Why is it necessary for two hosts trying to create a MESSAGE- ID to come up with the same result? I dont understand why anyone but the original host would try to create a message id... If the original host always created Message-Id-s, this would be a better solution. We will however have to accept the fact that some hosts do not create globally unique Message-ID-s. Neither RFC822 nor X.400 unfortunately require mandatory globally unique Message- ID-s (IPMessageID-s in X.400 terminology). Suppose one and the same message gets forwarded, directly or indirectly, to two different mailing lists, and that a certain user is a member of both lists. If the intermediate hosts handling the mailing list created a checksum, this could be used by the host for the recipient user to stop displaying the same message twice, or, to tell him that it is the same message which he gets twice. Why should the intermediate host add a Message-ID to a message lacking such an ID? Because the ID is very useful for loop control. If two mailing lists are members of each other (which has advantages, but cannot be done with present practices on Arpanet because of the risk for loops) then if the list maintaining program kept a list of the ID-s of messages sent via the list (COM does this) then it could stop re-sending the message when it comes around the second time. FROM: Craig.Everhart@CMU-CS-A.ARPA Why use the From: or Date: fields at all? The From: field is a popular candidate for editing by automatic agents; I'm not convinced that Mr. Palme's algorithm will remove all traces of that editing. The Date:-field algorithm was underspecified (year in century, as SMTP would have? What is the origin for months? Any use of time zone information?). The goal of course is to find an algorithm with a very very low probability of getting the same ID for two different messages, but also with low probability of giving different ID-s for the same message because of some transformation on the message. Only using TEXT CONTENT is NOT acceptable. Suppose in a voting application that two people wrote messages with the only content being the word "Yes!". Only using TEXT CONTENT would hide the very important fact of the names of the people who voted "Yes!". Not using Date/time is also not acceptable, suppose the same person voted "Yes!" on two different issues, this fact would then be hidden. FROM: Craig.Everhart@CMU-CS-A.ARPA It may be less expensive multiplication as a basis for the checksum on many small machines. Are there suitable algorithms based on bit rotations or shifts? Most of the multiplications in my algorithm (all of those in processing the body of the message) were by powers of 2 thus can be implemented by shifts.
Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/12/85)
A further reason why it is necessary to create a Message-ID for a message which does not have any: Checksums are necessary in order to preserve in-reply-to-relations between pairs of messages which may be transmitted between hosts via different routs. Example: Host A sends out a message M1 to hosts B and C. at host B, a reply M2 is written and sent to host B. In order for host B to be able to recognize that M2 is a reply to M1, A and B must independently generate the same Message-ID on M1. Question: Why did I not include the value of the in-reply-to field in the creation of the checksum. Answer: Because some systems may allow the addition of an in-reply-to field after the creation of a message. (At least we do. We sometimes get messages which clearly are replies to other messages, but which do not have in-reply-to fields. I sometimes then add an in-reply-to reference to the incoming message, since the sorting of the message data base through in-reply-to-relations makes it easier to use.)
gnu@sun.uucp (John Gilmore) (03/13/85)
> Suppose one and the same message gets forwarded, directly or > indirectly, to two different mailing lists, and that a certain > user is a member of both lists. If the intermediate hosts handling > the mailing list created a checksum, this could be used by the > host for the recipient user to stop displaying the same message > twice, or, to tell him that it is the same message which he gets > twice. The user's mail reader can do the comparision itself without using checksums and without trying to make some kind of network standard out of it. > If two mailing lists are members of each other (which has advantages, > but cannot be done with present practices on Arpanet because of the > risk for loops) then if the list maintaining program kept a list > of the ID-s of messages sent via the list (COM does this) then > it could stop re-sending the message when it comes around the > second time. This is exactly the situation with Arpanet mailing lists gatewayed to&from Usenet newsgroups. I don't know the algorithm, but it clearly works, without requiring message-ids -- and even works on digests, where this method would fall down. > Only using TEXT CONTENT is NOT acceptable. Suppose in a voting > application that two people wrote messages with the only content > being the word "Yes!". Only using TEXT CONTENT would hide the very > important fact of the names of the people who voted "Yes!". Not > using Date/time is also not acceptable, suppose the same person > voted "Yes!" on two different issues, this fact would then be > hidden. Anybody who casts a vote on an issue by sending a message consisting solely of the word "Yes" deserves whatever random thing they vote for. The subject line will distinguish, for real-world messages. If we ever come up with a naming scheme that everybody can implement, and routing to match, then there won't be a problem with munged From: addresses. For now, and for the foreseeable future, there is. ---- Actually, the whole idea of intermediate nodes throwing messages away because they happen to checksum like a message that came thru the preceding day strikes me as a really bad idea. You'd better compare the "envelope" as well as the message, since a message can go thru a host several times as aliases are resolved on its way to a final destination. I'd argue for comparing the entire message content, which of course defeats the whole purpose, rather than throw away a message. We lose enough already.
laura@utzoo.UUCP (Laura Creighton) (03/19/85)
Throwing away messages because they might have been seen before is a poor idea. I can think of lots of cases where it is not what I want done with my mail. For instance, if I send mail to someone and for some reason they don't get it, or their disk crashes, and they ask me to resend, I don't want any smart router to not send it because it has been seen before. Likewise, if I always have a meeting on Wednesday, and I always send myself mail on all the machines I work on Tuesday night, saying ``meeting at 4, stop hacking at 3:30'' I want to keep getting those messages! The actual proposed use of this ``feature'' (being on 2 lists where the same article is posted twice) is another instance where I *especially* want to see both articles. There are days I don't have time for large volume newsgroups or mailing lists, and will axe all messages in that group without reflection. However, something that was posted to both a large volume list and a smaller one is likely to still be of interest to me. Are you sure that you really have a problem that you are trying to fix? Laura Creighton utzoo!laura ihnp4!utzoo!laura@seismo
jsdy@hadron.UUCP (Joseph S. D. Yao) (03/25/85)
> This is exactly the situation with Arpanet mailing lists gatewayed > to&from Usenet newsgroups. I don't know the algorithm, but it clearly > works, without requiring message-ids -- and even works on digests, > where this method would fall down. I believe that Usenet newsgroups d o use message ID's. Otherwise, what are <9102@brl-tgr.ARPA> <2043@sun.uucp> that are References for this message? Joe Yao hadron!jsdy@seismo.{ARPA,UUCP}
lepreau@utah-cs.UUCP (Jay Lepreau) (04/09/85)
The real problem is that current standards don't require Message-ID's. (How come? I would think the need was obvious way before 822.) Seems a hell of a lot easier just to make it a requirement than to substitute another, much more vague, expensive, and complicated one.