[net.mail.headers] Checksum as a replacement for missing Message-ID.

Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/08/85)

Checksum as a replacement for missing Message-ID.
------------------------------------------------

The Message-ID is a very useful field for many purposes:

(a) To preserve In-reply-to references between transferred messages.

(b) To stop loops by not accepting the same message to the same
recipient more than once.

(c) To be able to identify that several copies of the same message
are the same, which will save disk space and provide better user
functionality in some systems.

The problem is that many messages do not have any Message-ID-s.

I am planning to modify the COM network mail interface to generate
Message-ID-s for messages which lack such ID-s. These generated ID-s
will be used internally in COM and will be affixed to a message if it
is sent out to the networks again, e.g. by a conference/mailing list
residing on a COM system.

The Message-ID should uniquely identify one message, so that all copies
of the same message will get the same Message-ID. Thus, if two systems
independently generate a Message-ID for a message, they should produce
the same message. To achieve this goal, I suggest to generate the
Message-ID as a checksum of the message.

If two systems independently generate a Message-ID for a message, they
should preferably produce the same ID. Thus, the ID should *not* refer
to the host name of the message system generating the ID, if this is not
the system where the message originated. Thus, I propose to generate
Message-ID-s of the format <CHECKSUM-VALUE@CHECKSUM> where the host
name in RFC822 is replaced by the word "CHECKSUM". This will tell
recieving systems that this is a CHECKSUM-ed ID, so that they can
identify it with other CHECKSUM-ed ID-s.

The alternative would be to produce ID-s in the format <CHECKSUM-
VALUE@ORIGINAL-HOST>. However, it does not seem nice to generate ID-s
purporting to come from a host which did not in reality generate this
ID.

Selection of CHECKSUM algorithm:
-------------------------------

The algorithm should uniquely identify a message with very low
probability of different messages getting the same ID. On the other
hand, the checksum should not change for common modifications to a
message, like additions of new recipients in the RFC822 header,
different line foldings or conversions of TAB-s to SPACE-s.

The following algorithm is proposed:

The CHECKSUM contains 15 characters, in three groups of five
characters. The first group is computed from the name in the FROM
field, the second group from the value in the DATE field, the third
group is computed from the textual contents of the message.

Each group should have a checksum algorithm suitable for that group.

For the FROM field, I suggest the following:

(a) Use only the value of the "addr-spec" part of the FROM field
(delete the "phrase" part and the <>-s, if any).

(b) Upcase the characters A-Z before checksum computation.

(c) Only include characters A-Z and digits 0-9 in checksum computation.

(d) Compute the checksum by summation of the characters, with the
weight 1 to the first character, 2 for the second, 4 for the third
etc. up to 2^16 for the sixteenth character, then 1 for the seventeenth
etc.

(e) Take the remainder of the checksum modulo 2^24. Translate this
remainder to five characters in a 32-based number system with the
digits 0...9, A..V.

Rationale: This checksum should be easy to compute on any computer
with 32-bit word integer arithmetic.

For the DATE field, I suggest as checksum the number
(((YEAR+SECOND+MONTH)*31+DAY)*24+HOUR)*60+MINUTE
This number is again translated to a five character string as
described in (e) above.

For the contents of the message, all non-printable characters, including
tab and space, should be disregarded when computing the checksum. The
checksum is computed using the algorithm in stage (d) and (e) described
above (but not stages a-b-c). Rationale: Disregarding all non-printable
characters, including tab and space, is necessary to ensure that line
folding will not change the checksum.

Torgny_Tholerus_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/08/85)

One problem I can see, is the DATE field.  This field might look
like:
                13 Apr 83 17:43 BST.
or:
                Wed Feb 20 16:07:15 1985

How do I do to find the year field or the month field?  Perhaps
is the best thing to treat the DATE field just like a string,
in the same way as the FROM field is treated.

Craig.Everhart@CMU-CS-A.ARPA (03/08/85)

Why use the From: or Date: fields at all?  The From: field is a popular
candidate for editing by automatic agents; I'm not convinced that Mr. Palme's
algorithm will remove all traces of that editing.  The Date:-field
algorithm was underspecified (year in century, as SMTP would have?  What
is the origin for months?  Any use of time zone information?).

For that matter, I'm not sure that all agents would agree on the concept
of ``printing character'' in the body of the message.

Why not use an algorithm based solely on the body of the message?  It can
ignore characters outside the range [33,126] (decimal, inclusive); obviously
it would only count characters in that range when incrementing the checksum
counter.

It may be less expensive to use something other than multiplication as a basis
for the checksum on many small machines.  Are there suitable algorithms based
on bit rotations or shifts?

And perhaps the whole discussion should be moved into an RFC.

Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/09/85)

The reason I choose not to treat the date field as a string,
is that the checksum should not change if the date field is
translated from one format to another, e.g. from the format
in RFC822 to the format in X.400.

This of course requires that the date field is parseable. But
every single message should, when you get it, have a header
in one given standard and thus date fields not belonging to this
standard should not occur.

One problem is the time zone. Since the format of this seems
to be often wrong, I did not include it in the checksum. This
means that if the time is translated to another time zone,
e.g. if the date "25 Feb 85 15:01 EST" is translated by some
agent handling the message into "25 Feb 85 12:01 PST" then the
algorithm will create a different checksum.

Question: Would such translations of dates by message handling
agents be expected to occur?

Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/09/85)

     FROM: William "Chops" Westfield <BillW@SU-SCORE.ARPA>

     Why is it necessary for two hosts trying to create a MESSAGE-
     ID to come up with the same result? I dont understand why
     anyone but the original host would try to create a message
     id...

If the original host always created Message-Id-s, this would be a
better solution. We will however have to accept the fact that some
hosts do not create globally unique Message-ID-s. Neither RFC822
nor X.400 unfortunately require mandatory globally unique Message-
ID-s (IPMessageID-s in X.400 terminology).

Suppose one and the same message gets forwarded, directly or
indirectly, to two different mailing lists, and that a certain
user is a member of both lists. If the intermediate hosts handling
the mailing list created a checksum, this could be used by the
host for the recipient user to stop displaying the same message
twice, or, to tell him that it is the same message which he gets
twice.

Why should the intermediate host add a Message-ID to a message
lacking such an ID? Because the ID is very useful for loop control.
If two mailing lists are members of each other (which has advantages,
but cannot be done with present practices on Arpanet because of the
risk for loops) then if the list maintaining program kept a list
of the ID-s of messages sent via the list (COM does this) then
it could stop re-sending the message when it comes around the
second time.

     FROM: Craig.Everhart@CMU-CS-A.ARPA
     Why use the From: or Date: fields at all? The From: field is
     a popular candidate for editing by automatic agents; I'm not
     convinced that Mr. Palme's algorithm will remove all traces
     of that editing. The Date:-field algorithm was underspecified
     (year in century, as SMTP would have? What is the origin for
     months? Any use of time zone information?).

The goal of course is to find an algorithm with a very very low
probability of getting the same ID for two different messages, but
also with low probability of giving different ID-s for the same
message because of some transformation on the message.

Only using TEXT CONTENT is NOT acceptable. Suppose in a voting
application that two people wrote messages with the only content
being the word "Yes!". Only using TEXT CONTENT would hide the very
important fact of the names of the people who voted "Yes!". Not
using Date/time is also not acceptable, suppose the same person
voted "Yes!" on two different issues, this fact would then be
hidden.

     FROM: Craig.Everhart@CMU-CS-A.ARPA
     It may be less expensive
     multiplication as a basis for the checksum on many small
     machines. Are there suitable algorithms based on bit
     rotations or shifts?

Most of the multiplications in my algorithm (all of those in
processing the body of the message) were by powers of 2
thus can be implemented by shifts.

Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA (03/12/85)

A further reason why it is necessary to create a Message-ID for a
message which does not have any: Checksums are necessary in order
to preserve in-reply-to-relations between pairs of messages which
may be transmitted between hosts via different routs.

Example: Host A sends out a message M1 to hosts B and C. at host
B, a reply M2 is written and sent to host B. In order for host B
to be able to recognize that M2 is a reply to M1, A and B must
independently generate the same Message-ID on M1.

Question: Why did I not include the value of the in-reply-to field
in the creation of the checksum. Answer: Because some systems may
allow the addition of an in-reply-to field after the creation of
a message. (At least we do. We sometimes get messages which clearly
are replies to other messages, but which do not have in-reply-to
fields. I sometimes then add an in-reply-to reference to the
incoming message, since the sorting of the message data base through
in-reply-to-relations makes it easier to use.)

gnu@sun.uucp (John Gilmore) (03/13/85)

> Suppose one and the same message gets forwarded, directly or
> indirectly, to two different mailing lists, and that a certain
> user is a member of both lists. If the intermediate hosts handling
> the mailing list created a checksum, this could be used by the
> host for the recipient user to stop displaying the same message
> twice, or, to tell him that it is the same message which he gets
> twice.

The user's mail reader can do the comparision itself without using
checksums and without trying to make some kind of network standard out of it.

> If two mailing lists are members of each other (which has advantages,
> but cannot be done with present practices on Arpanet because of the
> risk for loops) then if the list maintaining program kept a list
> of the ID-s of messages sent via the list (COM does this) then
> it could stop re-sending the message when it comes around the
> second time.

This is exactly the situation with Arpanet mailing lists gatewayed
to&from Usenet newsgroups.  I don't know the algorithm, but it clearly
works, without requiring message-ids -- and even works on digests,
where this method would fall down.
 
> Only using TEXT CONTENT is NOT acceptable. Suppose in a voting
> application that two people wrote messages with the only content
> being the word "Yes!". Only using TEXT CONTENT would hide the very
> important fact of the names of the people who voted "Yes!". Not
> using Date/time is also not acceptable, suppose the same person
> voted "Yes!" on two different issues, this fact would then be
> hidden.

Anybody who casts a vote on an issue by sending a message consisting
solely of the word "Yes" deserves whatever random thing they vote for.
The subject line will distinguish, for real-world messages.

If we ever come up with a naming scheme that everybody can implement,
and routing to match, then there won't be a problem with munged From:
addresses.  For now, and for the foreseeable future, there is.

----

Actually, the whole idea of intermediate nodes throwing messages away
because they happen to checksum like a message that came thru the
preceding day strikes me as a really bad idea.  You'd better compare
the "envelope" as well as the message, since a message can go thru a
host several times as aliases are resolved on its way to a final
destination.  I'd argue for comparing the entire message content,
which of course defeats the whole purpose, rather than throw away
a message.  We lose enough already.

laura@utzoo.UUCP (Laura Creighton) (03/19/85)

Throwing away messages because they might have been seen before is a
poor idea. I can think of lots of cases where it is not what I want
done with my mail. For instance, if I send mail to someone and for
some reason they don't get it, or their disk crashes, and they ask me
to resend, I don't want any smart router to not send it because it
has been seen before. Likewise, if I always have a meeting on
Wednesday, and I always send myself mail on all the machines I
work on Tuesday night, saying ``meeting at 4, stop hacking at 3:30''
I want to keep getting those messages! The actual proposed use of
this ``feature'' (being on 2 lists where the same article is posted
twice) is another instance where I *especially* want to see both
articles. There are days I don't have time for large volume
newsgroups or mailing lists, and will axe all messages in that
group without reflection. However, something that was posted to both
a large volume list and a smaller one is likely to still be of
interest to me.

Are you sure that you really have a problem that you are trying to fix?

Laura Creighton
utzoo!laura
ihnp4!utzoo!laura@seismo

jsdy@hadron.UUCP (Joseph S. D. Yao) (03/25/85)

> This is exactly the situation with Arpanet mailing lists gatewayed
> to&from Usenet newsgroups.  I don't know the algorithm, but it clearly
> works, without requiring message-ids -- and even works on digests,
> where this method would fall down.

I believe that Usenet newsgroups  d o  use message ID's.  Otherwise,
what are <9102@brl-tgr.ARPA> <2043@sun.uucp> that are References for
this message?

	Joe Yao		hadron!jsdy@seismo.{ARPA,UUCP}

lepreau@utah-cs.UUCP (Jay Lepreau) (04/09/85)

The real problem is that current standards don't require Message-ID's.
(How come?  I would think the need was obvious way before 822.)
Seems a hell of a lot easier just to make it a requirement than to
substitute another, much more vague, expensive, and complicated one.