[news.software.b] Ideas for Message-ID's

rsalz@bbn.com (Rich Salz) (03/22/91)

I know there are several Message-ID formats out there:
	B news sequence-number style.
	C News verbose date style.
	Various radix-64 compressions of the above.
How about this one
	<yydddss.pppp@host>
where
	yy	Last two digits of the year
	ddd	The day of the year, 000-365.
	ss	The current number of seconds, 00-59.
	pppp	The Process ID (not fixed format).
	host	The hostname (not fixed format).
For example
    9203212.1856@papaya.bbn.com
This is 27 bytes long.  The host-part is invariant, and the unique-part is
only 12 bytes, but it will vary by a couple depending on the pid.

Obviously, the only time this will have a problem is if the same process
submits two articles within the same second.  I don't think that's likely
to happen.

Comments?
	/r$
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

francis@wolfman.cis.ohio-state.edu (RD Francis) (03/22/91)

In article <3427@litchi.bbn.com> rsalz@bbn.com (Rich Salz) writes:
   How about this one
	   <yydddss.pppp@host>
   where
	   yy	Last two digits of the year
	   ddd	The day of the year, 000-365.
	   ss	The current number of seconds, 00-59.
	   pppp	The Process ID (not fixed format).
	   host	The hostname (not fixed format).
   For example
OK, let's risk making a real fool of myself; as far as I have been
able to tell in the past, when a machine is rebooted, it starts
counting processes from 1 (or whatever) again.  Imagine a machine
that goes down and comes back up twice in one day, and it'd easy to
imagine a situation where someone could *possibly* hit a conflict,
unlikely as it is.

Am I exposing my relative ignorance of the guts of Unix here?
--
R David Francis   francis@cis.ohio-state.edu

wisner@ims.alaska.edu (Bill Wisner) (03/22/91)

>	   <yydddss.pppp@host>
>   where
>	   yy	Last two digits of the year
>	   ddd	The day of the year, 000-365.
>	   ss	The current number of seconds, 00-59.
>	   pppp	The Process ID (not fixed format).
>	   host	The hostname (not fixed format).

I'd make the year four digits, to guarantee uniqueness (at least for
the next eight thousand years).  Also, this scheme has a big hole you
could pilot a B-2 through.  I've used machines that were used heavily
enough to cycle through all 30,000 PIDs several times in a day.  On such
a machine, it's almost inevitable that sooner or later a process will
manage to to repeat a PID/seconds combination.

Bill Wisner <wisner@ims.alaska.edu> Gryphon Gang Fairbanks AK 99775
"If you have a problem with one of my users, take it to me, and if
I need to kill them, I will." -- Eliot Lear <lear@turbo.bio.net>

tar@math.ksu.edu (Tim Ramsey) (03/22/91)

rsalz@bbn.com (Rich Salz) writes:

>How about this one
>	<yydddss.pppp@host>

How about:

   <tttttttt.pppp@host>

   where:    tttttttt is the return value of time(2) in base-16
     and:    pppp is the process id

At the time I posted this, t == 27e988be.  That's only 8 characters.
If you went to a larger radix this would be smaller.

--
Tim Ramsey (tar@math.ksu.edu)  (913) 532-6750 (voice)  (913) 532-7004 (FAX)
Department of Mathematics, Kansas State University, Manhattan KS 66506-2602

brad@looking.on.ca (Brad Templeton) (03/22/91)

Is there any reason for the message-id to be readable?   The date is
elsewhere in the message, and indeed elsewhere in the history file to
some extent.

I say make it as small as you can, either the sequence number, which
is smallest, or a radix 85 (or however many safe characters there are in
message-ids) encoding of the minute and process-id, with epoch when
you started your site up.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

billd@fps.com (Bill Davidson) (03/22/91)

In article <3427@litchi.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>How about this one
>	<yydddss.pppp@host>
>where
>	yy	Last two digits of the year
>	ddd	The day of the year, 000-365.
>	ss	The current number of seconds, 00-59.
>	pppp	The Process ID (not fixed format).
>	host	The hostname (not fixed format).
>For example
>    9203212.1856@papaya.bbn.com
>This is 27 bytes long.  The host-part is invariant, and the unique-part is
>only 12 bytes, but it will vary by a couple depending on the pid.
>
>Obviously, the only time this will have a problem is if the same process
>submits two articles within the same second.  I don't think that's likely
>to happen.

The same PID can occur with two different processes during the same day
due to turn-over.  This is quite common on fast machines that run with
dozens of users.  Two posts could easily occur during the same second
of their given minute (obviously they are probably hours apart).  Sure
it's unlikely but with as many machines as are on the net do you really
want to take the gamble?  If enough machines run with this scheme for
enough hours, it will break.

Maybe something more along the lines of hhmmss.  It adds four more
chars but still guarantees uniqueness.  It also makes the Message-ID
hard to predict for sendme message abusers (one of the goals of the
Cnews style).

Alternatively, the number of seconds in that have occured in that day
since midnight could be used.  This would only add three more chars
since there are only 86400 seconds in a day.  You could put it in
hex and do the number of seconds since 12am Jan 1 and only add 2
chars.

I guess I'm starting to run amuck.  Sorry.

--Bill
-- 
*ANOTHER* dumb move! -- Dick Spanner, Private Investigator

rsalz@bbn.com (Rich Salz) (03/22/91)

I got email pointing out that if the PID wraps around in less than 24
hours, and the new process posts within the same second, there will be a
conflict.  Hmm...  I don't think this is likely unless the machine crashes
a lot.  (Maybe it *IS* that likely. :-)

At any rate, here's what I'm going to do now:
	<ddd.sss.ppp@fqdn>
where
	ddd	The day of the year, in radix 64 (not fixed width)
	sss	The second within the day, in radix 64 (not fixed width)
	ppp	The process ID, in radix 64 (not fixed width)
	fqdn	The fully-qualified domain name

Here's some sample code:

/*
**  Test program to generate Message-ID's.
**  The ID includes the day number, the second within the day, and the current
**  process ID.  To conserver space, they are decoded into radix-64 strings,
**  using [0-9a-zA-Z.+] to represent 0..63.  Assumes 32-bit longs.
**
**  Rich $alz <rsalz@bbn.com>, 22-March-1991.
*/
#include <stdio.h>
#include <sys/types.h>
#include <time.h>


static char	ALPHABET[] =
    "+.ZYXWVUTSRQPONMLKJIHGFEDCBAzyxwvutsrqponmlkjihgfedcba9876543210";

extern char	*strchr();


/*
**  Turn a number into a Radix-64 string.
*/
void
Radix64(l, buff)
    register unsigned long	l;
    register char		*buff;
{
    register char		*p;
    register int		i;
    char			temp[20];

    /* Simple sanity checks. */
    l &= 0xFFFFFFFF;
    if (l == 0) {
	*buff++ = '0';
	*buff = '\0';
	return;
    }

    /* Format the string, in reverse. */
    for (p = temp; l; l >>= 6)
	*p++ = ALPHABET[(int)(l & 077)];

    /* Reverse it. */
    for (i = p - temp; --i >= 0; )
	*buff++ = *--p;
    *buff = '\0';
}


/*
**  Decode and print a radix-64 string as a number.
*/
Decode64(what, l, p)
    char	*what;
    long	l;
    char	*p;
{
    long	l2;
    char	*cp;

    printf("%s:  %ld = %s = ", what, l, p);
    for (l2 = 0; *p; p++) {
	if ((cp = strchr(ALPHABET, *p)) == NULL) {
	    printf("-->Invalid char %c\n", *p);
	    return;
	}
	l2 = (l2 << 6) + cp - ALPHABET;
    }
    printf("%ld\n", l2);
}


/*
**  Stub routine to get the fully-qualified domain name of this host.
*/
char *
GetFQDN()
{
    static char buff[256];

    gethostname(buff, sizeof buff);
    return buff;
}


main()
{
    struct tm		*gmt;
    time_t		now;
    char		day64[20];
    char		pid64[20];
    char		sec64[20];
    unsigned long	day;
    unsigned long	sec;
    unsigned long	pid;

    (void)time(&now);
    gmt = gmtime(&now);
    day = gmt->tm_year * 1000L + gmt->tm_yday;
    sec = gmt->tm_hour * 3600L + gmt->tm_min * 60L + gmt->tm_sec;
    pid = getpid();
    Radix64(day, day64);
    Radix64(sec, sec64);
    Radix64(pid, pid64);
    printf("<%s.%s.%s@%s>\n", day64, sec64, pid64, GetFQDN());

    Decode64("day", day, day64);
    Decode64("sec", sec, sec64);
    Decode64("pid", pid, pid64);
}
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

rees@pisa.citi.umich.edu (Jim Rees) (03/22/91)

There are lots of times when you want a unique identifier.  NFS file
handles, user/group identifiers, IPC port ids, and so on.  Some operating
systems provide a way to get an opaque bag-of-bits that is unique for all
time, for any application that needs it.

The OS that I use has such a feature, so I use it to generate message ids.
They contain a time stamp and a cpu serial number.

I'm not sure why Mach doesn't have unique ids (uids), as many older CMU OSs
had them, as did Eden, which was partly CMU inspired (Guy Almes was from
CMU).  Uids may be just another Multics-era idea that got lost in the quest
for "simplicity" (is Unix still simpler than Multics?)

henry@zoo.toronto.edu (Henry Spencer) (03/23/91)

In article <1991Mar22.044131.3764@maverick.ksu.ksu.edu> tar@math.ksu.edu (Tim Ramsey) writes:
>   <tttttttt.pppp@host>
>
>   where:    tttttttt is the return value of time(2) in base-16
>     and:    pppp is the process id
>...If you went to a larger radix this would be smaller.

This is what's planned for C News, in fact, with a carefully-chosen alphabet.
(You can't be too ambitious with the alphabet if you want to get it past all
the broken systems, e.g. B2.11 which does completely case-insensitive message-
ID matching, but you can do better than hex.)
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

tale@rpi.edu (David C Lawrence) (03/23/91)

As I am discovering now, articles are silently failing to be delivered
to some (lots?) of sites because of some unspecified quality of my
message-ids.  I even trimmed the set by several characters, though
they were always all valid RFC-822 id characters.  This is a problem
for me because of news.announce.newgroups, but I want to know just
what it is that is causing the problem before I hack into it and
change it to just use base 37 ([a-z0-9.]).  As Henry has already
pointed out, your base 64 is going to have a problem because of older
B News  sites, of which there are an appreciable amount to be
annoying --- they do case-insensitive id handling.
--
    (setq mail '("tale@rpi.edu" "uupsi!rpi!tale" "tale@rpitsmts.bitnet"))

palkovic@linac.fnal.gov (John Palkovic) (03/23/91)

>>>>> On 22 Mar 91 05:07:49 GMT, brad@looking.on.ca (Brad Templeton) said:
> Is there any reason for the message-id to be readable?  ...

Why should the headers be readable? The articles usually aren't. :-)

How about this little subroutine? Look in the header of this article
for an example. As long as pid's don't repeat in 60 sec it is fine.

 /*
  * The following was inspired by a program apparently written by Jon Zeeff
  * (zeeff@b-tech.ann-arbor.mi.us). Palkovic@linac.fnal.gov, 3/16/91.
  */

  /*
   * A string of some valid message id characters
   */

char string[] = "!#$%^&*_+|-=~`{}'?ABCDFGHJKLMNPQRSTVWXYZ1234567890";

#define size (sizeof(string) - 1)

void rand_id(s)
char *s;
{
    int getpid();
    long num;

    num = (time((long *) 0) - 658216800)/60;
    do {
	*s++ = string[num % size];
	num /= size;
    } while (num);

    num = (long) getpid();
    do {
	*s++ = string[num % size];
	num /= size;
    } while (num);
}

wb8foz@mthvax.cs.miami.edu (David Lesher) (03/23/91)

(Brad Templeton) writes:

>Is there any reason for the message-id to be readable? 

I guess I'm the outcast. I *like* the cnews message-id's.

Machines that hoard old news, then suddenly dump them back on the net
with new dates seem, alas, to be a regular "feature" in recent years.
The message-id's are a good sanity check on the

	"Is this pointless argument STILL going on?"
feeling when this happens.
-- 
A host is a host from coast to coast.....wb8foz@mthvax.cs.miami.edu 
& no one will talk to a host that's close............(305) 255-RTFM
Unless the host (that isn't close)......................pob 570-335
is busy, hung or dead....................................33257-0335

peter@taronga.hackercorp.com (Peter da Silva) (03/23/91)

brad@looking.on.ca (Brad Templeton) writes:
> I say make it as small as you can, either the sequence number, which
> is smallest, or a radix 85 (or however many safe characters there are in
> message-ids) encoding of the minute and process-id, with epoch when
> you started your site up.

I use radix-36. The number of safe characters including punctuation really
doesn't save that much space... I'm not going to worry about an extra
byte or two.
-- 
               (peter@taronga.uucp.ferranti.com)
   `-_-'
    'U`

igb@fulcrum.bt.co.uk (Ian G Batten) (03/26/91)

In article <5084abb1.1bc5b@pisa.citi.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
> CMU).  Uids may be just another Multics-era idea that got lost in the quest
> for "simplicity" (is Unix still simpler than Multics?)

I recall that Multics unique identifiers are only unique
per-installation and provided the clock is never reset.  They provided a
base-foo coding the clock which was ticking microseconds from 1900.  I
was told that there were some hairy interlocks on the clock so that in
multi-processor set-ups only one cpu could get a given clock value.

ian

fitz@wang.com (Tom Fitzgerald) (03/27/91)

brad@looking.on.ca (Brad Templeton) writes:
> I say make it as small as you can, either the sequence number, which
> is smallest, or a radix 85 (or however many safe characters there are in
> message-ids) encoding of the minute and process-id, with epoch when
> you started your site up.

The number of safe characters is way below 85 unfortunately.  But above
36 it really doesn't do you a lot of good.  If you want to crunch 31 bits
of timestamp and 15 bits of process ID into a string, it's easy to get a
10-character result (like you'll see in the message ID of this article)
and a pain to get any shorter than that.  Some points on the curve are:

for 31-bit date:
	6 characters, alphabet size must be  36  or greater
	5 characters, alphabet size must be  74  or greater
for 15-bit process ID:
	3 characters, alphabet size must be  32  or greater
	2 characters, alphabet size must be 182  or greater

So using lowercase letters and digits gives you a 10 character identifier
(with the separating dot).  It's impossible to get the alphabet size to 74
characters since some systems (early C news systems?  VMS systems running
ANU news?  Somebody...) require case-insensitive message IDs.

You could easily get rid of the dot separator by ALWAYS using 6+3
characters.  By treating the timestamp and process ID as a single 46-bit
number, things can get even smaller:

for 46-bit combined timestamp and process ID:
	9 characters, alphabet size must be  35  or greater
	8 characters, alphabet size must be  54  or greater

So for a 8-character identifier, the alphabet can be letters, digits and
18 random punctuation marks, which isn't too hard.

All this assumes that the world will come to an end in January of 2038,
but we all understand that.

Do any systems use 16-bit process IDs?

---
Tom Fitzgerald   Wang Labs        fitz@wang.com
1-508-967-5278   Lowell MA, USA   ...!uunet!wang!fitz

irwin@uvmark.uucp (Frank Irwin) (03/28/91)

In article <b2x77c.9uo@wang.com> fitz@wang.com (Tom Fitzgerald) writes:
>brad@looking.on.ca (Brad Templeton) writes:
>
>The number of safe characters is way below 85 unfortunately.  But above
>36 it really doesn't do you a lot of good.  If you want to crunch 31 bits
>of timestamp and 15 bits of process ID into a string, it's easy to get a
                  ^^^^^^^
>
>Do any systems use 16-bit process IDs?

The IBM RS/6000 uses 31-bit (yup, thirty-one) process IDs.  You can always
use the process slot in the kernel, which is encoded into the PID, but that
still uses 17 bits.

-- 
====================================================================
Frank Irwin                   |  "I'll bet $50 on that flush."
Vmark Software, Inc.          |  Whooooosh!
 ..uunet!merk!uvmark!irwin    |  "Aaaaiiiieeee!  Not *that* flush!"

rees@pisa.citi.umich.edu (Jim Rees) (03/29/91)

In article <b2x77c.9uo@wang.com>, fitz@wang.com (Tom Fitzgerald) writes:

  The number of safe characters is way below 85 unfortunately.  But above
  36 it really doesn't do you a lot of good.  If you want to crunch 31 bits
  of timestamp and 15 bits of process ID into a string, it's easy to get a
  10-character result (like you'll see in the message ID of this article)
  and a pain to get any shorter than that.

Wait, I've got an idea.  How about numbering each article, starting with '1'
and going up.  You could keep a counter, say in a file in /usr/lib/news, and
increment it each time.  You could use decimal and not reach that 10
character limit until you posted 10 billion articles.

henry@zoo.toronto.edu (Henry Spencer) (03/29/91)

In article <50a41197.1bc5b@pisa.citi.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
>Wait, I've got an idea.  How about numbering each article, starting with '1'
>and going up.  You could keep a counter, say in a file in /usr/lib/news, and
>increment it each time...

How do you coordinate simultaneous access to that file by multiple posters?
Across a network filesystem?  Across NFS (the thing that shambles like a
filesystem)?  What happens if it gets scrambled?  Shared databases are a
lot trickier than they look.  C News abandoned that approach deliberately.
-- 
"The stories one hears about putting up | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 are all true."  -D. Harrison|  henry@zoo.toronto.edu  utzoo!henry

kherron@ms.uky.edu (Kenneth Herron) (03/29/91)

rees@pisa.citi.umich.edu (Jim Rees) writes:

>In article <b2x77c.9uo@wang.com>, fitz@wang.com (Tom Fitzgerald) writes:
>Wait, I've got an idea.  How about numbering each article, starting with '1'
>and going up.  You could keep a counter, say in a file in /usr/lib/news, and
>increment it each time.

I thought of this once, but the locking could get painful and this has
uses beyond news anyway.  How about a "unique number server" that does
nothing but provide numbers on demand.  Give it a period of a million
or a billion and it'll take months or years to repeat...

Obviously there are problems with this; consider it a Partially Baked
Idea.
-- 
Kenneth Herron                                            kherron@ms.uky.edu
University of Kentucky                                        (606) 257-2975
Department of Mathematics 
                                "Never trust gimmicky gadgets" -- the Doctor

louie@sayshell.umd.edu (Louis A. Mamakos) (03/29/91)

In article <b2x77c.9uo@wang.com> fitz@wang.com (Tom Fitzgerald) writes:
>The number of safe characters is way below 85 unfortunately.  But above
>36 it really doesn't do you a lot of good.  If you want to crunch 31 bits
>of timestamp and 15 bits of process ID into a string,

I had an interesting thought; on BSD flavored systems with
gettimeofday(), you will always get a unique time returned (provided
the clock is not reset, but only slewed using adjtime()).  If
gettimeofday() is called more than once between clock interrupts, such
that the same time would have been returned, the low order bits of
tv_micro are farbled to ensure a unique value.

You might just dispense with the process id completely, and use a
unique time value composed of the time (though in that case, you've
got 64 bits 'o time, 32 each in tv_sec and tv_usec in struct timeval).
See, another reason to run NTP to synchronize your clocks and to beat
on your vendors that can't get a UNIX kernel to keep correct time and
not drop clock interrupts..

Just a random thought, 
louie

brad@looking.on.ca (Brad Templeton) (03/29/91)

Down the road, operating system designers probably should consider
getunique() as an operating system service.   A very simple function,
it would simply guarantee that it never, ever, returns the same string.
Would be handy.   It might have a few modes, providng strings that are
unique for the process, day, system-forever and universe-forever.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

jerry@olivey.ATC.Olivetti.Com (Jerry Aguirre) (03/30/91)

In article <1991Mar29.032900.548@ni.umd.edu> louie@sayshell.umd.edu (Louis A. Mamakos) writes:
>You might just dispense with the process id completely, and use a
>unique time value composed of the time (though in that case, you've
>got 64 bits 'o time, 32 each in tv_sec and tv_usec in struct timeval).

What!  Trade 15 bits of PID for 32 bits of tv_usec?  :-)

Actually the tv_usec doesn't use all 32 bits, only 20.  It only counts
to 999,999.  Even if it was bigger the correct action would be to scale
the excess into the seconds.

If one encoded it as <ssss.uuu@domain>, with the usec as variable width
field, then one could get smaller message IDs by just looping on
gettimeofday until the usec. returned is a small value.  Delaying some
postings a fraction of a second to save the world from having to handle
bigger message-IDs.  :-)

amanda@visix.com (Amanda Walker) (04/02/91)

rees@pisa.citi.umich.edu (Jim Rees) writes:

   You could keep a counter, say in a file in
   /usr/lib/news, and increment it each time.

Henry brought up locking, but there's also the issue of file system
access/writeability (NFS-mounted news spool/NNTP).
--
Amanda Walker						      amanda@visix.com
Visix Software Inc.					...!uunet!visix!amanda
-- 
Q.: What do you get if you cross a godfather with a lawyer?
A.: Someone who makes an offer you can't understand.

hks@nic.funet.fi (Harri Salminen) (04/17/91)

Would it be possible to have after the time a checksum calculated over
the most of the message? The checksum calculation should could include at
least the newsgroup name and subject if not everything.  It's unlikely
that even an automatic program sends within one second two messages
with same subject to same newsgroup. If you're worried that it still
might be the same in some very rare circumstances it should be
relatively easy to have the rejection routine to make a diff (or just
compare wordcount) with the original message and send it to news
manager for perusal.

Including the newsgroup name would make it possible to munge the
message-ids to become consistently different when gatewayed to two
different newsgroups from mail. In theory you shouldn't tamper with
message-ids if they are already present but in practise you might have
to or the message might get lost. The other way around the problem
would be to change all history database implementations to include the
newsgroup name or number in some form which would have to be changed
in all nodes wanting to benefit from this feature... Third and best
alternative which I hope could someday be achieved is to standardize
mailing lists at least as clearly as news messages so that
crossposting, followups, references etc. would work nicely giving us a
truly global distributed group communication service (some would call
it computer conferencing I presume)

The other advantage of this style of message id (marked with some special
delimiter?) could be used to detect problems in message transport. Since
most of us haven't yet migrated to use some single ISO standard character
set only US ASCII representations of 0-9, a-z and A-Z which are common
to almost all systems should be used. 

Harri



-- 
Harri K Salminen - Finnish University & Research Network project
hks@funet.fi, LK-HS at FINHUTC, tut!hks, OPMVAX::hks, OH2LGE@OH2RBI
FUNET c/o VTKK/TLP, PL 40,  02101 Espoo, Finland - +358-0-4572288
"Virtually, I don't work, I just netWORK :-)"

henry@zoo.toronto.edu (Henry Spencer) (04/18/91)

In article <1991Apr16.174706.4963@nic.funet.fi> hks@funet.fi writes:
>Would it be possible to have after the time a checksum calculated over
>the most of the message? The checksum calculation should could include at
>least the newsgroup name and subject if not everything.  It's unlikely
>that even an automatic program sends within one second two messages
>with same subject to same newsgroup...

I'm not sure what your objective is here.  What this is essentially
doing is adding a random number to the message ID.  Using the process ID
accomplishes the same thing, with random numbers that are *guaranteed
unique* over the whole system, making collisions essentially impossible.

>Including the newsgroup name would make it possible to munge the
>message-ids to become consistently different when gatewayed to two
>different newsgroups from mail. In theory you shouldn't tamper with
>message-ids if they are already present but in practise you might have
>to or the message might get lost.

Can you explain this in more detail?  I don't see why you ever have to
tamper with a legal message ID, and you most certainly should never have
to assign more than one to the same message.  Gatewaying to multiple
newsgroups should be done with a cross-posting, not by posting the same
article to each newsgroup in turn!

>The other advantage of this style of message id (marked with some special
>delimiter?) could be used to detect problems in message transport...

Geoff and I thought about this long and hard during C News development.
Some early versions generated a Checksum header for this purpose.  We
eventually deleted it.  The problem is that articles which go via broken
networks like Bitnet are often changed slightly in harmless ways, like
having tabs expanded to spaces or empty lines changed to contain a single
space.  So you get a lot of spurious checksum mismatches.  Given this,
we couldn't see a use for the checksums.  You can't just discard articles
with bad checksums.  Messages complaining about it will be frequent
enough that people will ignore them.  The software problems that cause
them are mostly already known, so alerting people won't do any good.
Checking the checksum on every article is costly, especially if the
algorithm is trying to be clever and ignore harmless kinds of damage.
There are perhaps rare circumstances where it would be useful to know
whether an article was damaged or not, but they didn't seem common enough
to justify hauling the checksum along in every message.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

wayne@dsndata.uucp (Wayne Schlitt) (04/18/91)

In article <1991Apr17.212354.12236@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
> In article <1991Apr16.174706.4963@nic.funet.fi> hks@funet.fi writes:
> >Would it be possible to have after the time a checksum calculated over
> >the most of the message? The checksum calculation should could include at
> >least the newsgroup name and subject if not everything.  It's unlikely
> >that even an automatic program sends within one second two messages
> >with same subject to same newsgroup...
> 
> I'm not sure what your objective is here.  What this is essentially
> doing is adding a random number to the message ID.  Using the process ID
> accomplishes the same thing, with random numbers that are *guaranteed
> unique* over the whole system, making collisions essentially impossible.

while looking through a long list of message id's, i came across one
message id format that i kind of like.  they used
<time-date-stamp.username@site>.   i hadnt really thought about it,
but having the login name instead of the process id has a real
advantage in that you will often have a valid email address to the
person who posted the article.

granted, using the login name has the same problem as a process id, in
that the same user can generate more than one article per second, but
it isnt any worse than the process id either.  the login name may also
be a little bit longer on average than the process id, but probably
not by that much.  you would also end up with lots of articles coming
from user names of "gateway", or "news", but even that is no _worse_
than the process id.  getting the users name in a protable way may be
a problem, i am not sure...  

anyway, just think of the fun hacks you could add to expire to kill
things from known net idiots quicker, and leave articles from doug
gwyn, chris torek and (of course) henry spencer around longer.

just a thought...

-wayne

wisner@ims.alaska.edu (Bill Wisner) (04/19/91)

In article <WAYNE.91Apr17195303@dsndata.uucp> wayne@dsndata.uucp (Wayne Schlitt) writes:
>granted, using the login name has the same problem as a process id, in
>that the same user can generate more than one article per second, but
>it isnt any worse than the process id either.

Wrong.  The message ID is generated by inews or anne.jones, which gets
invoked each time an article is posted.  Thus, the process ID is different
for every article.  The username is constant.  If the username replaces
the PID, two articles posted by the same user in one second will have the
same message ID.  Using the PID, the IDs will be different since the PID
will have changed.

I think it's safe to say that it's very unlikely for any system to cycle
through all 30,000 process IDs in one second.

Bill Wisner <wisner@ims.alaska.edu> Gryphon Gang Fairbanks AK 99775
bnug, dude
yeah
.

richard@locus.com (Richard M. Mathews) (04/20/91)

wisner@ims.alaska.edu (Bill Wisner) writes:
>Wrong.  The message ID is generated by inews or anne.jones, which gets
>invoked each time an article is posted.  Thus, the process ID is different
>for every article.

Actually, this isn't safe on all systems.  On "secure" systems which
generate pseudo-random PIDs, you are certain that there will not be two
processes at the same time with the same PID; but successive processes
could have the same PID.  The probability of two with the same PID being
created during different parts of the same second is small, but it doesn't
require cycling through 30000 processes.

Richard M. Mathews			Lietuva laisva = Free Lithuania
richard@locus.com			Brivu Latviju  = Free Latvia
lcc!richard@seas.ucla.edu		Eesti vabaks   = Free Estonia
...!{uunet|ucla-se|turnkey}!lcc!richard

res@colnet.uucp (Rob Stampfli) (04/23/91)

While we are on the subject of Message-IDs, how about this:

Choose any workable standard for message-IDs you like, provided it has some
randomness to it (already a good idea for other reasons).  Then pass this
format thru a one-way authenticating function (one that is hard to invert)
which produces a one-to-one mapping of inputs to outputs.  Use the output
of this function, suitably reformatted to match the specification for the
Message-ID field, in the Message-ID field of the message being posted.
Finally, save the unencrypted Message-ID at the site in a file accessible
only to the News software.

Because of the one-to-one mapping, it can be guaranteed that encrypted
Message-IDs will be unique if they are generated from a subset of unique
unencrypted Message-IDs.

Now, once this is in place, change the News software to demand that the
unencrypted Message-ID be sent along with any cancel message (define a new
"Authentication: " field) if a cancel request is for a Message-ID which
matches the format of an encrypted Message-ID.

The result is that cancel messages can no longer be forged.  Only the
poster or the News administrator at the site the message originates
at can cancel it.  Of course, a given site can kill the message locally
(and for downstream sites), but it cannot purge the message globally.

I think such a scheme would have significant advantages in the anarchy
we call Usenet.
-- 
Rob Stampfli, 614-864-9377, res@kd8wk.uucp (osu-cis!kd8wk!res), kd8wk@n8jyv.oh

hks@nic.funet.fi (Harri Salminen) (04/26/91)

henry@zoo.toronto.edu (Henry Spencer) writes:

>In article <1991Apr16.174706.4963@nic.funet.fi> hks@funet.fi writes:
>>Would it be possible to have after the time a checksum calculated over
>>the most of the message? The checksum calculation should could include at
>>least the newsgroup name and subject if not everything.  It's unlikely
>>that even an automatic program sends within one second two messages
>>with same subject to same newsgroup...

>I'm not sure what your objective is here.  What this is essentially
>doing is adding a random number to the message ID.  Using the process ID
>accomplishes the same thing, with random numbers that are *guaranteed
>unique* over the whole system, making collisions essentially impossible.

We need an unique identifier, but defining the PID to be part of
it might not mean it's unique, since some systems don't rotate pids, others
might not have them all and third group of system might not even use
separate processes for each message. Of course implementors choose
some way (even a counter) to make them unique but I thought the idea
was to put some meaning to the randomness and utilize it.
The idea of one way unique encryption sounds fine if one can find
a suitable algorithm which may be hard... You could use RSA or some other
public key system but the message-ID's might get several lines long :-)
Wasn't there an RFC on authenticated mail that could be utilized?

>>Including the newsgroup name would make it possible to munge the
>>message-ids to become consistently different when gatewayed to two
>>different newsgroups from mail. In theory you shouldn't tamper with
>>message-ids if they are already present but in practise you might have
>>to or the message might get lost.

>Can you explain this in more detail?  I don't see why you ever have to
>tamper with a legal message ID, and you most certainly should never have
>to assign more than one to the same message.  

I believe that gateways should be liberal in what they accept but
output only "legal" format. Although it's rare nowadays, you'll
sometimes get messages with illegal characters in message-ID's
(Zmailer is the only RFC-822 mailer I know that really cares what the
message-ID looks like...). Some popular gateways just map them to
legal ones and pass through. The same applies to other headers. I
don't like systems that discard or return the message if it has just
some minor errors (like "non-existing" timezone) and has already
reached all list subscribers without problems. Zmailer does quite nice
compromise by letting it through and informing very clearly what was
wrong according to which rfc. It even checks message-ID and references
fields...

>Gatewaying to multiple
>newsgroups should be done with a cross-posting, not by posting the same
>article to each newsgroup in turn!

It's almost impossible to do crossposting when you have multiple
gateways in different places. To do so each gateway would have to
decipher from the mail headers and a global gateway database to which
groups it's going (sometimes even impossible since the list name might
not even be in X-Resent-Cc: field in the message body not to mention
the millions of variations a list's addres can be represented).  The
only reliable method, unless you gateway a listserv list with right
newsgroup header insertion) is to have alias for each incoming list in
the gateway host. Sometimes the same list gatewayed to two different
distributions untill one of the groups is removed. We need to define
mailing lists better and coordinate gateways to improve the situation.

When a message ends up via two different gateways to two different
newsgroups it will be shown only in the first one it arrives.
Fortunately it's common that the person reads both groups so you can
live with it because the message is at least somewhere... To solve
this small problem you either need different message-ID's or history
checking that includes the newsgroup. Maybe the latter is cleaner
way after all? 

>>The other advantage of this style of message id (marked with some special
>>delimiter?) could be used to detect problems in message transport...
>-- 
>And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
>"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry
-- 
Harri K Salminen - Finnish University & Research Network project
hks@funet.fi, LK-HS at FINHUTC, tut!hks, OPMVAX::hks, OH2LGE@OH2RBI
FUNET c/o VTKK/TLP, PL 40,  02101 Espoo, Finland - +358-0-4572288
"Virtually, I don't work, I just netWORK :-)"

hks@nic.funet.fi (Harri Salminen) (04/26/91)

I forgot to note that if you just ADD a string
to the already unique messageid just before @ you'll get even more
unique ID. The string could be crc of newsgroup name & subject or some other
hash result of the newsgroup name. That way you don't have to change
anything else but the gateway although it isn't as clean as checking
newsgroup name in history check.

Harri
-- 
Harri K Salminen - Finnish University & Research Network project
hks@funet.fi, LK-HS at FINHUTC, tut!hks, OPMVAX::hks, OH2LGE@OH2RBI
FUNET c/o VTKK/TLP, PL 40,  02101 Espoo, Finland - +358-0-4572288
"Virtually, I don't work, I just netWORK :-)"