mcb@ncis.tis.llnl.gov (Michael C. Berch) (04/28/89)
We were sitting around jawing about conversion to the Domain Name System and how many sites are still using ".UUCP" or other name schemes, and how many sites don't put out legal Message-IDs at all. So I decided to count, using the Message-IDs in the history file. I don't remember how long we are keeping things in the history file after expiration, but there were about 64K articles in the file, which looked like a good round number for a sample. The "(No domain)" count were Message-IDs without a dot in them, and were counted automatically, so I don't know offhand who isn't doing an addr-spec on the right-hand side of the Message-ID. The "(Badly formed)" count was done by hand and represents 202 articles from about 70 different sites. Of those 202 articles, about half were literally badly-formed garbage and half were nonexistent top-level domains. I was loose about the legitimacy of top-level domains; in addition to all the legal (i.e., registered and known to the root servers) domains, I also included a few well-known names like "uninett", "junet", "bitnet", and "cdn" in the main list instead of putting them in the "nonexistent" domain list. People who used company names, etc., as top level domains, however, were considered losers and put in the "nonexistent" list. Michael C. Berch mcb@ncis.llnl.gov / uunet!ncis.llnl.gov!mcb ----- Usenet Top-Level Domain Census, 27 April 1989 edu 24851 com 16019 uucp 15891 gov 923 org 859 uk 758 us 613 mil 531 net 526 ca 511 arpa 315 se 215 nl 195 oz 176 bitnet 128 fi 85 ie 78 dk 63 de 45 fr 44 au 39 ch 21 cdn 21 nz 18 junet 18 il 15 no 13 is 12 uninett 11 es 7 pt 6 pb 4 ia 4 hk 3 it 1 cs 1 (No domain spec) 1699 (Badly formed) 202 --------------------- Total 64937
fair@Apple.COM (Erik E. Fair) (04/28/89)
More fodder for this one: my mkglue pathalias postprocessor has a list of legit top level domains (i.e. those recognized by the DNS), and produces a report containing counts of real and bogus domains in the UUCP maps every time it's run (just because the data is there). Enclosed is the report from last night's run on apple.com. Erik E. Fair apple!fair fair@apple.com top level domains com 663 edu 473 uk 259 jp 187 ca 124 org 108 dk 105 kr 98 us 59 fr 43 gov 35 es 29 nl 28 cl 23 fi 17 is 16 arpa 15 mil 15 net 12 be 7 ar 5 de 4 nz 3 sg 2 at 1 ch 1 no 1 se 1 unrecognized summary: junet 176 jpn 174 bitnet 50 ac 35 cdn 26 it 25 co 15 kaist 14 kit 12 re 11 etri 10 dacom 8 hanyang 3 gsc 2 kta 2 netnorth 2 odin 2 or 2 ampr 1 csnet 1 dkuug 1 dunet 1 gr 1 gs 1 konkuk 1 kwu 1 sca 1 seri 1 snu 1 sst 1 th 1 ukorea 1 watstar 1 wisc 1 yonsei 1
lmb@vicom.COM (Larry Blair) (04/28/89)
In article <164@ncis.tis.llnl.gov> mcb@ncis.tis.llnl.gov (Michael C. Berch) writes:
=We were sitting around jawing about conversion to the Domain Name
=System and how many sites are still using ".UUCP" or other name schemes,
=and how many sites don't put out legal Message-IDs at all.
Wait a sec. If you're talking about the Message-ID in news postings, I
have to disagree with your concept of illegal. A legal Message-ID is one
that uniquely identifies a particular news posting, does not conflict with
any other site's Message-IDs, and doesn't cause the news software to barf.
It is only used for article identification and has nothing to do addresses
or domains. You'll notice that my From: says lmb@vicom.COM and the Message-ID
is <nnnn@vicom.COM>. This only because I went in and hacked news source to
produce this. Unmodified, it generated <nnnn@vsi1.COM>, using our uucp node
name and appending our domain. That was a perfectly legal ID (I changed it
for esthetic reasons). If a site wants to generate IDs like <xAqd-froo%baz>
that's perfectly ok, provided they don't reuse that string. The software
doesn't care.
--
Larry Blair ames!vsi1!lmb lmb@vicom.com
mcb@ncis.tis.llnl.gov (Michael C. Berch) (04/29/89)
In article <1660@vicom.COM> lmb@vicom.COM (Larry Blair) writes: > In <164@ncis.tis.llnl.gov> mcb@ncis.tis.llnl.gov (Michael C. Berch) writes: > > We were sitting around jawing about conversion to the Domain Name > > System and how many sites are still using ".UUCP" or other name schemes, > > and how many sites don't put out legal Message-IDs at all. > > Wait a sec. If you're talking about the Message-ID in news postings, I > have to disagree with your concept of illegal. A legal Message-ID is one > that uniquely identifies a particular news posting, does not conflict with > any other site's Message-IDs, and doesn't cause the news software to barf. > It is only used for article identification and has nothing to do addresses > or domains. You'll notice that my From: says lmb@vicom.COM and the Message-ID > is <nnnn@vicom.COM>. This only because I went in and hacked news source to > produce this. Unmodified, it generated <nnnn@vsi1.COM>, using our uucp node > name and appending our domain. That was a perfectly legal ID (I changed it > for esthetic reasons). If a site wants to generate IDs like <xAqd-froo%baz> > that's perfectly ok, provided they don't reuse that string. The software > doesn't care. Well, yes and no. Here's what the standard says (M. Horton & R. Adams, Standard for the Interchange of USENET Messages, RFC-1036, December 1987): "2.1.5. Message-ID The "Message-ID" line gives the message a unique identifier. The Message-ID may not be reused during the lifetime of any previous message with the same Message-ID. (It is recommended that no Message-ID be reused for at least two years.) Message-ID's have the syntax: <string not containing blank or ">"> In order to conform to RFC-822, the Message-ID must have the format: <unique@full_domain_name> where full_domain_name is the full name of the host at which the message entered the network, including a domain that host is in, and unique is any string of printing ASCII characters, not including "<" (left angle bracket), ">" (right angle bracket), or "@" (at sign). [...]" My interpretation of this, and I believe it to be the general sense of the community, is that (1) the software as it presently exists will accept the weaker first form as a legal Message-ID, but (2) as a matter of policy all Message-IDs should conform to RFC-822, which requires a domain specification. This latter requirement is especially valuable in preserving Message-IDs across news/mail links (particularly important to those of us who do mailing-list/Usenet gatewaying), and for the transport of news other than by UUCP. Furthermore, as a matter of common sense, it is apparent that the only guarantee of uniqueness of a Message-ID is for each host to insert a unique host identifier. The only such unique identifier is a full domain name (including hosts in the UUCP pseudo-domain who have registered names in the UUCP map), since they are assigned by an external source. Otherwise, what is to prevent two hosts from both using "<xAqd-froo%baz>" as a Message-ID? Section 2.1.5 of RFC-1036 goes on to admonish programmers not to make unwarranted assumptions about the content or syntax of Message-IDs; if I were writing news software I would certainly not have it bounce anything that didn't have a domain spec, but I felt perfectly free in doing so for the purpose of the domain census. Considering that less than 2% of the IDs turned up as "illegal" under this metric proves the point, I think. (A number of the illegal IDs were in fact badly-formed under the weak standard in that they did not begin and end with "<" and ">".) A couple people wrote and asked me why I used the Message-IDs instead of the From: line. The intent of the exercise was a quick first-order approximation of the state of domain conversion in Usenet. My awk script took about 3 minutes to run through the 64K history file. Using the From: line would have either meant bashing 64,000 inodes or writing something to collect From: lines as articles arrived (probably the *right* way to do it); in either case I was unwilling to wait so long for the results, which I doubt would have been significantly different. Michael C. Berch mcb@ncis.llnl.gov / uunet!ncis.llnl.gov!mcb
rayan@ai.toronto.edu (Rayan Zachariassen) (04/30/89)
In article <53391@uunet.UU.NET> rick@uunet.UU.NET (Rick Adams) writes:
# A valid messaged ID is 3 components:
# "anything but whitespace or @" "@" "valid hostname"
#
# See RFC822, which the USENET specification defers to in many cases.
#
# You can have any kind of garbage that you want to the lef tof the
# at sign, but you MUST have an @ and you must have a valid hostname
RFC822:
msg-id = "<" addr-spec ">"
addr-spec = local-part "@" domain
local-part = word *("." word)
word = atom / quoted-string
atom = 1*<any CHAR except specials, SPACE, and CTL>
specials are: ()<>#@,;:\".[]
they must be within a quoted-string to occur in a word.
quoted-string = <"> *(qtext/quoted-pair) <">
qtext = <any CHAR excepting <">, ; => may be folded
"\" & CR, and including
linear-white-space>
quoted-pair = "\" CHAR ; may quote any char
The news rfc seems rather different than '822 in its description of the
requirement. Which takes precedence?
rayan
rick@uunet.UU.NET (Rick Adams) (04/30/89)
A valid messaged ID is 3 components: "anything but whitespace or @" "@" "valid hostname" See RFC822, which the USENET specification defers to in many cases. You can have any kind of garbage that you want to the lef tof the at sign, but you MUST have an @ and you must have a valid hostname --rick
rick@uunet.UU.NET (Rick Adams) (04/30/89)
The news rfc specifies that 1) All article headers must conform to rfc-822 2) the news rfc may be MORE restrictive that rfc-822 3) if there is a conflict, rfc-822 wins.
brad@looking.UUCP (Brad Templeton) (04/30/89)
And while we're on the subject, I would like to put in a plea to people to keep those message-ids short, if possible. In the old days, it was <sequence@site> where the sequence was usually no more than 5 digits. Nowadays I see things like: <Apr.28.00.38.03.1989.23256@NET.BIO.NET> Not to complain about bio.net in particular -- a fair number of sites are doing this sort of things. You may feel that if your site uses such IDs, it's not a problem, but when lots of people do it, it adds up. Not only does a typical net message contain a Reference line with several IDS, but each site likes to keep a database with these IDs as keys that goes back far longer than news is kept. In the case of the References line, that may only 60 extra bytes per message, but at 2500 messages/day, that's 150K extra per day, or about $1500 in transmission cost daily. Not because of just you, but everybody. So, if you are writing message-id code that doesn't use a sequence, try to keep it small. Encode things like dates & process-ids using the 92 safe printable characters, for example (no > or @) -- Brad Templeton, Looking Glass Software Ltd. -- Waterloo, Ontario 519/884-7473
wcs) (05/01/89)
In article <53391@uunet.UU.NET> rick@uunet.UU.NET (Rick Adams) writes:
] "anything but whitespace or @" "@" "valid hostname"
]See RFC822, which the USENET specification defers to in many cases.
]You can have any kind of garbage that you want to the lef tof the
]at sign, but you MUST have an @ and you must have a valid hostname
What about sites that don't have "valid hostname"s, in the sense of
being either a registered Internet node or a .UUCP uunet subscriber?
Sure, you can say "well go register your node with somebody", but
meanwhile, if your machine is
randompc
and you get your newsfeed from your nextdoor neighbor, what should
you use?
The implication of the RFC1036 excerpt was that it was nice to be
RFC822 compliant but not strictly required as long as you kept the
delimiters straight.
--
# Bill Stewart, AT&T Bell Labs 2G218 Holmdel NJ 201-949-0705 ho95c.att.com!wcs
# also found at 201-271-4712 tarpon.att.com!wcs
# welcome, to mars, eh? You hosers have a brew and some donuts!
andys@ulysses.homer.nj.att.com (Andy Sherman) (05/10/89)
In article <89Apr29.175638edt.39756@neat.ai.toronto.edu> rayan@ai.toronto.edu (Rayan Zachariassen) writes: > > The news rfc seems rather different than '822 in its description of the > requirement. Which takes precedence? 822 sets the standard for *mail*. The news rfc sets the standard for *news*. -- Andy Sherman/AT&T Bell Laboratories/Murray Hill, NJ *NEW ADDRESS* AUDIBLE: (201) 582-5928 *NEW PHONE* READABLE: andys@ulysses.ATT.COM or att!ulysses!andys *NEW EMAIL* The views and opinions are my own. Who else would want them? *OLD DISCLAIMER*
soley@moegate.UUCP (Norman S. Soley) (05/10/89)
In article <11505@ulysses.homer.nj.att.com> andys@ulysses.homer.nj.att.com (Andy Sherman) writes: >In article <89Apr29.175638edt.39756@neat.ai.toronto.edu> rayan@ai.toronto.edu (Rayan Zachariassen) writes: >> >> The news rfc seems rather different than '822 in its description of the >> requirement. Which takes precedence? > >822 sets the standard for *mail*. The news rfc sets the standard for *news*. Since mailing articles around is one of the possible transport mechanisms for news 822 should take precedence. -- Norman Soley - The Communications Guy - Ontario Ministry of the Environment Until the next maps go out: moegate!soley@ontenv.UUCP if you roll your own: uunet!{attcan!ncrcan|mnetor!ontmoh}!ontenv!moegate!soley I'd like to try golf but I just can't bring myself to buy a pair of plaid pants
henry@utzoo.uucp (Henry Spencer) (05/11/89)
In article <297@moegate.UUCP> soley@moegate.UUCP (Norman S. Soley) writes: >Since mailing articles around is one of the possible transport mechanisms >for news 822 should take precedence. I fear this is a non sequitur. Using mail as a transport mechanism tends to require wrapping the news articles up sufficiently so that mailers will not mess with them. The result is that it really doesn't matter what their headers look like, for this purpose. -- Mars in 1980s: USSR, 2 tries, | Henry Spencer at U of Toronto Zoology 2 failures; USA, 0 tries. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu