[news.admin] Usenet Top-Level Domain Census

mcb@ncis.tis.llnl.gov (Michael C. Berch) (04/28/89)

We were sitting around jawing about conversion to the Domain Name
System and how many sites are still using ".UUCP" or other name schemes, 
and how many sites don't put out legal Message-IDs at all.
So I decided to count, using the Message-IDs in the history file. 
I don't remember how long we are keeping things in the history file
after expiration, but there were about 64K articles in the
file, which looked like a good round number for a sample.

The "(No domain)" count were Message-IDs without a dot in them, and
were counted automatically, so I don't know offhand who isn't doing an
addr-spec on the right-hand side of the Message-ID.  The "(Badly
formed)" count was done by hand and represents 202 articles from about 70 
different sites.  Of those 202 articles, about half were literally 
badly-formed garbage and half were nonexistent top-level domains.

I was loose about the legitimacy of top-level domains; in addition to
all the legal (i.e., registered and known to the root servers) domains, 
I also included a few well-known names like "uninett", "junet", "bitnet", 
and "cdn" in the main list instead of putting them in the "nonexistent" 
domain list.  People who used company names, etc., as top level
domains, however, were considered losers and put in the "nonexistent" list.

Michael C. Berch  
mcb@ncis.llnl.gov / uunet!ncis.llnl.gov!mcb

-----
Usenet Top-Level Domain Census, 27 April 1989

edu		24851
com		16019
uucp		15891
gov		  923
org		  859
uk		  758
us		  613
mil		  531
net		  526
ca		  511
arpa		  315
se		  215
nl		  195
oz		  176
bitnet		  128
fi		   85
ie		   78
dk		   63
de		   45
fr		   44
au		   39
ch		   21
cdn		   21
nz		   18
junet		   18
il		   15
no		   13
is		   12
uninett		   11
es		    7
pt		    6
pb		    4
ia		    4
hk		    3
it		    1
cs		    1
(No domain spec) 1699
(Badly formed)	  202
---------------------
Total		64937

fair@Apple.COM (Erik E. Fair) (04/28/89)

More fodder for this one: my mkglue pathalias postprocessor has a list
of legit top level domains (i.e. those recognized by the DNS), and
produces a report containing counts of real and bogus domains in the
UUCP maps every time it's run (just because the data is there).
Enclosed is the report from last night's run on apple.com.

	Erik E. Fair	apple!fair	fair@apple.com

top level domains
com	663
edu	473
uk	259
jp	187
ca	124
org	108
dk	105
kr	98
us	59
fr	43
gov	35
es	29
nl	28
cl	23
fi	17
is	16
arpa	15
mil	15
net	12
be	7
ar	5
de	4
nz	3
sg	2
at	1
ch	1
no	1
se	1

unrecognized summary:
junet		176
jpn		174
bitnet		50
ac		35
cdn		26
it		25
co		15
kaist		14
kit		12
re		11
etri		10
dacom		8
hanyang		3
gsc		2
kta		2
netnorth	2
odin		2
or		2
ampr		1
csnet		1
dkuug		1
dunet		1
gr		1
gs		1
konkuk		1
kwu		1
sca		1
seri		1
snu		1
sst		1
th		1
ukorea		1
watstar		1
wisc		1
yonsei		1

lmb@vicom.COM (Larry Blair) (04/28/89)

In article <164@ncis.tis.llnl.gov> mcb@ncis.tis.llnl.gov (Michael C. Berch) writes:
=We were sitting around jawing about conversion to the Domain Name
=System and how many sites are still using ".UUCP" or other name schemes, 
=and how many sites don't put out legal Message-IDs at all.

Wait a sec.  If you're talking about the Message-ID in news postings, I
have to disagree with your concept of illegal.  A legal Message-ID is one
that uniquely identifies a particular news posting, does not conflict with
any other site's Message-IDs, and doesn't cause the news software to barf.
It is only used for article identification and has nothing to do addresses
or domains.  You'll notice that my From: says lmb@vicom.COM and the Message-ID
is <nnnn@vicom.COM>.  This only because I went in and hacked news source to
produce this.  Unmodified, it generated <nnnn@vsi1.COM>, using our uucp node
name and appending our domain.  That was a perfectly legal ID (I changed it
for esthetic reasons).  If a site wants to generate IDs like <xAqd-froo%baz>
that's perfectly ok, provided they don't reuse that string.  The software
doesn't care.
-- 
Larry Blair   ames!vsi1!lmb   lmb@vicom.com

mcb@ncis.tis.llnl.gov (Michael C. Berch) (04/29/89)

In article <1660@vicom.COM> lmb@vicom.COM (Larry Blair) writes:
> In <164@ncis.tis.llnl.gov> mcb@ncis.tis.llnl.gov (Michael C. Berch) writes:
> > We were sitting around jawing about conversion to the Domain Name
> > System and how many sites are still using ".UUCP" or other name schemes, 
> > and how many sites don't put out legal Message-IDs at all.
> 
> Wait a sec.  If you're talking about the Message-ID in news postings, I
> have to disagree with your concept of illegal.  A legal Message-ID is one
> that uniquely identifies a particular news posting, does not conflict with
> any other site's Message-IDs, and doesn't cause the news software to barf.
> It is only used for article identification and has nothing to do addresses
> or domains.  You'll notice that my From: says lmb@vicom.COM and the Message-ID
> is <nnnn@vicom.COM>.  This only because I went in and hacked news source to
> produce this.  Unmodified, it generated <nnnn@vsi1.COM>, using our uucp node
> name and appending our domain.  That was a perfectly legal ID (I changed it
> for esthetic reasons).  If a site wants to generate IDs like <xAqd-froo%baz>
> that's perfectly ok, provided they don't reuse that string.  The software
> doesn't care.

Well, yes and no.  Here's what the standard says (M. Horton & R. Adams, 
Standard for the Interchange of USENET Messages, RFC-1036, December 1987):

   "2.1.5.  Message-ID

    The "Message-ID" line gives the message a unique identifier.  The
    Message-ID may not be reused during the lifetime of any previous
    message with the same Message-ID.  (It is recommended that no
    Message-ID be reused for at least two years.)  Message-ID's have the
    syntax:

                     <string not containing blank or ">">

    In order to conform to RFC-822, the Message-ID must have the format:

                          <unique@full_domain_name>

    where full_domain_name is the full name of the host at which the
    message entered the network, including a domain that host is in, and
    unique is any string of printing ASCII characters, not including "<"
    (left angle bracket), ">" (right angle bracket), or "@" (at sign).
    [...]"

My interpretation of this, and I believe it to be the general sense of
the community, is that (1) the software as it presently exists will 
accept the weaker first form as a legal Message-ID, but (2) as a
matter of policy all Message-IDs should conform to RFC-822, which
requires a domain specification.  This latter requirement is
especially valuable in preserving Message-IDs across news/mail links
(particularly important to those of us who do mailing-list/Usenet
gatewaying), and for the transport of news other than by UUCP. 
Furthermore, as a matter of common sense, it is apparent that the only
guarantee of uniqueness of a Message-ID is for each host to insert a
unique host identifier.  The only such unique identifier is a full
domain name (including hosts in the UUCP pseudo-domain who have
registered names in the UUCP map), since they are assigned by an
external source.  Otherwise, what is to prevent two hosts from both
using "<xAqd-froo%baz>" as a Message-ID?

Section 2.1.5 of RFC-1036 goes on to admonish programmers not to make
unwarranted assumptions about the content or syntax of Message-IDs; if
I were writing news software I would certainly not have it bounce anything
that didn't have a domain spec, but I felt perfectly free in doing so
for the purpose of the domain census.  Considering that less than 2%
of the IDs turned up as "illegal" under this metric proves the point,
I think. (A number of the illegal IDs were in fact badly-formed under
the weak standard in that they did not begin and end with "<" and ">".)

A couple people wrote and asked me why I used the Message-IDs instead
of the From: line.  The intent of the exercise was a quick first-order
approximation of the state of domain conversion in Usenet. My awk
script took about 3 minutes to run through the 64K history file.  
Using the From: line would have either meant bashing 64,000 inodes or
writing something to collect From: lines as articles arrived (probably
the *right* way to do it); in either case I was unwilling to wait so
long for the results, which I doubt would have been significantly
different.

Michael C. Berch  
mcb@ncis.llnl.gov / uunet!ncis.llnl.gov!mcb

rayan@ai.toronto.edu (Rayan Zachariassen) (04/30/89)

In article <53391@uunet.UU.NET> rick@uunet.UU.NET (Rick Adams) writes:
# A valid messaged ID is 3 components:
# 	"anything but whitespace or @" "@" "valid hostname"
# 
# See RFC822, which the USENET specification defers to in many cases.
# 
# You can have any kind of garbage that you want to the lef tof the
# at sign, but you MUST have an @ and you must have a valid hostname

RFC822:

	msg-id = "<" addr-spec ">"

	addr-spec = local-part "@" domain

	local-part = word *("." word)

	word = atom / quoted-string

	atom = 1*<any CHAR except specials, SPACE, and CTL>

	specials are: ()<>#@,;:\".[]
	they must be within a quoted-string to occur in a word.

	quoted-string = <"> *(qtext/quoted-pair) <">

	qtext = <any CHAR excepting <">,     ; => may be folded
                     "\" & CR, and including
                     linear-white-space>

	quoted-pair = "\" CHAR                     ; may quote any char

The news rfc seems rather different than '822 in its description of the
requirement.  Which takes precedence?

rayan

rick@uunet.UU.NET (Rick Adams) (04/30/89)

A valid messaged ID is 3 components:
	"anything but whitespace or @" "@" "valid hostname"

See RFC822, which the USENET specification defers to in many cases.

You can have any kind of garbage that you want to the lef tof the
at sign, but you MUST have an @ and you must have a valid hostname

--rick

rick@uunet.UU.NET (Rick Adams) (04/30/89)

The news rfc specifies that 


1) All article headers must conform to rfc-822

2) the news rfc may be MORE restrictive that rfc-822

3) if there is a conflict, rfc-822 wins.

brad@looking.UUCP (Brad Templeton) (04/30/89)

And while we're on the subject, I would like to put in a plea to people
to keep those message-ids short, if possible.   In the old days, it
was <sequence@site> where the sequence was usually no more than 5 digits.

Nowadays I see things like: <Apr.28.00.38.03.1989.23256@NET.BIO.NET>
Not to complain about bio.net in particular -- a fair number of sites are
doing this sort of things.

You may feel that if your site uses such IDs, it's not a problem, but when
lots of people do it, it adds up.

Not only does a typical net message contain a Reference line with several
IDS, but each site likes to keep a database with these IDs as keys that
goes back far longer than news is kept.

In the case of the References line, that may only 60 extra bytes per
message, but at 2500 messages/day, that's 150K extra per day, or about
$1500 in transmission cost daily.  Not because of just you, but everybody.

So, if you are writing message-id code that doesn't use a sequence, try
to keep it small.  Encode things like dates & process-ids using the 92 safe
printable characters, for example (no > or @)
-- 
Brad Templeton, Looking Glass Software Ltd.  --  Waterloo, Ontario 519/884-7473

wcs) (05/01/89)

In article <53391@uunet.UU.NET> rick@uunet.UU.NET (Rick Adams) writes:
]	"anything but whitespace or @" "@" "valid hostname"
]See RFC822, which the USENET specification defers to in many cases.
]You can have any kind of garbage that you want to the lef tof the
]at sign, but you MUST have an @ and you must have a valid hostname

What about sites that don't have "valid hostname"s, in the sense of
being either a registered Internet node or a .UUCP uunet subscriber?
Sure, you can say "well go register your node with somebody", but
meanwhile, if your machine is 
	randompc
and you get your newsfeed from your nextdoor neighbor, what should
you use?

The implication of the RFC1036 excerpt was that it was nice to be
RFC822 compliant but not strictly required as long as you kept the
delimiters straight.
-- 
# Bill Stewart, AT&T Bell Labs 2G218 Holmdel NJ 201-949-0705 ho95c.att.com!wcs
# also found at 201-271-4712 tarpon.att.com!wcs 

# welcome, to mars, eh?  You hosers have a brew and some donuts!

andys@ulysses.homer.nj.att.com (Andy Sherman) (05/10/89)

In article <89Apr29.175638edt.39756@neat.ai.toronto.edu> rayan@ai.toronto.edu (Rayan Zachariassen) writes:
>
>   The news rfc seems rather different than '822 in its description of the
>   requirement.  Which takes precedence?

822 sets the standard for *mail*.  The news rfc sets the standard for *news*.

-- 
Andy Sherman/AT&T Bell Laboratories/Murray Hill, NJ           *NEW ADDRESS*
AUDIBLE:  (201) 582-5928                                      *NEW PHONE*
READABLE: andys@ulysses.ATT.COM  or att!ulysses!andys         *NEW EMAIL*
The views and opinions are my own.  Who else would want them? *OLD DISCLAIMER*

soley@moegate.UUCP (Norman S. Soley) (05/10/89)

In article <11505@ulysses.homer.nj.att.com> andys@ulysses.homer.nj.att.com (Andy Sherman) writes:
>In article <89Apr29.175638edt.39756@neat.ai.toronto.edu> rayan@ai.toronto.edu (Rayan Zachariassen) writes:
>>
>>   The news rfc seems rather different than '822 in its description of the
>>   requirement.  Which takes precedence?
>
>822 sets the standard for *mail*.  The news rfc sets the standard for *news*.

Since mailing articles around is one of the possible transport mechanisms 
for news 822 should take precedence. 


-- 
Norman Soley - The Communications Guy - Ontario Ministry of the Environment
Until the next maps go out:	moegate!soley@ontenv.UUCP 
if you roll your own: 	uunet!{attcan!ncrcan|mnetor!ontmoh}!ontenv!moegate!soley
I'd like to try golf but I just can't bring myself to buy a pair of plaid pants

henry@utzoo.uucp (Henry Spencer) (05/11/89)

In article <297@moegate.UUCP> soley@moegate.UUCP (Norman S. Soley) writes:
>Since mailing articles around is one of the possible transport mechanisms 
>for news 822 should take precedence. 

I fear this is a non sequitur.  Using mail as a transport mechanism tends
to require wrapping the news articles up sufficiently so that mailers will
not mess with them.  The result is that it really doesn't matter what their
headers look like, for this purpose.
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu