adams@plx.UUCP (Robert Adams) (05/30/85)
I remember a long discussion on the size of signatures and
the wasted resources on the net because of all this extra "junk".
I always thought that most of the space was being taken up by the
gigantic headers on all messages. To see if this was true, I wrote
a program that searched out everything in the news directories, parsed
the messages, and counted characters and lines.
At our site, we keep 7 days of everything on-line and archive
what we wish to keep. So, the below tabualted 8 megabytes of usenet
news is for 7 days. "header lines" are those that are at the
beginning of a message and have the form of "<symbol>: <text>"
(the <symbol>s are dynamically added to the header table so what is
below is all of the header types found). "inserted lines" are those
that are after the header and before the signature and begin with a ">"
(that is, all of the message over again stuff). "signature lines" are
those after the body of the message and after a line that is just "-- ".
This last test unfortunatly doesn't get enough of the signatures.
Anyway, here's the data.
files = 4063, lines = 197433, characters = 8169343
header lines = 59510, characters = 2380588, percent = 29%
signature lines = 9682, characters = 322370, percent = 4%
inserted lines = 18819, characters = 946019, percent = 12%
percent percent
Header occurances total chars of headers of total
Relay-Version 4059 219190 9.2% 2.7%
Posting-Version 4055 244616 10.3% 3.0%
Path 4059 306508 12.9% 3.8%
From 4060 148807 6.3% 1.8%
Newsgroups 4059 117076 4.9% 1.4%
Subject 4059 151294 6.4% 1.9%
Message-ID 4059 119605 5.0% 1.5%
Date 4059 133901 5.6% 1.6%
Article-I.D. 4059 98697 4.1% 1.2%
Posted 4059 129888 5.5% 1.6%
Date-Received 4059 170427 7.2% 2.1%
References 2217 113188 4.8% 1.4%
Distribution 1288 22074 0.9% 0.3%
Organization 3902 174832 7.3% 2.1%
Lines 4059 35957 1.5% 0.4%
Xref 1261 64983 2.7% 0.8%
Control 57 1905 0.1% 0.0%
Reply-To 934 39895 1.7% 0.5%
Sender 401 10175 0.4% 0.1%
Summary 399 5309 0.2% 0.1%
Expires 13 462 0.0% 0.0%
Followup-To 29 759 0.0% 0.0%
Nf-ID 84 3695 0.2% 0.0%
Nf-From 84 3690 0.2% 0.0%
Keywords 87 2708 0.1% 0.0%
Xpath 14 250 0.0% 0.0%
Apparently-To 5 160 0.0% 0.0%
Approved 24 627 0.0% 0.0%
Really-From 6 390 0.0% 0.0%
other 0 0 0.0% 0.0%
The observation is that one third of the data we have
received is message header. Almost 4% of what we receive the that
relative address in the path header line followed closely by all
of that useless information in the Posting-Version and Relay-Version
headers. It also seems that far too much space is taken up by the
inserted lines. I was also surprized to see that the average message
length is about 2000 characters. I never noticed that much useful
information in each and every message.
Suggestions for improvements to news software:
1) Find a way to compress header information. Cut characters out
of dates, compact version info in the posting and relay version info.
Anything would help. 2) add software assists to make referencing
past articles easier. Either commands to quicky go back to the
previous article for review or possibly a "insert that other article
here for the reader" type meta line.
--
...!{ucbvax,decwrl}!sun!plx!adams -- Robert Adams