[net.news] Usenet message content

adams@plx.UUCP (Robert Adams) (05/30/85)

     I remember a long discussion on the size of signatures and
the wasted resources on the net because of all this extra "junk".
I always thought that most of the space was being taken up by the 
gigantic headers on all messages.  To see if this was true, I wrote
a program that searched out everything in the news directories, parsed
the messages, and counted characters and lines.
     At our site, we keep 7 days of everything on-line and archive
what we wish to keep.  So, the below tabualted 8 megabytes of usenet 
news is for 7 days.  "header lines" are those that are at the 
beginning of a message and have the form of "<symbol>: <text>" 
(the <symbol>s are dynamically added to the header table so what is
below is all of the header types found).  "inserted lines" are those 
that are after the header and before the signature and begin with a ">" 
(that is, all of the message over again stuff).  "signature lines" are 
those after the body of the message and after a line that is just "-- ".
This last test unfortunatly doesn't get enough of the signatures.
	Anyway, here's the data.

files = 4063, lines = 197433, characters = 8169343
header lines = 59510, characters = 2380588, percent = 29%
signature lines = 9682, characters = 322370, percent =  4%
inserted lines = 18819, characters = 946019, percent = 12%
                                               percent    percent
            Header   occurances  total chars  of headers  of total
       Relay-Version     4059      219190        9.2%       2.7%
     Posting-Version     4055      244616       10.3%       3.0%
                Path     4059      306508       12.9%       3.8%
                From     4060      148807        6.3%       1.8%
          Newsgroups     4059      117076        4.9%       1.4%
             Subject     4059      151294        6.4%       1.9%
          Message-ID     4059      119605        5.0%       1.5%
                Date     4059      133901        5.6%       1.6%
        Article-I.D.     4059       98697        4.1%       1.2%
              Posted     4059      129888        5.5%       1.6%
       Date-Received     4059      170427        7.2%       2.1%
          References     2217      113188        4.8%       1.4%
        Distribution     1288       22074        0.9%       0.3%
        Organization     3902      174832        7.3%       2.1%
               Lines     4059       35957        1.5%       0.4%
                Xref     1261       64983        2.7%       0.8%
             Control       57        1905        0.1%       0.0%
            Reply-To      934       39895        1.7%       0.5%
              Sender      401       10175        0.4%       0.1%
             Summary      399        5309        0.2%       0.1%
             Expires       13         462        0.0%       0.0%
         Followup-To       29         759        0.0%       0.0%
               Nf-ID       84        3695        0.2%       0.0%
             Nf-From       84        3690        0.2%       0.0%
            Keywords       87        2708        0.1%       0.0%
               Xpath       14         250        0.0%       0.0%
       Apparently-To        5         160        0.0%       0.0%
            Approved       24         627        0.0%       0.0%
         Really-From        6         390        0.0%       0.0%
               other        0           0        0.0%       0.0%

	The observation is that one third of the data we have
received is message header.  Almost 4% of what we receive the that
relative address in the path header line followed closely by all
of that useless information in the Posting-Version and Relay-Version
headers.  It also seems that far too much space is taken up by the
inserted lines.  I was also surprized to see that the average message
length is about 2000 characters.  I never noticed that much useful
information in each and every message.
	  Suggestions for improvements to news software:
1) Find a way to compress header information.  Cut characters out 
of dates, compact version info in the posting and relay version info.
Anything would help. 2) add software assists to make referencing 
past articles easier.  Either commands to quicky go back to the 
previous article for review or possibly a "insert that other article 
here for the reader" type meta line.

-- 
...!{ucbvax,decwrl}!sun!plx!adams	-- Robert Adams