adams@plx.UUCP (Robert Adams) (05/30/85)
I remember a long discussion on the size of signatures and the wasted resources on the net because of all this extra "junk". I always thought that most of the space was being taken up by the gigantic headers on all messages. To see if this was true, I wrote a program that searched out everything in the news directories, parsed the messages, and counted characters and lines. At our site, we keep 7 days of everything on-line and archive what we wish to keep. So, the below tabualted 8 megabytes of usenet news is for 7 days. "header lines" are those that are at the beginning of a message and have the form of "<symbol>: <text>" (the <symbol>s are dynamically added to the header table so what is below is all of the header types found). "inserted lines" are those that are after the header and before the signature and begin with a ">" (that is, all of the message over again stuff). "signature lines" are those after the body of the message and after a line that is just "-- ". This last test unfortunatly doesn't get enough of the signatures. Anyway, here's the data. files = 4063, lines = 197433, characters = 8169343 header lines = 59510, characters = 2380588, percent = 29% signature lines = 9682, characters = 322370, percent = 4% inserted lines = 18819, characters = 946019, percent = 12% percent percent Header occurances total chars of headers of total Relay-Version 4059 219190 9.2% 2.7% Posting-Version 4055 244616 10.3% 3.0% Path 4059 306508 12.9% 3.8% From 4060 148807 6.3% 1.8% Newsgroups 4059 117076 4.9% 1.4% Subject 4059 151294 6.4% 1.9% Message-ID 4059 119605 5.0% 1.5% Date 4059 133901 5.6% 1.6% Article-I.D. 4059 98697 4.1% 1.2% Posted 4059 129888 5.5% 1.6% Date-Received 4059 170427 7.2% 2.1% References 2217 113188 4.8% 1.4% Distribution 1288 22074 0.9% 0.3% Organization 3902 174832 7.3% 2.1% Lines 4059 35957 1.5% 0.4% Xref 1261 64983 2.7% 0.8% Control 57 1905 0.1% 0.0% Reply-To 934 39895 1.7% 0.5% Sender 401 10175 0.4% 0.1% Summary 399 5309 0.2% 0.1% Expires 13 462 0.0% 0.0% Followup-To 29 759 0.0% 0.0% Nf-ID 84 3695 0.2% 0.0% Nf-From 84 3690 0.2% 0.0% Keywords 87 2708 0.1% 0.0% Xpath 14 250 0.0% 0.0% Apparently-To 5 160 0.0% 0.0% Approved 24 627 0.0% 0.0% Really-From 6 390 0.0% 0.0% other 0 0 0.0% 0.0% The observation is that one third of the data we have received is message header. Almost 4% of what we receive the that relative address in the path header line followed closely by all of that useless information in the Posting-Version and Relay-Version headers. It also seems that far too much space is taken up by the inserted lines. I was also surprized to see that the average message length is about 2000 characters. I never noticed that much useful information in each and every message. Suggestions for improvements to news software: 1) Find a way to compress header information. Cut characters out of dates, compact version info in the posting and relay version info. Anything would help. 2) add software assists to make referencing past articles easier. Either commands to quicky go back to the previous article for review or possibly a "insert that other article here for the reader" type meta line. -- ...!{ucbvax,decwrl}!sun!plx!adams -- Robert Adams