[news.admin] Compressing the news spool

nelson@sun.soe.clarkson.edu (Russ Nelson) (01/01/91)

Here's a suggestion for someone with time on their hands:

Some sites only let users read news through nntpd.  Therefore, the
actual storage of the news articles is hidden from the user.  So,
there is no reason why the news spool cannot be compressed.  There are
at least three levels of compression:

  0) None.  Each article is stored in its own file, as is currently
     the case.

  1) The mere catenation of N articles into one file.  This saves
     space because many news articles are a small multiple of the disk
     block size.  The unused part of the last block allocated to the
     article file is wasted, and it's often significant relative to
     the size of the article file.

  2) A restartable data compression algorithm is used to compress a
     level one file.  I posted my restartable Huffman decoder to
     alt.sources some while back.

  3) A different compression algorithm that runs slower but compresses
     more?

Obviously an index is needed for levels one and two.  But, the good
design of news's history file saves us.  Instead of creating a
separate index file, the group/number as stored in ${NEWSLIB}/history
is converted into the index.  For example, after a level zero article
is changed into level one, its history entry might read:

<1@foo.com>	888888888~77777777	compress#1.alt.foo/566567567

Where compress#1 is the name of the file holding the compressed
articles, and the number following the / is the seek offset into the
file.

This particular representation might not work -- I don't know what
constraints there might be on the history file format.  But certainly
something can be worked out.

It would be reasonable to keep the most recent day's articles in level
0 (since you're adding to them), yesterdays in level 1, and the far,
ancient past (two or more days ago :-) in level 2.

Expire could be a problem.  However, the program that schedules
increasing levels of compression could also group together only
articles with identical expire dates.

Why did I think of this weirdo scheme?  Well, 1) I've got a fever, and
my brain's running amok to little purpose, 2) the daily volume is ever
marching onward, and 3) I've got this "little" Xenix system with four
processors, 32 MB of memory, and 600MB of disk, but Xenix only gives
me 65535 inodes per partition, so I have lots of disk space, but not
much to do with it past 65535 article (references!, remember
cross-posting counts up an inode).

--
--russ (nelson@clutx [.bitnet | .clarkson.edu])  FAX 315-268-7600
It's better to get mugged than to live a life of fear -- Freeman Dyson
I joined the League for Programming Freedom, and I hope you'll join too.