[news.software.nntp] Compressing the news spool

nelson@sun.soe.clarkson.edu (Russ Nelson) (01/01/91)

Here's a suggestion for someone with time on their hands:

Some sites only let users read news through nntpd.  Therefore, the
actual storage of the news articles is hidden from the user.  So,
there is no reason why the news spool cannot be compressed.  There are
at least three levels of compression:

  0) None.  Each article is stored in its own file, as is currently
     the case.

  1) The mere catenation of N articles into one file.  This saves
     space because many news articles are a small multiple of the disk
     block size.  The unused part of the last block allocated to the
     article file is wasted, and it's often significant relative to
     the size of the article file.

  2) A restartable data compression algorithm is used to compress a
     level one file.  I posted my restartable Huffman decoder to
     alt.sources some while back.

  3) A different compression algorithm that runs slower but compresses
     more?

Obviously an index is needed for levels one and two.  But, the good
design of news's history file saves us.  Instead of creating a
separate index file, the group/number as stored in ${NEWSLIB}/history
is converted into the index.  For example, after a level zero article
is changed into level one, its history entry might read:

<1@foo.com>	888888888~77777777	compress#1.alt.foo/566567567

Where compress#1 is the name of the file holding the compressed
articles, and the number following the / is the seek offset into the
file.

This particular representation might not work -- I don't know what
constraints there might be on the history file format.  But certainly
something can be worked out.

It would be reasonable to keep the most recent day's articles in level
0 (since you're adding to them), yesterdays in level 1, and the far,
ancient past (two or more days ago :-) in level 2.

Expire could be a problem.  However, the program that schedules
increasing levels of compression could also group together only
articles with identical expire dates.

Why did I think of this weirdo scheme?  Well, 1) I've got a fever, and
my brain's running amok to little purpose, 2) the daily volume is ever
marching onward, and 3) I've got this "little" Xenix system with four
processors, 32 MB of memory, and 600MB of disk, but Xenix only gives
me 65535 inodes per partition, so I have lots of disk space, but not
much to do with it past 65535 article (references!, remember
cross-posting counts up an inode).

--
--russ (nelson@clutx [.bitnet | .clarkson.edu])  FAX 315-268-7600
It's better to get mugged than to live a life of fear -- Freeman Dyson
I joined the League for Programming Freedom, and I hope you'll join too.

henry@zoo.toronto.edu (Henry Spencer) (01/02/91)

In article <NELSON.90Dec31223115@sun.clarkson.edu> nelson@clutx.clarkson.edu (aka NELSON@CLUTX.BITNET) writes:
>Expire could be a problem.  However, the program that schedules
>increasing levels of compression could also group together only
>articles with identical expire dates.

Unfortunately, this denies the sysadmin the freedom to change expiry policy
to meet space crises, which is important.
-- 
"The average pointer, statistically,    |Henry Spencer at U of Toronto Zoology
points somewhere in X." -Hugh Redelmeier| henry@zoo.toronto.edu   utzoo!henry

richard@locus.com (Richard M. Mathews) (01/03/91)

nelson@sun.soe.clarkson.edu (Russ Nelson) writes:
>Here's a suggestion for someone with time on their hands:

Here's another way to compress a news file system.  It doesn't require
nntp (in fact, you should still be able to 'cd' to the news spool directory
and look around without noticing anything out of the ordinary).  It does
require a kernel with a file system switch.

Create a new file system type.  The file system code would automatically
compress files as they are written and uncompress as they are read.  You
can design a compression algorithm which is specially tailored to the
typical contents of news articles.  The block size of the file system
would be suitably small for typical compressed news articles, or you
could invent an allocation scheme that does not depend on blocking.
Since most files are opened, written, and closed in quick succession,
you shouldn't need to worry about the efficiency of arbitrary seeks,
holes in files, etc.  (I'm assuming that the $LIB directory is separate.)
You could even compress the directories by taking advantage of the fact
that most file names are small, consecutive numbers.  This file system
could work with B News or C News.

1/2 :-)

Richard M. Mathews			 Freedom for Lithuania
richard@locus.com				Laisve!
lcc!richard@seas.ucla.edu
...!{uunet|ucla-se|turnkey}!lcc!richard