[news.admin] Proposal: compressing news in the spool directories

paul@vixie.UUCP (04/15/87)

I'm willing to do some of the work for this, but I want to see what else
has been done or thought about it before I start.  So..........

What about compressing the news data in the spool directories?  If compress
can save half the transmission time, it ought to be able to save almost that
much in the storage costs as well.  It isn't quite the same win, since there
is the file system's frag size to consider -- but it's a win, just the same.

The hard part is the headers -- they should not be compressed because they
are examined independent of the (much longer) data quite often -- in expire,
in subject searches, etc.  In my view, the headers would be better left
uncompressed.  So we can either put compressed and uncompressed data in the
same file -- not simple, but possible -- or we can seperate the data into
other files.  This irritates the waste in the file system frag size, but I
think we would gain more in the compression than we would waste in the half-
empty blocks we would add in this scheme.

Perhaps an additional header line is in order -- starting with 'X-' in the
tradition of lines which should not be passed out of the current system.
How about 'X-Data-File: xxx [-c]' where 'xxx' is the name of the file where
the data is stored, and '-c' would indicate whether the data is compressed?
Alternately, we could use the 'magic number' of the data file to determine
its compression -- but uncompressed data has a random magic number, so this
may be problematic.

The last architectural problem is in somehow tying the data files back to the
header files -- since the header files will inevitably become corrupted at
some point, 'orphaning' any associated data files.  They could be given
similar names -- perhaps the same 'article number'-style name, but in a
'.data/' directory?  This would remove some of the need for the 'X-Data-File:'
header, but there would still have to be something in the headers to tell
the various news reading and transmission programs that the data is elsewhere
-- otherwise they would think the article had no text.

If every stored article had its data stored seperately, this would not be a
problem; however, as a news administrator, I would prefer to have some things
compressed and others not -- during the conversion, mainly, but there are
other possibilities.  A new flag in the LIBDIR/active file could tell whether
to compress new articles added to a particular newsgroup.

It occurs to be that there would be a terrible waste of CPU time if the
articles were batched and compressed in the current format, only to be
uncompressed, sorted out, then recompressed for storage.  A new batching
format is called for -- perhaps the headers and data could travel under
different covers, or at a minimum, the '#! cunbatch nnn' lines could have
a second argument added -- the length of the headers.  Leaving the first
argument as it is would make the format portable to older software on the
receiving end -- but newer software could send the first 'nnn' bytes though
compress, yielding the headers, then send the remainder directly to a file
somewhere, to be uncompressed by the news readers.

This may all be moot, given the onset of C news.  I don't know if C news
does this or not.

It's also possible that I'm overlooking something -- there could be very
good reasons why this scheme won't work, but if so, they elude me.

All the news readers would have to change -- I don't know if their various
implementors and maintainers would go for it.  I'm aware of 'readnews',
'vnews', 'rn', and 'vn'.  Then there's 'notesfiles' and all their myriad
gateway software.  Changing 'inews' and 'batch' starts to look easy in
comparison.  Perhaps this is why it hasn't been done yet?

If you reply via mail, let me know whether I can quote you in the summary
that I will post if I get lots of responses.  Somehow I think it's more
likely that this one will be either (a) shot down quickly by one of those
'good reasons' that are eluding me, or (b) hashed out severly in this and
other news groups.  Send mail if you want to, though -- I started it, so
I'll take on the task of summarizing your comments.
-- 
Paul A. Vixie        {ptsfa, crash, winfree}!vixie!paul
329 Noe Street       dual!ptsfa!vixie!paul@ucbvax.Berkeley.EDU
San Francisco        
CA  94116            paul@vixie.UUCP     (415) 864-7013