[news.software.b] Proposal: compressing news in the spool directories

paul@vixie.UUCP (04/15/87)

I'm willing to do some of the work for this, but I want to see what else
has been done or thought about it before I start.  So..........

What about compressing the news data in the spool directories?  If compress
can save half the transmission time, it ought to be able to save almost that
much in the storage costs as well.  It isn't quite the same win, since there
is the file system's frag size to consider -- but it's a win, just the same.

The hard part is the headers -- they should not be compressed because they
are examined independent of the (much longer) data quite often -- in expire,
in subject searches, etc.  In my view, the headers would be better left
uncompressed.  So we can either put compressed and uncompressed data in the
same file -- not simple, but possible -- or we can seperate the data into
other files.  This irritates the waste in the file system frag size, but I
think we would gain more in the compression than we would waste in the half-
empty blocks we would add in this scheme.

Perhaps an additional header line is in order -- starting with 'X-' in the
tradition of lines which should not be passed out of the current system.
How about 'X-Data-File: xxx [-c]' where 'xxx' is the name of the file where
the data is stored, and '-c' would indicate whether the data is compressed?
Alternately, we could use the 'magic number' of the data file to determine
its compression -- but uncompressed data has a random magic number, so this
may be problematic.

The last architectural problem is in somehow tying the data files back to the
header files -- since the header files will inevitably become corrupted at
some point, 'orphaning' any associated data files.  They could be given
similar names -- perhaps the same 'article number'-style name, but in a
'.data/' directory?  This would remove some of the need for the 'X-Data-File:'
header, but there would still have to be something in the headers to tell
the various news reading and transmission programs that the data is elsewhere
-- otherwise they would think the article had no text.

If every stored article had its data stored seperately, this would not be a
problem; however, as a news administrator, I would prefer to have some things
compressed and others not -- during the conversion, mainly, but there are
other possibilities.  A new flag in the LIBDIR/active file could tell whether
to compress new articles added to a particular newsgroup.

It occurs to be that there would be a terrible waste of CPU time if the
articles were batched and compressed in the current format, only to be
uncompressed, sorted out, then recompressed for storage.  A new batching
format is called for -- perhaps the headers and data could travel under
different covers, or at a minimum, the '#! cunbatch nnn' lines could have
a second argument added -- the length of the headers.  Leaving the first
argument as it is would make the format portable to older software on the
receiving end -- but newer software could send the first 'nnn' bytes though
compress, yielding the headers, then send the remainder directly to a file
somewhere, to be uncompressed by the news readers.

This may all be moot, given the onset of C news.  I don't know if C news
does this or not.

It's also possible that I'm overlooking something -- there could be very
good reasons why this scheme won't work, but if so, they elude me.

All the news readers would have to change -- I don't know if their various
implementors and maintainers would go for it.  I'm aware of 'readnews',
'vnews', 'rn', and 'vn'.  Then there's 'notesfiles' and all their myriad
gateway software.  Changing 'inews' and 'batch' starts to look easy in
comparison.  Perhaps this is why it hasn't been done yet?

If you reply via mail, let me know whether I can quote you in the summary
that I will post if I get lots of responses.  Somehow I think it's more
likely that this one will be either (a) shot down quickly by one of those
'good reasons' that are eluding me, or (b) hashed out severly in this and
other news groups.  Send mail if you want to, though -- I started it, so
I'll take on the task of summarizing your comments.
-- 
Paul A. Vixie        {ptsfa, crash, winfree}!vixie!paul
329 Noe Street       dual!ptsfa!vixie!paul@ucbvax.Berkeley.EDU
San Francisco        
CA  94116            paul@vixie.UUCP     (415) 864-7013

rob@briar.UUCP (Rob Robertson) (04/16/87)

There are several drawbacks to keeping news compressed inside the
spool directories.  The two big ones are CPU cycles and that compress
only gives you real gains if what you are trying to compress is
reasonable sized.

On big files, such as source archives, compress will give you about
a 50-60 percent savings in space.  On the average length article
30-40 lines you'll get about a 20-30 percent savings (on just the
article, not the header).  On a one line article with compression
you gain space due to the compression tables.  The point being
that your not going to be gaining the amount of space you might
be expecting.

Now for CPU cycles.  Compress, especially with big articles is
very expensive computationally.  Each time an article is read,
your going to have to invoke compress.  For small articles, this
is not that expensive, but for *.sources.*, your going to notice
it in both user wait time and cpu time.  Also, in batching news
your going to have to uncompress the articles your sending to
another site, and recompress them (you want BIG batches to make
better use of compress, you also want compatiblity with people
who do not do your method of compression).

This posting is not meant to dissuade you, just point out some
design considerations.  There are several situations where you
might want to do this.  If you had severe disk limitations,
alot of unused cpu cycles and/or only a few net readers.

Correct me if I wrong :-).

rob
-- 
       william robertson			philabs!rob@seismo.css.gov
	 (914)  945-6300			philabs!rob.uucp

		"indecision is the key to flexibility"

joe@auspyr.UUCP (Joe Angelo) (04/17/87)

in article <536@vixie.UUCP>, paul@vixie.UUCP (Paul Vixie Esq) says:
> 
> What about compressing the news data in the spool directories?  If compress
> can save half the transmission time, it ought to be able to save almost that
> much in the storage costs as well.  It isn't quite the same win, since there
> is the file system's frag size to consider -- but it's a win, just the same.
> 


It's a great idea ... but on the other hand, is the decompression-to-read
time worth it?  I mean, sure, compressing, batching, and then uncompressing
would waste alot of CPU time *if* your proposal were implemented, but that
wasted time would be nothing when measured against the CPU time of decompressing
the same message for 50 net-news readers (people); not to forget that THEY
have to wait for the decompression as well! It seams to me that decompressing
500 messages X 50 users tends to use more CPU time then decompressing all
of them once.

As things are going ... communication costs are certainly more expensive
(money wise) than disk storage, so what's the problem? 

-- 
"No matter      Joe Angelo, Sr. Sys. Engineer @ Austec, Inc., San Jose, CA.
where you go,   ARPA: aussjo!joe@lll-tis-b.arpa       PHONE: [408] 279-5533
there you       UUCP: {sdencore,cbosgd,amdahl,ptsfa,dana}!aussjo!joe
are ..."        UUCP: {styx,imagen,dlb,jmr,sci,altnet}!auspyr!joe

sundquis@umn-cs.UUCP (Tom Sundquist) (04/17/87)

In article <536@vixie.UUCP> paul@vixie.UUCP (Paul Vixie Esq) writes:
>
>The hard part is the headers -- they should not be compressed because they
>are examined independent of the (much longer) data quite often -- in expire,
>in subject searches, etc.  In my view, the headers would be better left
>uncompressed.  So we can either put compressed and uncompressed data in the

I computed a few statistics about news articles to help determine
how efficient such a compression scheme might be.
The average article size on our system was about 2600 b.
(This includes large archives in various ``source'' newsgroups).
The average header size was about 550 b.
The average compression rate of article bodies was roughly 50%.
(This did not include large ``source'' articles.)
Hence the net compression rate would be about

	((2600 - 550) * 50% +550) / 2600 = 61%

I.e. keeping headers uncompressed (necessary) results in a 10% loss
in compression rate.  I don't know if this is pathological to the 
argument but needs to be considered...

Tom Sundquist     sundquis@umn-cs.arpa     rutgers!meccts!umn-cs!sundquis

henry@utzoo.UUCP (Henry Spencer) (04/18/87)

> What about compressing the news data in the spool directories? ...

Apart from the issues of cpu time consumed, the slowdown in reading news,
etc., remember that compress does not do nearly as well on short files as
it does on long ones.  Most news articles are quite short.

> ... If compress
> can save half the transmission time, it ought to be able to save almost that
> much in the storage costs as well...

A quick experiment suggests that you'll be lucky to get 25% consistently.
Perhaps a smarter data-compression algorithm could do better, but *I* don't
want the job of inventing one.

> This may all be moot, given the onset of C news.  I don't know if C news
> does this or not.

Nope.  Geoff and I talked about this and other issues for a while.  After
considerable discussion and thought, we concluded that we did not know any
way of storing news articles that would be *decisively* superior to the
current one.  Most of the ideas we kicked around had disadvantages as well
as advantages, and we couldn't see anything that was enough of a net win to
be worth the trouble of changing over.

The specific problems of storing compressed articles (apart from the issue
of header searching, which isn't needed by C expire but would be significant
for the news readers) are the very modest saving in space for small files
when using current compression algorithms, the awkwardness of applying
general Unix tools to non-ASCII files, and above all the large overheads
it would add to news reception and reading.
-- 
"We must choose: the stars or	Henry Spencer @ U of Toronto Zoology
the dust.  Which shall it be?"	{allegra,ihnp4,decvax,pyramid}!utzoo!henry

sewilco@meccts.UUCP (04/19/87)

Most posters are assuming that "compression" requires the `compress`
program or algorithm.  It should not be implied (except by someone who
has implemented the idea using the L-Z algorithm :-).
-- 
Scot E. Wilcoxon   (guest account)  {ihnp4,amdahl,dayton}!meccts!sewilco
(612)825-2607           sewilco@MECC.COM            ihnp4!meccts!sewilco
	It may be the event of the century, but
	"Supernova 1987A" isn't a good catchword.