[news.admin] summary of responses to my question on compressing /usr/spool/news

paul@vixie.UUCP (Paul Vixie Esq) (05/25/87)

Well, okay, this is a very late summary of the responses to my question,
"should we compress news in the spool directories?"  The answer, it seems,
is "no".  Although I feel like I must have been pretty ambitious that day
to propose such a thing, I remain convinced that it would be an idea worth
trying.  If I try it, I'll let everybody know how it works out.

Since the commonest response was "that would take all kinds of CPU time for
the news readers", let me deal with that.  It takes MUCH LESS TIME to
uncompress than to compress something.  Try 'zcat' on a very long file, 
and you'll see what I mean.

The second-most-common response (subjectively) was "it would only be of
use on a small-disked, single-user system."  Well, that's me!  I have about
130Mb of disk, and 22Mb is taken up just keeping 7 days of news on line.
I'd like to have that other 10~12Mb for source code and other things ...

As for the complaint that compression doesn't help much with small files,
which most news articles are (small, that is), I don't know.  When you have
hundreds of files taking up tens of megabytes, a little bit would go a long
way.

Here are the responses, with headers and signatures pruned:
---------------------------------------------------------------------------
Organization: MECC Technical Services, St. Paul, MN
Date: 16 Apr 87 10:22:46 CST (Thu)
From: sewilco@meccts.MECC.COM (Scot E. Wilcoxon)

I've been thinking about the same thing for a while.

I think the articles should be compressed by replacing words and
phrases with short codes.  The `compress` program is a good
general-purpose tool, but one oriented specifically toward English can
do a good (better?) job.  Also, the news-reading programs need a
format which can be decoded quickly and cheaply.

If the codes happen to be similar to the escape sequences for macro
expansion of a terminal (say, the NAPLPS macros), expansion of the
text could even be done by a terminal or PC program.

For batching of such articles, the "sys" file entry could specify if
the other site can handle articles with this new compression.  If this
is the case, a new program can handle it instead of 'cunbatch'.
-------------------------------------------------------------------------
Date: Thu, 16 Apr 87 08:48:21 EDT
From: John Owens <sun!xanth.cs.odu.edu!john>

The biggest problem I see with your solution is the incredible use of
CPU time this would take whenever anyone read news.  I know that
sometimes, when I'm on a fast terminal, I'll skim through some
newsgroups with the space bar when I'm in a hurry, seeing if anything
catches my eye.  I wouldn't be able to do this with your scheme - it
would take at least a second or two to uncompress an article on a
moderately-loaded system, and would take much too long, and add much
too much to the load, on a heavily loaded system.

Of course, if you're reading news on a private system, like a PC
running Xenix, this wouldn't be as important, and since these systems
are the ones most likely to have a severe shortage of disk space, this
could work out after all.  I would suggest having the compressed data
in the article, and piping that part of the file through
compress/uncompress as needed - two files don't buy you anything,
since this is as easy as piping the entire file.

An X-Compressed: header would be sufficient to mark compressed
articles, but please don't compress any article less than, say, 2 to 4
blocks!  Actually, you might want to have a configurable size beyond
which you compress the article - that would be more useful than a
per-newsgroup compression.  Each system could then decide what the
tradeoff point is - maybe only compress really large things, like
digests, source, and encoded binaries....

I presume you're volunteering to do the work.... :-)
---------------------------------------------------------------------------
From: <ames!rutgers!cbmvax!snark!eric>
Date: 16 Apr 87 10:09:11 EST (Thu)

The news rewrite I've been doing (TMN netnews) allows you to flag a group
'compressed', in which case all its messages are automatically stored
compressed, and uncompressed when a reader wants to look at them.

Unfortunately, this turns out not to be a big win. Most articles aren't
large enough so that compression makes a big difference.

If you'd like more complete info or to join the beta site list, email me.
--------------------------------------------------------------------------
Date: Sat, 18 Apr 87 16:05:46 PST
From: ihnp4!ho95e!wcs
Organization: AT&T Bell Labs 46133, Holmdel, NJ

I've thought about that one a few times too, but haven't had the time
to do much with it.  I've seen 3 main problems with it; you've hit the
ones about headers and fragmentation.  The other problem is that,
purely aside from fragmentation issues, compress is much more
efficient on large files than on small ones, because it's seen a lot
more strings that it can compress into one word.  Here are some of my
suggestions:

	- headers make up a significant fraction of the average
	article, especially if it's been through a couple of network
	gateways.  It might be worth developing an abbreviated header
	format - put a magic word at the beginning (e.g. D-news), then
	use one-or-two-character abbreviations: Start the line with
	S instead of "Subject: ", etc.  (yes, you've got to teach all
	the newsreaders about it.).

	- Compress normally starts with a dictionary of 1-character
	strings; you could modify it to start with a large number of
	known strings, taken from typical news articles.  In
	particular, you should include allhe standard header strings
	(Subject: ), the newsgroup names (comp.sys.ibm.pc), and a lot
	of the important machine names (!ihnp4, !ucbvax, .BITNET, .COM).
	You'd need a different magic word than compress normally uses.
	However, this does mean you don't have to change the header
	format any more, just know the words.  You also gain some
	efficiency, because you've learned information from longer
	files that you're able to use in shorter files as well.
----------------------------------------------------------------------------
>From: joe@auspyr.UUCP
Date: 16 Apr 87 23:37:02 GMT
Organization: Austec, Inc., San Jose, CA. USA

In article <536@vixie.UUCP>, paul@vixie.UUCP (Paul Vixie Esq) says:
> What about compressing the news data in the spool directories?  If compress
> can save half the transmission time, it ought to be able to save almost that
> much in the storage costs as well.  It isn't quite the same win, since there
> is the file system's frag size to consider -- but it's a win, just the same.

It's a great idea ... but on the other hand, is the decompression-to-read
time worth it?  I mean, sure, compressing, batching, and then uncompressing
would waste alot of CPU time *if* your proposal were implemented, but that
wasted time would be nothing when measured against the CPU time of decompressing
the same message for 50 net-news readers (people); not to forget that THEY
have to wait for the decompression as well! It seams to me that decompressing
500 messages X 50 users tends to use more CPU time then decompressing all
of them once.

As things are going ... communication costs are certainly more expensive
(money wise) than disk storage, so what's the problem? 
-----------------------------------------------------------------------------
From mit-eddie!husc6!linus!philabs!briar!rob Sun Apr 19 22:12:07 PST 1987

There are several drawbacks to keeping news compressed inside the
spool directories.  The two big ones are CPU cycles and that compress
only gives you real gains if what you are trying to compress is
reasonable sized.

On big files, such as source archives, compress will give you about
a 50-60 percent savings in space.  On the average length article
30-40 lines you'll get about a 20-30 percent savings (on just the
article, not the header).  On a one line article with compression
you gain space due to the compression tables.  The point being
that your not going to be gaining the amount of space you might
be expecting.

Now for CPU cycles.  Compress, especially with big articles is
very expensive computationally.  Each time an article is read,
your going to have to invoke compress.  For small articles, this
is not that expensive, but for *.sources.*, your going to notice
it in both user wait time and cpu time.  Also, in batching news
your going to have to uncompress the articles your sending to
another site, and recompress them (you want BIG batches to make
better use of compress, you also want compatiblity with people
who do not do your method of compression).

This posting is not meant to dissuade you, just point out some
design considerations.  There are several situations where you
might want to do this.  If you had severe disk limitations,
alot of unused cpu cycles and/or only a few net readers.

-- 
       william robertson			philabs!rob@seismo.css.gov
-----------------------------------------------------------------------------
From ucbvax!umnd-cs!umn-cs!sundquis Sun Apr 19 22:12:49 PST 1987

In article <536@vixie.UUCP> paul@vixie.UUCP (Paul Vixie Esq) writes:
>
>The hard part is the headers -- they should not be compressed because they
>are examined independent of the (much longer) data quite often -- in expire,
>in subject searches, etc.  In my view, the headers would be better left
>uncompressed.  So we can either put compressed and uncompressed data in the

I computed a few statistics about news articles to help determine how
efficient such a compression scheme might be.  The average article size on our
system was about 2600 b.  (This includes large archives in various ``source''
newsgroups).  The average header size was about 550 b.  The average
compression rate of article bodies was roughly 50%.  (This did not include
large ``source'' articles.)  Hence the net compression rate would be about

	((2600 - 550) * 50% +550) / 2600 = 61%
	
I.e. keeping headers uncompressed (necessary) results in a 10% loss
in compression rate.  I don't know if this is pathological to the 
argument but needs to be considered...

Tom Sundquist     sundquis@umn-cs.arpa     rutgers!meccts!umn-cs!sundquis
---------------------------------------------------------------------------
From ptsfa!ames!rutgers!dayton!meccts!sewilco Sun Apr 19 22:13:12 PST 1987

Most posters are assuming that "compression" requires the `compress`
program or algorithm.  It should not be implied (except by someone who
has implemented the idea using the L-Z algorithm :-).
-- 
Scot E. Wilcoxon   (guest account)  {ihnp4,amdahl,dayton}!meccts!sewilco
----------------------------------------------------------------------------
Date: Mon, 20 Apr 87 13:33:23 est
From: sun!seismo!genat!methods!peter (Peter Blake    Mr Sys Adm)

I personally would not like to see the individual news articles stored
in compressed format for the following reasons:

	1)  It would be far too expensive CPU wise to be uncompress
	   an article for some one to read.  Consider a site that has
	   a large number of users and whom a decent percentage read
	   the news.  Can you justifing uncompressing the same article
	   at least 50 times in one day (especially if its a biggie)
	   when that same site has oodles of free disk space?

	2)  Compressed files are not convinient to 'grep' through (this
	   is something I've many times when I can remember the group
	   but now the article that something interesting was mentioned
	   in).

	3)  Increasing the number of files required to represent a article
	   (eg. the article's header file and its compressed data file)
	   mean that you'll need roughly twice as many inodes on that
	   file system.  Sometimes it is not possible to increase the
	   number of inodes on a file system in a sufficient manner,
	   what does the SA do then?
	     I think that breaking up an article into two files will
	   dramatically increase the number of disk fragments as well.

Summary:  News articles should not be separated into a header file and a
	compressed data file because:
		- the increase in CPU expense is not justifiable.
		- eliminates easy and convient text manipulation on articles.
		- increases the number of inodes available.
		- increases disk wastage due to fragmentation.

These were just a few points that come to mind.  The last two points are
very fresh in my mind, cause I've just wasted a glorious day outside while
rearranging my file systems in order to get more inodes on my news file
system (I had the free blocks but not free inodes).  If you have any
questions regarding what I've said, please feel free to ask.
----------------------------------------------------------------------------
Date: Tue, 21 Apr 87 09:48:28 EDT
From: ames!seismo!rochester!srs!matt (Matt Goheen)

Unless you have alot of computer cycles to spare, I wouldn't recommend
this strategy.  Every time someone reads an article you would have to
uncompress the body of it (assuming you have figured out how to do the
header stuff correctly).  It's a matter of disk space vs. CPU cycles
and I think in this case, disk space wins hands down...
----------------------------------------------------------------------------
-- 
Paul A Vixie Esq
329 Noe Street       {ptsfa, crash, hoptoad, ucat}!vixie!paul
San Francisco        ptsfa!vixie!paul@ames.ames.arc.nasa.gov
CA  94116            paul@vixie.UUCP     (415) 864-7013