[news.software.b] Why aren't new articles compressed?

rhg@cpsolv.UUCP (Richard H. Gumpertz) (10/27/89)

This is so obvious that it must have been asked before, but nobody I asked
seems to have answer.  Please bear with me and answer it one more time.

My totally unscientific eye tells me that typical news articles are about
1-2000 bytes long (with many smaller an many larger).  Anyway, this would seem
to indicate that compression might reduce the typical article from 2 disk
blocks (at 1K) to 1.  Bigger articles might do even better; small articles
would stay at one block.

Why aren't news articles compressed (and decompressed when read or forwarded)?
Does C news maybe compress them?  It seems like an effective halving of disk
usage would easily pay for the cycles needed to compress/uncompress.
-- 
==========================================================================
| Richard H. Gumpertz rhg@cpsolv.uu.NET -or- ...!uunet!amgraf!cpsolv!rhg |
| Computer Problem Solving, 8905 Mohawk Lane, Leawood, Kansas 66206-1749 |
==========================================================================

henry@utzoo.uucp (Henry Spencer) (10/27/89)

In article <431@cpsolv.UUCP> rhg@cpsolv.uucp (Richard H. Gumpertz) writes:
>Why aren't news articles compressed (and decompressed when read or forwarded)?
>Does C news maybe compress them?  It seems like an effective halving of disk
>usage would easily pay for the cycles needed to compress/uncompress.

No, we don't compress them.  In general, we didn't change the way news is
stored; we thought about a whole bunch of possible schemes and concluded
that none of them had enough advantages to be worthwhile.

Compressing lots and lots of small files is very expensive and the degree
of compression is not that impressive.  Admittedly, the quantized allocation
of disk space tends to magnify the effect for such small files, but it's
still a lot of work for limited gain.  It means that an article has to be
decompressed every time it is read, batched for transmission to another
site, or processed in any other way.  The performance impact, on a busy
machine, would be horrendous.  Our perception was that shortening expiry
times is generally a more cost-effective way of economizing on disk.

There is also a pragmatic issue in that it means modifying *all* the news
readers.  There are lots of those, many more than you'd think.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

rhg@cpsolv.UUCP (Richard H. Gumpertz) (10/29/89)

In article <1989Oct27.161920.5169@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>  Our perception was that shortening expiry
>times is generally a more cost-effective way of economizing on disk.

Why not make it an option that each site could choose to enable or disable
depending on the relative cost of disk sectors and CPU cycles at that site?

>There is also a pragmatic issue in that it means modifying *all* the news
>readers.  There are lots of those, many more than you'd think.

Readers would just be modified to look for either nnn or nnn.Z.  Not all that
major.  A given site would switch on compression only after the local readers
had all been fixed.  For small sites, where compression is most likely to
be valuable, there are probably few readers to fix.
-- 
==========================================================================
| Richard H. Gumpertz rhg@cpsolv.uu.NET -or- ...!uunet!amgraf!cpsolv!rhg |
| Computer Problem Solving, 8905 Mohawk Lane, Leawood, Kansas 66206-1749 |
==========================================================================

brad@looking.on.ca (Brad Templeton) (10/30/89)

While compression could be good, it might not be as good as you think.

First of all, the average usenet article is 2K.  On a site with 2K blocks,
compression might do nothing for many of the files.  Of course, on a more
typical 1K block site, it could do very well.

But the biggest gain would come from big files.  For source postings, that
would be great.  For binaries (about 12% of the volume of the net) it might
not do as well, as many are already compressed, although expanded out a bit
with uuencoding.

More would be gained by keeping the articles in indexable archives of some
sort, possibly compressed as well.   Depending on block size, 20-30% of the
disk space in your spool is wasted due to block granularity.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

rick@uunet.UU.NET (Rick Adams) (10/30/89)

I just ran this article size breakdown today (includes headers):

Kbytes  Count     %     Kbytes   Count    %
    1  14915  29.4%	   11     62   0.1%
    2  22971  45.3%	   12     57   0.1%
    3   6779  13.4%	   13     41   0.1%
    4   2636   5.2%	   14     29   0.1%
    5   1216   2.4%	   15     47   0.1%
    6    620   1.2%	   16     21   0.0%
    7    355   0.7%	   17     23   0.0%
    8    220   0.4%	   18     25   0.0%
    9    138   0.3%	   19     22   0.0%
   10    100   0.2%	>= 20    438   0.8%

henry@utzoo.uucp (Henry Spencer) (10/30/89)

In article <432@cpsolv.UUCP> rhg@cpsolv.uucp (Richard H. Gumpertz) writes:
>>  Our perception was that shortening expiry
>>times is generally a more cost-effective way of economizing on disk.
>
>Why not make it an option that each site could choose to enable or disable
>depending on the relative cost of disk sectors and CPU cycles at that site?

Basically because we didn't have time to do everything, and we perceived
this one as having insufficient payoff to make up for the impact on
performance, compatibility, and complexity.  Almost any feature has some
chance of being useful to *someone*, but when one is not trying to solve
all the world's problems (which C News is not -- we'll settle for 90%),
including "just one more feature" is always a judgement call.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

amanda@intercon.com (Amanda Walker) (11/10/89)

I got it backwards...  Sigh.  It happens to all of us every now and
then, I guess :-).  All I can plead as an excuse is that I haven't sat
in front of a UNIX box to do anything more programming-related than fix
sendmail.cf for about a year now...  I've gotten used to putting smarts
into routine libraries instead of executables that you can pipe together
(just try piping stuff around the Macintosh Programmer's Workshop :-P).

Mea culpa :-).

--
Amanda Walker <amanda@intercon.com>