[comp.sources.d] compressing compressed stuff - alright for binaries

thorinn@diku.dk (Lars Henrik Mathiesen) (05/10/88)

Summary: Maybe it's not so bad to send compressed archives if the stuff would
have to be uuencoded anyway.

In article <21263@oliveb.olivetti.com> jerry@oliveb.olivetti.com (Jerry Aguirre) writes:
>Now compress is perfectly capable of telling you whether it's output was
>compressed or not.  But in this case by the time it tells you the input
>is already gone!  Doing this would require writing batch and compress
>output to files.  Then you could decide what to send based on whether
>compress helped or not.

This does not really help either: If the batch of news contains some compressed
and some uncompressed files, the compression factor just goes down. The result
is larger than all the original files batched and compressed. The only way to
avoid this would be to make the sendbatch process able to recognize compressed
files and send these in a separate uncompressed batch.

By the way - my terminal goes weird when I read binary data in news.  So even
if the news software can handle it (which I don't know), I don't want people
to post raw binary files. The stuff that goes in a posting will have to be
encoded in some way. I just did some experiments with uuencode and btoa to see
if this ameliorates the effect of compressing twice; all cases have a last
compress to simulate the one in sendbatch, and the saving from this is noted.

Our kernel, compressed once and twice:
 408 -rwxr-xr-x  1 root       406528 Mar 13 19:52 /vmunix
 232 -rw-r--r--  1 thorinn    224983 May  9 17:03 vu.Z		( 44.7%)
 288 -rw-r--r--  1 thorinn    283993 May  9 17:33 vu.Z.Z	(-26.2%)
With btoa, second line compressed first (saves 14.7%)
 328 -rw-r--r--  1 thorinn    323394 May  9 17:26 vu.e.Z	( 21.4%)
 280 -rw-r--r--  1 thorinn    275895 May  9 17:11 vu.Z.e.Z	(  3.2%)
With uuencode, second line compressed first (saves 2.5%)
 280 -rw-r--r--  1 thorinn    278341 May 10 11:13 vu.u.Z	( 50.3%)
 280 -rw-r--r--  1 thorinn    271352 May 10 11:12 vu.Z.u.Z	( 12.5%)
(More sizes for comparison after the article)

I think that there are two interesting things to note about this:
  - The process of encoding `de-randomizes' the compress output enough that
the size does not grow when compressed again; it actually saves a little on
transmission costs to compress before encoding.
  - The btoa program (4 bytes into 5) gives relatively un-compressible output.
It does some modulo arithmetic, and compresses aligned 32-bit zeroes to one
byte. The output from uuencode, which just moves bits around to encode 3 bytes
in 4, is more compressible than the original - and uuencode uses less cpu!

So, if you know that your binary is going to transmitted in a compressed news
batch, just uuencode it - it's within 2.5% of the best you can do as regards
transmission costs, and it saves computing resources all over.
  I expect that much the same goes for the various compressed archives from
PCs: If they're uuencoded they won't cost more to transmit than the original
files uuencoded - but there are fewer bytes to download from the reciever's
news system to his PC.
  However, ASCII files should just be shar'ed - putting them in a compressed
archive and uuencoding that is very likely to cost more to transmit.
--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcvax!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.

Raw btoa sizes - compress saves 30.8% (fewer 32-bit zeroes).
 416 -rw-r--r--  1 thorinn    411500 May  9 17:13 vu.e
 288 -rw-r--r--  1 thorinn    284899 May  9 17:07 vu.Z.e
Raw uuencode sizes - compress saves 44.7% just like base files.
 560 -rw-r--r--  1 thorinn    560129 May 10 11:07 vu.u
 312 -rw-r--r--  1 thorinn    310003 May 10 11:07 vu.Z.u