[news.misc] Compress binaries? Some real numbers.

farren@gethen.UUCP (Michael J. Farren) (11/15/87)

Once again, I see many articles flying about debating the relative merits
of compressing or ARCing binaries before posting them vs. just uuencoding
them and letting them fly.  This time (as I did one time before in
comp.sys.ibm.pc), I decided to test the theories and get some real numbers
to show which would be more efficient.  The results I got were fairly
surprising, and not at all intuitive.  The data will follow this note, for
those of you with compulsive needs to check up :-)

My test was fairly simple.  I picked 12 files more-or-less at random,
trying to get a mix of text and binary, long and short files.  The
smallest file was 1807 bytes, the largest 144784.  For the first test, I
uuencoded the files and then compressed the resulting text file.  For text
files, I simply compressed them.  This corresponds, roughly, to the
process a 'normal' posting would go through in its transmission throughout
the net.  (Assumptions:  most sites do compressed batched transmission,
and most sites use the standard 16-bit compression scheme).  For the
second test, I compressed the files first, and then uuencoded the
resulting binary file.  I then compressed the uuencoded file again, as
this would likely happen during transmission anyway.  For the third test,
I ran all of the files through PKARC and ARC 5.21, and uuencoded the arc
files produced.  These were then compressed again.  For the final test, I
created a shar file from the unchanged text files and compressed/uuencoded
versions of the binary files, which was then compressed.  As a control, I
created a shar file from text files and uncompressed uuencoded binaries,
and compressed that.

Pure text files, of course, did much better if they were not compressed
beforehand.  The overhead of uuencoding a compressed text file clearly
outweighs any supposed benefit of compressing them first.  Binary files,
however, showed a distinct size reduction if compressed first, and then
uuencoded, even if they were compressed again in the uuencoded form.  This
surprised me some, but there it is.  The difference, especially for larger
files, is significant, and leads me to believe that large binaries would
fare better if compressed first, then uuencoded, then posted.

The situation with the ARC and PKARC programs is much more clear.  My
data show that 'shar'ing the text files and compressed/uuencoded binaries
produces some savings, but the test setup is artificial.  Since the two
cases where ARC is generally used are, first, when a large binary accom-
panied by some text is being posted, and, second, when a large number of
text files are being posted, I would think that using ARC or its variants
is never a clear winning situation, as the 16-bit compression commonly
found on Unix systems (and available for MS-DOS, as well) clearly is more
efficient than the 12 or 13-bit compression available in ARC or PKARC.
In short, don't use ARC - it doesn't gain that much, and could lose a lot.
And never, never, never use ARC to post text files - that is a clear loss.


Test Results

file  original  uuencoded,  compressed,  compressed
	size    compressed  uuencoded,   only (text)
			    compressed

a        7143      6227        5362*        3875
b       17060     13646       12396         9404
c        3040      1846        1770*         n/a
d        1807      1963        1818*        1304
e       45750     33812       29368        23353
f        4217      4022        3732*        2692
g        8832      5696        5380          n/a
h       72036     46275       43104          n/a
i       83924     57939       52592          n/a
j        6840      5080        4774          n/a
k      120728     93875       86997          n/a
l      144784    106359       97227          n/a

* - these files would not compress successfully the second time, as they
    would have expanded.  The expansion amount, though, was only 3%
    for the small binary file (all of the text files were so clearly
    better served with no processing whatsoever), and so I have dis-
    regarded it.  Larger files all show compression.

ARC test results

(All 12 files were archived)
Method    ARCed (shared)  uuencoded/compressed

ARC 5.21      296254          352549
PKARC 3.5     281234          342931
shar (1)      709793          364469
shar (2)      420973          337847

(1) - Text files untouched, binaries uuencoded
(2) - Text files untouched, binaries compressed then uuencoded


-- 
----------------
Michael J. Farren      "... if the church put in half the time on covetousness
unisoft!gethen!farren   that it does on lust, this would be a better world ..."
gethen!farren@lll-winken.arpa             Garrison Keillor, "Lake Wobegon Days"

jaw@aurora.UUCP (James A. Woods) (11/18/87)

# Oh, the shark has pretty teeth, dear--
  And he shows them pearly white
  Just a jackknife has Macheath, dear--
  And he keeps it out of sight.	    [Bertolt Brecht, Threepenny Opera] 

Now that Michael Farren has shown the diminished role for ARC within
USENET, let's once again return to standard Unix tools.

[Aside:  However, the analysis is further complicated by realizing 
that Telebit modems now do "compress -b11" in firmware.
I presume that big-CPU sites feeding similarly capable machines
would favor the more efficient software compress code, and turn
the modem feature off.  But smaller nodes could use the setting,
and not worry too much when the modem is fed random bits, since
expansion would never happen if the modem double buffers.]

Though I'm not a binary-posting freak (Unix is supposed to free us from
these shackles, right?), maybe it's time to revive the idea of 'shark',
now circulating on Compuserve (and discussed in comp.sources long ago).

It does decompression, de-archiving, and binary restoration without
unpleasant argument over the format, because it's self-decoding.
The receiver only needs a shell, a C compiler, and 'tar' (or Gilmore's
PD 'tar').  Last I heard, Macs, Amigas, Ataris and MS-DOS boxes all
qualified.  Warning: it does the compression *after* archiving (much
easier), so Farren's statistics might not be the same.  The decoder
length is short enough to amortize the xmission benefit at 10KB or so.

A simple Unix-only application of 'shark' is a mailer which splits up
large compressed packages for sending, where the recipient just
accumulates pieces with the mail 'w' command, then runs 'sh': 

	#!/bin/sh
	m=$1; shift
	shark $* | split -800 - /tmp/shark$$
	n=`ls /tmp/shark$$* | wc -l | sed 's/  *//'`
	p=0
	for f in `ls /tmp/shark$$*`
	do
		p=`expr $p + 1`
		mail -s "bundle from '`whoami`' ($p of $n)" $m < $f
	done
	rm /tmp/shark$$*

(The 'split' guarantees pieces < 64K.)   Usage: "sharkmail address files..."
Again, because of the statistics controversy, don't use it for sending
files to USENET, unless you're sure of the total system effect.  In case
you've forgotten, here's the rest:

------------
echo '#!/bin/sh
#"Copyright (C) 1988 by James A. Woods.  All rights reserved."
#"Cleverly he dialed from within." -- D. Van Vliet, "Trout Mask Replica"
PATH=$PATH:.
trap "rm -f zcat atob az$$*;exit" 0 1 2 3 15
echo decoding...
cat>az$$.c<<\_'
cat<<'TROUSER PRESS'
#include<stdio.h>
#define d define
#d I if
#d E else
#d W while
#d C char
#d L long
#d U unsigned
#d B(n) (1<<(n))
#d K bcount
#d G getchar()
#d P putchar
#d X return
#d H(n) ((C*)h)[n]
#d V(a,b) (B(b)-1&a)
#d Q 256
#d Y w=w*85+c-33,
#d Z w=Q;W(w--)t[w]=0
#d x break;
#d F {X(write(2,"shar botch: resend\n",19));}
L K,o,v,w,c,m,M=B(16),f;int i,q,b,n,k=16,e=128,j,O,S;U short t[69001];FILE*T;U
L h[69001];C D[255];d(c)int c;{I(c=='z'){I(K)F E{c=4;W(c--)y(0);}}E I(c>32&&c<
118){I(!K)w=c-33,++K;E I(K<4)Y++K;E Y y(w>>24),y(w>>16),y(w>>8),y(w),w=K=0;}E
F;}y(l)L l;{c=(int)(l&255);o^=c;m+=c;m++;I(1<<31&f)f*=2,f++;E f*=2;f+=c;putc(c
,T);}main(g,v)C**v;{int c;L i,A,h,y,s,r;I(**v!=97)X(a());sprintf(D,mktemp(
"azXXXXXX"));I(T=fopen(D,"w+"))unlink(D);E F;W(1){I(!fgets(D,255,stdin))F;I(!
strcmp(D,"xbtoa Begin\n"))x}W((c=G)>=0){I(c==10)continue;E I(c=='x')x E d(c);}
I(scanf("btoa End N %ld %lx E %lx S %lx R %lx\n",&A,&h,&y,&s,&r)!=5)F;I(A!=h||
y!=o||s!=m||r!=f)F E{fseek(T,0L,0);i=A;W(i--)P(getc(T));}}L g(){C*p=D;I(j>0||O
>=S||f>m){I(f>m)m= ++n==k?M:B(n)-1;I(j>0)m=B(n=9)-1,j=0;S=fread(D,1,n,stdin);I
(S<1)X-1;O=0;S=S*8-n+1;}q=O;b=n;p+=q>>3;q&=7;c=V(*p++>>q,8-q);q=8-q;b-=q;I(b>7
)c|=(*p++&255)<<q,q+=8,b-=8;c|=V(*p,b)<<q;O+=n;X c;}a(){I(G!=31||G!=157)F;k=G;
e=k&128;k&=31;M=B(k);I(k>16)F;z();}z(){L w;C*s;m=B(n=9)-1;Z,H(w)=(C)w;f=e?257:
Q;i=o=g();I(o==-1)X;P((C)i);s=(C*)&H(B(16));W((w=g())>=0){I(w==Q&&e){Z;j=1;f=Q
;I((w=g())==-1)x}v=w;I(w>=f)*s++=i,w=o;W((U L)w>=(U L)Q)*s++=H(w),w=t[w];*s++=
i=H(w);do P(*--s);W(s>&H(B(16)));I((w=f)<M)t[w]=(U short)o,H(w)=i,f=w+1;o=v;}}
TROUSER PRESS
echo '_
(set|compress|btoa|atob|zcat)2>/dev/null 1>&2||
(cc -o zcat az$$.c;ln zcat atob)
(atob|zcat|tar xBvf -)<<\TROUSERS'
tar cBf - $*|compress|btoa
echo TROUSERS