farren@gethen.UUCP (Michael J. Farren) (11/15/87)
Once again, I see many articles flying about debating the relative merits of compressing or ARCing binaries before posting them vs. just uuencoding them and letting them fly. This time (as I did one time before in comp.sys.ibm.pc), I decided to test the theories and get some real numbers to show which would be more efficient. The results I got were fairly surprising, and not at all intuitive. The data will follow this note, for those of you with compulsive needs to check up :-) My test was fairly simple. I picked 12 files more-or-less at random, trying to get a mix of text and binary, long and short files. The smallest file was 1807 bytes, the largest 144784. For the first test, I uuencoded the files and then compressed the resulting text file. For text files, I simply compressed them. This corresponds, roughly, to the process a 'normal' posting would go through in its transmission throughout the net. (Assumptions: most sites do compressed batched transmission, and most sites use the standard 16-bit compression scheme). For the second test, I compressed the files first, and then uuencoded the resulting binary file. I then compressed the uuencoded file again, as this would likely happen during transmission anyway. For the third test, I ran all of the files through PKARC and ARC 5.21, and uuencoded the arc files produced. These were then compressed again. For the final test, I created a shar file from the unchanged text files and compressed/uuencoded versions of the binary files, which was then compressed. As a control, I created a shar file from text files and uncompressed uuencoded binaries, and compressed that. Pure text files, of course, did much better if they were not compressed beforehand. The overhead of uuencoding a compressed text file clearly outweighs any supposed benefit of compressing them first. Binary files, however, showed a distinct size reduction if compressed first, and then uuencoded, even if they were compressed again in the uuencoded form. This surprised me some, but there it is. The difference, especially for larger files, is significant, and leads me to believe that large binaries would fare better if compressed first, then uuencoded, then posted. The situation with the ARC and PKARC programs is much more clear. My data show that 'shar'ing the text files and compressed/uuencoded binaries produces some savings, but the test setup is artificial. Since the two cases where ARC is generally used are, first, when a large binary accom- panied by some text is being posted, and, second, when a large number of text files are being posted, I would think that using ARC or its variants is never a clear winning situation, as the 16-bit compression commonly found on Unix systems (and available for MS-DOS, as well) clearly is more efficient than the 12 or 13-bit compression available in ARC or PKARC. In short, don't use ARC - it doesn't gain that much, and could lose a lot. And never, never, never use ARC to post text files - that is a clear loss. Test Results file original uuencoded, compressed, compressed size compressed uuencoded, only (text) compressed a 7143 6227 5362* 3875 b 17060 13646 12396 9404 c 3040 1846 1770* n/a d 1807 1963 1818* 1304 e 45750 33812 29368 23353 f 4217 4022 3732* 2692 g 8832 5696 5380 n/a h 72036 46275 43104 n/a i 83924 57939 52592 n/a j 6840 5080 4774 n/a k 120728 93875 86997 n/a l 144784 106359 97227 n/a * - these files would not compress successfully the second time, as they would have expanded. The expansion amount, though, was only 3% for the small binary file (all of the text files were so clearly better served with no processing whatsoever), and so I have dis- regarded it. Larger files all show compression. ARC test results (All 12 files were archived) Method ARCed (shared) uuencoded/compressed ARC 5.21 296254 352549 PKARC 3.5 281234 342931 shar (1) 709793 364469 shar (2) 420973 337847 (1) - Text files untouched, binaries uuencoded (2) - Text files untouched, binaries compressed then uuencoded -- ---------------- Michael J. Farren "... if the church put in half the time on covetousness unisoft!gethen!farren that it does on lust, this would be a better world ..." gethen!farren@lll-winken.arpa Garrison Keillor, "Lake Wobegon Days"
jaw@aurora.UUCP (James A. Woods) (11/18/87)
# Oh, the shark has pretty teeth, dear--
And he shows them pearly white
Just a jackknife has Macheath, dear--
And he keeps it out of sight. [Bertolt Brecht, Threepenny Opera]
Now that Michael Farren has shown the diminished role for ARC within
USENET, let's once again return to standard Unix tools.
[Aside: However, the analysis is further complicated by realizing
that Telebit modems now do "compress -b11" in firmware.
I presume that big-CPU sites feeding similarly capable machines
would favor the more efficient software compress code, and turn
the modem feature off. But smaller nodes could use the setting,
and not worry too much when the modem is fed random bits, since
expansion would never happen if the modem double buffers.]
Though I'm not a binary-posting freak (Unix is supposed to free us from
these shackles, right?), maybe it's time to revive the idea of 'shark',
now circulating on Compuserve (and discussed in comp.sources long ago).
It does decompression, de-archiving, and binary restoration without
unpleasant argument over the format, because it's self-decoding.
The receiver only needs a shell, a C compiler, and 'tar' (or Gilmore's
PD 'tar'). Last I heard, Macs, Amigas, Ataris and MS-DOS boxes all
qualified. Warning: it does the compression *after* archiving (much
easier), so Farren's statistics might not be the same. The decoder
length is short enough to amortize the xmission benefit at 10KB or so.
A simple Unix-only application of 'shark' is a mailer which splits up
large compressed packages for sending, where the recipient just
accumulates pieces with the mail 'w' command, then runs 'sh':
#!/bin/sh
m=$1; shift
shark $* | split -800 - /tmp/shark$$
n=`ls /tmp/shark$$* | wc -l | sed 's/ *//'`
p=0
for f in `ls /tmp/shark$$*`
do
p=`expr $p + 1`
mail -s "bundle from '`whoami`' ($p of $n)" $m < $f
done
rm /tmp/shark$$*
(The 'split' guarantees pieces < 64K.) Usage: "sharkmail address files..."
Again, because of the statistics controversy, don't use it for sending
files to USENET, unless you're sure of the total system effect. In case
you've forgotten, here's the rest:
------------
echo '#!/bin/sh
#"Copyright (C) 1988 by James A. Woods. All rights reserved."
#"Cleverly he dialed from within." -- D. Van Vliet, "Trout Mask Replica"
PATH=$PATH:.
trap "rm -f zcat atob az$$*;exit" 0 1 2 3 15
echo decoding...
cat>az$$.c<<\_'
cat<<'TROUSER PRESS'
#include<stdio.h>
#define d define
#d I if
#d E else
#d W while
#d C char
#d L long
#d U unsigned
#d B(n) (1<<(n))
#d K bcount
#d G getchar()
#d P putchar
#d X return
#d H(n) ((C*)h)[n]
#d V(a,b) (B(b)-1&a)
#d Q 256
#d Y w=w*85+c-33,
#d Z w=Q;W(w--)t[w]=0
#d x break;
#d F {X(write(2,"shar botch: resend\n",19));}
L K,o,v,w,c,m,M=B(16),f;int i,q,b,n,k=16,e=128,j,O,S;U short t[69001];FILE*T;U
L h[69001];C D[255];d(c)int c;{I(c=='z'){I(K)F E{c=4;W(c--)y(0);}}E I(c>32&&c<
118){I(!K)w=c-33,++K;E I(K<4)Y++K;E Y y(w>>24),y(w>>16),y(w>>8),y(w),w=K=0;}E
F;}y(l)L l;{c=(int)(l&255);o^=c;m+=c;m++;I(1<<31&f)f*=2,f++;E f*=2;f+=c;putc(c
,T);}main(g,v)C**v;{int c;L i,A,h,y,s,r;I(**v!=97)X(a());sprintf(D,mktemp(
"azXXXXXX"));I(T=fopen(D,"w+"))unlink(D);E F;W(1){I(!fgets(D,255,stdin))F;I(!
strcmp(D,"xbtoa Begin\n"))x}W((c=G)>=0){I(c==10)continue;E I(c=='x')x E d(c);}
I(scanf("btoa End N %ld %lx E %lx S %lx R %lx\n",&A,&h,&y,&s,&r)!=5)F;I(A!=h||
y!=o||s!=m||r!=f)F E{fseek(T,0L,0);i=A;W(i--)P(getc(T));}}L g(){C*p=D;I(j>0||O
>=S||f>m){I(f>m)m= ++n==k?M:B(n)-1;I(j>0)m=B(n=9)-1,j=0;S=fread(D,1,n,stdin);I
(S<1)X-1;O=0;S=S*8-n+1;}q=O;b=n;p+=q>>3;q&=7;c=V(*p++>>q,8-q);q=8-q;b-=q;I(b>7
)c|=(*p++&255)<<q,q+=8,b-=8;c|=V(*p,b)<<q;O+=n;X c;}a(){I(G!=31||G!=157)F;k=G;
e=k&128;k&=31;M=B(k);I(k>16)F;z();}z(){L w;C*s;m=B(n=9)-1;Z,H(w)=(C)w;f=e?257:
Q;i=o=g();I(o==-1)X;P((C)i);s=(C*)&H(B(16));W((w=g())>=0){I(w==Q&&e){Z;j=1;f=Q
;I((w=g())==-1)x}v=w;I(w>=f)*s++=i,w=o;W((U L)w>=(U L)Q)*s++=H(w),w=t[w];*s++=
i=H(w);do P(*--s);W(s>&H(B(16)));I((w=f)<M)t[w]=(U short)o,H(w)=i,f=w+1;o=v;}}
TROUSER PRESS
echo '_
(set|compress|btoa|atob|zcat)2>/dev/null 1>&2||
(cc -o zcat az$$.c;ln zcat atob)
(atob|zcat|tar xBvf -)<<\TROUSERS'
tar cBf - $*|compress|btoa
echo TROUSERS