[comp.sys.amiga] quantitative comparision of arc versus compress

kent@xanth.UUCP (Kent Paul Dolan) (04/21/87)

Amigans,

I ran an extensive series of tests based on having all files in a directory
either compressed (whether compress did anything with them or not) and the
same set of files arced, in single file arcs for large files, or in merged
file arcs for small files.  Arc did very well, and it looks like using arc
where we have been using compress will save a lot of space for us in the
public directories. Here are the results:

			Files			 Storage Blocks
	      ----------------------------     --------------------   Percent 
 Directory  Pictures  Non-   Wouldn't  After  Original After   After   saved
	            Pictures compress   arc    total    arc   compress by arc

1) art		43	 2	 7	45	960	764	765	00.1
2) digitized	22	 1	12	 8	985	835	956	12.7
3) images	19	 0	 3	19	559	413	471	12.3
4) abasic	 0	96	 1	90	398	243	232	-4.7
5) abasic	 0	96	 1	 3	398	201	232	13.3
6) amigabasic	 0	13	 0	13	 58	 39	 39	00.0
7) amigabasic	 0	13	 0	 1	 58	 32	 39	17.9

Notes:

1) ~public/amiga/pictures/art files - even though a sixth of the files did
   not compress, this was about a tie.  Arc lost on the indices, but gained
   by squeezing the files compress couldn't make smaller.  This test was
   done before I added the latest pictures.

2) ~public/amiga/pictures/digitized files - these are mostly HAM files, and
   compress lost big by not finding a good technique.  Every file that was
   compressed was smaller than the corresponding arced file, but the losers,
   all of which arc squeezed, more than tipped the balance.

3) ~public/amiga/UNTESTED/taug/pd-12/images files - these are now in #1, above.
   Like the digitized files, the relatively large portion of files (again
   about 1/6) that compress couldn't improve tipped the overall balance.

4) ~public/amiga/abasic files - a group of 93 small sources plus 3 data files.
   This and the next are the same set of files.  Notice that the files are
   roughly 4K in size.  The only file compress couldn't handle, arc packed;
   all the rest, both arc and compress were using the same algorithm, so
   the results favored the scheme without the table of contents.  Note that
   there were two groups of files, one of size three files and one of size
   two, that were actually related, and had to be arced into single arc files
   to maintain the relationship.  There were also three other cases in which
   files names were identical to the first eight characters, so arc packed
   two files into one arc file.  That accounts for the "missing" six arc
   files.  The trouble with either scheme here is that the files are short
   relative to the size of the unused fragment in the last block of each
   file, so that the next result will reward having fewer files.

5) Using the same files, the two natural groups were arced together, and the
   rest of the files were arced into a single arc file.  The result of
   cutting the fragmentation loss is an 18% swing in favor of arc over
   compress, which always compresses each file separately.

6) & 7)  ~public/amiga/amigabasic - a set of 13 short source files.
   The same trick works for a smaller set of amigabasic files; putting
   them into one file, even at the cost of the table of contents and CRCs,
   arc saved enough on fragmentation to beat compress.  Even with separate
   files, arc ties compress, although compress compressed all 13 files here.
   I think the equality in 6 was a fluke; there is enough slop in the end block
   fragementation to disguise a small effect like the arc table of contents
   occasionally; I think arc just lucked out on trial 6.
 
Discussion:  The advantages of arc are that it keeps related files together,
maintains a CRC for each file, saves on directory entries, especially on the
Amiga, where every file seems to need its own directory block, and makes
directories neater looking and easier to maintain.  Arc usually finds
compression methods for files on which compress fails to save space (always in
the cases seen here.)  Arc saves space overall.  A mix of arc and compress in
a single directory of large files might even save more space, but would make
file maintenance more of a hand chore.
 
The disadvantages of arc are that it comes from an MS-DOS environment,and so
cannot handle the longer file and directory names of the Unix (tm) and
AmigaDOS(tm) environments.  This is sometimes a problem; arc will not
correctly recognize an arc file with a root portion longer than 8 letters,
and will not store a file name internally longer than 16 characters.
 
Arc is also slower to pack files, because it must try six or eight different
compression techniques before deciding on the one to use for a file, and
slower in unpacking a file, because it has a distributed table
of contents, and must walk the table and the files between the table entries,
to find a file.  (This latter is a supposition on my part because of the
way the disk drives behave when arc is running.)

The advantages of compress are that it is faster and keeps full length file
names.  The latter advantage is moot for the environment in which we use
these files, because kermit is also limited in file length, and it is very
crucial if the .Z is removed from a file during transmission from the VAX,
to an Amiga(tm), for compress then won't recognize it.  Compress also makes
all the files easily accessable and visible, at the cost of storage and
directory entries.  Compress is also a more mature, less buggy piece of
software, although arc is highly improved over the last year.
 
Since our primary concern right now is space rather tahn cpu cycles (most
of the arc and kermit activity takes place in the evening), I will continue
to change directories of compressed files to directories of arced files in
our public domain Amiga software and data files, based on the above
results, unless directed otherwise.
 
Since this is useful, and almost research, I am going to almost publish it
by posting it to comp.sys.amiga, to share with the folks who shared this
software and data with us in the first place.
 
Kent.

--
The Contradictor	Member HUP (Happily Unemployed Programmers)  // Yet
								    // Another
Back at ODU to learn how to program better (after 25 years!)    \\ // Happy
								 \// Amigan!
UUCP  :  kent@xanth.UUCP   or    ...{sun,cbosgd,harvard}!xanth!kent
CSNET :  kent@odu.csnet    ARPA  :  kent@xanth.cs.odu.edu
Voice :  (804) 587-7760    USnail:  P.O. Box 1559, Norfolk, Va 23501-1559

Copyright 1987 Kent Paul Dolan.			How about if we keep the human
All Rights Reserved.  Author grants free	race around long enough to see
retransmission rights, recursively only.	a bit more of the universe?

dillon@CORY.BERKELEY.EDU (Matt Dillon) (04/22/87)

	Interesting comparision.  One thing I like about compress is that it
can take it's input from stdin, and output to stdout.  I can then use PIPE:
to buffer data going into the compress (since disk IO is much faster than
compress can handle) and thus get 100% utilization of the CPU decompressing
my VD0: from floppy.  (how's that for a mouthful!)

				-Matt

rs@mirror.UUCP (04/22/87)

Most sites on Usenet use the COMPRESS program to cut down their
transmission costs.  Compressing an ARC'd file results in something
bigger than the original.

As a courtesy, then, to those sites who pay big money to ship articles
around, the nicest way to post binary files is to use UUENCODE.
The C sources to UUENCODE have been posted to mod.sources; a portable
shar/unshar package was posted in net.sources.  Contact your nearest
mod.sources archive site for the former, me for the either.

Also, people have written that "well, it's easiest for ME to do things
this way."  In an internetworked electronic conferencing system of
several thousand sites, the ruling theme should be the greatest good
for the greatest number.  Rise above selfishness, and go out of your
way a bit to help the net that helps you:  conventional wisdom and
empiricism have shown that posting UUENCODE'd binaries work best,
given that COMPRESS is usually in the transport mechanism.

As lesser arguments, assuming appeals to your "better side" don't
work:  ARC is shareware, which goes against the grain of the so-called
free Usenet, don't you think?  The maintainers and developers of ARC
refuse to support Unix ports.  Think about the implications of those
two sentences...

Comments to me, and I'll summarize.
	/Rich $alz,
	Moderator of mod.sources (soon to be comp.sources.unix)
-- 
--
Rich $alz					"Drug tests p**s me off"
Mirror Systems, Cambridge Massachusetts		rs@mirror.TMC.COM
{cbosgd, cca.cca.com, harvard!wjh12, ihnp4, mit-eddie, seismo}!mirror!rs