[comp.compression] Data + Algorithms = Compression

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (04/10/91)
It pays off a lot when trying to prove the efficiency of your favorite
compression algorithm to carefully select the data you want to use to
show it off to best effect.  Said another way, nothing succeeds like a
rigged demo.

I was developing a program for rec.games.programmer, and needed a debug
file for a little logic problem I knew was there, but couldn't quite
visualize.

     [The problem was to create a maze to represent a town of the
     sort used in, e.g., the commercial Bard's Tale game series.
     (As a side light, the program eventually succeeded to some
     nice degree, and examples are posted to r.g.p). The debug
     output was the (21x43 character) maze image, repeated once
     for each step of the drawing algorithm.]

In the current example, that is 123 blocks of text, 947 bytes long each
(counting linefeeds and one blank line), differing in about seven
character positions from one block to the next.

Watching the file crawl by at 1200 baud was aging me too fast, so I
decided to pack it up, download it home, and view it at 38,000 baud
on my local hardware.  I got a very pleasent surprise when I chose
to do this with lharc.

This is severely regular data, and various compression algorithms take
advantage of that regularity with varying degrees of success:

  bytes  file name               algorithm
114106 townmaze.out          original data
 32422 townmaze.out.aed      CACM 6/87 Arithmetic Data Compression code
 24582 townmaze.out.S        Recently mentioned splay tree code
 15143 townmaze.out.zoo      zoo 2.01
 14363 townmaze.out.Z        standard BSD compress -b16
  4801 townmaze.out.lzh      lharc 1.02A as posted to alt.sources late last
                             year by me.

That last is less than 5% of the original data volume, and yes, it is
real; I unpacked the original file from it twice to replace the files
replaced by splay and compress, which do not normally leave an original
copy lying about.

It is thus not all that awful an idea to consider reverting to the ARC
program's idea of trying eight or more compression methods with each
having various favorite data types, and saving only the best result in
the archive, if time and cpu cycles are cheap and storage/transmission
bandwidth is dear.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>