[comp.sys.amiga.tech] Compression Algorithms...

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (05/27/90)
In article <1103@tardis.Tymnet.COM> jms@tardis.Tymnet.COM (Joe Smith) writes:
>Another comment about compression algorithms: After reading enough of the
>input file, LZW starts assigning codes to entire words and phrases.  For
>instance, "(char *)malloc(" can get replaced by a single 13-bit code.  Or a
>single code can replace the phrase "the party of the first part" whereever
>it appears.  Files that contain the same words, over and over, become much
>smaller when compressed.  As you can imagine, COBOL programs and legalese
>compress well.

I'm playing around with the arithmetic encoding/decoding algorithm from CACM
of a couple years back, replacing the "most frequent to front" linear array
lookup of the "how often this character was seen so far" information with a
heap based maintenance of the same information, in hopes of a small speedup.
It works so far, but the speedup is a bit iffy: random input runs faster my
heap based way, strongly biased input runs faster with the fast front linear
search (proving once again that an order(log N) algorithm only beats an
order(N) algorithm if the constant multipliers are close and N is "sufficiently
large".

Anyway ... in testing my algorithm against the original, I had occassion
to create and compress a file consisting of 100,000 bytes comprised of
lines of 49 "0"'s and one newline.  Because arithmetic compression is capable
of using fractions of a bit of output to represent a byte of input (yes, that
really means what it says and is possible!) arithmetic compression managed 
to compress the 100,000 byte input into 2338 bytes of encoded data, beating
out the LZ "compress" algorithm by some 400 bytes.

It is interesting to find that 1) arithmetic compression is capable of almost
50 to 1 compression 2) can, for certain data, beat LZ compression, and 3) runs
reasonably fast for this case, even compared to "compress".  Data where it
might be useful to consider arithmetic compression would be any extremely
sparse data, such as line drawn images.

This is, incidently, my first excursion into C on the Amiga, using Lattice
C release 5.04.  I am underwhelmed: a too-long "#define" definition not only
prompted the compiler to fail to compile my program, but also to crash my
system and to trash Rad: as well.  A little bullet proofing would seem to be
a given by the time a commercial product has gone through this many revisions.

On the bright side, for a guy with no hard disk but lots and lots of ram,
compiling with all the files and executables memory resident is sufficiently
fast to be a pleasant surprise.  It approaches the speed (though with nowhere
near the convenience or productivity) of Benchmark Modula-2 in the same
system.  I recommend spending bucks on memory until the box is full before
looking for a hard disk; the memory is lots more useful.  In 9.5 meg, I keep
the whole workbench, lots of other goodies (emacs, zoo, Amigabasic, etc, etc),
the entire Lattice C development system, and have about 1.5 meg of Rad: to
work in, and 2.0 meg of available memory for executing code, screens, windows,
and other active data.  Of course, I'm still watching the optical rewritables
like a hawk.  Anybody put an ORWD on a SCSI interface on the Amiga for a price
the impoverished could enjoy?

Kent, the man from xanth, now zooming in from Zorch.
xanthian@zorch.sf-bay.org