[net.research] Lempel-Ziv Compression Made Swift, part 2

jaw@ames.UUCP (James A. Woods) (07/26/84)

#  "They squeeze the orange and throw away the skin."

	-- Voltaire, 1751 (letter referring to quarrel with Frederick the Great)

     Last episode, I remarked on an alternative implementation of LZ coding.
I proposed a rather memory-intensive structure inherently faster than hashing
or linked lists.  The tests, by the way, were done on a medium-loaded (10 users)
4 megabyte 11/750 running 4.2 BSD.  For all of the program runs, there were
only a few (<7) page faults across all compressor types.  Lesson:  don't let
the high I/O usage stats reported by the "time" command scare you -- the
bottom line is real time.

     What is the use of FAST compression, you say?  Phone lines only run at
9600 baud or so.  We've got a > 50% method in LZW, so why mess with it?
Good questions.

     But, there are applications beyond squeezing blood out of net.flame
over modems.  Even here, Thomas's code won't work on small address space
machines for N > 12.  This brings up the point that maybe Lempel-Ziv shouldn't
be wired down into a utility until it works for the PDP 11 (or IBM PC/XT!) uucp
sites at reasonable rates.  Maybe someone should look into hashing with open
addressing instead of memory consuming chaining -- there's lots of interesting
time/space tradeoffs to investigate.  I rejected hashing initially because
of a slow malloc(), and the fact that computing good hash functions in software
generally requires more time than scanning characters or links (c.f.
discussion of hashing vs. tries [for that nifty "on-line" spelling checker]
in Aho/Ullman's Data Structures and Algorithms.)  But, "tries" have their
own problems ...

     Anyway, I think there's a niche for fast code, and, at a minimum,
I'm intending to #ifdef my speedups into Spencer's code along with
mcvax!jim's portability mods (oops, I hope he used masks rather than
assuming the existence of a barrel shifter), so a VAX (or 68K/32032/RISC)
can communicate with the dinosaurs at BITS = 12.  While we're at it,
why not use the Berkeley "compact" command line interface (filename extensions
for friendly in-place file compression in addition to pipe capability),
and it's name (we all agree that the algorithm is a turkey).  That way,
we don't need more manual pages for compress/squeeze/squish/squash --
just do surgery on compact's guts and give it back to UCB.
Also, if we're going to let users play with compression rates, that
magic number better have the BITS parameter, for idiot-proof decompression.

     As for benchmark files, try the 200K /usr/dict/words, or System V
with the extra "carburetor" but no "tofu".

     Some other applications:

(1) Macintosh screen compression.  Run length coding accessed from the
    Toolbox gets pictures down to 10K or so, but maybe some form of LZ can
    postprocess this stuff even more for those tiny diskettes.

(2) Backup tapes.  Welch reports 2-3 X savings system wide.  Let's see,
    "tar/dump" goes about 3 meg / minute = 50K bytes / second at 6250 density.
    Hmm, we don't have a time payoff unless we're running a mainframe
    mismatched with slow tapes.  But it'd still be fun to wire an
    assembly-coded version (no memory contention running single-user, afterall)
    into dump/restore.  One tape is still better than two, albeit sluggish.
    I predict the LZ method will be put (invisibly) into tape controllers
    by the Hardy boys.

(3) Scientific data.  Lempel-Ziv buys 3 : 1 for ANOVA columnar statistics
    and certain random-access data bases (containing mostly zeroes) run at
    8 : 1!  We never did want those removable drives the division is pushing,
    anyway.

(4) Slow networks.  What does 4.2 TCP/IP over Ethernet do?  50K bytes / second?
    Or have that Cray/Denelcor UNIX ship compressed data to its frontend,
    or to its slow disks.

	-- James A. Woods  {dual,hplabs,hao}!ames!jaw  (jaw@riacs.ARPA)

P.S  Data compression has obsessed me since days as a kid in grad school,
     where I did music coding (adaptive transform coding on an 11/40
     with no floating point, running UNIX Version (not System) 5.
     If you're going for optimality, try Chaitin/Kolmogorov universal
     machine complexity theory (except it's worse than NP-complete, being
     undecidable).  Some variants of the linear LZ are indeed NP-complete
     (Storer/Szymanski [author of pack?!?] at Bell Labs, are you there?)
     But for practicality, hats off to Lempel and Ziv.  For the first time
     in 20-odd years, we have something to beat the pants off of Huffman
     codes (perhaps the most famous master's thesis of 1952).

     In a way, it's like besting Bob Beamon's long jump.  Tres recherche!