[comp.compression] Lempel-Ziv v/s huffman encoding

kenw@skyler.arc.ab.ca (Ken Wallewein) (03/29/91)

  For a long time, everybody used Huffman compression.  Then Lempel-Ziv
(which I presume is a recent development) became popular, and everybody 
started using that instead.  Now I see a return to Huffman, and some 
utilities which claim to use some sort of combination.

  Could somebody please explain all this to me?  Which is better for what?
Can they really be combined?  Why are people returning to Huffman
compression? 

--
/kenw

Ken Wallewein                                                     A L B E R T A
kenw@noah.arc.ab.ca  <-- replies (if mailed) here, please       R E S E A R C H
(403)297-2660                                                     C O U N C I L

readdm@ccwf.cc.utexas.edu (David M. Read) (03/31/91)

In article <> kenw@skyler.arc.ab.ca (Ken Wallewein) writes:
>  For a long time, everybody used Huffman compression.  Then Lempel-Ziv
>(which I presume is a recent development) became popular, and everybody 
>started using that instead.  Now I see a return to Huffman, and some 
>utilities which claim to use some sort of combination.

>  Could somebody please explain all this to me?  Which is better for what?
>Can they really be combined?  Why are people returning to Huffman
>compression? 
>

Would someone do me the favor of explaining huffman compression? I
think I have a grip on LZ compression, but I don't even have reference
for Huffman compression.  

Another one I'd like to see explained: what exactly is the scheme
behind arithmetic coding?

Just wondering...


-- 
Dave Read                 | readdm@ccwf.cc.utexas.edu  | ...in large, friendly 
UT-Austin Nuclear Physics |                            | letters, were the words
 Graduate Student (Slave) |                            |      DON'T PANIC !
"What for you bury me in the cold, cold ground ?" - The Tasmanian Devil

brad@looking.on.ca (Brad Templeton) (03/31/91)

The original L-Z algorithm was huffman encoded.

Welch, working at Sperry, modified the algorithm and *removed* the huffman
encoding.  The result was poorer compression.  But the result was also
fixed-sized tokens, and that meant you could implement the LZW algorithm in
hardware, and quickly.   This meant it was suitable for compressing as
you wrote out to disk.  Decompress was similarly quick.

Zip, for example, uses the original L-Z algorithm (sometimes called LZ1)
with variable length prefix codes.  (Huffman codes are the best possible
prefix codes, Zip uses Shannon-Fano codes which are almost as good but
slightly easier to calculate)

LZW is based on the 2nd L-Z algorithm.  It was designed for speed, not
the very best compression.  It is still a lot better than any first order
probabilistic algorithm like basic huffman coding.   But as you will note,
ZIP does a fair bit better than compress.

ZIP is far from perfect as well.  I have a compressor nearing beta test
that does quite a bit better that ZIP.

Note that while huffman makes the best prefix codes, prefix codes are
not the best encoding of uneven probability data streams.   If something
occurs 49% of the time, you will give it a 2 bit prefix code, when it
really only needs just slightly more than one bit.  BIG waste.  Arithmetic
coding (described in this month's Doctor Dobb's, and many other sources)
is able to use fractional bits, sort of, and it truly is optimal.  But
it takes more CPU to do, and a lot of storage if you want to store static
tables in the file.  It's almost always done in adaptive form.

The above example is the worst possible case.  Typical case is that arithmetic
coding gives you 1/2 bit per token improvement.   If your tokens are 10
bits long on average, as they might be in LZ, that's a 5% reduction in the
size of the file.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (03/31/91)

readdm@ccwf.cc.utexas.edu (David M. Read) writes:

> Would someone do me the favor of explaining huffman compression? I
> think I have a grip on LZ compression, but I don't even have reference
> for Huffman compression.

> Another one I'd like to see explained: what exactly is the scheme
> behind arithmetic coding?

Neither of these are the kind of things that fit into a news article
nicely, so reading the references is best.

For Huffman coding and compression, a book that should be in every
software practicioner's library, Robert Sedgewick's _Algorithms_, 2nd
Edition, ISBN 0-201-06673-4 from Addison Wesley, has an excellent
discussion running from page 322 to page 330. Unfortunately, it only
covers the fixed Huffman code case, while the adaptive case is more
important for general compression.

In a sketch, the technique is to count the frequency of appearance of
each symbol (byte bit pattern) into an array with an entry per symbol.
Then a tree is built bottom to top by taking the two lowest frequency
bins, subordinating them to a new node with datum the combined
frequencies of the two child bins, and replacing them with this subtree.

This is continued until the whole array has become a binary tree of the
kind known as a trie. Now going from the root to the symbol position,
coding zero for a left branch and one for a right branch, each symbol
has a unique encoding, with the higher frequency symbols being blessed
with the shorter encodings.

For reasons I don't claim to understand, the trie disambiguates the
boundary between symbols; no symbol is encoded with a code which is also
the prefix of another symbol's code. By using fewer bits to encode more
popular characters, compression is achieved in cases where the
frequencies are skewed sufficiently.

Overhead in passing the tree structure at the start of the transmitted
compressed data means that fixed Huffman coding is only effective for
large files, or datasets where an _assumed_ frequency table can be used
by agreement between the compressing and uncompressing functionalities.
An example is English text.

For Arithmetic Compression, the standard tutorial reference is the June
1987 Communications of the ACM, pages 520-540, Ian H. Witten, Radford M.
Neal and John G. Cleary, "Arithmetic Coding for Data Compression" (my
apologies to all the correspondents I told "the 7/87 issue" -- purely
braindead).

In a sketch, and here I'll describe the adaptive case because the fixed
table case never interested me, a set of bins with frequencies for each
symbol is initialized with each bin set to one (and never allowed to go
to zero), and the array of bins is conceptually stretched across the
real interval from [0,1) (open above), with the width of each bin
proportional to the count it holds.

The goal is to use a string of bits to define a subset of the real
interval [0,1) whose endpoints lie strictly inside that part of the same
real interval assigned to the bin for the symbol being encoded. This is
done by a clever scheme omitted here which either halves the current
subinterval or slides it to cover different quarters of the current
subinterval until the goal is attained. These move are encoded in the
output bit stream and recoverable from the same bitstream read in.
Symbols with wide bins can "capture" the subinterval with fewer moves, and
thus fewer encoding bits, than symbols with narrow bins.

After the subinterval is isolated within a symbol's bin, detectable from
the known current bin frequency values, that symbol's bin is incremented,
(moving the endpoints of _all_ the symbol bins in the real interval [0,1)
for the next symbol's encoding), and the current subinterval is logically
widened to [0,1) to hold the whole set of bins again, and the process
iterated.

The eventual output is a binary fraction that encodes the first symbol,
"within" that symbol the second symbol, "within" that symbol the third
symbol, and so on.  There needs to be a separate, extra EOF symbol to
tell when to stop.

There is a lot of bookkeeping to keep bins from overflowing, and to make
the above infinite precision real number process actually work in a
limited precision, integer arithmetic computational environment.

It isn't obvious from the above description, but one of the effects of
the arithmetic encoding algorithm is that symbols can be encoded in
fewer than one bit! This occurs only where a single symbol value
overwhelms the other frequencies. This gives arithmetic compression the
opportunity to be more efficient than Huffman compression, which cannot
take advantage of fractional bits, but must use an even number of bits
for each encoded symbol.

Really, the references explain this _much_ better than I care to.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>

rikitake@jrd.dec.com (Kenji Rikitake) (04/01/91)

In article <46466@ut-emx.uucp>, readdm@ccwf.cc.utexas.edu (David M. Read) writes...
>Would someone do me the favor of explaining huffman compression? I
>think I have a grip on LZ compression, but I don't even have reference
>for Huffman compression.  

Check out BSD "compact" utility, a straight-forward adaptive huffman coding 
compressor. This utility is rather old, so maybe you cannot find it in your 
UNIX kit. 

--
Kenji Rikitake  /  rikitake@jrd.dec.com  /  home: kenji@ybbs.c.u-tokyo.ac.jp
"Mistrust Authority - Promote Decentralization."   -- Steven Levy, "hackers"
Disclaimer:  my employer is not responsible for my messages on the Internet.

py@meadow.uucp (Peter Yeung) (04/02/91)

In article <46466@ut-emx.uucp> readdm@ccwf.cc.utexas.edu (David M. Read) writes:
>In article <> kenw@skyler.arc.ab.ca (Ken Wallewein) writes:
>>  For a long time, everybody used Huffman compression.  Then Lempel-Ziv
>>(which I presume is a recent development) became popular, and everybody 
>>started using that instead.  Now I see a return to Huffman, and some 
>>utilities which claim to use some sort of combination.
>
>>  Could somebody please explain all this to me?  Which is better for what?
>>Can they really be combined?  Why are people returning to Huffman
>>compression? 
>>
>
>Would someone do me the favor of explaining huffman compression? I
>think I have a grip on LZ compression, but I don't even have reference
>for Huffman compression.  
>

There is an article on LZW and Dynamic Huffman Encoding in the March issue Byte
Magazine.

>
>-- 
>Dave Read                 | readdm@ccwf.cc.utexas.edu  | ...in large, friendly 
>UT-Austin Nuclear Physics |                            | letters, were the words
> Graduate Student (Slave) |                            |      DON'T PANIC !
>"What for you bury me in the cold, cold ground ?" - The Tasmanian Devil


-- 
Peter Yeung     Amdahl Canada Ltd., Software Development Center
                2000 Argentia Road, Plaza 2, Suite 300
                Mississauga, Ont.   L5N 1V8
                Phone: (416) 542-6300    Fax: (416) 858-2233