kenw@skyler.arc.ab.ca (Ken Wallewein) (03/29/91)
For a long time, everybody used Huffman compression. Then Lempel-Ziv (which I presume is a recent development) became popular, and everybody started using that instead. Now I see a return to Huffman, and some utilities which claim to use some sort of combination. Could somebody please explain all this to me? Which is better for what? Can they really be combined? Why are people returning to Huffman compression? -- /kenw Ken Wallewein A L B E R T A kenw@noah.arc.ab.ca <-- replies (if mailed) here, please R E S E A R C H (403)297-2660 C O U N C I L
readdm@ccwf.cc.utexas.edu (David M. Read) (03/31/91)
In article <> kenw@skyler.arc.ab.ca (Ken Wallewein) writes: > For a long time, everybody used Huffman compression. Then Lempel-Ziv >(which I presume is a recent development) became popular, and everybody >started using that instead. Now I see a return to Huffman, and some >utilities which claim to use some sort of combination. > Could somebody please explain all this to me? Which is better for what? >Can they really be combined? Why are people returning to Huffman >compression? > Would someone do me the favor of explaining huffman compression? I think I have a grip on LZ compression, but I don't even have reference for Huffman compression. Another one I'd like to see explained: what exactly is the scheme behind arithmetic coding? Just wondering... -- Dave Read | readdm@ccwf.cc.utexas.edu | ...in large, friendly UT-Austin Nuclear Physics | | letters, were the words Graduate Student (Slave) | | DON'T PANIC ! "What for you bury me in the cold, cold ground ?" - The Tasmanian Devil
brad@looking.on.ca (Brad Templeton) (03/31/91)
The original L-Z algorithm was huffman encoded. Welch, working at Sperry, modified the algorithm and *removed* the huffman encoding. The result was poorer compression. But the result was also fixed-sized tokens, and that meant you could implement the LZW algorithm in hardware, and quickly. This meant it was suitable for compressing as you wrote out to disk. Decompress was similarly quick. Zip, for example, uses the original L-Z algorithm (sometimes called LZ1) with variable length prefix codes. (Huffman codes are the best possible prefix codes, Zip uses Shannon-Fano codes which are almost as good but slightly easier to calculate) LZW is based on the 2nd L-Z algorithm. It was designed for speed, not the very best compression. It is still a lot better than any first order probabilistic algorithm like basic huffman coding. But as you will note, ZIP does a fair bit better than compress. ZIP is far from perfect as well. I have a compressor nearing beta test that does quite a bit better that ZIP. Note that while huffman makes the best prefix codes, prefix codes are not the best encoding of uneven probability data streams. If something occurs 49% of the time, you will give it a 2 bit prefix code, when it really only needs just slightly more than one bit. BIG waste. Arithmetic coding (described in this month's Doctor Dobb's, and many other sources) is able to use fractional bits, sort of, and it truly is optimal. But it takes more CPU to do, and a lot of storage if you want to store static tables in the file. It's almost always done in adaptive form. The above example is the worst possible case. Typical case is that arithmetic coding gives you 1/2 bit per token improvement. If your tokens are 10 bits long on average, as they might be in LZ, that's a 5% reduction in the size of the file. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (03/31/91)
readdm@ccwf.cc.utexas.edu (David M. Read) writes: > Would someone do me the favor of explaining huffman compression? I > think I have a grip on LZ compression, but I don't even have reference > for Huffman compression. > Another one I'd like to see explained: what exactly is the scheme > behind arithmetic coding? Neither of these are the kind of things that fit into a news article nicely, so reading the references is best. For Huffman coding and compression, a book that should be in every software practicioner's library, Robert Sedgewick's _Algorithms_, 2nd Edition, ISBN 0-201-06673-4 from Addison Wesley, has an excellent discussion running from page 322 to page 330. Unfortunately, it only covers the fixed Huffman code case, while the adaptive case is more important for general compression. In a sketch, the technique is to count the frequency of appearance of each symbol (byte bit pattern) into an array with an entry per symbol. Then a tree is built bottom to top by taking the two lowest frequency bins, subordinating them to a new node with datum the combined frequencies of the two child bins, and replacing them with this subtree. This is continued until the whole array has become a binary tree of the kind known as a trie. Now going from the root to the symbol position, coding zero for a left branch and one for a right branch, each symbol has a unique encoding, with the higher frequency symbols being blessed with the shorter encodings. For reasons I don't claim to understand, the trie disambiguates the boundary between symbols; no symbol is encoded with a code which is also the prefix of another symbol's code. By using fewer bits to encode more popular characters, compression is achieved in cases where the frequencies are skewed sufficiently. Overhead in passing the tree structure at the start of the transmitted compressed data means that fixed Huffman coding is only effective for large files, or datasets where an _assumed_ frequency table can be used by agreement between the compressing and uncompressing functionalities. An example is English text. For Arithmetic Compression, the standard tutorial reference is the June 1987 Communications of the ACM, pages 520-540, Ian H. Witten, Radford M. Neal and John G. Cleary, "Arithmetic Coding for Data Compression" (my apologies to all the correspondents I told "the 7/87 issue" -- purely braindead). In a sketch, and here I'll describe the adaptive case because the fixed table case never interested me, a set of bins with frequencies for each symbol is initialized with each bin set to one (and never allowed to go to zero), and the array of bins is conceptually stretched across the real interval from [0,1) (open above), with the width of each bin proportional to the count it holds. The goal is to use a string of bits to define a subset of the real interval [0,1) whose endpoints lie strictly inside that part of the same real interval assigned to the bin for the symbol being encoded. This is done by a clever scheme omitted here which either halves the current subinterval or slides it to cover different quarters of the current subinterval until the goal is attained. These move are encoded in the output bit stream and recoverable from the same bitstream read in. Symbols with wide bins can "capture" the subinterval with fewer moves, and thus fewer encoding bits, than symbols with narrow bins. After the subinterval is isolated within a symbol's bin, detectable from the known current bin frequency values, that symbol's bin is incremented, (moving the endpoints of _all_ the symbol bins in the real interval [0,1) for the next symbol's encoding), and the current subinterval is logically widened to [0,1) to hold the whole set of bins again, and the process iterated. The eventual output is a binary fraction that encodes the first symbol, "within" that symbol the second symbol, "within" that symbol the third symbol, and so on. There needs to be a separate, extra EOF symbol to tell when to stop. There is a lot of bookkeeping to keep bins from overflowing, and to make the above infinite precision real number process actually work in a limited precision, integer arithmetic computational environment. It isn't obvious from the above description, but one of the effects of the arithmetic encoding algorithm is that symbols can be encoded in fewer than one bit! This occurs only where a single symbol value overwhelms the other frequencies. This gives arithmetic compression the opportunity to be more efficient than Huffman compression, which cannot take advantage of fractional bits, but must use an even number of bits for each encoded symbol. Really, the references explain this _much_ better than I care to. Kent, the man from xanth. <xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>
rikitake@jrd.dec.com (Kenji Rikitake) (04/01/91)
In article <46466@ut-emx.uucp>, readdm@ccwf.cc.utexas.edu (David M. Read) writes... >Would someone do me the favor of explaining huffman compression? I >think I have a grip on LZ compression, but I don't even have reference >for Huffman compression. Check out BSD "compact" utility, a straight-forward adaptive huffman coding compressor. This utility is rather old, so maybe you cannot find it in your UNIX kit. -- Kenji Rikitake / rikitake@jrd.dec.com / home: kenji@ybbs.c.u-tokyo.ac.jp "Mistrust Authority - Promote Decentralization." -- Steven Levy, "hackers" Disclaimer: my employer is not responsible for my messages on the Internet.
py@meadow.uucp (Peter Yeung) (04/02/91)
In article <46466@ut-emx.uucp> readdm@ccwf.cc.utexas.edu (David M. Read) writes: >In article <> kenw@skyler.arc.ab.ca (Ken Wallewein) writes: >> For a long time, everybody used Huffman compression. Then Lempel-Ziv >>(which I presume is a recent development) became popular, and everybody >>started using that instead. Now I see a return to Huffman, and some >>utilities which claim to use some sort of combination. > >> Could somebody please explain all this to me? Which is better for what? >>Can they really be combined? Why are people returning to Huffman >>compression? >> > >Would someone do me the favor of explaining huffman compression? I >think I have a grip on LZ compression, but I don't even have reference >for Huffman compression. > There is an article on LZW and Dynamic Huffman Encoding in the March issue Byte Magazine. > >-- >Dave Read | readdm@ccwf.cc.utexas.edu | ...in large, friendly >UT-Austin Nuclear Physics | | letters, were the words > Graduate Student (Slave) | | DON'T PANIC ! >"What for you bury me in the cold, cold ground ?" - The Tasmanian Devil -- Peter Yeung Amdahl Canada Ltd., Software Development Center 2000 Argentia Road, Plaza 2, Suite 300 Mississauga, Ont. L5N 1V8 Phone: (416) 542-6300 Fax: (416) 858-2233