[net.math] Huffman decompression without the tree

leichterj@rani.DEC (09/09/84)
A recent article in net.math posed the problem:  Given a Huffman-encoded version
of some file, and a rough idea of what the file looks like, but not the actual
Huffman tree, can you decode the file?  The answer depends on what the semantics
of the file you compressed, and how it relates to the unit of Huffman encoding.
(By "unit" I mean, for a byte-oriented file, do you use bytes, pairs of bytes,
or what as the "characters" to be represented.)  Obviously, any simple substi-
tution of the units has no effect at all on the statistics, and produces the
same Huffman tree with a different labeling.  If a simple substitution can turn
a file that "looks right" into another such file, then you can't distinguish
between the two.  If what was compressed was English text, then this won't
happen; the "unicity distance" for English text under simple substitution is
the small - somewhere around 35 characters.  (That is:  If you are given more
than about 35 characters that arose from a simple substitution applied to En-
glish text, then there is usually only one text/substitution pair that could
have produced it - though finding it can be hard.)  A Huffman code at the
byte level is, of course, a simple substitution anyway - it just happens to use
varying-length encodings.  There is, of course, the question of whether you
can tell where the codes start and end.  Given the tree, Huffman codes are
inherently self-synchronizing - if you get out of step, you get back in in a
couple of characters.  It's not obvious to me what happens if you DON'T know
the tree - I think you have to make guesses about it.

Note that a permutation of the units in the input produces the same tree, too.

The semantics of the encoded text is essential.  A simple substitution of black
and white for each other will still look like a Picasso, so the Huffman code
won't tell you if the original was black-on-white or white-on-black; but you
may not care.  As you go to more complex kinds of input files, this ambiguity
becomes more and more of a problem - you can play around with the gray levels
of a natural image in many ways and still have a natural-looking picture.

							-- Jerry