leichterj@rani.DEC (09/09/84)
A recent article in net.math posed the problem: Given a Huffman-encoded version of some file, and a rough idea of what the file looks like, but not the actual Huffman tree, can you decode the file? The answer depends on what the semantics of the file you compressed, and how it relates to the unit of Huffman encoding. (By "unit" I mean, for a byte-oriented file, do you use bytes, pairs of bytes, or what as the "characters" to be represented.) Obviously, any simple substi- tution of the units has no effect at all on the statistics, and produces the same Huffman tree with a different labeling. If a simple substitution can turn a file that "looks right" into another such file, then you can't distinguish between the two. If what was compressed was English text, then this won't happen; the "unicity distance" for English text under simple substitution is the small - somewhere around 35 characters. (That is: If you are given more than about 35 characters that arose from a simple substitution applied to En- glish text, then there is usually only one text/substitution pair that could have produced it - though finding it can be hard.) A Huffman code at the byte level is, of course, a simple substitution anyway - it just happens to use varying-length encodings. There is, of course, the question of whether you can tell where the codes start and end. Given the tree, Huffman codes are inherently self-synchronizing - if you get out of step, you get back in in a couple of characters. It's not obvious to me what happens if you DON'T know the tree - I think you have to make guesses about it. Note that a permutation of the units in the input produces the same tree, too. The semantics of the encoded text is essential. A simple substitution of black and white for each other will still look like a Picasso, so the Huffman code won't tell you if the original was black-on-white or white-on-black; but you may not care. As you go to more complex kinds of input files, this ambiguity becomes more and more of a problem - you can play around with the gray levels of a natural image in many ways and still have a natural-looking picture. -- Jerry