[comp.compression] Modeling vs encoding

tgl@g.gp.cs.cmu.edu (Tom Lane) (04/01/91)

In article <1991Mar31.000653.4598@zorch.SF-Bay.ORG>, xanthian@zorch.SF-Bay.ORG
(Kent Paul Dolan) writes:
> [a short summary of Huffman and arithmetic encoding]

I second Kent's recommended references for these encoding methods.
Another good introductory article is "Data Compression" in ACM Computing
Surveys v19n3 (Sept 87).  This covers both of these encoding methods
and many more.

There's an important common feature of Huffman and arithmetic coding that
Kent didn't explain very clearly, so I will take a shot at it.  Both of
these methods (and many others) are based on the same fundamental idea:
	1. Obtain some statistical information about the relative
	   frequencies of the different symbols that may appear in the
	   source message.
	2. Encode the symbols in such a way that the common symbols are
	   represented in fewer bits than average, while uncommon symbols
	   need more bits than average.
(Compare this with standard encodings like ASCII, wherein every symbol is
represented with the same number of bits.)  Note that a "symbol" may be a
single 0/1 bit, a character, a word, or whatever unit is convenient.  Since
you can't get something for nothing, you must use longer-than-average
encodings for some symbols in order to get shorter-than-average encodings
for others.  *If* your statistics are accurate, you can still end up with
fewer total bits over a whole file.  If they aren't, you come out behind.

(This approach is completely different from Lempel-Ziv and related methods,
which rely on encoding whole strings of symbols rather than individual
symbols.  Most of those methods don't gather statistics per se, they just
assume that a string used before may be used again.)

Strictly speaking, the terms "Huffman" and "arithmetic" only tell you how
step 2 is done; to fully describe a compression method, you must also say
how step 1 is done.  In hi-faluting academic discussions, step 1 is called
modeling and step 2 is called coding; so be careful to say "Huffman coding",
not "Huffman compression", when you are talking to an expert.

There are a bunch of different ways of doing the modeling.  The simplest
distinction is between fixed, semiadaptive, and fully adaptive models.

With a fixed model, you gather your statistics but once, then encode
whatever is thrown at you.  Typically you gather statistics on a big sample
of English text (or whatever you can find that seems representative) and
then preset both compressor and decompressor with a corresponding Huffman
(or other coding method) encoding table.  Compression is fast and simple,
and it works well as long as the data to be encoded matches the fixed
statistics reasonably closely.  If you are given a file in, say, Swahili,
you can get unboundedly bad compression.

A semiadaptive modeler prescans the file to be compressed, computes
statistics *for that file*, and then encodes the file accordingly.  This
avoids the Swahili problem.  With this method you must store/transmit the
encoding table along with the data, so it loses for small files (the
overhead of the table is too great).  Also, in many applications having to
scan the file twice is bad news; you can't compress a stream of data on the
fly.  This option is pretty much out of favor these days.

A fully adaptive model starts from innocuous assumptions (say, all symbols
equally probable).  It then gathers statistics *as it scans the file*.  The
encoder updates its model after encoding each symbol; to stay in step, the
decoder makes an identical update to its model after decoding each symbol.
This arrangement can adapt to any file and has no overhead for transmitting
the model.  Characteristically, the first part of a file is encoded only
moderately well, but compression efficiency improves as you get further
along (because the statistics get more accurate).  A really sophisticated
adaptive model uses statistics from only the recently seen part of the file,
so that it can adapt to local changes in probabilities in different parts of
the file; neither fixed nor semiadaptive models can do this at all.  The
main drawback of the adaptive approach is that you need an encoding method
that can work directly from the statistical counts; you can't afford an
expensive statistics-to-encoding transformation.  Even with well-tuned
methods, these types of models tend to need more computation per symbol than
fixed or semiadaptive models.

Still with me?  The other important classification of models is "order",
which is how much history or state is used.

A zeroth-order model just remembers the (independent) probability of each
symbol, and codes accordingly.  For N symbols the model can be described
by N probabilities or counts.

A first-order model keeps track of the probability of each symbol *given the
previous symbol*.  For example, while a 0-order model would always code "u"
the same, a first-order model would discover that "u" is extremely probable
after "q"; therefore, after coding a "q" it would use a very short encoding
for "u", shorter than it would use for a "u" after some other letter.  This
sort of model can be thought of as having N states (one for each symbol); in
each state a different set of symbol probabilities is maintained, and a
different set of symbol encodings is used.  You need to store N*N
statistical counts to maintain a first-order model.

Similarly, second-order models code each symbol using knowledge of the
two preceding symbols, third-order models three preceding symbols, etc.
With each order step, the amount of state needed to represent the model's
statistics grows by a factor of N.  So high-order models are unusable for
semiadaptive modeling (you would never want to transmit that big a table),
but they can be a win for adaptive modeling (where the model size only
affects the amount of memory needed during compression/decompression).

It is also possible to develop multi-state models in which the state is
determined by something other than the immmediately preceding symbol(s).
For instance, in the JPEG image compression standard, the statistical model
uses separate counts and encodings for DC (zero-frequency) and AC
(non-zero-frequency) values, and it further separates sign bits from
magnitude bits in the values.  So the JPEG model separates states according
to the "class" of symbol rather than according to the immediately preceding
symbol.


The important feature of arithmetic coding is that the predicted frequencies
can be reflected exactly in the total length of the output string, whereas
Huffman coding must round off the frequencies because it encodes each symbol
with an integral number of bits.  For example, if a symbol has predicted
frequency 0.4, Huffman coding must represent it with two bits, while
arithmetic coding can actually achieve the one-and-a-fraction bits that are
justified in an information-theoretic sense.  This leads to the conclusion
that arithmetic coding is, on the average, about one-half bit per symbol
better than Huffman (given identical frequency statistics).  In order to
exploit this advantage, you need a model that can give adequately precise
statistical predictions; so more complex models are justified with
arithmetic coding than with Huffman coding.  For example, after a "q"
Huffman coding still needs at least one bit to represent "u", so there's no
point in knowing that "u" has better than 50% probability in this case.
But arithmetic coding can easily put out one-onehundredth of a bit if you
tell it the true probability of that "u" is 99%.


For further discussion of the notion of modeling for compression, see
Bell, Witten & Cleary in ACM Computing Surveys v21n4 (Dec 89).

-- 
			tom lane
Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (04/01/91)

In article <12546@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes:
  [ in an otherwise accurate summary ]
> Similarly, second-order models code each symbol using knowledge of the
> two preceding symbols, third-order models three preceding symbols, etc.
> With each order step, the amount of state needed to represent the model's
> statistics grows by a factor of N.

False. As my travesty program shows, the amount of state needed is at
most the size of the text times the order; I discovered later that the
necessary state was simply linear in the size of the input. That doesn't
mean the same scheme is practical for compression, only that high-order
models are actually rather cheap in memory.

---Dan

brad@looking.on.ca (Brad Templeton) (04/01/91)

In article <12546@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes:
>A semiadaptive modeler prescans the file to be compressed, computes
>statistics *for that file*, and then encodes the file accordingly.  This
>avoids the Swahili problem.  With this method you must store/transmit the
>encoding table along with the data, so it loses for small files (the
>overhead of the table is too great).  Also, in many applications having to
>scan the file twice is bad news; you can't compress a stream of data on the
>fly.  This option is pretty much out of favor these days.

Depends what you mean by "out of favour."   One of the most commonly
used compressors in the world, PKZIP, uses exactly this method, and does
well by it.   It is, in fact, the superior method for medium sized files.

So many compression theorists run around assuming files of arbitrary
length, which does not attack real world problems.

Two-pass systems such as ZIP suffer the speed penalty of the 2nd pass,
but quite often the penalty of maintaining the adaptive trees or using
arithmetic coding can be high.  There are linear time adaptive huffman
algorithms, but this is no good if the time per bit is still high.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (04/01/91)

From article <12546@pt.cs.cmu.edu>, by tgl@g.gp.cs.cmu.edu (Tom Lane):
> 
> A fully adaptive model starts from innocuous assumptions (say, all symbols
> equally probable).

Since the compresser and the expander must be primed with identical
assumptions about the source alphabet, the initial assumptions may
be used as a key for encryption (I first made this comment in my CACM
article on splay-tree based encryption, cited in the recent bibliographic
posting on this newsgroup).

> 
> A zeroth-order model just remembers the (independent) probability of each
> symbol, and codes accordingly.
> 
> A first-order model keeps track of the probability of each symbol *given the
> previous symbol*.

And as I commented in the same CACM article, fractional orders are quite
practical.  A less-than-first-order model that is better than zeroth-order
doesn't remember the previous symbol, but it remembers something about it.
For example, remembering if the previous symbol was a letter, space, or
punctuation, or remembering the least-significant few bits of the character
code.  As the data in my CACM article shows, retaining a small amount of
context information can give a compressor a big boost in performance.

				Doug Jones
				jones@cs.uiowa.edu

madler@nntp-server.caltech.edu (Mark Adler) (04/02/91)

In article <1991Apr01.053616.3665@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>In article <12546@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes:
>>A semiadaptive modeler prescans the file to be compressed, computes
>>statistics *for that file*, and then encodes the file accordingly.  This
...
>>fly.  This option is pretty much out of favor these days.
>
>Depends what you mean by "out of favour."   One of the most commonly
>used compressors in the world, PKZIP, uses exactly this method, and does
>well by it.   It is, in fact, the superior method for medium sized files.

Actually PKZIP doesn't really do that (by the way, we're talking about
the "implode" method, not "shrink", which is an LZW variant).  What
implode does is look at the first 3K of the file (or all of it if < 3K),
counts how many bytes are in an odd-looking set of printable characters
(CR, LF, space, some digits, some upper case letter, some lower case
letters, a few punctuation--the letters look almost like the Scrabble
letters of four or less points), and if the count is less than 5/9 of
the total, then the file is "binary", otherwise it is "text".

Then, implode uses a prebuilt set of trees depending on if it is binary
or text.  These trees were made from a statistical analysis of a very
large set of sample files of each type.  Actually, the text set of
trees has two possibilities--if the file is less than 5632 bytes, there
is no literal tree, and a 4K dictionary is used, otherwise there is a
literal tree, and an 8K dictionary is used.

Also, it doesn't do any of this, and uses shrink (LZW) instead if the
file is 320 bytes or less.

(Some of the above is from discussions with PKWare (PK and others), and
some is from experimentation with PKZIP.  I have not, nor would I wish
to disassemble PKZIP.)

So, the "statistical analysis" of the file really only outputs less than
two bits about the file, upon which the tree selection is made.  I believe
that the implication about "methods out of favor" refers to making a
complete first pass over the file to build a set of trees customized to
that file.  PKZIP doesn't do that.

The funny thing is, PKZIP *sends* the trees with each compressed file,
and PKUNZIP uses them (PKUNZIP does not have those trees pre-stored).
Perhaps it was PK's intention to do make customized trees for each file,
but found that it was too slow.

By the way, the free zip being worked on *does* build trees for each
file, and right now it's too slow.  But a new string matching algorithm
is being implemented, and we expect it to speed up considerably.  Stay
tuned ...

Mark Adler
madler@pooh.caltech.edu

tgl@g.gp.cs.cmu.edu (Tom Lane) (04/02/91)

In article <5173@ns-mx.uiowa.edu>, jones@pyrite.cs.uiowa.edu (Douglas W.
Jones,201H MLH,3193350740,3193382879) writes:
>
> > A fully adaptive model starts from innocuous assumptions (say, all symbols
> > equally probable).
> 
> Since the compresser and the expander must be primed with identical
> assumptions about the source alphabet, the initial assumptions may
> be used as a key for encryption.

Check.  Another advantage of fully adaptive models for encryption purposes
is that any wrong guess as to a decoded symbol will trash the remaining
output (since the decoder's model will now be out of sync).  This makes the
would-be decryptor's job much harder.

The downside of this, of course, is that adaptive decompressors have
essentially no recovery ability after errors in the compressed data stream.
Fixed-model Huffman decompressors can often resynchronize within a few
symbols after an error.

-- 
			tom lane
Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma

brad@looking.on.ca (Brad Templeton) (04/02/91)

If PKZIP does do that, I am at a loss to explain why it sends the full
tree in the file (unless it was really forward looking, which by and large
it was not) but more to the point, why it builds a complete temp file for
the file and does a full two pass compress?

If you heard it from Katz himself, I believe it, but it seems odd that
it does the full two passes.

My two-pass compressor has a mode where it will scan the first N bytes of
the file and build tables on those, doing one pass compress from then on.
(You need such a mode to compress arbitrary sized stdin -- or you can use
the "retain huffer" mode which outputs new tables every N bytes for this)

This semi-one pass mode is only about one or two percent worse on
a typical large file, which is not bad.    On large files that
vary in data composition -- such as general tar files, etc, the
retraining method can give the best compression.


BTW, I am looking for a good name for my compressor.  Current name is
"ISP" for the "Incredible Shrinking Program" but I think it sounds a bit
wimpy, and too much like LISP.

Other choices:
	ZAP		- but it may be too close to ZIP
	Scrunch/Scrinch
	YACS		- (Yet another compression synonym)
	Pressor		- the name of its inner subroutine


What is your pick?

-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

madler@nntp-server.caltech.edu (Mark Adler) (04/02/91)

Brad Templeton wonders about the use of fixed trees in PKZIP:

>> If PKZIP does do that, I am at a loss to explain why it sends the full
>> tree in the file

I think the intention was to do two passes, but it was too slow.  It was
probably left in to allow for a two pass compressor in a later version.

>> but more to the point, why it builds a complete temp file for
>> the file and does a full two pass compress?

Does it?

Anyway, I just checked it myself and verified that there are only three
sets of trees in a large number (~100) of imploded entries produced by
PKZIP 1.10.  I can even send you the trees, if you're curious.

Mark Adler
madler@pooh.caltech.edu

brad@looking.on.ca (Brad Templeton) (04/03/91)

No, I believe you.   Well, no wonder it's easy to make a compressor that
does better than ZIP then.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473