[comp.compression] Is arithmetic compression a crock?

csu@alembic.acs.com (Dave Mack) (05/30/91)

About a month ago, Kent Dolan posted a sample arithmetic coding
compression package here. I finally got around to messing with 
it. For reference:


    From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan)
    Newsgroups: comp.compression
    Subject: Re: looking for arith. codeing refs.
    Message-ID: <1991Apr22.130611.9492@zorch.SF-Bay.ORG>
    Date: 22 Apr 91 13:06:11 GMT
    References: <11020001@hppad.waterloo.hp.com> <19319@cs.utexas.edu>
    Organization: SF-Bay Public-Access Unix
    Lines: 598

    > "Arithmetic Coding for Data Compression", Witten, Radford and Cleary,
    > Communications of the ACM, Volume 30, #6, June 1987

    > This paper has a good description of arithmetic coding as well as
    > working C code. I'm sure that the code is online somewhere, but I
    > couldn't tell you where.

    I'll probably get harassed for this, but here it is. I added ANSI C
    prototypes, and for complicated reasons this exact Makefile is untested,
    and it needs unpacking on a system with long file name capability, or
    else hand editing to shorten the names before unpacking (and in the
    files, too.)

So I modified the makefile to use gcc with -g and -O, then ran it
on a large text file (jargon2.8.2, actually.) The results were
pretty horrible:

    # time encode <test >test.A
    80.7 real        79.8 user         0.3 sys  
    # time decode <test.A >untest
    101.1 real        98.7 user         0.5 sys  
    # cmp test untest
    # ls -l test test.A
    -rw-r--r--  1 csu        890267 May 29 23:19 test
    -rw-r--r--  1 csu        522053 May 29 23:28 test.A

This was done on a *very* lightly loaded Sun-4/110. For comparison,
I ran an LZW compress over the same input:

    # time compress test
    17.1 real        16.3 user         0.4 sys  
    # ls -l test.Z
    -rw-r--r--  1 csu        392609 May 29 23:19 test.Z
    # time uncompress test.Z
    10.8 real         9.2 user         0.5 sys  

Is this a really bad implementation, a really bad algorithm, or
is arithmetic compression basically useless for text compression?

-- 
Dave Mack

ricks@nrl.navy.mil (Richard Schumeyer) (05/31/91)

In article <1991May30.035552.14435@alembic.acs.com> csu@alembic.acs.com (Dave Mack) writes:
[stuff deleted]
>   Is this a really bad implementation, a really bad algorithm, or
>   is arithmetic compression basically useless for text compression?
>
>   -- 
>   Dave Mack

Arithmetic compression gives the best compression for a given model.
If your model stinks, so will your compression.  Did you use the
static model given in the paper?  Much better text models exist, like
the string-based model in compress.

--
------------------------------------------------------------
Rick Schumeyer           ricks@luke.nrl.navy.mil 
Naval Research Lab       (202)404-7966
Space Science Division
------------------------------------------------------------

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (05/31/91)

From article <1991May30.035552.14435@alembic.acs.com>,
by csu@alembic.acs.com (Dave Mack):
> 
> Is this a really bad implementation, a really bad algorithm, or
> is arithmetic compression basically useless for text compression?

Arithmetic coding has never been advertised as a fast method, and the
speed you see is typical of the results others have obtained.

As for compression performance, the code in the Witten Neal and Cleary
paper only uses a single-state source model.  To compress as well as
LZW compresses, you need multiple states in your source model.

Witten Neal and Cleary used a move-to-front cumulative frequency table,
occupying 257 16 bit words.  If you want a 64 state source model, you
will need 64 of these tables (for a total cost of 16448 words or
32896 bytes.  In my CACM paper on Splay Tree Based compression (vol 31
no 8), I showed that a 64 state source model (where the next state is
determined by the previous character mod 64) works pretty well for
splay trees, and it should work even better for arithmetic codes.

The 64 state source model I suggested uses much less memory than the
hash tables used by LZW, and it should compress significantly better.
My splay-tree based code was quite good with this 64 state source model,
frequently beating LZW, and it is a simple prefix code where each source
character codes to at least one bit.

In a multiple-state source model, many characters will be so common in
some states that they will code to less than a bit when arithmetic
codes are used.

Finally, note that a multiple state source model does not slow things
down significantly.  You pay about one array index operation per character
compressed to find the base of the sub-array to be used in compressing
that character.
					Doug Jones
					jones@cs.uiowa.edu

campbell@sol.cs.wmich.edu (Paul Campbell) (05/31/91)

To be fair, it would be a bad idea to compare this to a string compressor
because that's not what it is. It would be fair to compare it to a Huffman
encoding scheme. It does better than Huffman, and it is the same class of
data compression as the Huffman compressor. Neither were intended to beat out
string compressors except in a few cases. They both rely on the frequency
of occurance of a specific character. You could, for instance, run the output
of the compress program through the input of the arithmetic encoder to get
even better compression.

Also, as mentioned elsewhere, PLEASE don't use a fixed model unless you scan
the data first to generate a table for your specific case or do an adaptive
implementation.

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (06/03/91)

ricks@nrl.navy.mil (Richard Schumeyer) writes:

> Arithmetic compression gives the best compression
> for a given model. If your model stinks, so will
> your compression. Did you use the static model
> given in the paper? Much better text models exist,
> like the string-based model in compress.

Um, the code I posted was for the adaptive model of
the June 1987 CACM article.  There may be "string
based models" in arithmetic compression, but I don't
know of them.  Then again, I'm no expert in the field.

The advances over this primitive arithmetic compression
that I know of all use multicharacter predictor methods,
but that is not terribly well related to compress' strings.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (06/03/91)

 csu@alembic.acs.com (Dave Mack) writes:

> About a month ago, Kent Dolan posted a sample arithmetic coding
> compression package here. I finally got around to messing with 
> it. For reference:

[Omitted]

> So I modified the makefile to use gcc with -g and -O, then ran it
> on a large text file (jargon2.8.2, actually.) The results were
> pretty horrible:

>   # time encode <test >test.A
>   80.7 real        79.8 user         0.3 sys  
>   # time decode <test.A >untest
>   101.1 real        98.7 user         0.5 sys  
>   # cmp test untest
>   # ls -l test test.A
>   -rw-r--r--  1 csu        890267 May 29 23:19 test
>   -rw-r--r--  1 csu        522053 May 29 23:28 test.A

> This was done on a *very* lightly loaded Sun-4/110. For comparison,
> I ran an LZW compress over the same input:

>   # time compress test
>   17.1 real        16.3 user         0.4 sys  
>   # ls -l test.Z
>   -rw-r--r--  1 csu        392609 May 29 23:19 test.Z
>   # time uncompress test.Z
>   10.8 real         9.2 user         0.5 sys  

> Is this a really bad implementation, a really bad algorithm, or
> is arithmetic compression basically useless for text compression?

Your results are fairly consistent with mine, but
don't despair. The algorithm _is_ about six times as
slow as compress, and not that hot for normal text.
However, knowing several things will make you
willing to investigate further.

1) The code as written is an adaptive model, like
adaptive Huffman coding. For text, a fixed model (or
best, a two pass approach) would possibly do better.

2) The code as written was explicitly written for
clarity of tutorial use in a CACM article, not for
speed. In particular, it does such horrid things as
make a subroutine call per _bit_ of compressed file
to write or read the bits. A little cleanup of this
kind of bogosity would do a _lot_ to speed up its
six to one lossage to "compress".

3) Arithmetic compression is an improvement over
Huffman coding, by not insisting on encoding
characters on whole bit boundaries, and not
requiring explicit stop bits, and being capable of
encoding multiple characters per bit in extreme
cases, so it is a possible win as the second phase
coding in LHARC-like two phase compression schemes.

4) Arithmetic coding shines, not for text, where
about 1/3rd of the possible symbols are fairly well
represented, but in _severely_ skewed data sets,
like monochrome line drawings, where almost all of
of the data is one symbol from the symbol set. To
see arithmetic codeing shine, make up a test data
set consisting of 99% nulls, and a random
replacement of a randomly chosen other 1% with any
of the other 255 byte values. Compress that data set
with compress and with arithmetic compression, and
compare the results for the two, and be prepared to
be impressed for megabyte sized data sets.

The obvious applications are noisy image data, where
the noise breaks up the strings that make compress
effective, but still leave the bulk of the pixels as
a background value.

5) If you absolutely _must_ get the smallest
possible archive file, use lharc. As a second
choice, feed the output of compress through the
arithmetic encoder. Though what compress produces
looks like pure noise, I've found encode will
compress it about 4% further. It costs lots of time,
of course, but if that is not your main criterion,
then it is quite handy.

6) Much better, state sensitive arithmetic coding
algorithms exist than this simple one character
deep tutorial code.  See Doug Jones' response for
more on this.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>

Tapani.Lindgren@HUT.Fi (Tapani Lindgren) (06/10/91)

Douglas W. Jones <jones@pyrite.cs.uiowa.edu> writes:
>In a multiple-state source model, many characters will be so common in
                                   ^^^^
>some states that they will code to less than a bit when arithmetic
>codes are used.

May I rephrase this statement so noone will get the wrong idea:
There may be some states, where _one_ character (or any symbol)
is so common (probability > 50%) that it will code
to less than a bit.

Tapani Lindgren