csu@alembic.acs.com (Dave Mack) (05/30/91)
About a month ago, Kent Dolan posted a sample arithmetic coding compression package here. I finally got around to messing with it. For reference: From: xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) Newsgroups: comp.compression Subject: Re: looking for arith. codeing refs. Message-ID: <1991Apr22.130611.9492@zorch.SF-Bay.ORG> Date: 22 Apr 91 13:06:11 GMT References: <11020001@hppad.waterloo.hp.com> <19319@cs.utexas.edu> Organization: SF-Bay Public-Access Unix Lines: 598 > "Arithmetic Coding for Data Compression", Witten, Radford and Cleary, > Communications of the ACM, Volume 30, #6, June 1987 > This paper has a good description of arithmetic coding as well as > working C code. I'm sure that the code is online somewhere, but I > couldn't tell you where. I'll probably get harassed for this, but here it is. I added ANSI C prototypes, and for complicated reasons this exact Makefile is untested, and it needs unpacking on a system with long file name capability, or else hand editing to shorten the names before unpacking (and in the files, too.) So I modified the makefile to use gcc with -g and -O, then ran it on a large text file (jargon2.8.2, actually.) The results were pretty horrible: # time encode <test >test.A 80.7 real 79.8 user 0.3 sys # time decode <test.A >untest 101.1 real 98.7 user 0.5 sys # cmp test untest # ls -l test test.A -rw-r--r-- 1 csu 890267 May 29 23:19 test -rw-r--r-- 1 csu 522053 May 29 23:28 test.A This was done on a *very* lightly loaded Sun-4/110. For comparison, I ran an LZW compress over the same input: # time compress test 17.1 real 16.3 user 0.4 sys # ls -l test.Z -rw-r--r-- 1 csu 392609 May 29 23:19 test.Z # time uncompress test.Z 10.8 real 9.2 user 0.5 sys Is this a really bad implementation, a really bad algorithm, or is arithmetic compression basically useless for text compression? -- Dave Mack
ricks@nrl.navy.mil (Richard Schumeyer) (05/31/91)
In article <1991May30.035552.14435@alembic.acs.com> csu@alembic.acs.com (Dave Mack) writes: [stuff deleted] > Is this a really bad implementation, a really bad algorithm, or > is arithmetic compression basically useless for text compression? > > -- > Dave Mack Arithmetic compression gives the best compression for a given model. If your model stinks, so will your compression. Did you use the static model given in the paper? Much better text models exist, like the string-based model in compress. -- ------------------------------------------------------------ Rick Schumeyer ricks@luke.nrl.navy.mil Naval Research Lab (202)404-7966 Space Science Division ------------------------------------------------------------
jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (05/31/91)
From article <1991May30.035552.14435@alembic.acs.com>, by csu@alembic.acs.com (Dave Mack): > > Is this a really bad implementation, a really bad algorithm, or > is arithmetic compression basically useless for text compression? Arithmetic coding has never been advertised as a fast method, and the speed you see is typical of the results others have obtained. As for compression performance, the code in the Witten Neal and Cleary paper only uses a single-state source model. To compress as well as LZW compresses, you need multiple states in your source model. Witten Neal and Cleary used a move-to-front cumulative frequency table, occupying 257 16 bit words. If you want a 64 state source model, you will need 64 of these tables (for a total cost of 16448 words or 32896 bytes. In my CACM paper on Splay Tree Based compression (vol 31 no 8), I showed that a 64 state source model (where the next state is determined by the previous character mod 64) works pretty well for splay trees, and it should work even better for arithmetic codes. The 64 state source model I suggested uses much less memory than the hash tables used by LZW, and it should compress significantly better. My splay-tree based code was quite good with this 64 state source model, frequently beating LZW, and it is a simple prefix code where each source character codes to at least one bit. In a multiple-state source model, many characters will be so common in some states that they will code to less than a bit when arithmetic codes are used. Finally, note that a multiple state source model does not slow things down significantly. You pay about one array index operation per character compressed to find the base of the sub-array to be used in compressing that character. Doug Jones jones@cs.uiowa.edu
campbell@sol.cs.wmich.edu (Paul Campbell) (05/31/91)
To be fair, it would be a bad idea to compare this to a string compressor because that's not what it is. It would be fair to compare it to a Huffman encoding scheme. It does better than Huffman, and it is the same class of data compression as the Huffman compressor. Neither were intended to beat out string compressors except in a few cases. They both rely on the frequency of occurance of a specific character. You could, for instance, run the output of the compress program through the input of the arithmetic encoder to get even better compression. Also, as mentioned elsewhere, PLEASE don't use a fixed model unless you scan the data first to generate a table for your specific case or do an adaptive implementation.
xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (06/03/91)
ricks@nrl.navy.mil (Richard Schumeyer) writes: > Arithmetic compression gives the best compression > for a given model. If your model stinks, so will > your compression. Did you use the static model > given in the paper? Much better text models exist, > like the string-based model in compress. Um, the code I posted was for the adaptive model of the June 1987 CACM article. There may be "string based models" in arithmetic compression, but I don't know of them. Then again, I'm no expert in the field. The advances over this primitive arithmetic compression that I know of all use multicharacter predictor methods, but that is not terribly well related to compress' strings. Kent, the man from xanth. <xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>
xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (06/03/91)
csu@alembic.acs.com (Dave Mack) writes: > About a month ago, Kent Dolan posted a sample arithmetic coding > compression package here. I finally got around to messing with > it. For reference: [Omitted] > So I modified the makefile to use gcc with -g and -O, then ran it > on a large text file (jargon2.8.2, actually.) The results were > pretty horrible: > # time encode <test >test.A > 80.7 real 79.8 user 0.3 sys > # time decode <test.A >untest > 101.1 real 98.7 user 0.5 sys > # cmp test untest > # ls -l test test.A > -rw-r--r-- 1 csu 890267 May 29 23:19 test > -rw-r--r-- 1 csu 522053 May 29 23:28 test.A > This was done on a *very* lightly loaded Sun-4/110. For comparison, > I ran an LZW compress over the same input: > # time compress test > 17.1 real 16.3 user 0.4 sys > # ls -l test.Z > -rw-r--r-- 1 csu 392609 May 29 23:19 test.Z > # time uncompress test.Z > 10.8 real 9.2 user 0.5 sys > Is this a really bad implementation, a really bad algorithm, or > is arithmetic compression basically useless for text compression? Your results are fairly consistent with mine, but don't despair. The algorithm _is_ about six times as slow as compress, and not that hot for normal text. However, knowing several things will make you willing to investigate further. 1) The code as written is an adaptive model, like adaptive Huffman coding. For text, a fixed model (or best, a two pass approach) would possibly do better. 2) The code as written was explicitly written for clarity of tutorial use in a CACM article, not for speed. In particular, it does such horrid things as make a subroutine call per _bit_ of compressed file to write or read the bits. A little cleanup of this kind of bogosity would do a _lot_ to speed up its six to one lossage to "compress". 3) Arithmetic compression is an improvement over Huffman coding, by not insisting on encoding characters on whole bit boundaries, and not requiring explicit stop bits, and being capable of encoding multiple characters per bit in extreme cases, so it is a possible win as the second phase coding in LHARC-like two phase compression schemes. 4) Arithmetic coding shines, not for text, where about 1/3rd of the possible symbols are fairly well represented, but in _severely_ skewed data sets, like monochrome line drawings, where almost all of of the data is one symbol from the symbol set. To see arithmetic codeing shine, make up a test data set consisting of 99% nulls, and a random replacement of a randomly chosen other 1% with any of the other 255 byte values. Compress that data set with compress and with arithmetic compression, and compare the results for the two, and be prepared to be impressed for megabyte sized data sets. The obvious applications are noisy image data, where the noise breaks up the strings that make compress effective, but still leave the bulk of the pixels as a background value. 5) If you absolutely _must_ get the smallest possible archive file, use lharc. As a second choice, feed the output of compress through the arithmetic encoder. Though what compress produces looks like pure noise, I've found encode will compress it about 4% further. It costs lots of time, of course, but if that is not your main criterion, then it is quite handy. 6) Much better, state sensitive arithmetic coding algorithms exist than this simple one character deep tutorial code. See Doug Jones' response for more on this. Kent, the man from xanth. <xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>
Tapani.Lindgren@HUT.Fi (Tapani Lindgren) (06/10/91)
Douglas W. Jones <jones@pyrite.cs.uiowa.edu> writes: >In a multiple-state source model, many characters will be so common in ^^^^ >some states that they will code to less than a bit when arithmetic >codes are used. May I rephrase this statement so noone will get the wrong idea: There may be some states, where _one_ character (or any symbol) is so common (probability > 50%) that it will code to less than a bit. Tapani Lindgren