[comp.compression] Is there better lossless compression than LZW?

hal@eecs.cs.pdx.edu (Aaron Harsh) (05/24/91)

  The only lossless compression algorithms I've seen have either been LZW
or its derivatives (AP, MW and Y-coding) or Huffman encoding.  These seem
fast, but their compression doesn't seem that remarkable.  Are there any
other general purpose algorithms that achieve better results, even at the
expense of speed?

Aaron Harsh
hal@eecs.cs.pdx.edu

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (05/24/91)

From article <2722@pdxgate.UUCP>, by hal@eecs.cs.pdx.edu (Aaron Harsh):
> 
>   The only lossless compression algorithms I've seen have either been LZW
> or its derivatives (AP, MW and Y-coding) or Huffman encoding.  These seem
> fast, but their compression doesn't seem that remarkable.  Are there any
> other general purpose algorithms that achieve better results, even at the
> expense of speed?

Well, there's always my splay-tree based prefix code (software available
by e-mail request).  See:

   Application of Splay Trees to Data Compression, by Douglas Jones
     Communications of the ACM, Vol 31 No 8 (Aug 1988) pages 996-1006

But by far the best lossless compression is achieved by arithmetic codes,
especially when used with mulitple state source models.  Arithmetic codes
can code individual source characters in fractional bits (something that
a Huffman code can't do), but they are generally slow, and when used with
multiple state source models, they get quite big as well.  See:

   Arithmetic Coding for Data Compression, by Witten et al
     Communications of the ACM, Vol 30 No 6 (June 1987) pages 520-540

I Hope this helps.
				Doug Jones
				jones@cs.uiowa.edu

brad@looking.on.ca (Brad Templeton) (05/25/91)

In article <6200@ns-mx.uiowa.edu> jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes:
>But by far the best lossless compression is achieved by arithmetic codes,
>especially when used with mulitple state source models.  Arithmetic codes
>can code individual source characters in fractional bits (something that
>a Huffman code can't do), but they are generally slow, and when used with
>multiple state source models, they get quite big as well.  See:

I don't know about "but by far the best."   Arithmetic codes give you
one half bit per token better compression, which on 0 order byte models
is around 10% better, and on 2nd or 3rd order models such as the ones you
describe, this drops just just a few percent.   Not that it isn't good to
get every drop out, but sometimes the speed penalty isn't worth it.

So on to the related question, what are the best (non-patented) algorithms
for fast implementation of arithmetic coding, with alphabet sizes in the
300 token range?

Note that good application of LZ1 algorithms (the ones used in PKZIP, ARJ
and LHARC are such) can give, in reasonable time, compression levels
superior to LZW or AP and not far off C-W and high order multiple state
or markov models.

And of course, compressors designed for the general class of data can often
do better than such compressors.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

ttobler@unislc.uucp (Trent Tobler) (05/29/91)

From article <2722@pdxgate.UUCP>, by hal@eecs.cs.pdx.edu (Aaron Harsh):
> 
>   The only lossless compression algorithms I've seen have either been LZW
> or its derivatives (AP, MW and Y-coding) or Huffman encoding.  These seem
> fast, but their compression doesn't seem that remarkable.  Are there any
> other general purpose algorithms that achieve better results, even at the
> expense of speed?
> 
What do you mean by thier compression does not seem remarkable.  For
data where the redundant information is in it's symbol frequency (huffman
encoding) or symbol string frequency (LZ type algorythms) the compression
approaches the max compression for large files.  English text has both of these
properties, and therefore good compression ratios are achieved with either
algorithms (although LZ works somewhat better because symbol strings are
common).  If redundant information exists in other forms, or the amount
of data to compress is small, then the general algorythems are somewhat less
than maximal.  There is a limit for the  compression ratio that is possible in
a specific class of data.  The 'BEST' compression algorithm is the one that
can approach this limit for that data.

There are algorithems that work better than LZ and Huffman for the
class of data they operate on.

     Full Arithmetic encoding achieves better compression than huffman
     with the drawback of being significantly slower

     Some kind of string look-ahead combined with some heuristic algorithem
     to decide if adding the string to the dictionary is useful, as well as
     intelligently removing entries from the dictionary; combined with full
     Arithmetic encoding will beat the LZ type algorithms, of course with
     the cost of being significantly slow/requiring more memory.

It is a hypothises of mine that as the ratio of compression approaches
the limit of compression for a class of data, the amount of memory and
amount of time increase.  Something along the lines of 
   m*t = 1/(d-L)
where m:-memory, t:-time, d:-desired compression, L:-maximal compression.

Anyone have any comments, or have seen something like this?

--
     Trent Tobler - ttobler@csulx.weber.edu

brad@looking.on.ca (Brad Templeton) (05/29/91)

You suggest that a string look ahead with clever heuristics to add
and delete from the dictionary would give good compression.

How?  If it's a look ahead method, the decompressor can't look ahead and
duplicate the compresor's model.

This means you must insert tokens into the stream, saying "here's
a good dictionary entry."    And then you need the reference to the
entry.  The two together take as much or more room than the backpointer
of LZ compression.

Or do you have some idea that I am missing?

-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

campbell@sol.cs.wmich.edu (Paul Campbell) (05/30/91)

The only other improvement you can make beyond LZ derivatives and Arithmetic
encoding is either hybrids or heuristics. For instance, you could do some sort
of wierd algorithm where you create a half adaptive LZ and half fixed LZ
encoding scheme, which is a heuristic approach. If someone can find a neat,
intelligent way to do this, that would be great. It is obvious that there is
something to be had here when you do an unbounded adaptive LZ compression, but
keep the table and then run it again using the fixed table. Then you can also
remove the unused entries. On something like English text, this method just
might result in even high compression ratios, but this is speculation. The
coders on here might want to try this. I have never had either the software,
the time, or the machine to try this.
	The other possibility is hybrids. Lots of MS-DOG and other compressors
are based off the idea of combining 2 or more compressors. For instance, a lot
of them will use just Run Length encoding to remove the repitition from a file.
It does very well when combined with Arithmetic or Huffman encoding, and a
little bit for LZ stuff. Almost universally, though, they take an LZ
compression with a low maximum table size (10-13 bits on the average) and
feed the resulting codes to an Arithmetic compressor for further compression,
which actually does achieve good results.
	The only alternative is to know something about the data you are
compressing. With sound and graphics, you can go to a lossy compression. With
text, you can start with the system dictionary or a phoneme table to do an
initial compression (even n-grams help here) and then hit it with LZ. The real
advantage to dictionary compression without any other compression is that you
can still do keyword and other types of searches on the text.
	Under Minix, it compresses all assembly language outputs by turning the
mnemonics into single tokens that are in the range of $80+ right in the text
stream.
	I forgot the guy's name, but he's applying the same thing he did to
fractals onto graphics and getting phenomonal compression ratios (10,000:1).
He has a simple formula where a major class of fractals can be represented with
about 8 parameters. Then he does pattern matching to try to find a set of
fractals that match the image.

	Don't forget the other major advantage of data compression: until we
can use memory chips for everything, you've still got to deal with slow storage
devices. If you can compress something to half it's original size, with the
amount of time that the CPU spends idle waiting for loads (you could also use
a dedicated sub-processor for this on a Unix box some day), you will double
your throughput by compressing/decompressing the data stream. BBS users are
already reaping the benefits from this because almost all BBS's store files in
some sort of compressed format and transmit it over very slow 2400 bps lines
in most cases. When I'm waiting for a file transfer, I have plenty of time
to think of ways to get it to my machine faster, and the protocol twiddling
has pretty much reached a zenith.

campbell@sol.cs.wmich.edu (Paul Campbell) (05/30/91)

Please note the reply line. It mismarks my postings here. I'm suggesting that
if you extend the default dictionary (which ordinarily contains just the
initial character set) to contain other things, and that you find some way to
find things which would work 'best' in this dictionary, since you're now going
to have to store at least part of that dictionary as a preamble to the LZ
compressed file, there should be a way to gain at least some savings. For
instance, store any string which is repeated over x% of the document (yep, you
will have to pre-scan the document first to find these). Another example was
where a friend and I were looking at adaptive and fixed Arithmetic encoding.
We tried making a better method by first doing a frequency scan of the incoming
text, which would generate the necessary data for a good fixed model. Then we
went ahead and did an adaptive compression for a small savings, since it had
a very good initial weighting system to start with.

ttobler@unislc.uucp (Trent Tobler) (06/02/91)

From article <1991May29.155853.13716@looking.on.ca>, by brad@looking.on.ca (Brad Templeton):
> You suggest that a string look ahead with clever heuristics to add
> and delete from the dictionary would give good compression.
> 
> How?  If it's a look ahead method, the decompressor can't look ahead and
> duplicate the compresor's model.
> 
> This means you must insert tokens into the stream, saying "here's
> a good dictionary entry."    And then you need the reference to the
> entry.  The two together take as much or more room than the backpointer
> of LZ compression.
> 
> Or do you have some idea that I am missing?

This type of algorithem would adapt faster than any LZ algorithem.  It would
also only have those entries in it's dictionary that it needed, and because
it uses arithmetic coding, any LZ algorithem that contained more than
x times the number of entries at any time compared to this will lose when
it comes to compression ratios (the x come from the number of bits for the
prefix necessary to determine if the entry should be removed/added/etc - this
is what I think you had a problem with).  Since I have not actually coded this,
I can not give exact figures, but here is a description (albiet cryptic) of one.

The algorithm that I have imagined and examined is sort of a language.  It
has a beginning dictionary of BD (begin-definition), ED (end-definition),
ITK (insert-token-keep), ITR (insert-token-remove), BQ (quote), EQ (end-quote),
and QC (quote-character).  This is a state machine (Actually closer to a PDA),
with each state having some probability of occuring (based on passed
selections).  Below, I have outlined the possible actions that each state can
do.

  Command:
     BD  -> [Definition] appended to stream.
     ITK -> [Token] appended to stream.
     ITR -> [Token] appended to stream, then removed from dictionary.
     BQ -> [Quote] appended to stream.
     QC -> single character appended to stream.

  Definition:
     BD -> [Definition] appended to new token.
     ED -> return new token to calling state.
     ITK -> [Token] appended to new token.
     ITR -> [Token] appended to new token, then removed from dictionary.
     BQ -> [Quote] appended to new token.
     QC -> single character appended to new token.

  Token:
     <ident> -> return contents of token <ident>

  Quote:
     <ascii> -> add ascii character to quote buffer.
     EQ -> return quote buffer.


It might help if I gave an example. 


  Uncompressed text:
     "mississippi"

  Compressed text:                                      dictionary:
     [QC](2.32)  'm'(8)                                   -
     [BD](2.32)                                           -
          [BD](2.58) [QC](2.58) 'i'(8) [ED](2.58)        'i'
          [BD](2.58) [QC](2.58) 's'(8) [ED](2.58)        'i','s'
          [ITD](2.58) 's'(1)                             'i'
     [ED](2.58)                                          'i','iss'
     [ITD](2.32) 'iss'(1)                                'i'
     [ITK](2.32) 'i'(0)                                  'i'
     [BD](2.32) [QC](2.58) 'p'(8) [ED](2.58)             'i','p'
     [ITD](2.32) 'p'(1)                                  'i'
     [ITD](2.32) 'i'(0)                                   -

*the number in the parenteses is the number of bits to encode the symbol.

So, if you are using an 8 bit machine, "mississippi" requires 8*11 = 88 bits,
encoded, 77.04 bits.

> -- 
> Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

--
   Trent Tobler - ttobler@csulx.weber.edu

brad@looking.on.ca (Brad Templeton) (06/03/91)

I investigated such ideas and I think there are a few problems.

First of all, to avoid confusion of terms, LZ compressions refers to two
algorithms.   One is a general algorithm where you have backpointers into
the previous text.  The whole of the previous text (or the last N bytes)
is your dictionary, so there is not a question of "adapting faster" in
such an algorithm.   THe laster LZ algorithm, and its descendants, such
as LZW (a variant designed for speed) M-W and All-Prefix coding do have
"how fast they adapt" as an important factor.

Your algorithm is similar to LZ1, but instead of backpointers you use
tags.   Tag a string when it goes out, and then refer to it when you need to.
Problem is that tag+referral can be larger than just a backpointer referral.
How long are the end-token and start-token codes?   Just as important, since
they exist in the same token-space as regular entries in the alphabet, how
much longer do they make all the other codes.
(ie. if you use 1 bit for a tag code, you add 1 bit to all the regular
tokens in the alphabet)

It's tricky.  I'm not saying it's impossible, but I would like to see more
evidence that it can be done.

Remember as well that you have to deal with the fact that a new entry can
often consists of the tail of one reference and the head of another reference.

ie. if you have two dictionary entries "the boy " and "who fell" how do you
tag the string "boy who" if it first appears in the transmission of the
two entries together?   How do you do this without using space?
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

ttobler@unislc.uucp (Trent Tobler) (06/05/91)

From article <1991Jun02.213116.3069@looking.on.ca>, by brad@looking.on.ca (Brad Templeton):
> I investigated such ideas and I think there are a few problems.
> 
> First of all, to avoid confusion of terms, LZ compressions refers to two
> algorithms.   One is a general algorithm where you have backpointers into
> the previous text.  The whole of the previous text (or the last N bytes)
> is your dictionary, so there is not a question of "adapting faster" in
> such an algorithm.   THe laster LZ algorithm, and its descendants, such

Such an algorithm indeed does adapt, but similar to mine, requires
additional bits in order to determine if the next token is a symbol, or
if the next token is a back-pointer.  Depending on how fast you want it
to adapt, the number of additional bits for a back-pointer can be 0 (ie
no adaption, but high compression), or take 0 to be the number of bits to
say 'this is a symbol' and get infinite adaption, but no compression.
And also, you have to decide on how you are going to represent the
'back-pointers'.  Are they relative to the current position, the beginning
of the file, somewhere in between?  Also, are they fixed length, variable
length, etc.  And how do you say "take 50 characters from a position 100
bytes back"

> as LZW (a variant designed for speed) M-W and All-Prefix coding do have
> "how fast they adapt" as an important factor.

So does the 'back-pointer'.  I guess the nice thing about it is you
can decide how fast it will adapt..

> Your algorithm is similar to LZ1, but instead of backpointers you use
> tags.   Tag a string when it goes out, and then refer to it when you need to.

No, I only tag a string that the process KNOWS will be used later.

> Problem is that tag+referral can be larger than just a backpointer referral.
> How long are the end-token and start-token codes?   Just as important, since
> they exist in the same token-space as regular entries in the alphabet, how
> much longer do they make all the other codes.

all other codes will be 2.32 bits longer (ie a single character will be
10.32 bits long); a string of characters will be 2.32 + (x+1)*8.01 bits
long (ie a string of 10 characters will be 82.42 bits long); tokens (dictionary
entry substitution) will be 2.32 + ln(T/N)/ln(2) - N is the number of times the
token has been used, T is sum of the number of times all current tokens have
been used (so if there are 10 tokens, N(4) = 10, and T = 50, so the number
of bits for token 4 would be 4.64 bits;  To define a new entry requires
4.90 additional bits above what it takes to encode what that entry is (using
the previous encodings.)  All of this assumes that I do only frequency
counting on the dictionary entries, and do none on any of the other tokens.

> (ie. if you use 1 bit for a tag code, you add 1 bit to all the regular
> tokens in the alphabet)
> 
> It's tricky.  I'm not saying it's impossible, but I would like to see more
> evidence that it can be done.
> 
> Remember as well that you have to deal with the fact that a new entry can
> often consists of the tail of one reference and the head of another reference.

But I take in to consideration this fact.

> ie. if you have two dictionary entries "the boy " and "who fell" how do you
> tag the string "boy who" if it first appears in the transmission of the
> two entries together?   How do you do this without using space?

well, if "the boy who fell" appeared in the text, "the boy", and "who fell"
would not have been added to the dictionary unless the appeared later.  And
if "boy who" also appeared later, "boy ", and "who" would have also been
added.  I did not give any indication of how I was storing the strings, only
that they must have the properties that I can add any entry, and delete
any entry.  The string storage could be reference pointers, actual text, etc,
and so the space used would be dependant on that.  (There is one method
that guarantees no more than N+ln(N)*N memory units for data of size N)

Also, because I did not even attempt any heuristic algorithems in
my description of an example, there will be problems like this.  That is
why I stated that this combined with heuristics would be much better.

> -- 
> Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

--
   Trent Tobler - ttobler@csulx.weber.edu