hal@eecs.cs.pdx.edu (Aaron Harsh) (05/24/91)
The only lossless compression algorithms I've seen have either been LZW or its derivatives (AP, MW and Y-coding) or Huffman encoding. These seem fast, but their compression doesn't seem that remarkable. Are there any other general purpose algorithms that achieve better results, even at the expense of speed? Aaron Harsh hal@eecs.cs.pdx.edu
jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (05/24/91)
From article <2722@pdxgate.UUCP>, by hal@eecs.cs.pdx.edu (Aaron Harsh): > > The only lossless compression algorithms I've seen have either been LZW > or its derivatives (AP, MW and Y-coding) or Huffman encoding. These seem > fast, but their compression doesn't seem that remarkable. Are there any > other general purpose algorithms that achieve better results, even at the > expense of speed? Well, there's always my splay-tree based prefix code (software available by e-mail request). See: Application of Splay Trees to Data Compression, by Douglas Jones Communications of the ACM, Vol 31 No 8 (Aug 1988) pages 996-1006 But by far the best lossless compression is achieved by arithmetic codes, especially when used with mulitple state source models. Arithmetic codes can code individual source characters in fractional bits (something that a Huffman code can't do), but they are generally slow, and when used with multiple state source models, they get quite big as well. See: Arithmetic Coding for Data Compression, by Witten et al Communications of the ACM, Vol 30 No 6 (June 1987) pages 520-540 I Hope this helps. Doug Jones jones@cs.uiowa.edu
brad@looking.on.ca (Brad Templeton) (05/25/91)
In article <6200@ns-mx.uiowa.edu> jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes: >But by far the best lossless compression is achieved by arithmetic codes, >especially when used with mulitple state source models. Arithmetic codes >can code individual source characters in fractional bits (something that >a Huffman code can't do), but they are generally slow, and when used with >multiple state source models, they get quite big as well. See: I don't know about "but by far the best." Arithmetic codes give you one half bit per token better compression, which on 0 order byte models is around 10% better, and on 2nd or 3rd order models such as the ones you describe, this drops just just a few percent. Not that it isn't good to get every drop out, but sometimes the speed penalty isn't worth it. So on to the related question, what are the best (non-patented) algorithms for fast implementation of arithmetic coding, with alphabet sizes in the 300 token range? Note that good application of LZ1 algorithms (the ones used in PKZIP, ARJ and LHARC are such) can give, in reasonable time, compression levels superior to LZW or AP and not far off C-W and high order multiple state or markov models. And of course, compressors designed for the general class of data can often do better than such compressors. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
ttobler@unislc.uucp (Trent Tobler) (05/29/91)
From article <2722@pdxgate.UUCP>, by hal@eecs.cs.pdx.edu (Aaron Harsh): > > The only lossless compression algorithms I've seen have either been LZW > or its derivatives (AP, MW and Y-coding) or Huffman encoding. These seem > fast, but their compression doesn't seem that remarkable. Are there any > other general purpose algorithms that achieve better results, even at the > expense of speed? > What do you mean by thier compression does not seem remarkable. For data where the redundant information is in it's symbol frequency (huffman encoding) or symbol string frequency (LZ type algorythms) the compression approaches the max compression for large files. English text has both of these properties, and therefore good compression ratios are achieved with either algorithms (although LZ works somewhat better because symbol strings are common). If redundant information exists in other forms, or the amount of data to compress is small, then the general algorythems are somewhat less than maximal. There is a limit for the compression ratio that is possible in a specific class of data. The 'BEST' compression algorithm is the one that can approach this limit for that data. There are algorithems that work better than LZ and Huffman for the class of data they operate on. Full Arithmetic encoding achieves better compression than huffman with the drawback of being significantly slower Some kind of string look-ahead combined with some heuristic algorithem to decide if adding the string to the dictionary is useful, as well as intelligently removing entries from the dictionary; combined with full Arithmetic encoding will beat the LZ type algorithms, of course with the cost of being significantly slow/requiring more memory. It is a hypothises of mine that as the ratio of compression approaches the limit of compression for a class of data, the amount of memory and amount of time increase. Something along the lines of m*t = 1/(d-L) where m:-memory, t:-time, d:-desired compression, L:-maximal compression. Anyone have any comments, or have seen something like this? -- Trent Tobler - ttobler@csulx.weber.edu
brad@looking.on.ca (Brad Templeton) (05/29/91)
You suggest that a string look ahead with clever heuristics to add and delete from the dictionary would give good compression. How? If it's a look ahead method, the decompressor can't look ahead and duplicate the compresor's model. This means you must insert tokens into the stream, saying "here's a good dictionary entry." And then you need the reference to the entry. The two together take as much or more room than the backpointer of LZ compression. Or do you have some idea that I am missing? -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
campbell@sol.cs.wmich.edu (Paul Campbell) (05/30/91)
The only other improvement you can make beyond LZ derivatives and Arithmetic encoding is either hybrids or heuristics. For instance, you could do some sort of wierd algorithm where you create a half adaptive LZ and half fixed LZ encoding scheme, which is a heuristic approach. If someone can find a neat, intelligent way to do this, that would be great. It is obvious that there is something to be had here when you do an unbounded adaptive LZ compression, but keep the table and then run it again using the fixed table. Then you can also remove the unused entries. On something like English text, this method just might result in even high compression ratios, but this is speculation. The coders on here might want to try this. I have never had either the software, the time, or the machine to try this. The other possibility is hybrids. Lots of MS-DOG and other compressors are based off the idea of combining 2 or more compressors. For instance, a lot of them will use just Run Length encoding to remove the repitition from a file. It does very well when combined with Arithmetic or Huffman encoding, and a little bit for LZ stuff. Almost universally, though, they take an LZ compression with a low maximum table size (10-13 bits on the average) and feed the resulting codes to an Arithmetic compressor for further compression, which actually does achieve good results. The only alternative is to know something about the data you are compressing. With sound and graphics, you can go to a lossy compression. With text, you can start with the system dictionary or a phoneme table to do an initial compression (even n-grams help here) and then hit it with LZ. The real advantage to dictionary compression without any other compression is that you can still do keyword and other types of searches on the text. Under Minix, it compresses all assembly language outputs by turning the mnemonics into single tokens that are in the range of $80+ right in the text stream. I forgot the guy's name, but he's applying the same thing he did to fractals onto graphics and getting phenomonal compression ratios (10,000:1). He has a simple formula where a major class of fractals can be represented with about 8 parameters. Then he does pattern matching to try to find a set of fractals that match the image. Don't forget the other major advantage of data compression: until we can use memory chips for everything, you've still got to deal with slow storage devices. If you can compress something to half it's original size, with the amount of time that the CPU spends idle waiting for loads (you could also use a dedicated sub-processor for this on a Unix box some day), you will double your throughput by compressing/decompressing the data stream. BBS users are already reaping the benefits from this because almost all BBS's store files in some sort of compressed format and transmit it over very slow 2400 bps lines in most cases. When I'm waiting for a file transfer, I have plenty of time to think of ways to get it to my machine faster, and the protocol twiddling has pretty much reached a zenith.
campbell@sol.cs.wmich.edu (Paul Campbell) (05/30/91)
Please note the reply line. It mismarks my postings here. I'm suggesting that if you extend the default dictionary (which ordinarily contains just the initial character set) to contain other things, and that you find some way to find things which would work 'best' in this dictionary, since you're now going to have to store at least part of that dictionary as a preamble to the LZ compressed file, there should be a way to gain at least some savings. For instance, store any string which is repeated over x% of the document (yep, you will have to pre-scan the document first to find these). Another example was where a friend and I were looking at adaptive and fixed Arithmetic encoding. We tried making a better method by first doing a frequency scan of the incoming text, which would generate the necessary data for a good fixed model. Then we went ahead and did an adaptive compression for a small savings, since it had a very good initial weighting system to start with.
ttobler@unislc.uucp (Trent Tobler) (06/02/91)
From article <1991May29.155853.13716@looking.on.ca>, by brad@looking.on.ca (Brad Templeton): > You suggest that a string look ahead with clever heuristics to add > and delete from the dictionary would give good compression. > > How? If it's a look ahead method, the decompressor can't look ahead and > duplicate the compresor's model. > > This means you must insert tokens into the stream, saying "here's > a good dictionary entry." And then you need the reference to the > entry. The two together take as much or more room than the backpointer > of LZ compression. > > Or do you have some idea that I am missing? This type of algorithem would adapt faster than any LZ algorithem. It would also only have those entries in it's dictionary that it needed, and because it uses arithmetic coding, any LZ algorithem that contained more than x times the number of entries at any time compared to this will lose when it comes to compression ratios (the x come from the number of bits for the prefix necessary to determine if the entry should be removed/added/etc - this is what I think you had a problem with). Since I have not actually coded this, I can not give exact figures, but here is a description (albiet cryptic) of one. The algorithm that I have imagined and examined is sort of a language. It has a beginning dictionary of BD (begin-definition), ED (end-definition), ITK (insert-token-keep), ITR (insert-token-remove), BQ (quote), EQ (end-quote), and QC (quote-character). This is a state machine (Actually closer to a PDA), with each state having some probability of occuring (based on passed selections). Below, I have outlined the possible actions that each state can do. Command: BD -> [Definition] appended to stream. ITK -> [Token] appended to stream. ITR -> [Token] appended to stream, then removed from dictionary. BQ -> [Quote] appended to stream. QC -> single character appended to stream. Definition: BD -> [Definition] appended to new token. ED -> return new token to calling state. ITK -> [Token] appended to new token. ITR -> [Token] appended to new token, then removed from dictionary. BQ -> [Quote] appended to new token. QC -> single character appended to new token. Token: <ident> -> return contents of token <ident> Quote: <ascii> -> add ascii character to quote buffer. EQ -> return quote buffer. It might help if I gave an example. Uncompressed text: "mississippi" Compressed text: dictionary: [QC](2.32) 'm'(8) - [BD](2.32) - [BD](2.58) [QC](2.58) 'i'(8) [ED](2.58) 'i' [BD](2.58) [QC](2.58) 's'(8) [ED](2.58) 'i','s' [ITD](2.58) 's'(1) 'i' [ED](2.58) 'i','iss' [ITD](2.32) 'iss'(1) 'i' [ITK](2.32) 'i'(0) 'i' [BD](2.32) [QC](2.58) 'p'(8) [ED](2.58) 'i','p' [ITD](2.32) 'p'(1) 'i' [ITD](2.32) 'i'(0) - *the number in the parenteses is the number of bits to encode the symbol. So, if you are using an 8 bit machine, "mississippi" requires 8*11 = 88 bits, encoded, 77.04 bits. > -- > Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473 -- Trent Tobler - ttobler@csulx.weber.edu
brad@looking.on.ca (Brad Templeton) (06/03/91)
I investigated such ideas and I think there are a few problems. First of all, to avoid confusion of terms, LZ compressions refers to two algorithms. One is a general algorithm where you have backpointers into the previous text. The whole of the previous text (or the last N bytes) is your dictionary, so there is not a question of "adapting faster" in such an algorithm. THe laster LZ algorithm, and its descendants, such as LZW (a variant designed for speed) M-W and All-Prefix coding do have "how fast they adapt" as an important factor. Your algorithm is similar to LZ1, but instead of backpointers you use tags. Tag a string when it goes out, and then refer to it when you need to. Problem is that tag+referral can be larger than just a backpointer referral. How long are the end-token and start-token codes? Just as important, since they exist in the same token-space as regular entries in the alphabet, how much longer do they make all the other codes. (ie. if you use 1 bit for a tag code, you add 1 bit to all the regular tokens in the alphabet) It's tricky. I'm not saying it's impossible, but I would like to see more evidence that it can be done. Remember as well that you have to deal with the fact that a new entry can often consists of the tail of one reference and the head of another reference. ie. if you have two dictionary entries "the boy " and "who fell" how do you tag the string "boy who" if it first appears in the transmission of the two entries together? How do you do this without using space? -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
ttobler@unislc.uucp (Trent Tobler) (06/05/91)
From article <1991Jun02.213116.3069@looking.on.ca>, by brad@looking.on.ca (Brad Templeton): > I investigated such ideas and I think there are a few problems. > > First of all, to avoid confusion of terms, LZ compressions refers to two > algorithms. One is a general algorithm where you have backpointers into > the previous text. The whole of the previous text (or the last N bytes) > is your dictionary, so there is not a question of "adapting faster" in > such an algorithm. THe laster LZ algorithm, and its descendants, such Such an algorithm indeed does adapt, but similar to mine, requires additional bits in order to determine if the next token is a symbol, or if the next token is a back-pointer. Depending on how fast you want it to adapt, the number of additional bits for a back-pointer can be 0 (ie no adaption, but high compression), or take 0 to be the number of bits to say 'this is a symbol' and get infinite adaption, but no compression. And also, you have to decide on how you are going to represent the 'back-pointers'. Are they relative to the current position, the beginning of the file, somewhere in between? Also, are they fixed length, variable length, etc. And how do you say "take 50 characters from a position 100 bytes back" > as LZW (a variant designed for speed) M-W and All-Prefix coding do have > "how fast they adapt" as an important factor. So does the 'back-pointer'. I guess the nice thing about it is you can decide how fast it will adapt.. > Your algorithm is similar to LZ1, but instead of backpointers you use > tags. Tag a string when it goes out, and then refer to it when you need to. No, I only tag a string that the process KNOWS will be used later. > Problem is that tag+referral can be larger than just a backpointer referral. > How long are the end-token and start-token codes? Just as important, since > they exist in the same token-space as regular entries in the alphabet, how > much longer do they make all the other codes. all other codes will be 2.32 bits longer (ie a single character will be 10.32 bits long); a string of characters will be 2.32 + (x+1)*8.01 bits long (ie a string of 10 characters will be 82.42 bits long); tokens (dictionary entry substitution) will be 2.32 + ln(T/N)/ln(2) - N is the number of times the token has been used, T is sum of the number of times all current tokens have been used (so if there are 10 tokens, N(4) = 10, and T = 50, so the number of bits for token 4 would be 4.64 bits; To define a new entry requires 4.90 additional bits above what it takes to encode what that entry is (using the previous encodings.) All of this assumes that I do only frequency counting on the dictionary entries, and do none on any of the other tokens. > (ie. if you use 1 bit for a tag code, you add 1 bit to all the regular > tokens in the alphabet) > > It's tricky. I'm not saying it's impossible, but I would like to see more > evidence that it can be done. > > Remember as well that you have to deal with the fact that a new entry can > often consists of the tail of one reference and the head of another reference. But I take in to consideration this fact. > ie. if you have two dictionary entries "the boy " and "who fell" how do you > tag the string "boy who" if it first appears in the transmission of the > two entries together? How do you do this without using space? well, if "the boy who fell" appeared in the text, "the boy", and "who fell" would not have been added to the dictionary unless the appeared later. And if "boy who" also appeared later, "boy ", and "who" would have also been added. I did not give any indication of how I was storing the strings, only that they must have the properties that I can add any entry, and delete any entry. The string storage could be reference pointers, actual text, etc, and so the space used would be dependant on that. (There is one method that guarantees no more than N+ln(N)*N memory units for data of size N) Also, because I did not even attempt any heuristic algorithems in my description of an example, there will be problems like this. That is why I stated that this combined with heuristics would be much better. > -- > Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473 -- Trent Tobler - ttobler@csulx.weber.edu