Peter_Gutmann@kcbbs.gen.nz (Peter Gutmann) (04/16/91)
I threw together the following introduction to data compression about a
year ago. I thought someone out there might find it useful (it explains some
of the theory behind data compression, as well as some of the jargon, and
gives a brief intro.to some of the more common techniques). Some of the
material is now a bit dated, and the notation used to indicate things like
use of italics and special fonts is...err...original (I eventually redid this
with proper fonts, diagrams, etc, but can't post the result since its not
ASCII text any more). There's quite a bit of information theory stuff at the
beginning (you might call it a load of IT:-) which may be a bit tedious too...
anyway, here it is: hope someone finds it useful.
An Introduction to Data Compression
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Peter Gutmann, 1990
Basics of Information Theory
----------------------------
<Quantity of information>
A message can be described in terms of how much information it contains.
For example, the message "The sun rose this morning" contains very little
information, whereas the message "Your house just burned down" contains a
much larger quantity of information. The amount of information conveyed in
an event (in this case a message) depends on the probability of that event.
If someone is told something he already knows (so that the probability before
being told it was already unity), then the probability afterwards remains at
unity. On the other hand, if someone is told something that is relatively
improbable, its probability changes from a small value beforehand to unity
afterwards. A satisfactory definition of quantity of information must thus
involve the probability of an event.
Lets consider a specific example, in which someone is communicating to
someone else the position of an object in a square divided into four parts:
1 2
+---+---+
1 | A | B |
+---+---+
2 | C | D |
+---+---+
The person giving the position could communicate the information in two
ways. They could either give the letter corresponding to the square, or they
could give the position in a row,column format. In the first case the
probability before the event is 1/4, and unity afterwards. In the latter
case, the probability is 1/2 before the event and unity afterwards for the
row, and similarly for the column. Obviously both methods impart the same
information: The sum of the information regarding the row and column must
equal that regarding the whole square. Thus we can see that information is
summed while the probabilities are multiplied (1/2 * 1/2 = 1/4), which leads
us to the definition of quantity of information involving a logarithm of the
probability:
Quantity of information I = -log( p )
(the minus sign is used to make I positive, since p < 1). Now in a binary
system we take logs to base 2, so if we have a source producing 0's and 1's
with equal probability (ie p( 0 ) = p( 1 ) = 1/2) then the information per
digit is log2( 2 ) = 1 bit. A few further examples of probabilities are:
- Letters of the alphabet. If these are assumed to be equiprobable, then
the probability of a letter is 1/26, giving I = log( 26 ) =
4.7 bits/letter.
- Numerals. Again assuming all digits are equiprobable, we get I =
log( 10 ) = 3.32 bits/digit.
- An already-known event. I = log( 1 ) = 0.
<Entropy>
In practice we will be more interested in the average information conveyed
in an entire message than in the specific information in just one event.
This information is called the *entropy*, and for a set of symbols S1, S2,
S3, .... Sn with probabilities p( S1 ), p( S2 ), p( S3 ), ... p( Sn )) is
given by:
n
Entropy H = -sigma p( S sub i) * log( p( S sub i ) )
i=1
In the case of letters of the alphabet we can now make a more accurate
estimate of their probability as:
H = -( p( A ) * log( p( A ) ) + p( B ) * log( p( B ) ) + ...
... + p( Z ) * log( p( Z ) )
= 4.05 bits (approximately) using standard values for p( A ) ... p( Z ).
<Redundancy>
A concept related to entropy is that of *redundancy*. Redundancy is the
presence of more symbols in a message than is necessary. For example in a
system with two symbols X and Y, we could code X as [000], and Y as [111]
(instead of the more obvious X = [0], Y = [1]). This gives us some
protection against errors, since an error of one bit in three could be
tolerated (assuming that a majority is used for decoding at the receiver).
This leads to a definition of redundancy as:
maximum entropy - actual entropy
Redundancy = --------------------------------
maximum entropy
So in the above example the redundancy would be ( 3 - 1 ) / 3 = 0.66, since
three digits could carry three bits, but actually carry only one. Spoken
languages usually have a quite high level of redundancy, allowing them to be
understood in the presence of noise or errors. For example, most people will
be able to understand the phrase "d*ta c*mpr*ssio*" even though some of the
letters are missing. The redundancy of English will be explored more fully a
little further on.
<Probability Theory>
When we speak of the probability of an event (such as the probability of
the letter 'A' in English text), we actually mean the relative number of
occurrences of 'A' in a long sample of text, ie:
number of occurrences of A
p( A ) = -----------------------------------
total number of letters in the text
Strictly speaking, p( A ) is the limit of this ratio as the number of letters
in the sample tends to infinity. This we can see that the probability of
occurrence of any letter S will be given by:
no of occurrences of Si
p( Si ) = lim -----------------------
N->infinity N
Unfortunately, many events in the real world are not completely independant,
but depend on other events. For example, in English the chance of an H
following a T is quite high, whereas that of a T following an H is rather
low. When we have sequences of letters such as in English, the probabilities
of letters are strongly governed by those of the preceding letter or letters.
This is known as *intersymbol influence*; in English it may stretch over many
letters, words, or even entire sentences. For example we may have a source
in which intersymbol influence extends over pairs of adjacent letters. Such
a source is often referred to as a *first-order Markov source*, since we are
using a memory of one symbol (the symbol immediately preceding current one).
The simple model of discrete probabilites is often referred to as a
*memoryless* or *zero-order Markov source*. When we remember two symbols
preceding the current one we say we have a *second-order Markov source*, and
so on. In our example with a first-order Markov source, suppose we have two
symbols x and y. Now the information obtained when y is received is
-log( p( y | x ) ), or minus the logarithm of the probability of receiving y
given that x has already been received. In order to find the average
information over all letters we simply have to average over all possible
pairs of letters x, y. The result is known as the *conditional entropy
H( y | x )* of the sequence, and represents the information per letter in
such a sequence:
H( y | x ) = -sigma sigma p( x, y ) * log( p( y | x ) )
i j
Extending this idea to a second-order model, so that we have conditional
probabilities of the form p( z | xy ), the formula becomes:
H( z | xy ) = -sigma sigma sigma p( x, y, z ) * log( p( z | xy ) )
and so on. From this we can evaluate the following probabilities for English
(26 letters plus space):
For a zero-order Markov source with equiprobable symbols, we have an
entropy of 4.75 bits/letter.
For a zero-order Markov source with standard probabilities for letters in
English, we have an entropy of 4.05 bits/letter.
For a first-order Markov source, or digraphs, we have an entropy of 3.32
bits/letter.
For a second-order Markov source, or trigraphs, we have an entropy of
3.10 bits/letter.
<Redundancy in English>
The ideas presented above can be extended to arbitrarily complex Markov
models. For example, assume that we have received the letters "INFORMATIO".
Then the probability of the next letter being an 'N' is almost unity (we
could always have an error somewhere which changed it to some other letter).
In 1949, Shannon and Weaver studied the redundancy in printed English by
constructing closer and closer approximations to real text. They started out
with the models already presented above, arriving at the figure of 3.10
bits/letter when they considered conditional probabilties over three symbols
(in other words a second-order Markov source). Further approximations
involved using the correct probabilities for complete words (giving an
entropy of around 2.1 bits/letter) and allowing for conditional probabilities
between words. Finally, in order to estimate the information in complete
English text, "volunteers" (probably grad students) were shown samples of
text and asked to predict the next letters. In this manner it was shown that
information varies somewhere between 0.6 and 1.3 bits/letter depending on the
type of text.
From this we can see that the storage of text in computers is very
inefficient. Text is usually stored in ASCII or EBCDIC form, in which each
letter is stored in an 8-bit byte, resulting in a storage requirement of
nearly 8 times what is really necessary.
Statistical Compression
-----------------------
Although typesetters for centuries have taken advantage of the fact that
some letters are used more frequently than others in stocking type, it was
Samuel Morse who first applied the properties of frequency analysis to
communications. When Morse decided to use an alphabetical system of signals
for his newly invented telegraph, he consulted a Philadelphia newspapers
typecase. Morse desired to assign the shorter dot-and-dash symbols used for
telegraph transmission to the more commonly used letters of the alphabet,
while longer dot-and-dash symbols would be used for less frequently used
characters. In counting type in each letter bin, he found 12,000 e's, 9000
t's, 8000 each of a, o, m, i, and s, 6400 h's, and so on. With few
exceptions, Morse followed this list in developing his teletype code,
assigning the shortest symbol, a dot, to the letter e, which was the most
common letter encountered, another short symbol, a dash, to the letter t,
and so on, assigning gradually longer codes to less frequently used letters.
In this way, he came up with a code in which transmission of an English
message consisting of 100 letters requires the transmission time of around
940 units, where the transmissionof a dot equals one unit, a dash equals
three units, and a space equals one or three units depending on whether it is
between letters or words. If the symbols had been assigned at random, the
same message would require the transmission time of approximately 1160 units,
an increase of about 23%. This encoding technique represents the earliest
applications of statistical compression for data transmission.
We can apply Morse's ideas to the design of a code for transmitting
information in the following manner. Say we have the following codes:
Code No. Symbol 1 Symbol 2 Symbol 3 Symbol 4
1 00 01 10 11
2 0 10 110 1110
3 0 10 110 111
Code 1 is the simplest of the four, with each symbol being allotted an
equal length code. Code 2 is a variable-length code which is known as a
*comma code*, since the digit '0' is being used as a "comma" or seperator
between code words. Code 3 is another variable-length code with the
important property of being an *instantaneous* code, that is, as soon as the
code for a complete code word has been received we can immediately decode
that codeword. Consider a code in which two symbols X and Y are encoded as
[0] and [01]. Assume we receive the message "001". When we receive the
first bit, 0, we can't be sure if this represents X or Y until we look ahead
to the next bit, which is 0. Again, we can't be sure whether this is the
message "XX" or "X" followed by the first half of Y. Only when we receive
the next bit, 1, can we be sure that we are in fact looking at the message
"XY". Code 3 does not suffer from this problem, since no codeword can be a
prefix of another codeword.
Now consider a source with symbol probabilities of 0.5, 0.25, 0.125, and
0.125 for symbols 1-4 respectively. With the simple fixed-length code, Code
1, the transmission of 10 letter requires 20 bits. Using Code 3, the length
of the message would be:
10 * ( ( 0.5 * 1 ) + ( 0.25 * 2 ) + ( 0.125 * 3 ) + ( 0.125 * 3 ) )
= 17.5 bits, an improvement of 12.5% over the simpler fixed-length code.
The average length L of a code is an important parameter in determining the
efficiency of a system. If we encode a source of i symbols with probability
p sub i into codewords of length curly l sub i, we have the average length of
the code as:
L = sigma p sub i * curly l sub i
i
The efficiency of a code E is defined as:
H
E = -
L
and is always <= 1 (corresponding to H( S ) <= L ). In general, H( S ) < L,
since L must be an integral unit (ie 1, 2, 3..., not 2.82348). Using the
above code, we can calculate its efficiency as:
L = sigma p sub i * curly l sub i
i
= ( ( 1 * 0.5 ) + ( 2 * 0.25 ) + ( 3 * 0.125 ) + ( 3 * 0.125 ) )
= 1.75 bits/symbol
Now if we calculate the entropy, we get:
H = -sigma p( S sub i ) * log( p( S sub i ) )
i
= -( ( 0.5 * log( 0.5 ) ) + ( 0.25 * log( 0.25 ) ) +
( 0.125 * log( 0.125 ) + ( 0.125 * log( 0.125 ) )
= 1.75 bits/symbol
In this case we have H( S ) = L. This is in fact a rather special case
where all symbol probabilities are integral powers of 1/2 (ie p( S sub i ) =
( 1/2 ) ^ n). This result does not occur if symbol probabilities are not
powers of 2, a fact which becomes important in the discussion of Huffman and
Shannon-Fano codes below.
Huffman and Related Compression Techniques
------------------------------------------
*Huffman compression* is a statistical data compression technique which
gives a reduction in the average code length used to represent the symbols of
a alphabet. The Huffman code is an example of a code which is optimal in the
case where all symbols probabilities are integral powers of 1/2. A Huffman
code can be built in the following manner:
[1] Rank all symbols in order of probability of occurrence.
[2] Successively combine the two symbols of the lowest probability to form
a new composite symbol; eventually we will build a tree where each node
is the probability of all nodes beneath it.
[3] Trace a path to each leaf, noticing the direction at each node.
For example, using the symbols and probabilities given above, we would
proceed as follows:
Step 1:
S sub 1 [0.5]
S sub 2 [0.25]
S sub 3 [0.125]
S sub 4 [0.125]
Step 2a:
S sub 1 [0.5]
S sub 2 [0.25]
S sub 3 [0.125] -+- [0.25]
|
S sub 4 [0.125] -+
Step 2b:
S sub 1 [0.5]
S sub 2 [0.25] -------------+- [0.5]
|
S sub 3 [0.125] -+- [0.25] -+
|
S sub 4 [0.125] -+
Step 2c:
S sub 1 [0.5] ------------------------+- [1.0]
|
S sub 2 [0.25] -------------+- [0.5] -+
|
S sub 3 [0.125] -+- [0.25] -+
|
S sub 4 [0.125] -+
Step 3:
(0)
S sub 1 [0.5] ------------------------+- [1.0]
(0) (1) |
S sub 2 [0.25] -------------+- [0.5] -+
(0) (1) |
S sub 3 [0.125] -+- [0.25] -+
(1) |
S sub 4 [0.125] -+
Thus we get the following codes:
S sub 1 = 0
S sub 2 = 10
S sub 3 = 110
S sub 4 = 111
which have already been discussed above.
Huffman codes can be easily constructed with the aid of a computer. The
algorithm assumes that initially we have a forest of trees where each tree
corresponds to an individual source symbol, and then proceeds as follows:
while( there is more than one tree in the forest )
{
/* Find the two symbols with the lowest probability */
i = the tree with the smallest weight;
j = the tree with the next smallest weight;
/* Combine them into a new composite symbol */
create a new node with left child i and right child j;
replace tree i with this new tree, with the tree's weight being
the weight of i plus j;
delete j from the forest;
}
As we merge the two lowest-weight trees into a single new tree we slowly
decrease the size of the forest until finally we have reduced the entire
forest to one tree, which represents the Huffman tree for that set of symbols.
--
A technique related to Huffman coding is *Shannon-Fano coding*, which was
suggested by Shannon and Weaver in 1949 and modified by Fano in 1961. It
works as follows:
[1] Rank all symbols in order of probability of occurrence.
[2] Successively divide the set of symbols into two equal or almost equal
subsets based on the probability of occurrence of characters in each
subset. The first symbol in one subset is assigned a binary zero, the
second a binary one.
For example, again using the symbols and probabilities given above, we
would proceed as follows:
Step 1:
S sub 1 [0.5]
S sub 2 [0.25]
S sub 3 [0.125]
S sub 4 [0.125]
Step 2a:
(0)
S sub 1 [0.5] --- [0.50] -+- [1.00]
(1) |
S sub 2 [0.25] -+- [0.50] -+
|
S sub 3 [0.125] -+
|
S sub 4 [0.125] -+
Step 2b:
(0)
S sub 1 [0.5] -------------- [0.50] -+- [1.00]
(0) (1) |
S sub 2 [0.25] --- [0.25] -+- [0.50] -+
(1) |
S sub 3 [0.125] -+- [0.25] -+
|
S sub 4 [0.125] -+
Step 2c:
(0)
S sub 1 [0.5] -------------- [0.50] -+- [1.00]
(0) (1) |
S sub 2 [0.25] --- [0.25] -+- [0.50] -+
(0) (1) |
S sub 3 [0.125] -+- [0.25] -+
(1) |
S sub 4 [0.125] -+
Thus we get the following codes:
S sub 1 = 0
S sub 2 = 10
S sub 3 = 110
S sub 4 = 111
These codes are in this case identical to the Huffman codes - the only
difference is that tha algorithm used to create the Huffman codes is
bottom-up, and the one for the Shannon-Fano codes is top-down. In practice
Shannon-Fano codes are usually of similar lengths to Huffman codes. However,
if the probabilities of occurrence of the symbols have a wide variance, the
Shannon-Fano code will be more efficient, while the Huffman code will be more
efficient as the difference in symbol probabilities decreases. One problem
that Shannon-Fano codes have is that they are not uniquely defined - if the
cut point for the set of symbols is on an interval, you can cut above or
below that point. However Shannon-Fano coding does have the advantage that
it is a bit more tractable mathematically than Huffman coding, making
mathematical analysis easier. Both Huffman and Shannon-Fano codes are
instantaneous codes.
Optimal Coding:
---------------
It would appear from the above results that Huffman or Shannon-Fano coding
is the perfect means of compressing data. However, this is *not* the case.
As mentioned above, these coding methods are optimal when and only when the
symbol probabilities are integral powers of 1/2, which is usually not the
case. Consider the following simple example: Suppose we have a source
producing symbols X and Y of probability 0.66 and 0.33 respectively. Then
the Huffman/Shannon-Fano code for these symbols would be:
(0)
X [0.66] -+- [1.0]
(1) |
Y [0.33] -+
resulting in a code of X = [0], Y = [1], with average length:
L = ( 1 * 0.33 ) + ( 1 * 0.66 )
= 1.0
and entropy:
H = -( ( 0.33 * log( 0.33 ) ) + ( 0.66 * log( 0.66 ) ) )
= 0.92
giving an efficiency of:
0.92
E = ----
1.00
or 92%.
There is, however, a way to improve the coding efficiency of Huffman and
Shannon-Fano encoding. This can be done by means of *alphabet extensions*,
that is, by extending the source alphabet to include groups of symbols
instead of just individual symbols. Say we have an alphabet of two symbols,
X and Y. Now if we extend this to a first-order model to get { XX, XY, YX,
YY } and build the Huffman tree for this alphabet we find that there is an
increase in coding efficiency. This effect continues as we keep extending
the alphabet in this manner (there are exceptions to this rule - we can
actually get a negative increase during an extension from n to n + 1, but the
lower bound will still be the entropy of the data), for example for the above
alphabet with symbol probabilities of .25 and .75 we get the following results:
H( S sub 1 ) = .811, L sub 1 = 1.000, n sub 1 = .811
H( S sub 2 ) = 1.623, L sub 2 = 1.688, n sub 2 = .962
H( S sub 3 ) = 2.434, L sub 3 = 2.469, n sub 3 = .986
H( S sub 4 ) = 3.245, L sub 4 = 3.273, n sub 4 = .991
H( S sub 5 ) = 4.056, L sub 5 = 4.089, n sub 4 = .992
As the extension tends towards infinity, the efficiency approaches 1.000.
However for practical reasons this approach is impossible - most computers
would run out of memory long before they got to the third or fourth
extensions. And yet there *is* a way in which this type of encoding can be
achieved: through a technique called *arithmetic coding*.
Arithmetic Coding:
------------------
As already discussed above, Huffman coding is not optimal. This is due
to the fact that it is not used to encode single, complete messages, but only
individual bytes or n-byte blocks, and while each byte (or block of bytes) is
encoded with minimum redundancy into an integral number of bits, the sequence
of bytes is not. However, the technique of *arithmetic coding* does not have
this restriction: It achieves the same effect as treating the message as one
single unit (a technique which would, for Huffman coding, require enumeration
of every single possible message), and thus attains the theoretical entropy
bound to compression efficiency for any source.
Arithmetic coding works by representing a number by an interval of real
numbers between 0 and 1. As the message becomes longer, the interval needed
to represent it becomes smaller and smaller, and the number of bits needed to
specify that interval increases. Successive symbols in the message reduce
this interval in accordance with the probability of that symbol. The more
likely symbols reduce the range by less, and thus add fewer bits to the
message.
1 Codewords
+-----------+-----------+-----------+ /-----\
| |8/9 XX | Detail |<- 31/32 11111
| +-----------+-----------+<- 15/16 1111
| X | | too small |<- 14/16 1110
|2/3 | XY | for text |<- 6/8 110
+-----------+-----------+-----------+
| | |16/27 YXX |<- 10/16 1010
| | +-----------+
| | YX | |
| | | YXY |<- 4/8 100
| |4/9 | |
| +-----------+-----------+
| | | |
| Y | | YYX |<- 3/8 011
| | |8/27 |
| | +-----------+
| | YY | |
| | | |<- 1/4 01
| | | YYY |
| | | |
|0 | | |
+-----------+-----------+-----------+
As an example of arithmetic coding, lets consider the previous example of
two symbols X and Y, of probabilities 0.66 and 0.33, for which the
Huffman/Shannon-Fano code efficiency was 92%. To encode this message, we
examine the first symbol: If it is a X, we choose the lower partition; if it
is a Y, we choose the upper partition. Continuing in this manner for three
symbols, we get the codewords shown to the right of the diagram above - they can be
found by simply taking an appropriate location in the interval for that
particular set of symbols and turning it into a binary fraction. From these
codewords we can calculate the average codeword length as:
L = ( 0.062 * 5 ) + ( 0.296 * 4 ) + ( 0.296 * 4 ) + ( 0.296 * 4 ) +
( 0.444 * 3 ) + ( 0.296 * 4 ) + ( 0.444 * 3 ) + ( 0.444 * 3 ) +
( 0.593 * 2 )
= 2.88 bits for 3 symbols
= 0.96 bits/symbol
giving an efficiency of:
0.92
E = ----
0.96
= 96%
In this case the arithmetic code is not completely efficient, which is due
to the shortness of the message - with longer messages the coding efficiency
does indeed approach 100%.
Now that we have an efficient encoding technique, what can we do with it?
What we need is a technique for building a model of the data which we can
then use with the encoder. The simplest model is a fixed one, for example a
table of standard letter frequencies for English text which we can then use
to get letter probabilities. An improvement on this technique is to use an
*adaptive model*, in other words a model which adjusts itself to the data
which is being compressed as the data is compressed. We can convert the
fixed model into an adaptive one by adjusting the symbol frequencies after
each new symbol is encoded, allowing the model to track the data being
transmitted. However, we can do much better than that. Recall from the
section on information theory that using the symbol probabilities by
themselves is not a particularly good estimate of the true entropy of the
data: We can take into account intersymbol probabilities as well. The best
compressors available today take this approach: DMC (Dynamic Markov Coding)
starts with a zero-order Markov model and gradually extends this initial
model as compression progresses; PPM (Prediction by Partial Matching) looks
for a match of the text to be compressed in an order-n context. If no match
is found, it drops to an order n-1 context, until it reaches order 0. Both
these techniques thus obtain a much better model of the data to be
compressed, which, combined with the use of arithmetic coding, results in
superior compression performance.
So if arithmetic coding-based compressors are so powerful, why are they not
used universally? Apart from the fact that they are relatively new and
haven't come into general use too much yet, there is also one major concern:
The fact that they consume rather large amounts of computing resources, both
in terms of CPU power and memory. The building of sophisticated models for
the compression can chew through a fair amount of memory (especially in the
case of DMC, where the model can grow without bounds); and the arithmetic
coding itself involves a fair amount of number crunching. Most
"bread-and-butter" PC's simply don't have the memory (640K - 1MB) or CPU
power (8-10 MHz 68000's or 80286's) to run a powerful arithmetic compressor.
There is however an alternative approach, a class of compressors generally
referred to as *substitutional* or *dictionary-based compressors*.
Substitutional Compressors
--------------------------
The basic idea behind a substitutional compressor is to replace an
occurrence of a particular phrase or group of bytes in a piece of data with a
reference to a previous occurrence of that phrase. There are two main
classes of schemes, named after Jakob Ziv and Abraham Lempel, who first
proposed them in 1977 and 1978.
<The LZ78 family of compressors>
LZ78-based schemes work by entering phrases into a *dictionary* and then,
when a repeat occurrence of that particular phrase is found, outputting the
dictionary index instead of the phrase. There exist several compression
algorithms based on this principle, differing mainly in the manner in which
they manage the dictionary. The most well-known scheme (in fact the most
well-known of all the Lempel-Ziv compressors, the one which is generally (and
mistakenly) referred to as "Lempel-Ziv Compression"), is Terry Welch's LZW
scheme, which he designed in 1984 for implementation in hardware for high-
performance disk controllers.
Input string: /WED/WE/WEE/WEB
Character input: Code output: New code value and associated string:
/W / 256 = /W
E W 257 = WE
D E 258 = ED
/ D 259 = D/
WE 256 260 = /WE
/ E 261 = E/
WEE 260 262 = /WEE
/W 261 263 = E/W
EB 257 264 = WEB
<END> B
LZW starts with a 4K dictionary, of which entries 0-255 refer to individual
bytes, and entries 256-4095 refer to substrings. Each time a new code is
generated it means a new string has been parsed. New strings are generated
by appending the current character K to the end of an existing string w. The
algorithm for LZW compression is as follows:
set w = NIL
loop
read a character K
if wK exists is in the dictionary
w = wK
else
output the code for w
add wK to the string table
w = K
endloop
A sample run of LZW over a (highly redundant) input string can be seen in
the diagram above. The strings are built up character-by-character starting
with a code value of 256. LZW decompression takes the stream of codes and
uses it to exactly recreate the original input data. Just like the
compression algorithm, the decompressor adds a new string to the dictionary
each time it reads in a new code. All it needs to do in addition is to
translate each incoming code into a string and send it to the output. A
sample run of the LZW decompressor is shown in below.
Input code: /WED<256>E<260><261><257>B
Input code: Output string: New code value and associated string:
/ /
W W 256 = /W
E E 257 = WE
D D 258 = ED
256 /W 259 = D/
E E 260 = /WE
260 /WE 261 = E/
261 E/ 262 = /WEE
257 WE 263 = E/W
B B 264 = WEB
The most remarkable feature of this type of compression is that the entire
dictionary has been transmitted to the decoder without actually explicitly
transmitting the dictionary. At the end of the run, the decoder will have a
dictionary identical to the one the encoder has, built up entirely as part of
the decoding process.
LZW is more commonly encountered today in a variant known as LZC, after
its use in the UNIX "compress" program. In this variant, pointers do not
have a fixed length. Rather, they start with a length of 9 bits, and then
slowly grow to their maximum possible length once all the pointers of a
particular size have been used up. Furthermore, the dictionary is not frozen
once it is full as for LZW - the program continually monitors compression
performance, and once this starts decreasing the entire dictionary is
discarded and rebuilt from scratch. More recent schemes use some sort of
least-recently-used algorithm to discard little-used phrases once the
dictionary becomes full rather than throwing away the entire dictionary.
Finally, not all schemes build up the dictionary by adding a single new
character to the end of the current phrase. An alternative technique is to
concatenate the previous two phrases (LZMW), which results in a faster
buildup of longer phrases than the character-by-character buildup of the
other methods. The disadvantage of this method is that a more sophisticated
data structure is needed to handle the dictionary.
<The LZ77 family of compressors>
LZ77-based schemes keep track of the last n bytes of data seen, and when a
phrase is encountered that has already been seen, they output a pair of
values corresponding to the position of the phrase in the previously-seen
buffer of data, and the length of the phrase. In effect the compressor moves
a fixed-size *window* over the data (generally referred to as a *sliding
window*), with the position part of the (position, length) pair referring to
the position of the phrase within the window. The most commonly used
algorithms are derived from the LZSS scheme described by James Storer and
Thomas Szymanski in 1982. In this the compressor maintains a window of size
N bytes and a *lookahead buffer* the contents of which it tries to find a
match for in the window:
while( lookAheadBuffer not empty )
{
get a pointer ( position, match ) to the longest match in the window
for the lookahead buffer;
if( length > MINIMUM_MATCH_LENGTH )
{
output a ( position, length ) pair;
shift the window length characters along;
}
else
{
output the first character in the lookahead buffer;
shift the window 1 character along;
}
}
Decompression is simple and fast: Whenever a ( position, length ) pair is
encountered, go to that ( position ) in the window and copy ( length ) bytes
to the output.
Sliding-window-based schemes can be simplified by numbering the input text
characters mod N, in effect creating a circular buffer. The sliding window
approach automatically creates the LRU effect which must be done explicitly
in LZ78 schemes. Variants of this method apply additional compression to the
output of the LZSS compressor, which include a simple variable-length code
(LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP), all of
which result in a certain degree of improvement over the basic scheme,
especially when the data are rather random and the LZSS compressor has little
effect.
Recently an algorithm was developed which combines the ideas behind LZ77
and LZ78 to produce a hybrid called LZFG. LZFG uses the standard sliding
window, but stores the data in a modified trie data structure and produces as
output the position of the text in the trie. Since LZFG only inserts
complete *phrases* into the dictionary, it runs considerably faster than any
other LZ77-based compressors. This technique achieves the best compression
of any Ziv-Lempel type compressor, and has one of the best runtimes.
Compression Performance
-----------------------
The following representative sample of compressors were tested on a suite
of test data which ranged from straight ASCII text, through program source
code, graphics data, to object code and a file of seismic data. The
compressors evaluated were:
The following schemes can be classed as "generic" schemes used as
representatives of a particular type of compression (these values are taken
from "Text Compression").
Huffman coding [HUFF]: Vitter's adaptive Huffman coding scheme, as used
in the UNIX "compact" program.
LZW compression [LZW]: The LZ78 family representative.
LZSS compression [LZSS]: The LZ77 family representative.
LZFG compression [LZFG]: A LZ77/LZ78 hybrid.
PPMC compression [PPMC]: The current state of the art in compression
performance, outlined in the section on
arithmetic coding.
The following schemes are examples of practical implementations in everyday
use:
Modified LZW [LZC]: As first used in the UNIX "compress" program, also used
in the ARC, ZOO, and Stuffit archivers (results are for
Compress 4.0).
Imploding [PKZIP]: The "Imploding" scheme used by the PKZIP archiver
(Version 1.10).
Freezing [LHARC]: The LZSS/adaptive Huffman coding combination used by the
LHARC archiver (Version 1.13).
The compression performances were as follows (all results are in
bits/byte):
Size HUFF LZW LZSS LZFG PPMC LZC PKZIP LHARC
---------------+------+------+------+------+------+------+------+------+
Bib : 111,261 | 5.24 | 3.84 | 3.35 | 2.90 | 2.11 | 3.89 | 2.97 | 3.34 |
Book1: 768,771 | 4.56 | 4.03 | 4.08 | 3.62 | 2.48 | 4.06 | 3.65 | 3.85 |
Book2: 610,856 | 4.83 | 4.52 | 3.41 | 3.05 | 2.26 | 4.25 | 3.04 | 3.31 |
Geo : 102,400 | 5.70 | 6.15 | 6.43 | 5.70 | 4.78 | 6.10 | 5.95 | 5.53 |
News : 377,109 | 5.23 | 4.92 | 3.79 | 3.44 | 2.65 | 4.90 | 3.34 | 3.52 |
Obj1 : 21,504 | 6.06 | 6.30 | 4.57 | 4.03 | 3.76 | 6.15 | 3.92 | 4.00 |
Obj2 : 246,814 | 6.30 | 9.81 | 3.30 | 2.96 | 2.69 | 5.19 | 2.92 | 2.94 |
Paper1: 53,161 | 5.04 | 4.58 | 3.38 | 3.03 | 2.48 | 4.43 | 3.01 | 3.27 |
Paper2: 82,199 | 4.65 | 4.02 | 3.58 | 3.16 | 2.45 | 3.98 | 3.20 | 3.43 |
Pic : 513,216 | 1.66 | 1.09 | 1.67 | 0.87 | 1.09 | 0.99 | 0.99 | 0.96 |
Progc: 39,611 | 5.26 | 4.88 | 3.24 | 2.89 | 2.49 | 4.41 | 2.86 | 3.11 |
Progl: 71,646 | 4.81 | 3.89 | 2.37 | 1.97 | 1.90 | 3.57 | 1.93 | 2.09 |
Progp: 49,379 | 4.92 | 3.73 | 2.36 | 1.90 | 1.84 | 3.72 | 1.92 | 2.07 |
Trans: 93,695 | 5.98 | 4.24 | 2.44 | 1.76 | 1.77 | 3.94 | 1.98 | 2.40 |
---------------+------+------+------+------+------+------+------+------+
14 Files 4.99 4.71 3.43 2.95 2.48 4.26 2.98 3.13
Huffman coding is easily the worst of the lot, particularly for the highly
compressible program and trans files, where most other compressors attain
around 2 bits/byte. The disadvantage of non-adaptive compression is shown by
the results for the LZW compression of the file obj2, which at the start
consists almost purely of text, but then switches to code and graphics data.
Since the LZW dictionary is frozen after 4K (during which time it has adapted
itself to compressing text) it ends up expanding the data once it gets past
the text part of the program.
So which compressor is best overall? That depends on several things.
PPMC achieves the best compression by quite a margin, but requires a fair
amount of computing resources, both in terms of memory and CPU power (though
of course these considerations will become relatively minor as faster systems
with larger amounts of memory become available). One nice feature of PPMC is
that it can be scaled down to run in less memory, but compression performance
suffers in doing this. LZC is popular since it is relatively easy to
implement, very fast, and achieves acceptable levels of compression in little
memory. PKZIP is especially popular on micros since it is also quite fast,
requires little memory, and usually achieves better performance than any
other archiver.
----------------
--
Peter_Gutmann@kcbbs.gen.nz || peter@nacjack.gen.nz || pgut1@cs.aukuni.ac.nz
(In order of decreasing reliability)
Warning!
Something large, scaly, and with fangs a foot long lives between <yoursite>
and <mysite>. Every now and then it kills and eats messages. If you don't
receive a reply within a week, try resending...