[comp.compression] A

Peter_Gutmann@kcbbs.gen.nz (Peter Gutmann) (04/16/91)
  I threw together the following introduction to data compression about a 
year ago.  I thought someone out there might find it useful (it explains some 
of the theory behind data compression, as well as some of the jargon, and 
gives a brief intro.to some of the more common techniques).  Some of the 
material is now a bit dated, and the notation used to indicate things like 
use of italics and special fonts is...err...original (I eventually redid this 
with proper fonts, diagrams, etc, but can't post the result since its not 
ASCII text any more).  There's quite a bit of information theory stuff at the
beginning (you might call it a load of IT:-) which may be a bit tedious too...
anyway, here it is: hope someone finds it useful.

                     An Introduction to Data Compression
                     -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
                            Peter Gutmann, 1990

Basics of Information Theory 
----------------------------

<Quantity of information>

  A message can be described in terms of how much information it contains. 
For example, the message "The sun rose this morning" contains very little 
information, whereas the message "Your house just burned down" contains a 
much larger quantity of information.  The amount of information conveyed in 
an event (in this case a message) depends on the probability of that event.  
If someone is told something he already knows (so that the probability before 
being told it was already unity), then the probability afterwards remains at 
unity.  On the other hand, if someone is told something that is relatively 
improbable, its probability changes from a small value beforehand to unity 
afterwards.  A satisfactory definition of quantity of information must thus 
involve the probability of an event.
  Lets consider a specific example, in which someone is communicating to 
someone else the position of an object in a square divided into four parts:

                          1   2
                        +---+---+
                      1 | A | B |
                        +---+---+
                      2 | C | D |
                        +---+---+

  The person giving the position could communicate the information in two 
ways.  They could either give the letter corresponding to the square, or they
could give the position in a row,column format.  In the first case the 
probability before the event is 1/4, and unity afterwards.  In the latter 
case, the probability is 1/2 before the event and unity afterwards for the 
row, and similarly for the column.  Obviously both methods impart the same 
information: The sum of the information regarding the row and column must 
equal that regarding the whole square.  Thus we can see that information is 
summed while the probabilities are multiplied (1/2 * 1/2 = 1/4), which leads 
us to the definition of quantity of information involving a logarithm of the 
probability:

  Quantity of information I = -log( p )
    
(the minus sign is used to make I positive, since p < 1).  Now in a binary 
system we take logs to base 2, so if we have a source producing 0's and 1's 
with equal probability (ie p( 0 ) = p( 1 ) = 1/2) then the information per 
digit is log2( 2 ) = 1 bit.  A few further examples of probabilities are:

  - Letters of the alphabet.  If these are assumed to be equiprobable, then
    the probability of a letter is 1/26, giving I = log( 26 ) = 
    4.7 bits/letter.
      
  - Numerals.  Again assuming all digits are equiprobable, we get I = 
    log( 10 ) = 3.32 bits/digit.
  
  - An already-known event.  I = log( 1 ) = 0.

<Entropy>

  In practice we will be more interested in the average information conveyed 
in an entire message than in the specific information in just one event.  
This information is called the *entropy*, and for a set of symbols S1, S2, 
S3, .... Sn with probabilities p( S1 ), p( S2 ), p( S3 ), ... p( Sn )) is 
given by:

                 n
  Entropy H = -sigma p( S sub i) * log( p( S sub i ) )
                i=1
          
  In the case of letters of the alphabet we can now make a more accurate 
estimate of their probability as:

  H = -( p( A ) * log( p( A ) ) + p( B ) * log( p( B ) ) + ...
                                             ... + p( Z ) * log( p( Z ) )

    = 4.05 bits (approximately) using standard values for p( A ) ... p( Z ).
   
<Redundancy>

    A concept related to entropy is that of *redundancy*.  Redundancy is the 
presence of more symbols in a message than is necessary.  For example in a 
system with two symbols X and Y, we could code X as [000], and Y as [111] 
(instead of the more obvious X = [0], Y = [1]).  This gives us some 
protection against errors, since an error of one bit in three could be 
tolerated (assuming that a majority is used for decoding at the receiver).  
This leads to a definition of redundancy as:

               maximum entropy - actual entropy
  Redundancy = --------------------------------
                      maximum entropy
                        
So in the above example the redundancy would be ( 3 - 1 ) / 3 = 0.66, since 
three digits could carry three bits, but actually carry only one.  Spoken 
languages usually have a quite high level of redundancy, allowing them to be 
understood in the presence of noise or errors.  For example, most people will 
be able to understand the phrase "d*ta c*mpr*ssio*" even though some of the 
letters are missing.  The redundancy of English will be explored more fully a 
little further on.

<Probability Theory>

  When we speak of the probability of an event (such as the probability of 
the letter 'A' in English text), we actually mean the relative number of 
occurrences of 'A' in a long sample of text, ie:

                number of occurrences of A
  p( A ) = -----------------------------------
           total number of letters in the text

Strictly speaking, p( A ) is the limit of this ratio as the number of letters 
in the sample tends to infinity.  This we can see that the probability of 
occurrence of any letter S will be given by:

                        no of occurrences of Si   
  p( Si ) =    lim      -----------------------
           N->infinity             N
    
Unfortunately, many events in the real world are not completely independant, 
but depend on other events.  For example, in English the chance of an H 
following a T is quite high, whereas that of a T following an H is rather 
low. When we have sequences of letters such as in English, the probabilities 
of letters are strongly governed by those of the preceding letter or letters. 
This is known as *intersymbol influence*; in English it may stretch over many 
letters, words, or even entire sentences.  For example we may have a source 
in which intersymbol influence extends over pairs of adjacent letters.  Such 
a source is often referred to as a *first-order Markov source*, since we are 
using a memory of one symbol (the symbol immediately preceding current one). 
The simple model of discrete probabilites is often referred to as a 
*memoryless* or *zero-order Markov source*.  When we remember two symbols 
preceding the current one we say we have a *second-order Markov source*, and 
so on.  In our example with a first-order Markov source, suppose we have two 
symbols x and y.  Now the information obtained when y is received is 
-log( p( y | x ) ), or minus the logarithm of the probability of receiving y 
given that x has already been received.  In order to find the average 
information over all letters we simply have to average over all possible 
pairs of letters x, y.  The result is known as the *conditional entropy 
H( y | x )* of the sequence, and represents the information per letter in 
such a sequence:

  H( y | x ) = -sigma sigma p( x, y ) * log( p( y | x ) )
                  i     j
                    
  Extending this idea to a second-order model, so that we have conditional 
probabilities of the form p( z | xy ), the formula becomes:

  H( z | xy ) = -sigma sigma sigma p( x, y, z ) * log( p( z | xy ) )
    
and so on.  From this we can evaluate the following probabilities for English 
(26 letters plus space):

  For a zero-order Markov source with equiprobable symbols, we have an
  entropy of 4.75 bits/letter.

  For a zero-order Markov source with standard probabilities for letters in
  English, we have an entropy of 4.05 bits/letter.
    
  For a first-order Markov source, or digraphs, we have an entropy of 3.32
  bits/letter.
    
  For a second-order Markov source, or trigraphs, we have an entropy of 
  3.10 bits/letter.

<Redundancy in English>

  The ideas presented above can be extended to arbitrarily complex Markov 
models.  For example, assume that we have received the letters "INFORMATIO". 
Then the probability of the next letter being an 'N' is almost unity (we 
could always have an error somewhere which changed it to some other letter).  
In 1949, Shannon and Weaver studied the redundancy in printed English by 
constructing closer and closer approximations to real text.  They started out 
with the models already presented above, arriving at the figure of 3.10 
bits/letter when they considered conditional probabilties over three symbols 
(in other words a second-order Markov source).  Further approximations 
involved using the correct probabilities for complete words (giving an 
entropy of around 2.1 bits/letter) and allowing for conditional probabilities 
between words. Finally, in order to estimate the information in complete 
English text, "volunteers" (probably grad students) were shown samples of 
text and asked to predict the next letters.  In this manner it was shown that 
information varies somewhere between 0.6 and 1.3 bits/letter depending on the 
type of text.
  From this we can see that the storage of text in computers is very 
inefficient.  Text is usually stored in ASCII or EBCDIC form, in which each 
letter is stored in an 8-bit byte, resulting in a storage requirement of 
nearly 8 times what is really necessary.

Statistical Compression
-----------------------

  Although typesetters for centuries have taken advantage of the fact that 
some letters are used more frequently than others in stocking type, it was 
Samuel Morse who first applied the properties of frequency analysis to 
communications.  When Morse decided to use an alphabetical system of signals 
for his newly invented telegraph, he consulted a Philadelphia newspapers 
typecase.  Morse desired to assign the shorter dot-and-dash symbols used for 
telegraph transmission to the more commonly used letters of the alphabet, 
while longer dot-and-dash symbols would be used for less frequently used 
characters.  In counting type in each letter bin, he found 12,000 e's, 9000 
t's, 8000 each of a, o, m, i, and s, 6400 h's, and so on.  With few 
exceptions, Morse followed this list in developing his teletype code, 
assigning the shortest symbol, a dot, to the letter e, which was the most 
common letter encountered, another short symbol, a dash, to the letter t, 
and so on, assigning gradually longer codes to less frequently used letters.  
In this way, he came up with a code in which transmission of an English 
message consisting of 100 letters requires the transmission time of around 
940 units, where the transmissionof a dot equals one unit, a dash equals 
three units, and a space equals one or three units depending on whether it is 
between letters or words. If the symbols had been assigned at random, the 
same message would require the transmission time of approximately 1160 units, 
an increase of about 23%.  This encoding technique represents the earliest 
applications of statistical compression for data transmission.
    We can apply Morse's ideas to the design of a code for transmitting 
information in the following manner.  Say we have the following codes:
    
       Code No.   Symbol 1    Symbol 2    Symbol 3    Symbol 4
          1          00          01          10          11
          2           0          10         110        1110
          3           0          10         110         111
          
  Code 1 is the simplest of the four, with each symbol being allotted an 
equal length code.  Code 2 is a variable-length code which is known as a 
*comma code*, since the digit '0' is being used as a "comma" or seperator 
between code words.  Code 3 is another variable-length code with the 
important property of being an *instantaneous* code, that is, as soon as the 
code for a complete code word has been received we can immediately decode 
that codeword.  Consider a code in which two symbols X and Y are encoded as 
[0] and [01].  Assume we receive the message "001".  When we receive the 
first bit, 0, we can't be sure if this represents X or Y until we look ahead 
to the next bit, which is 0. Again, we can't be sure whether this is the 
message "XX" or "X" followed by the first half of Y.  Only when we receive 
the next bit, 1, can we be sure that we are in fact looking at the message 
"XY".  Code 3 does not suffer from this problem, since no codeword can be a 
prefix of another codeword.
  Now consider a source with symbol probabilities of 0.5, 0.25, 0.125, and 
0.125 for symbols 1-4 respectively.  With the simple fixed-length code, Code 
1, the transmission of 10 letter requires 20 bits.  Using Code 3, the length 
of the message would be:

    10 * ( ( 0.5 * 1 ) + ( 0.25 * 2 ) + ( 0.125 * 3 ) + ( 0.125 * 3 ) )
    
    = 17.5 bits, an improvement of 12.5% over the simpler fixed-length code.

  The average length L of a code is an important parameter in determining the 
efficiency of a system.  If we encode a source of i symbols with probability 
p sub i into codewords of length curly l sub i, we have the average length of 
the code as:

  L = sigma p sub i * curly l sub i
        i

  The efficiency of a code E is defined as:
    
      H
  E = -
      L
        
and is always <= 1 (corresponding to H( S ) <= L ).  In general, H( S ) < L, 
since L must be an integral unit (ie 1, 2, 3..., not 2.82348). Using the 
above code, we can calculate its efficiency as:

  L = sigma p sub i * curly l sub i
        i
          
    = ( ( 1 * 0.5 ) + ( 2 * 0.25 ) + ( 3 * 0.125 ) + ( 3 * 0.125 ) )
     
    = 1.75 bits/symbol
      
  Now if we calculate the entropy, we get:
    
  H = -sigma p( S sub i ) * log( p( S sub i ) )
         i
          
    = -( ( 0.5 * log( 0.5 ) ) + ( 0.25 * log( 0.25 ) ) + 
         ( 0.125 * log( 0.125 ) + ( 0.125 * log( 0.125 ) )
           
    = 1.75 bits/symbol
      
  In this case we have H( S ) = L.  This is in fact a rather special case 
where all symbol probabilities are integral powers of 1/2 (ie p( S sub i ) = 
( 1/2 ) ^ n).  This result does not occur if symbol probabilities are not 
powers of 2, a fact which becomes important in the discussion of Huffman and 
Shannon-Fano codes below.

Huffman and Related Compression Techniques
------------------------------------------

  *Huffman compression* is a statistical data compression technique which 
gives a reduction in the average code length used to represent the symbols of 
a alphabet.  The Huffman code is an example of a code which is optimal in the 
case where all symbols probabilities are integral powers of 1/2.  A Huffman 
code can be built in the following manner:

  [1] Rank all symbols in order of probability of occurrence.
    
  [2] Successively combine the two symbols of the lowest probability to form
      a new composite symbol; eventually we will build a tree where each node
      is the probability of all nodes beneath it.
        
  [3] Trace a path to each leaf, noticing the direction at each node.
    
  For example, using the symbols and probabilities given above,  we would 
proceed as follows:

Step 1:

  S sub 1 [0.5]
    
  S sub 2 [0.25]
    
  S sub 3 [0.125]
    
  S sub 4 [0.125]
    
Step 2a:

  S sub 1 [0.5]
    
  S sub 2 [0.25]
    
  S sub 3 [0.125] -+- [0.25]
                   |
  S sub 4 [0.125] -+

Step 2b:

  S sub 1 [0.5]
    
  S sub 2 [0.25] -------------+- [0.5]
                              |
  S sub 3 [0.125] -+- [0.25] -+
                   |
  S sub 4 [0.125] -+

Step 2c:

  S sub 1 [0.5] ------------------------+- [1.0]
                                        |  
  S sub 2 [0.25] -------------+- [0.5] -+
                              |
  S sub 3 [0.125] -+- [0.25] -+
                   |
  S sub 4 [0.125] -+

Step 3:
                                  (0)
  S sub 1 [0.5] ------------------------+- [1.0]
                        (0)       (1)   |  
  S sub 2 [0.25] -------------+- [0.5] -+
             (0)        (1)   |
  S sub 3 [0.125] -+- [0.25] -+
             (1)   |
  S sub 4 [0.125] -+

  Thus we get the following codes:
    
  S sub 1 = 0
  S sub 2 = 10
  S sub 3 = 110
  S sub 4 = 111
    
  which have already been discussed above.
    
  Huffman codes can be easily constructed with the aid of a computer.  The 
algorithm assumes that initially we have a forest of trees where each tree 
corresponds to an individual source symbol, and then proceeds as follows:

  while( there is more than one tree in the forest )
      {
      /* Find the two symbols with the lowest probability */
      i = the tree with the smallest weight;
      j = the tree with the next smallest weight;
        
      /* Combine them into a new composite symbol */
      create a new node with left child i and right child j;
      replace tree i with this new tree, with the tree's weight being
          the weight of i plus j;
      delete j from the forest;
      }
        
  As we merge the two lowest-weight trees into a single new tree we slowly 
decrease the size of the forest until finally we have reduced the entire 
forest to one tree, which represents the Huffman tree for that set of symbols.

                                    --
                                    
  A technique related to Huffman coding is *Shannon-Fano coding*, which was 
suggested by Shannon and Weaver in 1949 and modified by Fano in 1961.  It 
works as follows:

  [1] Rank all symbols in order of probability of occurrence.
    
  [2] Successively divide the set of symbols into two equal or almost equal
      subsets based on the probability of occurrence of characters in each 
      subset.  The first symbol in one subset is assigned a binary zero, the
      second a binary one.
    
  For example, again using the symbols and probabilities given above,  we 
would proceed as follows:

Step 1:

  S sub 1 [0.5]
   
  S sub 2 [0.25]
  
  S sub 3 [0.125]
   
  S sub 4 [0.125]
    
Step 2a:
                       (0)
  S sub 1 [0.5]   --- [0.50] -+- [1.00]
                       (1)    |     
  S sub 2 [0.25]  -+- [0.50] -+        
                   |             
  S sub 3 [0.125] -+  
                   |
  S sub 4 [0.125] -+

Step 2b:
                                  (0)
  S sub 1 [0.5]   -------------- [0.50] -+- [1.00] 
                       (0)        (1)    |
  S sub 2 [0.25]  --- [0.25] -+- [0.50] -+        
                       (1)    |      
  S sub 3 [0.125] -+- [0.25] -+ 
                   |
  S sub 4 [0.125] -+

Step 2c:
                                  (0)
  S sub 1 [0.5]   -------------- [0.50] -+- [1.00] 
                       (0)        (1)    |
  S sub 2 [0.25]  --- [0.25] -+- [0.50] -+        
            (0)        (1)    |      
  S sub 3 [0.125] -+- [0.25] -+ 
            (1)    |
  S sub 4 [0.125] -+

  Thus we get the following codes:
    
  S sub 1 = 0
  S sub 2 = 10
  S sub 3 = 110
  S sub 4 = 111
    
  These codes are in this case identical to the Huffman codes - the only 
difference is that tha algorithm used to create the Huffman codes is 
bottom-up, and the one for the Shannon-Fano codes is top-down.  In practice 
Shannon-Fano codes are usually of similar lengths to Huffman codes.  However, 
if the probabilities of occurrence of the symbols have a wide variance, the 
Shannon-Fano code will be more efficient, while the Huffman code will be more 
efficient as the difference in symbol probabilities decreases.  One problem 
that Shannon-Fano codes have is that they are not uniquely defined - if the 
cut point for the set of symbols is on an interval, you can cut above or 
below that point.  However Shannon-Fano coding does have the advantage that 
it is a bit more tractable mathematically than Huffman coding, making 
mathematical analysis easier.  Both Huffman and Shannon-Fano codes are 
instantaneous codes.
    
Optimal Coding:
---------------

  It would appear from the above results that Huffman or Shannon-Fano coding 
is the perfect means of compressing data.  However, this is *not* the case.  
As mentioned above, these coding methods are optimal when and only when the 
symbol probabilities are integral powers of 1/2, which is usually not the 
case. Consider the following simple example:  Suppose we have a source 
producing symbols X and Y of probability 0.66 and 0.33 respectively.  Then 
the Huffman/Shannon-Fano code for these symbols would be:

      (0)
  X [0.66] -+- [1.0]
      (1)   |  
  Y [0.33] -+
    
  resulting in a code of X = [0], Y = [1], with average length:
    
  L = ( 1 * 0.33 ) + ( 1 * 0.66 )
    
    = 1.0
      
  and entropy:
    
  H = -( ( 0.33 * log( 0.33 ) ) + ( 0.66 * log( 0.66 ) ) )
    
    = 0.92
      
  giving an efficiency of:
    
      0.92
  E = ----
      1.00
        
  or 92%.

  There is, however, a way to improve the coding efficiency of Huffman and 
Shannon-Fano encoding.  This can be done by means of *alphabet extensions*, 
that is, by extending the source alphabet to include groups of symbols 
instead of just individual symbols.  Say we have an alphabet of two symbols, 
X and Y. Now if we extend this to a first-order model to get { XX, XY, YX, 
YY } and build the Huffman tree for this alphabet we find that there is an 
increase in coding efficiency.  This effect continues as we keep extending 
the alphabet in this manner (there are exceptions to this rule - we can 
actually get a negative increase during an extension from n to n + 1, but the 
lower bound will still be the entropy of the data), for example for the above 
alphabet with symbol probabilities of .25 and .75 we get the following results:

    H( S sub 1 ) =  .811, L sub 1 = 1.000, n sub 1 = .811
    H( S sub 2 ) = 1.623, L sub 2 = 1.688, n sub 2 = .962
    H( S sub 3 ) = 2.434, L sub 3 = 2.469, n sub 3 = .986
    H( S sub 4 ) = 3.245, L sub 4 = 3.273, n sub 4 = .991
    H( S sub 5 ) = 4.056, L sub 5 = 4.089, n sub 4 = .992
    
As the extension tends towards infinity, the efficiency approaches 1.000. 
However for practical reasons this approach is impossible - most computers 
would run out of memory long before they got to the third or fourth 
extensions. And yet there *is* a way in which this type of encoding can be 
achieved: through a technique called *arithmetic coding*.

Arithmetic Coding:
------------------

    As already discussed above, Huffman coding is not optimal.  This is due 
to the fact that it is not used to encode single, complete messages, but only 
individual bytes or n-byte blocks, and while each byte (or block of bytes) is 
encoded with minimum redundancy into an integral number of bits, the sequence 
of bytes is not.  However, the technique of *arithmetic coding* does not have 
this restriction: It achieves the same effect as treating the message as one 
single unit (a technique which would, for Huffman coding, require enumeration 
of every single possible message), and thus attains the theoretical entropy 
bound to compression efficiency for any source.
  Arithmetic coding works by representing a number by an interval of real 
numbers between 0 and 1.  As the message becomes longer, the interval needed 
to represent it becomes smaller and smaller, and the number of bits needed to 
specify that interval increases.  Successive symbols in the message reduce 
this interval in accordance with the probability of that symbol.  The more 
likely symbols reduce the range by less, and thus add fewer bits to the 
message.

     1                                             Codewords
    +-----------+-----------+-----------+           /-----\
    |           |8/9 XX     |  Detail   |<- 31/32    11111
    |           +-----------+-----------+<- 15/16     1111
    |    X      |           | too small |<- 14/16     1110
    |2/3        |    XY     | for text  |<- 6/8        110
    +-----------+-----------+-----------+
    |           |           |16/27 YXX  |<- 10/16     1010
    |           |           +-----------+
    |           |    YX     |           |
    |           |           |   YXY     |<- 4/8        100
    |           |4/9        |           |
    |           +-----------+-----------+
    |           |           |           |
    |    Y      |           |   YYX     |<- 3/8        011
    |           |           |8/27       |
    |           |           +-----------+
    |           |    YY     |           |
    |           |           |           |<- 1/4         01
    |           |           |   YYY     |
    |           |           |           |
    |0          |           |           |
    +-----------+-----------+-----------+
    
  As an example of arithmetic coding, lets consider the previous example of
two symbols X and Y, of probabilities 0.66 and 0.33, for which the
Huffman/Shannon-Fano code efficiency was 92%.  To encode this message, we
examine the first symbol:  If it is a X, we choose the lower partition; if it
is a Y, we choose the upper partition.  Continuing in this manner for three
symbols, we get the codewords shown to the right of the diagram above - they can be
found by simply taking an appropriate location in the interval for that
particular set of symbols and turning it into a binary fraction.  From these
codewords we can calculate the average codeword length as:

  L = ( 0.062 * 5 ) + ( 0.296 * 4 ) + ( 0.296 * 4 ) + ( 0.296 * 4 ) +
      ( 0.444 * 3 ) + ( 0.296 * 4 ) + ( 0.444 * 3 ) + ( 0.444 * 3 ) +
      ( 0.593 * 2 )

    = 2.88 bits for 3 symbols

    = 0.96 bits/symbol

giving an efficiency of:

      0.92
  E = ----
      0.96

    = 96%
        
  In this case the arithmetic code is not completely efficient, which is due 
to the shortness of the message - with longer messages the coding efficiency 
does indeed approach 100%.

  Now that we have an efficient encoding technique, what can we do with it? 
What we need is a technique for building a model of the data which we can 
then use with the encoder.  The simplest model is a fixed one, for example a 
table of standard letter frequencies for English text which we can then use 
to get letter probabilities.  An improvement on this technique is to use an 
*adaptive model*, in other words a model which adjusts itself to the data 
which is being compressed as the data is compressed.  We can convert the 
fixed model into an adaptive one by adjusting the symbol frequencies after 
each new symbol is encoded, allowing the model to track the data being 
transmitted.  However, we can do much better than that.  Recall from the 
section on information theory that using the symbol probabilities by 
themselves is not a particularly good estimate of the true entropy of the 
data:  We can take into account intersymbol probabilities as well.  The best 
compressors available today take this approach:  DMC (Dynamic Markov Coding) 
starts with a zero-order Markov model and gradually extends this initial 
model as compression progresses; PPM (Prediction by Partial Matching) looks 
for a match of the text to be compressed in an order-n context.  If no match 
is found, it drops to an order n-1 context, until it reaches order 0.  Both 
these techniques thus obtain a much better model of the data to be 
compressed, which, combined with the use of arithmetic coding, results in 
superior compression performance.
    
  So if arithmetic coding-based compressors are so powerful, why are they not 
used universally?  Apart from the fact that they are relatively new and 
haven't come into general use too much yet, there is also one major concern:  
The fact that they consume rather large amounts of computing resources, both 
in terms of CPU power and memory.  The building of sophisticated models for 
the compression can chew through a fair amount of memory (especially in the 
case of DMC, where the model can grow without bounds); and the arithmetic 
coding itself involves a fair amount of number crunching.  Most 
"bread-and-butter" PC's simply don't have the memory (640K - 1MB) or CPU 
power (8-10 MHz 68000's or 80286's) to run a powerful arithmetic compressor.  
There is however an alternative approach, a class of compressors generally 
referred to as *substitutional* or *dictionary-based compressors*.

Substitutional Compressors
--------------------------

  The basic idea behind a substitutional compressor is to replace an 
occurrence of a particular phrase or group of bytes in a piece of data with a 
reference to a previous occurrence of that phrase.  There are two main 
classes of schemes, named after Jakob Ziv and Abraham Lempel, who first 
proposed them in 1977 and 1978.

<The LZ78 family of compressors>

  LZ78-based schemes work by entering phrases into a *dictionary* and then, 
when a repeat occurrence of that particular phrase is found, outputting the 
dictionary index instead of the phrase.  There exist several compression 
algorithms based on this principle, differing mainly in the manner in which 
they manage the dictionary.  The most well-known scheme (in fact the most 
well-known of all the Lempel-Ziv compressors, the one which is generally (and 
mistakenly) referred to as "Lempel-Ziv Compression"), is Terry Welch's LZW 
scheme, which he designed in 1984 for implementation in hardware for high- 
performance disk controllers.

Input string: /WED/WE/WEE/WEB

Character input:    Code output:    New code value and associated string:
    /W                  /                   256 = /W
    E                   W                   257 = WE
    D                   E                   258 = ED
    /                   D                   259 = D/
    WE                  256                 260 = /WE
    /                   E                   261 = E/
    WEE                 260                 262 = /WEE
    /W                  261                 263 = E/W
    EB                  257                 264 = WEB
    <END>               B
    
  LZW starts with a 4K dictionary, of which entries 0-255 refer to individual 
bytes, and entries 256-4095 refer to substrings.  Each time a new code is 
generated it means a new string has been parsed.  New strings are generated 
by appending the current character K to the end of an existing string w.  The 
algorithm for LZW compression is as follows:

  set w = NIL
  loop
      read a character K
      if wK exists is in the dictionary
          w = wK
      else
          output the code for w
          add wK to the string table
          w = K
  endloop

  A sample run of LZW over a (highly redundant) input string can be seen in 
the diagram above.  The strings are built up character-by-character starting 
with a code value of 256.  LZW decompression takes the stream of codes and 
uses it to exactly recreate the original input data.  Just like the 
compression algorithm, the decompressor adds a new string to the dictionary 
each time it reads in a new code.  All it needs to do in addition is to 
translate each incoming code into a string and send it to the output.  A 
sample run of the LZW decompressor is shown in below.

Input code: /WED<256>E<260><261><257>B

Input code:        Output string:     New code value and associated string:
    /                  /            
    W                  W                      256 = /W
    E                  E                      257 = WE
    D                  D                      258 = ED
    256                /W                     259 = D/
    E                  E                      260 = /WE
    260                /WE                    261 = E/
    261                E/                     262 = /WEE
    257                WE                     263 = E/W
    B                  B                      264 = WEB
           
  The most remarkable feature of this type of compression is that the entire 
dictionary has been transmitted to the decoder without actually explicitly 
transmitting the dictionary.  At the end of the run, the decoder will have a 
dictionary identical to the one the encoder has, built up entirely as part of 
the decoding process.
    LZW is more commonly encountered today in a variant known as LZC, after 
its use in the UNIX "compress" program.  In this variant, pointers do not 
have a fixed length.  Rather, they start with a length of 9 bits, and then 
slowly grow to their maximum possible length once all the pointers of a 
particular size have been used up.  Furthermore, the dictionary is not frozen 
once it is full as for LZW - the program continually monitors compression 
performance, and once this starts decreasing the entire dictionary is 
discarded and rebuilt from scratch.  More recent schemes use some sort of 
least-recently-used algorithm to discard little-used phrases once the 
dictionary becomes full rather than throwing away the entire dictionary.  
Finally, not all schemes build up the dictionary by adding a single new 
character to the end of the current phrase. An alternative technique is to 
concatenate the previous two phrases (LZMW), which results in a faster 
buildup of longer phrases than the character-by-character buildup of the 
other methods.  The disadvantage of this method is that a more sophisticated 
data structure is needed to handle the dictionary.

<The LZ77 family of compressors>

  LZ77-based schemes keep track of the last n bytes of data seen, and when a 
phrase is encountered that has already been seen, they output a pair of 
values corresponding to the position of the phrase in the previously-seen 
buffer of data, and the length of the phrase.  In effect the compressor moves 
a fixed-size *window* over the data (generally referred to as a *sliding 
window*), with the position part of the (position, length) pair referring to 
the position of the phrase within the window.  The most commonly used 
algorithms are derived from the LZSS scheme described by James Storer and 
Thomas Szymanski in 1982.  In this the compressor maintains a window of size 
N bytes and a *lookahead buffer* the contents of which it tries to find a 
match for in the window:

  while( lookAheadBuffer not empty )
      {
      get a pointer ( position, match ) to the longest match in the window
          for the lookahead buffer;

      if( length > MINIMUM_MATCH_LENGTH )
          {
          output a ( position, length ) pair;
          shift the window length characters along;
          }
      else
          {
          output the first character in the lookahead buffer;
          shift the window 1 character along;
          }
      }
        
  Decompression is simple and fast:  Whenever a ( position, length ) pair is 
encountered, go to that ( position ) in the window and copy ( length ) bytes 
to the output.
  Sliding-window-based schemes can be simplified by numbering the input text 
characters mod N, in effect creating a circular buffer.  The sliding window 
approach automatically creates the LRU effect which must be done explicitly 
in LZ78 schemes.  Variants of this method apply additional compression to the 
output of the LZSS compressor, which include a simple variable-length code 
(LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP), all of 
which result in a certain degree of improvement over the basic scheme, 
especially when the data are rather random and the LZSS compressor has little 
effect.
  Recently an algorithm was developed which combines the ideas behind LZ77 
and LZ78 to produce a hybrid called LZFG.  LZFG uses the standard sliding 
window, but stores the data in a modified trie data structure and produces as 
output the position of the text in the trie.  Since LZFG only inserts 
complete *phrases* into the dictionary, it runs considerably faster than any 
other LZ77-based compressors.  This technique achieves the best compression 
of any Ziv-Lempel type compressor, and has one of the best runtimes.

Compression Performance
-----------------------

  The following representative sample of compressors were tested on a suite 
of test data which ranged from straight ASCII text, through program source 
code, graphics data, to object code and a file of seismic data.  The 
compressors evaluated were:

  The following schemes can be classed as "generic" schemes used as 
representatives of a particular type of compression (these values are taken
from "Text Compression").

  Huffman coding [HUFF]:  Vitter's adaptive Huffman coding scheme, as used
                          in the UNIX "compact" program.

  LZW compression [LZW]: The LZ78 family representative.
    
  LZSS compression [LZSS]: The LZ77 family representative.
    
  LZFG compression [LZFG]: A LZ77/LZ78 hybrid.
    
  PPMC compression [PPMC]: The current state of the art in compression 
                           performance, outlined in the section on
                           arithmetic coding.

  The following schemes are examples of practical implementations in everyday 
use:

  Modified LZW [LZC]: As first used in the UNIX "compress" program, also used
                      in the ARC, ZOO, and Stuffit archivers (results are for
                      Compress 4.0).
                        
  Imploding [PKZIP]: The "Imploding" scheme used by the PKZIP archiver 
                     (Version 1.10).
    
  Freezing [LHARC]: The LZSS/adaptive Huffman coding combination used by the
                    LHARC archiver (Version 1.13).
    
  The compression performances were as follows (all results are in 
bits/byte):

       Size      HUFF    LZW   LZSS   LZFG   PPMC    LZC   PKZIP  LHARC 
---------------+------+------+------+------+------+------+------+------+
Bib  : 111,261 | 5.24 | 3.84 | 3.35 | 2.90 | 2.11 | 3.89 | 2.97 | 3.34 |
Book1: 768,771 | 4.56 | 4.03 | 4.08 | 3.62 | 2.48 | 4.06 | 3.65 | 3.85 |
Book2: 610,856 | 4.83 | 4.52 | 3.41 | 3.05 | 2.26 | 4.25 | 3.04 | 3.31 |
Geo  : 102,400 | 5.70 | 6.15 | 6.43 | 5.70 | 4.78 | 6.10 | 5.95 | 5.53 |
News : 377,109 | 5.23 | 4.92 | 3.79 | 3.44 | 2.65 | 4.90 | 3.34 | 3.52 |
Obj1 :  21,504 | 6.06 | 6.30 | 4.57 | 4.03 | 3.76 | 6.15 | 3.92 | 4.00 |
Obj2 : 246,814 | 6.30 | 9.81 | 3.30 | 2.96 | 2.69 | 5.19 | 2.92 | 2.94 |
Paper1: 53,161 | 5.04 | 4.58 | 3.38 | 3.03 | 2.48 | 4.43 | 3.01 | 3.27 |
Paper2: 82,199 | 4.65 | 4.02 | 3.58 | 3.16 | 2.45 | 3.98 | 3.20 | 3.43 |
Pic  : 513,216 | 1.66 | 1.09 | 1.67 | 0.87 | 1.09 | 0.99 | 0.99 | 0.96 |
Progc:  39,611 | 5.26 | 4.88 | 3.24 | 2.89 | 2.49 | 4.41 | 2.86 | 3.11 |
Progl:  71,646 | 4.81 | 3.89 | 2.37 | 1.97 | 1.90 | 3.57 | 1.93 | 2.09 |
Progp:  49,379 | 4.92 | 3.73 | 2.36 | 1.90 | 1.84 | 3.72 | 1.92 | 2.07 |
Trans:  93,695 | 5.98 | 4.24 | 2.44 | 1.76 | 1.77 | 3.94 | 1.98 | 2.40 |
---------------+------+------+------+------+------+------+------+------+
 14 Files        4.99   4.71   3.43   2.95   2.48   4.26   2.98   3.13  

  Huffman coding is easily the worst of the lot, particularly for the highly 
compressible program and trans files, where most other compressors attain 
around 2 bits/byte.  The disadvantage of non-adaptive compression is shown by 
the results for the LZW compression of the file obj2, which at the start 
consists almost purely of text, but then switches to code and graphics data. 
Since the LZW dictionary is frozen after 4K (during which time it has adapted 
itself to compressing text) it ends up expanding the data once it gets past 
the text part of the program.

    So which compressor is best overall?  That depends on several things.  
PPMC achieves the best compression by quite a margin, but requires a fair 
amount of computing resources, both in terms of memory and CPU power (though 
of course these considerations will become relatively minor as faster systems 
with larger amounts of memory become available).  One nice feature of PPMC is 
that it can be scaled down to run in less memory, but compression performance 
suffers in doing this.  LZC is popular since it is relatively easy to 
implement, very fast, and achieves acceptable levels of compression in little 
memory.  PKZIP is especially popular on micros since it is also quite fast, 
requires little memory, and usually achieves better performance than any 
other archiver.
   
                                ----------------
--
 Peter_Gutmann@kcbbs.gen.nz || peter@nacjack.gen.nz || pgut1@cs.aukuni.ac.nz
                     (In order of decreasing reliability)
Warning!
  Something large, scaly, and with fangs a foot long lives between <yoursite>
and <mysite>.  Every now and then it kills and eats messages.  If you don't
receive a reply within a week, try resending...