[alt.sources.d] Compression algorithm

gtoal@tharr.UUCP (Graham Toal) (09/16/90)

These are my thoughts on a new compression algorithm to replace Lempel-
Zev-Welch.*  They are very preliminary, and I haven't coded any of it. I throw
it out for comment, and if anyone wants to modify it and/or code it up,
please feel free to do so.  It's too big a hack for me to start on just now;
I've got quite a big in-tray!

(* if it becomes necessary because of the software patent issue)


Take the input data laid out horizontally, and build a tree out of it with
pairwise bottom-up assembly, eg:


FIG 1:
                      /\
                     /  \
                    /    \
                   /      \
                  /        \
                 /          \
                /            \
               /              \
              /                \
             /                  \
            /                    \
           /                      \
          /\                      /\
         /  \                    /  \
        /    \                  /    \
       /      \                /      \
      /        \              /        \
     /          \            /          \
    /\          /\          /\          /\
   /  \        /  \        /  \        /  \
  /    \      /    \      /    \      /    \
 /\    /\    /\    /\    /\    /\    /\    /\
a  b  c  d  a  b  c  g  h  i  b  c  c  g  c  g



Now we convert this into a DAG (Directed Acyclic Graph). In fact we actually
do the node merging/dag creation on the fly as we build the table bottom up.
If you use a hash table to detect identical nodes, this will surprisingly be
a linear time algorithm at the expense of space propotional to the tree
width; I used something similar in a spelling-checker project to compact word
dictionaries, so I know it works.

The Directed Graph looks like this:


FIG 2:
                       0
                      /\
                     /  \
                    /    \
                   /      \
                  /        \
                 /          \
                /            \
               /              \
              /                \
             /                  \
            /                    \
           1                      2
          /\                      /\
         /  \                    /  \
        /    \                  /    \
       /      \                /      \
      /        \              /        \
     3          4            5          6
    /\          /\          /\          /\
   /  \        /  \        /  \        /  \
  7    8     =7    10     11   12    =10  =10
 /\    /\          /\    /\    /\          
a  b  c  d  a  b  c  g  h  i  b  c  c  g  c  g


Now we output the tree from left to right, walking the leaves.

a  b  c  d  [7]  c  g  h  i  b  c  [10] [10]

Every time a node number is output instead of a leaf character, enough of the
tree will have been built so that the node number points to a bit of the tree
which has already been built.

(Node numbers could be output using escape codes, but it would probably be
better to use something like huffman on an alphabet size which is large
enough to encompass the tree.  Note that the most frequent codes, near the
bottom of the tree, have large numbers.  A neat encoding scheme should
undo this effect.)

In real life, data will seldom come in powers of two; however this is an easy
problem to avoid; just assume an infinite sequence of some impossible code
following the last character; eg, if our last example were a little shorter:


 { FIG 3:
 {
 {                        0
 {                       /\
 {                      /  \
 {                     /    \
 {                    /      \
 {                   /        \
 {                  /          \
 {                 /            \
 {                /              \
 {               /                \
 {              /                  \
 {             /                    \
 {            1                      2
 {           /\                      /\
 {          /  \                    /  \
 {         /    \                  /    \
 {        /      \                /      \
 {       /        \              /        \
 {      3          4            5          6
 {     /\          /\          /\          /\
 {    /  \        /  \        /  \        /  \
 {   7    8     =7    10     11   12     /    \
 {  /\    /\          /\    /\    /\    /\    /\ 
 { a  b  c  d  a  b  c  g  h  i  b  c  c -1  -1 -1


Of course, you wouldn't output all the -1's -- the first one would be enough
to imply termination of the compressed input file when reading it back.

The next problem is that a tree like this is not sensible for large files;
clearly we want some sort of windowing mechanism. I suggest we pick some
maximum power of two (here we're using 16 for the sake of demonstration,
although a real program might use 16384)

Now we start building up our tree as before, but when we've filled it
(to the stage we had done in FIG 2), we emit the codes for the left
half, and shift the right tree branches to the left;


FIG 4:

Output: a b c d [7] c g

                         0
                        / \
                       /   .
                      /     .
                     /       .
                    /
                   /
                  /
                 /
                /
               /
              /
             2
            /\
           /  \
          /    \
         /      \
        /        \
       5          6
      /\          /\
     /  \        /  \
    11   12    =10  =10
   /\    /\              
  h  i  b  c  c  g  c  g


The next N/2 chars are read in and added to the DAG, sharing nodes as before.
Nodes in the left-hand half which pointed to a node which has now been lost
must be renumbered (or more precisely, a new master node should be created for
each shared node):


FIG 5:

                         0
                        / \
                       /   .
                      /     .
                     /       .
                    /
                   /
                  /
                 /
                /
               /
              /
             1
            /\
           /  \
          /    \
         /      \
        /        \
       3          4
      /\          /\
     /  \        /  \
    7    8      9    =9
   /\    /\    /\                
  h  i  b  c  c  g  c  g


This process of double-buffering now repeats until the first -1 node is output.


Note that large runs of repeated data are encoded in NlogN tokens.  Whether this
is enough, or whether we should pipe the input through a simple run-length
compacter before being fed to this algorithm can be determined by experiment.


FIG 6:
                       0
                      /\
                     /  \
                    /    \
                   /      \
                  /        \
                 /          \
                /            \
               /              \
              /                \
             /                  \
            /                    \
           1                      =1
          /\
         /  \
        /    \
       /      \
      /        \
     3          =3
    /\           
   /  \
  7    =7
 /\  
a  a  a  a  a  a  a  a  a  a  a  a  a  a  a  a


Outputs: a a [7] [3] [1]


Happy hacking.

Graham Toal <gtoal@ed.ac.uk>

(Don't mail me comments; post instead please (to alt.sources.d))

7] [3] [1]


Happy hacking.

Graham Toal <gtoal@ed.ac.uk>

(Don't mail me comments; post instead please (to alt.sources.d))

pgut1@iti.org (PeterClaus Gutmann ) (09/28/90)

In <986@tharr.UUCP> gtoal@tharr.UUCP (Graham Toal) writes:

>These are my thoughts on a new compression algorithm to replace Lempel-
>Zev-Welch.*  They are very preliminary, and I haven't coded any of it. I throw
>it out for comment, and if anyone wants to modify it and/or code it up,
>please feel free to do so.
>(* if it becomes necessary because of the software patent issue)

[Long description of compression algorithm]

.................

>(Node numbers could be output using escape codes, but it would probably be
>better to use something like huffman on an alphabet size which is large
>enough to encompass the tree.  Note that the most frequent codes, near the
>bottom of the tree, have large numbers.  A neat encoding scheme should
>undo this effect.)

>The next problem is that a tree like this is not sensible for large files;
>clearly we want some sort of windowing mechanism. I suggest we pick some
>maximum power of two (here we're using 16 for the sake of demonstration,
>although a real program might use 16384)

.................

>Happy hacking.

>Graham Toal <gtoal@ed.ac.uk>

Algorithms to do this type of compression already exist.  Lempel and Ziv did
two compression schemes, generally referred to as LZ77 and LZ78 (after the
years in which they were published in the IEEE Transactions on Information
Theory).  LZW is based on the LZ78 scheme, the algorithm you propose is like
their LZ77 scheme.  This (LZ77) algorithm is used by PKZIP and LHARC as the
"front-end" for their compression.  LHARC then uses adaptive Huffman coding
to compress the output of the LZ compressor, PKZIP uses Shannon-Fano coding.
The LZ77-based compressors generally work by using an 8K window with various
data structures to handle the maximum-length string matching.  I've written a
compressor which seems to achieve much better results with a 16K window, at the
expense of being a bit slower.  There is also a very fast compression
algorithm invented by Ed Fiala and Dan Greene (published in Communications of
the ACM, Vol.32, No.4 (April 1989), p490) which is a sort of hybrid between
LZ77 and LZ78, which uses a very fancy data structure to build a DAG (much like
you are proposing).  The problem with this algorithm is that, like LZW, it is
also patented.  The LZ77-compressors used in PKZIP and LHARC are based on the
LZSS compression scheme, which is described fairly well by Tim Bell in
the IEEE Trans. on Communications, Vol.34, No.12 (Dec.1986), p.1176.....
waffle burble waffle (this is one of my pet topics :-)

Peter.

--
[In order of reliability] Peter_Gutmann@kcbbs.gen.nz or peter@nacjack.gen.nz
			or pgut1@compsc.aukuni.ac.nz
Disclaimer:  The contents of this message are irrelevant.  It serves merely
	     as a covert channel for a data leak to The Commies.