[comp.compression] Preserving lexicographic order under compression.

ross@spam.ua.oz.au (Ross Williams) (06/03/91)

INITIAL REQUEST:
>I am looking for an algorithm that can compress keys without destroying
>lexigraphic ordering.  A reply to the effect of "this is impossible" is
>...
>be useful.  To wit, I have a very large number of variable length keys
>from 1 to 64 bytes in length, and I have got to get the size down for
>sorting....

JONES'S REPLY:
>I'm at home and can't check my references, but the compression technique
>that you might want is called the Garcia Wachs (or something like that)
>In sum, the idea is to use a prefix code, as with Huffman Codes, but
>to balance the encoding tree without reordering the leaves.  This can
>be done with compression efficiency that is no worse than one bit per
>character worse than the efficiency of a Huffman code (which is no worse
>than one bit per character worse than the source entropy under the source
>model used for compression).

Yes, this is  a good solution (and  the one I first  thought of) but I
think there is  an even better solution as Garcia  Wachs (or whatever)
has unbounded  inefficiency in  nasty worst  cases such  as p(A)=0.01,
p(B)=0.98, p(C)=0.01  (although for English  text it will  probably be
OK).

Isn't this a PERFECT application for arithmetic coding?

Just  count the  number of  times that  each character  occurs in  the
entire database  and then  use the frequencies  to construct  a fixed,
zero-order  model. Then  use  the model  to  code  each record  into a
variable length  string by coding  each character in  succession using
arithmetic coding. So long as the symbols are allocated at each coding
step to the available range IN ORDER, then lexicographic ordering will
be preserved in the compressed data.

In  fact  the model  could  even  be  a  higher order  ADAPTIVE  model
(although in  this case the  strings are  probably not long  enough to
make  it worthwhile).  All that  matters is  that the  arithmetic code
range is divided up among the symbols IN ORDER.

The  result   is  that  the   requirement  for  the   preservation  of
lexicographic order does not affect the compression possible at all. I
suggest that you generate a  static third-order Markov model from your
data  and  then  compress  each  record  using  arithmetic  coding  as
described above.  If your  records are English  text, this  should get
each record down to about one third of its previous size.

Arithmetic coding is sometimes criticised for being slow (which it can
be).  In this  case,  I suspect  that  record size  will  be far  more
important that speed of compression as  you said it was a big database
and the whole lot has to be sorted.

A good description of arithmetic coding can be found in:
Witten.I.H.,  Neal.R.M.,  Cleary.J.G.,  "Arithmetic  Coding  for  Data
Compression", Communications of the ACM, Vol 30, No 6, pp.520-540.

Ross Williams
ross@spam.ua.oz.au