ross@spam.ua.oz.au (Ross Williams) (06/03/91)
INITIAL REQUEST: >I am looking for an algorithm that can compress keys without destroying >lexigraphic ordering. A reply to the effect of "this is impossible" is >... >be useful. To wit, I have a very large number of variable length keys >from 1 to 64 bytes in length, and I have got to get the size down for >sorting.... JONES'S REPLY: >I'm at home and can't check my references, but the compression technique >that you might want is called the Garcia Wachs (or something like that) >In sum, the idea is to use a prefix code, as with Huffman Codes, but >to balance the encoding tree without reordering the leaves. This can >be done with compression efficiency that is no worse than one bit per >character worse than the efficiency of a Huffman code (which is no worse >than one bit per character worse than the source entropy under the source >model used for compression). Yes, this is a good solution (and the one I first thought of) but I think there is an even better solution as Garcia Wachs (or whatever) has unbounded inefficiency in nasty worst cases such as p(A)=0.01, p(B)=0.98, p(C)=0.01 (although for English text it will probably be OK). Isn't this a PERFECT application for arithmetic coding? Just count the number of times that each character occurs in the entire database and then use the frequencies to construct a fixed, zero-order model. Then use the model to code each record into a variable length string by coding each character in succession using arithmetic coding. So long as the symbols are allocated at each coding step to the available range IN ORDER, then lexicographic ordering will be preserved in the compressed data. In fact the model could even be a higher order ADAPTIVE model (although in this case the strings are probably not long enough to make it worthwhile). All that matters is that the arithmetic code range is divided up among the symbols IN ORDER. The result is that the requirement for the preservation of lexicographic order does not affect the compression possible at all. I suggest that you generate a static third-order Markov model from your data and then compress each record using arithmetic coding as described above. If your records are English text, this should get each record down to about one third of its previous size. Arithmetic coding is sometimes criticised for being slow (which it can be). In this case, I suspect that record size will be far more important that speed of compression as you said it was a big database and the whole lot has to be sorted. A good description of arithmetic coding can be found in: Witten.I.H., Neal.R.M., Cleary.J.G., "Arithmetic Coding for Data Compression", Communications of the ACM, Vol 30, No 6, pp.520-540. Ross Williams ross@spam.ua.oz.au