[comp.databases] In Search Of a key compression algorithm!

jerryw@tove.cs.umd.edu (Jerry Wieber) (05/30/91)

I am looking for an algorithm that can compress keys without destroying
lexigraphic ordering.  A reply to the effect of "this is impossible" is
equally helpful, of course.  Any compression, no matter how small, may
be useful.  To wit, I have a very large number of variable length keys
from 1 to 64 bytes in length, and I have got to get the size down for
sorting....


All replies gratefully appreciated!
-Jerry
--
                                                            __________ 
 UUCP:     uunet!cs.umd.edu!jerryw   SPOKEN: Jerry Wieber   |/ `-.    |  U of Md
 INTERNET: jerryw@cs.umd.edu              "Disclaimer"            \_|.|-,  
                                                                     ` -

buckland@cheddar.ucs.ubc.ca (Tony Buckland) (05/30/91)

In article <35014@mimsy.umd.edu> jerryw@tove.cs.umd.edu (Jerry Wieber) writes:
>I am looking for an algorithm that can compress keys without destroying
>lexigraphic ordering.  A reply to the effect of "this is impossible" is
>equally helpful, of course.  Any compression, no matter how small, may
>be useful. ...

 If you can specify that only a fixed, small character set is
 used in the keys, usually the case since keys tend to be
 alphabetic plus a little punctuation, translate each key byte to
 a 1-byte integer representing the ordering of the characters you
 use (e.g. 00 = blank, 01 = a, ...), figure out the minimum number
 of bits you need to express those integers (5 might do it if you
 want to disregard case, so that a=A, b=B, ...), and compress
 n bytes of key to n*number of bits.  This will get you 3/8 or
 possibly 2/8 shrinkage in the key length, with no loss of
 ordering.

steve@Pkg.Mcc.COM (Steve Madere) (06/01/91)

In article <35014@mimsy.umd.edu>, jerryw@tove.cs.umd.edu (Jerry Wieber) writes:
| I am looking for an algorithm that can compress keys without destroying
| lexigraphic ordering.  A reply to the effect of "this is impossible" is
| equally helpful, of course.  Any compression, no matter how small, may
| be useful.  To wit, I have a very large number of variable length keys
| from 1 to 64 bytes in length, and I have got to get the size down for
| sorting....
| 
| 
| All replies gratefully appreciated!
| -Jerry
| --
|                                                             __________ 
|  UUCP:     uunet!cs.umd.edu!jerryw   SPOKEN: Jerry Wieber   |/ `-.   
|  U of Md
|  INTERNET: jerryw@cs.umd.edu              "Disclaimer"           
\_|.|-,  
|                                                                      `
- 

If you have a truly large set, you will probably have lots of sets
of adjacent keys in which they share alot of common leading symbols.

American Airlines
American Antiques
American Antlers
Amerigo Vespucci

etc.

What you can do is to devote the first byte of each key to count
the number of leading characters in common with the previous key
so that the above list would become.

American Airlines
'\0x0a'ntiques
'\0x0c'lers
'\0x05'go Vespucci

This saves 24 chars out of 40.

This can only be used to store them in the final list however.
During sorting they will obviously have to be represented in their
full sized form.

I can't think of a good search technique through such a sorted
list for the moment but I'm sure you can if you spend a few moments
to think about it.

I realize this is a pretty braindead way to compact a key list
but it would in certain cases afford significant storage savings.

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (06/03/91)

> From: jerryw@tove.cs.umd.edu (Jerry Wieber)
> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
> 
> I am looking for an algorithm that can compress keys without destroying
> lexigraphic ordering.

I'm at home and can't check my references, but the compression technique
that you might want is called the Garcia Wachs (or something like that)
algorithm.  I have a reprint of a paper by Jeff Kingston that discusses
this algorithm.

In sum, the idea is to use a prefix code, as with Huffman Codes, but
to balance the encoding tree without reordering the leaves.  This can
be done with compression efficiency that is no worse than one bit per
character worse than the efficiency of a Huffman code (which is no worse
than one bit per character worse than the source entropy under the source
model used for compression).
					Doug Jones
					jones@cs.uiowa.edu

radford@ai.toronto.edu (Radford Neal) (06/03/91)

>From: jerryw@tove.cs.umd.edu (Jerry Wieber)
>Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
>
>I am looking for an algorithm that can compress keys without destroying
>lexigraphic ordering.  A reply to the effect of "this is impossible" is
>equally helpful, of course.  Any compression, no matter how small, may
>be useful.  To wit, I have a very large number of variable length keys
>from 1 to 64 bytes in length, and I have got to get the size down for
>sorting....


Arithmetic coding should work fine for this. Presumably, one would 
use a fixed model, but one might want to pay attention to context. 

All that is required to preserve lexigraphic order is that the mapping
from letters to numeric regions preserve the alphabetic ordering. This
is easily arrainged, perhaps at a slight computational cost.

Assuming that these keys are something like peoples' names, compression
down to about 4 bits per letter should be readily achievable. I would
expect that compression to less than 3 bits per letter might well be
possible.

    Radford Neal

pgh@cs.brown.edu (Paul Howard) (06/03/91)

Knuth Volume III (Sorting and Searching) describes the Hu-Tucker
algorithm, a prefix code that preserves lex order.  This may be
similar to the Garcia Wachs algorithm mentioned by Doug Jones.  I
would just use arithmetic coding.

-- Paul

--

=============================================================

Paul G. Howard                        email: pgh@cs.brown.edu
Department of Computer Science 	      phone: (401) 863-7672
Brown University               	      FAX:   (401) 863-7657
Providence RI 02912-1910       	      ham:   KA1YNE