[comp.compression] fast string compr.?

laca@vipunen.hut.fi (Laszlo C Balint) (05/20/91)

I am looking for an algorithm to compress and decompress strings of
variable length. The compressed sequence can be fixed length too, but
the process should be effective and reversable. I need it for a spell 
checker, where the number of words might be huge, but all they have 
to fit in the memory becuase of a frequent access (~20 lookups/checked
word).

Any help appreciated, especially if a C(++) code :-)

Thanks in advance,

laca
--
Laszlo C. Balint      ( Hungarian on the go )      laca@vipunen.hut.fi
 ------------------------------------------------------------------
| ....That's all folks. Flames, if you must, by mail please. They're |
| easier to ignore that way.     (Tony Lezard, with(out) permission) |

brad@looking.on.ca (Brad Templeton) (05/24/91)

Data compression is not normally used for spell checkers.

Instead, most spell checkers use hashing.   Create an array of around
B bits (around 600,000 is good and easy) and for each word in your dictionary,
apply N hashing functions that map from the word to a number from 1 to B.
Set the bit of that number. 

To see if a word is in the dictionary, apply the N functions to the word
and see if all the bits are set, if so, the word is valid, else it is
definitely not in.

Note that this will say that some bad words are correct, it never says
that a valid word is not found.  The probablilty of saying a bad word is
correct can be reduced until it is meaningless.

If you have a 30,000 word dictionary and 5 functions then then ~1/4 of
the words are set.  The probability of a bad word giving yes is .25 ^ 5 
or .09%.    There is a formula that works out the optimal values of
N for B and the dictionary size.

If you are really worried, take common misspellings and special case them.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

bill@franklin.com (bill) (05/25/91)

In article <1991May23.235530.11443@looking.on.ca>
	brad@looking.on.ca (Brad Templeton) writes:
: Data compression is not normally used for spell checkers.

Hokum. The data that goes into, for example, one of our Italian
spelling checkers is around 13M of text. It is compressed into
around 64K, reversibly. I think that can be called data
compression.

: Instead, most spell checkers use hashing.

No, Brad, they do not. As one of the developers of the most
widely used OEM spelling checker, and having had to take a good
look at our competitors at various times, I can say this with
authority.

Furthermore, hash techniques were generally discarded over nine
years ago because they accept way too many spelling errors. And,
no, you cannot make them acceptable by hard coding the common
spelling mistakes they fail on. No matter how low the probability
of false acceptance is made, it is still high enough to make the
customers wince.

*Please*, people, do not use this algorithm for a spelling
checker. It has its uses, but this is not one of them.

There was an article in a recent DDJ, if one is interested in
implementation details of the hash algorithm that Brad described.
I don't recall the exact title, but it had the word "existence" in
it. (That's an example of brain damage if ever I saw one; this
technique is useful for testing for the *nonexistence* of
something, or for screening for the *possible* existence of
something.)

BTW, the *correct* answer to the question of the "normal" way of
storing the lexicon is this: store the lexicon as a trie. This is
easily the best method for heavily inflected languages like
Finnish, and is used by most of the big players. The other method
of doing it is to store the dictionary in a sorted order possibly
with an in-memory index and/or a small cache of frequently used words
to avoid most disk lookups. This, too, typically involves a
rudimentary kind of data compression (dictionary style), in that
groups of inflected words are often stored as a root and
inflection information.

brad@looking.on.ca (Brad Templeton) (05/25/91)

I apologize if I was a bit behind the times.  The spelling checkers I use
are both of the hash variety, and I have little trouble with them.  Hashing
variants run into danger on suggesting corrections, because you examine a
couple of hundred permutations in the dictionary and this makes it likely you
will run into an error, but I still think it can be put below a threshold.

Of course you are right that it takes one user who doesn't understand this
to cause nasty complaints...
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

cjkuo@locus.com (Chengi Jimmy Kuo) (05/26/91)

laca@vipunen.hut.fi (Laszlo C Balint) writes:

>I am looking for an algorithm to compress and decompress strings of
>variable length. The compressed sequence can be fixed length too, but
>the process should be effective and reversable. I need it for a spell 
>checker, where the number of words might be huge, but all they have 
>to fit in the memory becuase of a frequent access (~20 lookups/checked
>word).

>Any help appreciated, especially if a C(++) code :-)

Sorry, no C++, but...

Compress the dictionary but having a table for the first 2 or 3 letters
(26x26, or 26x26x26, but actually much less since there are no qbX, etc.).
For the words, once you're passed the first 3 letters, start compressing by
using "repeat deltas".  A word is composed of N letters that repeat from the
previous word, followed by the delta.  You can compress the delta portion
using whatever method you want since it's straight alphabet letters (but I
choose not to).  Compression may not be the best, but data access is pretty
fast since decompression is trivial (also, strcmp is partially completed
due to this encoding).

Jimmy Kuo
-- 
cjkuo@locus.com
"The correct answer to an either/or question is both!"

gtoal@castle.ed.ac.uk (G Toal) (05/26/91)

In article <1991May23.235530.11443@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>Data compression is not normally used for spell checkers.
Actually it is.

>Instead, most spell checkers use hashing.   Create an array of around
>B bits (around 600,000 is good and easy) and for each word in your dictionary,
>apply N hashing functions that map from the word to a number from 1 to B.
>Set the bit of that number. 
>
>To see if a word is in the dictionary, apply the N functions to the word
>and see if all the bits are set, if so, the word is valid, else it is
>definitely not in.
'fraid Bloom filters went out of fashion years ago -- they're particularly
bad at typo correction.  Although the data compression seems good (you get
a reasonably good probability of detecting errors by using about 20 bits
per word), it is lossy, and a Dag data structure comes in at about the
same -- and is just as fast if not faster -- and lossless.  Also very good
for error correction.

If the original poster wants some old C code, I posted a set of spelling
utilities to alt.sources about two years back.  Look for spellutils/* or
dawgutils/* - or mail me for a copy.

Regards
   Graham

bill@franklin.com (bill) (05/27/91)

In article <1991May25.235703.159943@locus.com> cjkuo@locus.com
	(Chengi Jimmy Kuo) writes:

(Are you the Jimmy Kuo who used to work at Proximity?)

: Compress the dictionary but having a table for the first 2 or 3 letters
: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.).
: For the words, once you're passed the first 3 letters, start compressing by
: using "repeat deltas".  A word is composed of N letters that repeat from the
: previous word, followed by the delta.  You can compress the delta portion
: using whatever method you want since it's straight alphabet letters (but I
: choose not to).  Compression may not be the best, but data access is pretty
: fast since decompression is trivial (also, strcmp is partially completed
: due to this encoding).

The problem is that you have to decompress, on average, half a
bin, and if you have a lot of words, or lots of inflections, this
will be awfully slow. The thing that made this kind of compression
work for us (at Proximity) was that not only did we do the above,
but we assigned special codes that meant "include a table of
endings here". That and Huffman compression, with multiple models.

This was all very ad hoc, but got some very good compression. If
you can find it, the old Word Challenge game (which the Jimmy Kuo
of Proximity did the first version of and I did the version you
are likely to find), had a database compressed this way. We got
80K+ words and the program itself onto one old Apple II disk, the
140K one. If my memory serves me, we got just slightly over one
byte per word.

cjkuo@locus.com (Chengi Jimmy Kuo) (05/27/91)

bill@franklin.com (bill) writes:

>In article <1991May25.235703.159943@locus.com> cjkuo@locus.com
>	(Chengi Jimmy Kuo) writes:

>(Are you the Jimmy Kuo who used to work at Proximity?)

Yes.  Hi Bill.  Wow, it's been 7 1/2 years!  (Sorry everybody for the
reminiscing.)  Say hi to everyone for me, will ya?

>: Compress the dictionary but having a table for the first 2 or 3 letters
>: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.).
>: For the words, once you're passed the first 3 letters, start compressing by
>: using "repeat deltas".  A word is composed of N letters that repeat from the
>: previous word, followed by the delta.  You can compress the delta portion
>: using whatever method you want since it's straight alphabet letters (but I
>: choose not to).  Compression may not be the best, but data access is pretty
>: fast since decompression is trivial (also, strcmp is partially completed
>: due to this encoding).

>The problem is that you have to decompress, on average, half a
>bin, and if you have a lot of words, or lots of inflections, this
>will be awfully slow. The thing that made this kind of compression
>work for us (at Proximity) was that not only did we do the above,
>but we assigned special codes that meant "include a table of
>endings here". That and Huffman compression, with multiple models.

You're talking about the full compression for the full compression.
The algorithm I described above is just a subset of that.  It's
not the same one that was used for Word Challenge (below).  The one I
described is the one I used for a simple spell-check demo.  I used it for
the incore cache words.  Decompression was real fast because you could
skip all the words with a repeat larger than what you're looking for, and
other neat things happen when you start playing with this.

And all this, I understand is way out of date as far as Franklin/Proximity
is concerned now that you guys have encoded phonetics and thesauraus and
all that good stuff as well.

>This was all very ad hoc, but got some very good compression. If
>you can find it, the old Word Challenge game (which the Jimmy Kuo
>of Proximity did the first version of and I did the version you
>are likely to find), had a database compressed this way. We got
>80K+ words and the program itself onto one old Apple II disk, the
>140K one. If my memory serves me, we got just slightly over one
>byte per word.

Ah, Word Challenge, the game that never caught on.  The thing that
was so wonderful about that was Clean Disk.  Punch a hole in the 
diskette and it still works!  Technology before its time.

Jimmy
-- 
cjkuo@locus.com
"The correct answer to an either/or question is both!"

bill@franklin.com (bill) (05/28/91)

In article <1991May27.085944.172509@locus.com> cjkuo@locus.com
	(Chengi Jimmy Kuo) writes:
: bill@franklin.com (bill) writes:
: >(Are you the Jimmy Kuo who used to work at Proximity?)
:
: Yes.  Hi Bill.  Wow, it's been 7 1/2 years!  (Sorry everybody for the
: reminiscing.)  Say hi to everyone for me, will ya?

I've passed them your e-mail address. There's still a few around
whom you might know.

: >: Compress the dictionary but having a table for the first 2 or 3 letters
: >: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.).
: >: For the words, once you're passed the first 3 letters, start compressing by
: >: using "repeat deltas".  A word is composed of N letters that repeat from the
: >: previous word, followed by the delta.  You can compress the delta portion
: >: using whatever method you want since it's straight alphabet letters (but I
: >: choose not to).  Compression may not be the best, but data access is pretty
: >: fast since decompression is trivial (also, strcmp is partially completed
: >: due to this encoding).
:
: >The problem is that you have to decompress, on average, half a
: >bin, and if you have a lot of words, or lots of inflections, this
: >will be awfully slow. The thing that made this kind of compression
: >work for us (at Proximity) was that not only did we do the above,
: >but we assigned special codes that meant "include a table of
: >endings here". That and Huffman compression, with multiple models.
:
: You're talking about the full compression for the full compression.
: The algorithm I described above is just a subset of that.  It's
: not the same one that was used for Word Challenge (below).  The one I
: described is the one I used for a simple spell-check demo.  I used it for
: the incore cache words.  Decompression was real fast because you could
: skip all the words with a repeat larger than what you're looking for, and
: other neat things happen when you start playing with this.

Then we're saying the same thing, actually. If the list is small
enough, this will be very fast. If you have a lot of words, it
can get slow. And if space is of real concern and you do
additional compression, you'll pay for it in time for the
decompression itself and because a lot of the speed-up tricks you
allude to won't work if you've done more compression.

I guess the person who asked the original question is going to
have to weigh the pros and cons for himself; I wonder if any of
this is even relevant to his intent?