laca@vipunen.hut.fi (Laszlo C Balint) (05/20/91)
I am looking for an algorithm to compress and decompress strings of variable length. The compressed sequence can be fixed length too, but the process should be effective and reversable. I need it for a spell checker, where the number of words might be huge, but all they have to fit in the memory becuase of a frequent access (~20 lookups/checked word). Any help appreciated, especially if a C(++) code :-) Thanks in advance, laca -- Laszlo C. Balint ( Hungarian on the go ) laca@vipunen.hut.fi ------------------------------------------------------------------ | ....That's all folks. Flames, if you must, by mail please. They're | | easier to ignore that way. (Tony Lezard, with(out) permission) |
brad@looking.on.ca (Brad Templeton) (05/24/91)
Data compression is not normally used for spell checkers. Instead, most spell checkers use hashing. Create an array of around B bits (around 600,000 is good and easy) and for each word in your dictionary, apply N hashing functions that map from the word to a number from 1 to B. Set the bit of that number. To see if a word is in the dictionary, apply the N functions to the word and see if all the bits are set, if so, the word is valid, else it is definitely not in. Note that this will say that some bad words are correct, it never says that a valid word is not found. The probablilty of saying a bad word is correct can be reduced until it is meaningless. If you have a 30,000 word dictionary and 5 functions then then ~1/4 of the words are set. The probability of a bad word giving yes is .25 ^ 5 or .09%. There is a formula that works out the optimal values of N for B and the dictionary size. If you are really worried, take common misspellings and special case them. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
bill@franklin.com (bill) (05/25/91)
In article <1991May23.235530.11443@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes: : Data compression is not normally used for spell checkers. Hokum. The data that goes into, for example, one of our Italian spelling checkers is around 13M of text. It is compressed into around 64K, reversibly. I think that can be called data compression. : Instead, most spell checkers use hashing. No, Brad, they do not. As one of the developers of the most widely used OEM spelling checker, and having had to take a good look at our competitors at various times, I can say this with authority. Furthermore, hash techniques were generally discarded over nine years ago because they accept way too many spelling errors. And, no, you cannot make them acceptable by hard coding the common spelling mistakes they fail on. No matter how low the probability of false acceptance is made, it is still high enough to make the customers wince. *Please*, people, do not use this algorithm for a spelling checker. It has its uses, but this is not one of them. There was an article in a recent DDJ, if one is interested in implementation details of the hash algorithm that Brad described. I don't recall the exact title, but it had the word "existence" in it. (That's an example of brain damage if ever I saw one; this technique is useful for testing for the *nonexistence* of something, or for screening for the *possible* existence of something.) BTW, the *correct* answer to the question of the "normal" way of storing the lexicon is this: store the lexicon as a trie. This is easily the best method for heavily inflected languages like Finnish, and is used by most of the big players. The other method of doing it is to store the dictionary in a sorted order possibly with an in-memory index and/or a small cache of frequently used words to avoid most disk lookups. This, too, typically involves a rudimentary kind of data compression (dictionary style), in that groups of inflected words are often stored as a root and inflection information.
brad@looking.on.ca (Brad Templeton) (05/25/91)
I apologize if I was a bit behind the times. The spelling checkers I use are both of the hash variety, and I have little trouble with them. Hashing variants run into danger on suggesting corrections, because you examine a couple of hundred permutations in the dictionary and this makes it likely you will run into an error, but I still think it can be put below a threshold. Of course you are right that it takes one user who doesn't understand this to cause nasty complaints... -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
cjkuo@locus.com (Chengi Jimmy Kuo) (05/26/91)
laca@vipunen.hut.fi (Laszlo C Balint) writes: >I am looking for an algorithm to compress and decompress strings of >variable length. The compressed sequence can be fixed length too, but >the process should be effective and reversable. I need it for a spell >checker, where the number of words might be huge, but all they have >to fit in the memory becuase of a frequent access (~20 lookups/checked >word). >Any help appreciated, especially if a C(++) code :-) Sorry, no C++, but... Compress the dictionary but having a table for the first 2 or 3 letters (26x26, or 26x26x26, but actually much less since there are no qbX, etc.). For the words, once you're passed the first 3 letters, start compressing by using "repeat deltas". A word is composed of N letters that repeat from the previous word, followed by the delta. You can compress the delta portion using whatever method you want since it's straight alphabet letters (but I choose not to). Compression may not be the best, but data access is pretty fast since decompression is trivial (also, strcmp is partially completed due to this encoding). Jimmy Kuo -- cjkuo@locus.com "The correct answer to an either/or question is both!"
gtoal@castle.ed.ac.uk (G Toal) (05/26/91)
In article <1991May23.235530.11443@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes: >Data compression is not normally used for spell checkers. Actually it is. >Instead, most spell checkers use hashing. Create an array of around >B bits (around 600,000 is good and easy) and for each word in your dictionary, >apply N hashing functions that map from the word to a number from 1 to B. >Set the bit of that number. > >To see if a word is in the dictionary, apply the N functions to the word >and see if all the bits are set, if so, the word is valid, else it is >definitely not in. 'fraid Bloom filters went out of fashion years ago -- they're particularly bad at typo correction. Although the data compression seems good (you get a reasonably good probability of detecting errors by using about 20 bits per word), it is lossy, and a Dag data structure comes in at about the same -- and is just as fast if not faster -- and lossless. Also very good for error correction. If the original poster wants some old C code, I posted a set of spelling utilities to alt.sources about two years back. Look for spellutils/* or dawgutils/* - or mail me for a copy. Regards Graham
bill@franklin.com (bill) (05/27/91)
In article <1991May25.235703.159943@locus.com> cjkuo@locus.com (Chengi Jimmy Kuo) writes: (Are you the Jimmy Kuo who used to work at Proximity?) : Compress the dictionary but having a table for the first 2 or 3 letters : (26x26, or 26x26x26, but actually much less since there are no qbX, etc.). : For the words, once you're passed the first 3 letters, start compressing by : using "repeat deltas". A word is composed of N letters that repeat from the : previous word, followed by the delta. You can compress the delta portion : using whatever method you want since it's straight alphabet letters (but I : choose not to). Compression may not be the best, but data access is pretty : fast since decompression is trivial (also, strcmp is partially completed : due to this encoding). The problem is that you have to decompress, on average, half a bin, and if you have a lot of words, or lots of inflections, this will be awfully slow. The thing that made this kind of compression work for us (at Proximity) was that not only did we do the above, but we assigned special codes that meant "include a table of endings here". That and Huffman compression, with multiple models. This was all very ad hoc, but got some very good compression. If you can find it, the old Word Challenge game (which the Jimmy Kuo of Proximity did the first version of and I did the version you are likely to find), had a database compressed this way. We got 80K+ words and the program itself onto one old Apple II disk, the 140K one. If my memory serves me, we got just slightly over one byte per word.
cjkuo@locus.com (Chengi Jimmy Kuo) (05/27/91)
bill@franklin.com (bill) writes: >In article <1991May25.235703.159943@locus.com> cjkuo@locus.com > (Chengi Jimmy Kuo) writes: >(Are you the Jimmy Kuo who used to work at Proximity?) Yes. Hi Bill. Wow, it's been 7 1/2 years! (Sorry everybody for the reminiscing.) Say hi to everyone for me, will ya? >: Compress the dictionary but having a table for the first 2 or 3 letters >: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.). >: For the words, once you're passed the first 3 letters, start compressing by >: using "repeat deltas". A word is composed of N letters that repeat from the >: previous word, followed by the delta. You can compress the delta portion >: using whatever method you want since it's straight alphabet letters (but I >: choose not to). Compression may not be the best, but data access is pretty >: fast since decompression is trivial (also, strcmp is partially completed >: due to this encoding). >The problem is that you have to decompress, on average, half a >bin, and if you have a lot of words, or lots of inflections, this >will be awfully slow. The thing that made this kind of compression >work for us (at Proximity) was that not only did we do the above, >but we assigned special codes that meant "include a table of >endings here". That and Huffman compression, with multiple models. You're talking about the full compression for the full compression. The algorithm I described above is just a subset of that. It's not the same one that was used for Word Challenge (below). The one I described is the one I used for a simple spell-check demo. I used it for the incore cache words. Decompression was real fast because you could skip all the words with a repeat larger than what you're looking for, and other neat things happen when you start playing with this. And all this, I understand is way out of date as far as Franklin/Proximity is concerned now that you guys have encoded phonetics and thesauraus and all that good stuff as well. >This was all very ad hoc, but got some very good compression. If >you can find it, the old Word Challenge game (which the Jimmy Kuo >of Proximity did the first version of and I did the version you >are likely to find), had a database compressed this way. We got >80K+ words and the program itself onto one old Apple II disk, the >140K one. If my memory serves me, we got just slightly over one >byte per word. Ah, Word Challenge, the game that never caught on. The thing that was so wonderful about that was Clean Disk. Punch a hole in the diskette and it still works! Technology before its time. Jimmy -- cjkuo@locus.com "The correct answer to an either/or question is both!"
bill@franklin.com (bill) (05/28/91)
In article <1991May27.085944.172509@locus.com> cjkuo@locus.com (Chengi Jimmy Kuo) writes: : bill@franklin.com (bill) writes: : >(Are you the Jimmy Kuo who used to work at Proximity?) : : Yes. Hi Bill. Wow, it's been 7 1/2 years! (Sorry everybody for the : reminiscing.) Say hi to everyone for me, will ya? I've passed them your e-mail address. There's still a few around whom you might know. : >: Compress the dictionary but having a table for the first 2 or 3 letters : >: (26x26, or 26x26x26, but actually much less since there are no qbX, etc.). : >: For the words, once you're passed the first 3 letters, start compressing by : >: using "repeat deltas". A word is composed of N letters that repeat from the : >: previous word, followed by the delta. You can compress the delta portion : >: using whatever method you want since it's straight alphabet letters (but I : >: choose not to). Compression may not be the best, but data access is pretty : >: fast since decompression is trivial (also, strcmp is partially completed : >: due to this encoding). : : >The problem is that you have to decompress, on average, half a : >bin, and if you have a lot of words, or lots of inflections, this : >will be awfully slow. The thing that made this kind of compression : >work for us (at Proximity) was that not only did we do the above, : >but we assigned special codes that meant "include a table of : >endings here". That and Huffman compression, with multiple models. : : You're talking about the full compression for the full compression. : The algorithm I described above is just a subset of that. It's : not the same one that was used for Word Challenge (below). The one I : described is the one I used for a simple spell-check demo. I used it for : the incore cache words. Decompression was real fast because you could : skip all the words with a repeat larger than what you're looking for, and : other neat things happen when you start playing with this. Then we're saying the same thing, actually. If the list is small enough, this will be very fast. If you have a lot of words, it can get slow. And if space is of real concern and you do additional compression, you'll pay for it in time for the decompression itself and because a lot of the speed-up tricks you allude to won't work if you've done more compression. I guess the person who asked the original question is going to have to weigh the pros and cons for himself; I wonder if any of this is even relevant to his intent?