dhesi@bsu-cs.UUCP (Rahul Dhesi) (03/28/89)
In article <22142@agate.BERKELEY.EDU> wainscot@wheatena.berkeley.edu (Brian Wainscott) writes: >In need of a good algorithm for my spell checking routine. Specifically, >once I know the word they used is not in my dictionary, how do I go about >making rational suggestions for replacement? For each entry in the word list, keep its Soundex code too. When you find a misspelled word, find its Soundex code, and show all other words with the same Soundex code as possible corrections. Instead of Soundex (described in one of Knuth's books, I think) you could use some other algorithm that works better for non-proper nouns. This discussion is being diverted to comp.lang.misc, where the language gurus hide, who will probably scream that it doesn't belong there either, but will at least be able to suggest a better place. -- Rahul Dhesi UUCP: <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi ARPA: dhesi@bsu-cs.bsu.edu
leichter@cs.yale.edu (Jerry Leichter) (03/29/89)
In article <6346@bsu-cs.UUCP>, dhesi@bsu-cs.UUCP (Rahul Dhesi) writes... >In article <22142@agate.BERKELEY.EDU> wainscot@wheatena.berkeley.edu (Brian >Wainscott) writes: >>In need of a good algorithm for my spell checking routine. Specifically, >>once I know the word they used is not in my dictionary, how do I go about >>making rational suggestions for replacement? > >For each entry in the word list, keep its Soundex code too. When you >find a misspelled word, find its Soundex code, and show all other words >with the same Soundex code as possible corrections. Instead of Soundex >(described in one of Knuth's books, I think) you could use some other >algorithm that works better for non-proper nouns. > Soundex and Soundex-like algorithms are not a complete solution. The problem is that spelling errors arise from two sources: User didn't know the spelling but accurately thought that what he typed WAS the correct spelling; or user knew the spelling but included a typo. The actual "error syndrome" that these two sources of error produce are very different. Soundex-like algorithms are often quite good at dealing with the first class, the "spelling" errors, since people generally pick a (mis)spelling which "sounds" right to them, not some- thing completely unrelated. (However, algorithms of this form can be badly fooled since they are at best only a weak approximation to the way people think of pronounciation. There are much better algorithms around for deter- mining pronounciation, developed for speech synthesis. I've never heard of anyone using one in a spelling checker, though.) Soundex-like algorithms are generally useless for the second class of errors, the typos. Studies of typos have shown that almost all of them fall into one of four classes: Missed letter; inserted letter; substitution of one letter for another; exchange of two adjacent letters. (There are also exchanges of multiple letters or of letters more than two apart that occur when what really gets exchanges is "finger macros" - good typists type "th" as a unit, not as two individual letters.) It's not hard, though it is somewhat time-consuming, to generate all the possible "words" that could have been turned into the "word" seen in the text by one of these four mechanisms. Trying to deal with the rarer cases, like multi-letter exchanges, takes a LONG time and gets prob- lematical - after a while, you are generating very different words. Still, you could provide a set of successively-slower correction algorithms, and try them one after another until you find some match. If the user then rejects this match, go back and try with still slower algorithms, until either the user is happy or you run out of ideas. -- Jerry