[comp.lang.misc] Spelling Checker Algorithm

dhesi@bsu-cs.UUCP (Rahul Dhesi) (03/28/89)

In article <22142@agate.BERKELEY.EDU> wainscot@wheatena.berkeley.edu (Brian
Wainscott) writes:
>In need of a good algorithm for my spell checking routine.  Specifically,
>once I know the word they used is not in my dictionary, how do I go about
>making rational suggestions for replacement?

For each entry in the word list, keep its Soundex code too.  When you
find a misspelled word, find its Soundex code, and show all other words
with the same Soundex code as possible corrections.  Instead of Soundex
(described in one of Knuth's books, I think) you could use some other
algorithm that works better for non-proper nouns.

This discussion is being diverted to comp.lang.misc, where the language
gurus hide, who will probably scream that it doesn't belong there
either, but will at least be able to suggest a better place.
-- 
Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi
                    ARPA:  dhesi@bsu-cs.bsu.edu

leichter@cs.yale.edu (Jerry Leichter) (03/29/89)

In article <6346@bsu-cs.UUCP>, dhesi@bsu-cs.UUCP (Rahul Dhesi) writes...
 
>In article <22142@agate.BERKELEY.EDU> wainscot@wheatena.berkeley.edu (Brian
>Wainscott) writes:
>>In need of a good algorithm for my spell checking routine.  Specifically,
>>once I know the word they used is not in my dictionary, how do I go about
>>making rational suggestions for replacement?
> 
>For each entry in the word list, keep its Soundex code too.  When you
>find a misspelled word, find its Soundex code, and show all other words
>with the same Soundex code as possible corrections.  Instead of Soundex
>(described in one of Knuth's books, I think) you could use some other
>algorithm that works better for non-proper nouns.
> 

Soundex and Soundex-like algorithms are not a complete solution.  The problem
is that spelling errors arise from two sources:  User didn't know the spelling
but accurately thought that what he typed WAS the correct spelling; or user
knew the spelling but included a typo.  The actual "error syndrome" that these
two sources of error produce are very different.  Soundex-like algorithms are
often quite good at dealing with the first class, the "spelling" errors, since
people generally pick a (mis)spelling which "sounds" right to them, not some-
thing completely unrelated.  (However, algorithms of this form can be badly
fooled since they are at best only a weak approximation to the way people
think of pronounciation.  There are much better algorithms around for deter-
mining pronounciation, developed for speech synthesis.  I've never heard of
anyone using one in a spelling checker, though.)

Soundex-like algorithms are generally useless for the second class of errors,
the typos.  Studies of typos have shown that almost all of them fall into one
of four classes:  Missed letter; inserted letter; substitution of one letter
for another; exchange of two adjacent letters.  (There are also exchanges of
multiple letters or of letters more than two apart that occur when what really
gets exchanges is "finger macros" - good typists type "th" as a unit, not as
two individual letters.)  It's not hard, though it is somewhat time-consuming,
to generate all the possible "words" that could have been turned into the
"word" seen in the text by one of these four mechanisms.  Trying to deal with
the rarer cases, like multi-letter exchanges, takes a LONG time and gets prob-
lematical - after a while, you are generating very different words.  Still,
you could provide a set of successively-slower correction algorithms, and try
them one after another until you find some match.  If the user then rejects
this match, go back and try with still slower algorithms, until either the
user is happy or you run out of ideas.
							-- Jerry