doug@alice.UUCP (Doug McIlroy) (05/17/89)
>I've bee wondering what the program "spell" does... >All these words [utomsrr ...] are caught as misspelled by the HP-UX version For the secrets of spell, see chapter 13 in J. Bentley, Programming Pearls, Addison-Wesley, 1986. The curious results of feeding random strings to spell arise from hashing, which is the key to spell's speed and compactness (fewer than 15 bits per dictionary entry!). The expected rate of hashing errors is 1 in 2000 misspellings. Differing outcomes in different implementations are accounted for by evolution in dictionaries, hashing functions and affix-stripping rules. Doug McIlroy AT&T Bell Labs doug@research.att.com research!doug
lauther@janus.Berkeley.EDU (Ulrich Lauther) (06/06/90)
In article <91@grumbly.UUCP> root@grumbly.com writes: > > Does anyone know the file format of the unhashed, raw word-list used >to make the spelling database (/usr/lib/spell/hlista in SCO Unix). >Specifically, what notations are used to indicate variations on a base word? >i.e. s, es, ed, ing, tion ... . Or is it done by the spell program and no >special notation is needed in the word-list? > [stuff deleted] Me too would like to know, since TFM is rather terse, not to say cryptic. E.g. there seems to be a +local_words option that I have seen being used and that seems to work (to some extent) but that is not mentioned in any man-page I have been able to access. Specifically I would like to know: 1. Why refuses spell to learn some words (either using +local_words, or using spellin). E.g.: "rerouting". 2. Why spell accepts some words as correct, though I see no way how they could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist". (This holds for a SCO=Xenix distribution where I am absolutely sure, that nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that is). 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]? Is that file in any way related to /usr/lib/spell/words? ----------------------------------------------------------------- Ulrich Lauther Internet: lauther@janus.berkeley.edu Siemens / UCB ph: +1 415 642 3338 fax: 642 2739 +1 415 658 8529 home
sysnmc@magic706.chron.com (Matt Cohen) (06/09/90)
> Specifically I would like to know: > > 1. Why refuses spell to learn some words (either using +local_words, or using > spellin). E.g.: "rerouting". > > 2. Why spell accepts some words as correct, though I see no way how they > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist". > (This holds for a SCO=Xenix distribution where I am absolutely sure, that > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that > is). > > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]? > Is that file in any way related to /usr/lib/spell/words? Ulrich, To the best of my knowledge, the way spell works is that it hashes the dictionary into a very large hash table. Unfortunately, the table would be ridiculously large if the entire hash table were stored in the usual fashion. (Each hash value pointing to a linked list of words with that hash value.) Instead, spell uses a technique called "optimistic hashing". What is done is to drop the linked list at each hash value. You can then use a single bit to say "at least one word in the dictionary hashes to this value". This decreases space usage dramatically. However, although the hash table used is huge, there is a chance that any random word will hash to the same value as a word in the dictionary. When one of these non-words is identified, it is entered into the "stop list", which is then hashed with a different hash function. The stop list is typically small since it's difficult to identify these words except by chance. If a word hashes to a set bit in the hashed dictionary and to a clear bit in the stop list, it is declared correct. Also, the word is examined to see if it could be the result of applying one of a bunch of prefixes/suffixes to a root word. All of these potential root words are also checked, and if any of them passes, the word is declared correct. --- Now, this explains your problems. > 1. Why refuses spell to learn some words (either using +local_words, or using > spellin). E.g.: "rerouting". It looks like some quirk of spell keeps it from recognizing that rerouting = re + route - e + ing . It seems to figure out "reroute" and "routing" ok. My guess is that "rerouting" hashes to the same value as a word in the stop list. Therefore, when you create a dictionary with "rerouting" in it, it can't pass. > 2. Why spell accepts some words as correct, though I see no way how they > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist". > (This holds for a SCO=Xenix distribution where I am absolutely sure, that > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that > is). These words are "derived" from correct words. neeeds = nee + ed + s miist = mi + ist The problem is that the dictionary contains no information about which suffixes/prefixes are appropriate for each word. You can therefore come up with many ridiculous things that spell says are good words. > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]? No. The hashed files contain none of the input words, see above. > Is that file in any way related to /usr/lib/spell/words? It is likely that a file similar to /usr/lib/spell/words (/usr/dict/words on many Unixes) was hashed to produce /usr/lib/spell/hlista. American spellings were probably naively changed to British and hashed to produce hlistb. The source of hlista is probably not exactly /usr/dict/words, at least on my version of Unix (SunOS 4.0.3). You can test this easily with "spell /usr/lib/spell/words". By its very definition, no word in the dictionary should be deemed incorrect. Whew! That was a mouthful. Hope it helped. You can gather that modern spell checkers don't use this kind of optimistic hashing. You may want to check out GNU's ispell program which solves many of these problems. You can ftp it from uunet.uu.net. The files are ~/gnu/{ispell.shar,ispell.el,dict.shar}. I can also send it to you if you want. You may want to play with "spell -v" and "spell -x", if your system has these options. Please tell me what else you discover. Good luck! -- Matt Matt Cohen INET: sysnmc@chron.com Department of Technology Resources UUCP: ...!uunet!chron!sysnmc The Houston Chronicle AT&T: +1 713 220 7023 801 Texas Avenue, Houston, TX 77002 "The opinions above are most likely "Houston's Leading Information Source" those of my employer."