[comp.unix.questions] spell

doug@alice.UUCP (Doug McIlroy) (05/17/89)

>I've bee wondering what the program "spell" does...

>All these words [utomsrr ...] are caught as misspelled by the HP-UX version

For the secrets of spell, see chapter 13 in J. Bentley, Programming
Pearls, Addison-Wesley, 1986.  The curious results of feeding
random strings to spell arise from hashing, which is the
key to spell's speed and compactness (fewer than 15 bits per
dictionary entry!).  The expected rate of hashing errors
is 1 in 2000 misspellings.  Differing outcomes in different 
implementations are accounted for by evolution in dictionaries, 
hashing functions and affix-stripping rules.

Doug McIlroy
AT&T Bell Labs
doug@research.att.com
research!doug

lauther@janus.Berkeley.EDU (Ulrich Lauther) (06/06/90)

In article <91@grumbly.UUCP> root@grumbly.com writes:
>
>	Does anyone know the file format of the unhashed, raw word-list used
>to make the spelling database (/usr/lib/spell/hlista in SCO Unix).  
>Specifically, what notations are used to indicate variations on a base word?
>i.e. s, es, ed, ing, tion ... .  Or is it done by the spell program and no
>special notation is needed in the word-list?
>
[stuff deleted]

Me too
would like to know, since TFM is rather terse, not to say cryptic.
E.g. there seems to be a +local_words option that I have seen being used
and that seems to work (to some extent) but that is not mentioned in any
man-page I have been able to access.

Specifically I would like to know:

1. Why refuses spell to learn some words (either using +local_words, or using
spellin). E.g.: "rerouting".

2. Why spell accepts some words as correct, though I see no way how they
could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist".
(This holds for a SCO=Xenix distribution where I am absolutely sure, that
nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that
is).

3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]?
Is that file in any way related to /usr/lib/spell/words?

-----------------------------------------------------------------
Ulrich Lauther               Internet: lauther@janus.berkeley.edu
Siemens / UCB                ph: +1 415 642 3338 fax: 642 2739
				 +1 415 658 8529 home

sysnmc@magic706.chron.com (Matt Cohen) (06/09/90)

 > Specifically I would like to know:
 > 
 > 1. Why refuses spell to learn some words (either using +local_words, or using
 > spellin). E.g.: "rerouting".
 > 
 > 2. Why spell accepts some words as correct, though I see no way how they
 > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist".
 > (This holds for a SCO=Xenix distribution where I am absolutely sure, that
 > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that
 > is).
 > 
 > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]?
 > Is that file in any way related to /usr/lib/spell/words?

Ulrich,

	To the best of my knowledge, the way spell works is that it 
hashes the dictionary into a very large hash table. Unfortunately, the 
table would be ridiculously large if the entire hash table were stored 
in the usual fashion. (Each hash value pointing to a linked list of words 
with that hash value.) Instead, spell uses a technique called 
"optimistic hashing".

	What is done is to drop the linked list at each hash value. You 
can then use a single bit to say "at least one word in the dictionary hashes
to this value". This decreases space usage dramatically. However,
although the hash table used is huge, there is a chance that any
random word will hash to the same value as a word in the dictionary.

	When one of these non-words is identified, it is entered into
the "stop list", which is then hashed with a different hash function.
The stop list is typically small since it's difficult to identify
these words except by chance.

	If a word hashes to a set bit in the hashed dictionary and
to a clear bit in the stop list, it is declared correct.

	Also, the word is examined to see if it could be the result
of applying one of a bunch of prefixes/suffixes to a root word. All
of these potential root words are also checked, and if any of them passes,
the word is declared correct.

	--- Now, this explains your problems.

 > 1. Why refuses spell to learn some words (either using +local_words, or using
 > spellin). E.g.: "rerouting".

	It looks like some quirk of spell keeps it from recognizing that

	rerouting = re + route - e + ing .

	It seems to figure out "reroute" and "routing" ok.

	My guess is that "rerouting" hashes to the same value as a word in the
stop list. Therefore, when you create a dictionary with "rerouting" in it,
it can't pass.

 > 2. Why spell accepts some words as correct, though I see no way how they
 > could (mistakenly) be derived from correct words. E.g.: "neeeds", "miist".
 > (This holds for a SCO=Xenix distribution where I am absolutely sure, that
 > nobody has tempered with /usr/lib/spell/hlist[ab], *after* installation, that
 > is).

	These words are "derived" from correct words.

	neeeds = nee + ed + s
	miist  = mi + ist

	The problem is that the dictionary contains no information
about which suffixes/prefixes are appropriate for each word. You
can therefore come up with many ridiculous things that spell says
are good words.

 > 3. Is there a way to see all the words currently in /usr/lib/spell/hlist[ab]?

	No. The hashed files contain none of the input words, see above.

 > Is that file in any way related to /usr/lib/spell/words?

	It is likely that a file similar to /usr/lib/spell/words
(/usr/dict/words on many Unixes) was hashed to produce /usr/lib/spell/hlista.
American spellings were probably naively changed to British and hashed
to produce hlistb.
	
	The source of hlista is probably not exactly /usr/dict/words, at
least on my version of Unix (SunOS 4.0.3). You can test this easily
with "spell /usr/lib/spell/words". By its very definition, no
word in the dictionary should be deemed incorrect.

	Whew! That was a mouthful. Hope it helped. You can gather
that modern spell checkers don't use this kind of optimistic hashing.
You may want to check out GNU's ispell program which solves many
of these problems. You can ftp it from uunet.uu.net. The files
are ~/gnu/{ispell.shar,ispell.el,dict.shar}. I can also send it to
you if you want.

	You may want to play with "spell -v" and "spell -x", if your 
system has these options.

	Please tell me what else you discover. Good luck!

			-- Matt

Matt Cohen                              INET: sysnmc@chron.com
Department of Technology Resources      UUCP: ...!uunet!chron!sysnmc
The Houston Chronicle                   AT&T: +1 713 220 7023
801 Texas Avenue, Houston, TX 77002     "The opinions above are most likely
"Houston's Leading Information Source"        those of my employer."