[comp.unix.questions] Spell and /usr/dict/web2

larry@utecfb.Toronto.Edu (Larry Philps) (03/13/87)

It seems to me, from looking at the makefile, that the old /usr/dict/words
is used when creating the hash file for spell.  Now 4.3 comes with the
entire webster's 2nd edition on line in /usr/dict/web2.  Is there any
reason why this is not used for making the hash file?

Of course I could go and read the code myself and look for a limit,
or I could just try it, but with all you genius' out there - why bother. :-)

Has anyone done this?

Larry Philps	Engineering Computing Facility	University of Toronto
NEW PATH:   larry@ecf.toronto.edu
USENET:     {linus, ihnp4, allegra, decvax, floyd}!utcsri!ecfb!larry
CSNET:      larry@Toronto
ARPA:       larry%Toronto@CSNet-Relay
BITNET:     larry@ecf.utoronto.BITNET

mike@hcr.UUCP (Mike Tilson) (03/16/87)

The question was whether one would get a better "spell" hash table
by building it from the Webster's word list rather than from the
much shorter dictionary used by the "spell" makefile.

I believe the answer is "no".

First, if you do greatly expand the dictionary, remember to increase
the size of the hash table.  The algorithm works best when the final
bit table has about 50% ones and 50% zeros.  With a larger dictionary
and no increase in table size, you will turn on more bits.  In the limit,
all bits will be on, and all input to spell will be accepted as correct
spelling.

The biggest problem is that a large dictionary contains many "strange"
words.  Many of these words differ from common words by only one
letter (or one letter transposition).  This means you increase the
chance of a typo turning into another valid word in the dictionary.
It turns out that you are better off using a smaller dictionary.  If
you use a very rare word, spell will report it, but you simply ignore
the report.  On the other hand, more of your typing and spelling errors
will also be reported.  I believe there was a paper that discussed this
subject in CACM a few months back (I don't have it at hand, so it might
have been in some other journal.)

/Michael Tilson, HCR Corporation {utzoo,utcsri,ihnp4,...}!hcr!mike

ken@rochester.ARPA (SKY) (03/18/87)

|The biggest problem is that a large dictionary contains many "strange"
|words.  Many of these words differ from common words by only one
|letter (or one letter transposition).  This means you increase the
|chance of a typo turning into another valid word in the dictionary.

Indeed. I know a case where "action" was mistyped as "cation" but
the dictionary knew too much about chemistry. There is some kind of
trade-off here. Anybody know of studies of this sort?

	Ken

ccplumb@watnot.UUCP (03/18/87)

In article <2563@hcr.UUCP> mike@hcr.UUCP (Mike Tilson) writes:
>The question was whether one would get a better "spell" hash table
>by building it from the Webster's word list rather than from the
>much shorter dictionary used by the "spell" makefile.
>
>I believe the answer is "no".
>
>First, if you do greatly expand the dictionary, remember to increase
>the size of the hash table.  The algorithm works best when the final
>bit table has about 50% ones and 50% zeros.  With a larger dictionary
>and no increase in table size, you will turn on more bits.  In the limit,
>all bits will be on, and all input to spell will be accepted as correct
>spelling.

Um... not quite.  If the table is 50% ones, then 50% of any collection of
random character strings would be accepted as correct.
As I recall, from Jon Bentley's discussion of spell in Programming Pearls,
the hash table size was chosen so that, assuming 20 reported problems per
run, a misspelling would slip through about once every hundred runs, so
the ones should fill up 1/2000 of the hash table.
--
	-Colin Plumb (watmath!watnot!ccplumb)

Zippy says:
My CODE of ETHICS is vacationing at famed SCHROON LAKE in upstate New York!!

mouse@mcgill-vision.UUCP (04/02/87)

In article <12630@watnot.UUCP>, ccplumb@watnot.UUCP writes:
> In article <2563@hcr.UUCP> mike@hcr.UUCP (Mike Tilson) writes:
>> The algorithm works best when the final bit table has about 50% ones
>> and 50% zeros.

> Um... not quite.  If the table is 50% ones, then 50% of any
> collection of random character strings would be accepted as correct.
> As I recall, from Jon Bentley's discussion of spell in Programming
> Pearls, the hash table size was chosen so that, assuming 20 reported
> problems per run, a misspelling would slip through about once every
> hundred runs, so the ones should fill up 1/2000 of the hash table.

Which spell are you talking about?  Ours uses multiple bits per word;
here's how it works....(don't ask me about the math, it's been too long
since I took probability)

/*
* Hash table for spelling checker has n bits.
* Each word w is hashed by k different (modular) hash functions, hi.
* The bits hi(w), i=1..k, are set for words in the dictionary.
* Assuming independence, the probability that no word of a d-word
* dictionary sets a particular bit is given by the Poisson formula
* P = exp(-y)*y**0/0!, where y=d*k/n.
* The probability that a random string is recognized as a word is then
* (1-P)**k.  For given n and d this is minimum when y=log(2), P=1/2,
* whence one finds, for example, that a 25000-word dictionary in a
* 400000-bit table works best with k=11.
*/

50% ones and 50% zeros maximizes the information content of the table.
Assuming whoever wrote that comment knows their probability (and that
their assumptions are correct), the chance of a random string being
accepted is thus 1/2048.

					der Mouse

Smart mailers: mouse@mcgill-vision.uucp
USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!musocs!mcgill-vision!mouse
     think!mosart!mcgill-vision!mouse
ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu