larry@utecfb.Toronto.Edu (Larry Philps) (03/13/87)
It seems to me, from looking at the makefile, that the old /usr/dict/words is used when creating the hash file for spell. Now 4.3 comes with the entire webster's 2nd edition on line in /usr/dict/web2. Is there any reason why this is not used for making the hash file? Of course I could go and read the code myself and look for a limit, or I could just try it, but with all you genius' out there - why bother. :-) Has anyone done this? Larry Philps Engineering Computing Facility University of Toronto NEW PATH: larry@ecf.toronto.edu USENET: {linus, ihnp4, allegra, decvax, floyd}!utcsri!ecfb!larry CSNET: larry@Toronto ARPA: larry%Toronto@CSNet-Relay BITNET: larry@ecf.utoronto.BITNET
mike@hcr.UUCP (Mike Tilson) (03/16/87)
The question was whether one would get a better "spell" hash table by building it from the Webster's word list rather than from the much shorter dictionary used by the "spell" makefile. I believe the answer is "no". First, if you do greatly expand the dictionary, remember to increase the size of the hash table. The algorithm works best when the final bit table has about 50% ones and 50% zeros. With a larger dictionary and no increase in table size, you will turn on more bits. In the limit, all bits will be on, and all input to spell will be accepted as correct spelling. The biggest problem is that a large dictionary contains many "strange" words. Many of these words differ from common words by only one letter (or one letter transposition). This means you increase the chance of a typo turning into another valid word in the dictionary. It turns out that you are better off using a smaller dictionary. If you use a very rare word, spell will report it, but you simply ignore the report. On the other hand, more of your typing and spelling errors will also be reported. I believe there was a paper that discussed this subject in CACM a few months back (I don't have it at hand, so it might have been in some other journal.) /Michael Tilson, HCR Corporation {utzoo,utcsri,ihnp4,...}!hcr!mike
ken@rochester.ARPA (SKY) (03/18/87)
|The biggest problem is that a large dictionary contains many "strange" |words. Many of these words differ from common words by only one |letter (or one letter transposition). This means you increase the |chance of a typo turning into another valid word in the dictionary. Indeed. I know a case where "action" was mistyped as "cation" but the dictionary knew too much about chemistry. There is some kind of trade-off here. Anybody know of studies of this sort? Ken
ccplumb@watnot.UUCP (03/18/87)
In article <2563@hcr.UUCP> mike@hcr.UUCP (Mike Tilson) writes: >The question was whether one would get a better "spell" hash table >by building it from the Webster's word list rather than from the >much shorter dictionary used by the "spell" makefile. > >I believe the answer is "no". > >First, if you do greatly expand the dictionary, remember to increase >the size of the hash table. The algorithm works best when the final >bit table has about 50% ones and 50% zeros. With a larger dictionary >and no increase in table size, you will turn on more bits. In the limit, >all bits will be on, and all input to spell will be accepted as correct >spelling. Um... not quite. If the table is 50% ones, then 50% of any collection of random character strings would be accepted as correct. As I recall, from Jon Bentley's discussion of spell in Programming Pearls, the hash table size was chosen so that, assuming 20 reported problems per run, a misspelling would slip through about once every hundred runs, so the ones should fill up 1/2000 of the hash table. -- -Colin Plumb (watmath!watnot!ccplumb) Zippy says: My CODE of ETHICS is vacationing at famed SCHROON LAKE in upstate New York!!
mouse@mcgill-vision.UUCP (04/02/87)
In article <12630@watnot.UUCP>, ccplumb@watnot.UUCP writes: > In article <2563@hcr.UUCP> mike@hcr.UUCP (Mike Tilson) writes: >> The algorithm works best when the final bit table has about 50% ones >> and 50% zeros. > Um... not quite. If the table is 50% ones, then 50% of any > collection of random character strings would be accepted as correct. > As I recall, from Jon Bentley's discussion of spell in Programming > Pearls, the hash table size was chosen so that, assuming 20 reported > problems per run, a misspelling would slip through about once every > hundred runs, so the ones should fill up 1/2000 of the hash table. Which spell are you talking about? Ours uses multiple bits per word; here's how it works....(don't ask me about the math, it's been too long since I took probability) /* * Hash table for spelling checker has n bits. * Each word w is hashed by k different (modular) hash functions, hi. * The bits hi(w), i=1..k, are set for words in the dictionary. * Assuming independence, the probability that no word of a d-word * dictionary sets a particular bit is given by the Poisson formula * P = exp(-y)*y**0/0!, where y=d*k/n. * The probability that a random string is recognized as a word is then * (1-P)**k. For given n and d this is minimum when y=log(2), P=1/2, * whence one finds, for example, that a 25000-word dictionary in a * 400000-bit table works best with k=11. */ 50% ones and 50% zeros maximizes the information content of the table. Assuming whoever wrote that comment knows their probability (and that their assumptions are correct), the chance of a random string being accepted is thus 1/2048. der Mouse Smart mailers: mouse@mcgill-vision.uucp USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!musocs!mcgill-vision!mouse think!mosart!mcgill-vision!mouse ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu