kaplan@cmcl2.UUCP (03/04/87)
Here's a good one. It really makes you lose confidence in UNIX text processing tools, especially 'spell'. Try the following: prompt > spell -v aniversery ^D prompt > (Don't type 'prompt >', that's your prompt.) spell does not flag the word as being misspelled. I tried this on various vanilla (direct from the factory ) systems and none thought that the word was misspelled. the -x option says that the root is the whole word and spellout says that the word IS in the hlist. However, when I made a new hlist directly from a words file that did not have the word, it still seemed to appear in the hlist! This would lend me to believe that the bug is in spellin. Anyone who can show me why this is not wrong or explain it better, please send me mail (I do not subscribe to this newsgroup) and I will summarize to the net if necessary. Laurence S. Kaplan NYU Ultracomputer Research Project 715 Broadway Rm. 1005 New York, NY 10003 (212) 460-7327 arpa: kaplan@cmcl2 uucp: {ihnp4,seismo}!cmcl2!kaplan
thomas%spline.uucp@utah-gr.UUCP (Spencer W. Thomas) (03/06/87)
Spell uses a hash table to determine correctly spelled words. The chance of a misspelled word hitting a `1' in the hash table is small, but non-zero. Spell will pass many totally garbage sequences of characters as being spelled "correctly". It is unfortunate that your misspelling of "anniversary" gets a hit. We should note, however, that spell would run *much* slower if it attempted to look each word up in a dictionary. =Spencer ({ihnp4,decvax}!utah-cs!thomas, thomas@utah-cs.ARPA)
jerry@oliveb.UUCP (Jerry F Aguirre) (03/10/87)
In article <1943@utah-gr.UUCP> thomas%spline.UUCP@utah-gr.UUCP (Spencer W. Thomas) writes: >that spell would run *much* slower if it attempted to look each word >up in a dictionary. Not true! I have a version of spell that I wrote myself. It uses a long word list (>90,000 words). An index based on the first two letters provides a seek pointer into the list and the program "guestimates" the final position based on letter frequency. I like it because it displays the entire input file with unfound words in reverse video. This allows me to see the exceptions in context and verify whether I really wanted that funny character string. Also it is not subject to random acceptance of funny strings. I have run timing test comparing my version to the standard spell and it runs about 10% slower. Acceptable to me given the greater functionality. I also tried out a version using "dbm" to access the word list and the speed was comparable. This could add some improved functionality as the entry could contain additional information about afixes and such. The speed was comparable. Jerry Aguirre
kaplan@cmcl2.UUCP (Laurence S. Kaplan) (03/18/87)
I finally got around to reading through the responses to my posting about the bug in spell. To refresh your memory I had the word "aniversery" accepted by spell with an hlist that I knew did not contain the word. The best (and only) description of what must have happened follows: ***** start of response ***** From: mcnc.org!ecsvax!dukeac!bet@seismo.UUCP Perhaps (I'm not sure) you've found an instance of a potential failure mode that spell has always had. I had never heard of an example before. Here's the situation: spell(1) wants to have a reasonably huge number of possible words in its dictionary, wants to be able to run in a reasonably small amount of memory, and wants to be FAST. So, they set up a table in memory (50K bytes) and treat it as a bit array (400K bits). Then they compute, for each word, N independant hash functions in the range [0-400K]. For each word, they turn on the N bits at the locations identified by the N hash functions. N is chosen depending on the size of the hash table and the number of words, so that approximately 1/2 of the bits are turned on in the final table (which obviously maximizes the information content of the table). This comes to 11 in the standard UNIX spell, if I recall correctly. This means that the odds of any random string being in the dictionary are 1 in 2**11 == 1/2048 -- right small. Seems like you might have found such an example, however. If I recall the implementation of the full spell(1) command, it is a shell script that calls the spelling check program twice, first with a stop file which is a hash table of words to insist are misspelled, regardless of whether they are in the dictionary, then runs the remaining words through the regular dictionary. The stop list mechanism was put in place, as I recall, as a method of trapping words that would be (mistakenly) accepted due to misbehavior in the prefix/suffix stripper (the example I recall is "thier == thy - y + ier"); however, it seems to me you should be able to add your example word to the hstop word list and rebuild the hstop table, and fix the problem that way. -Bennett -- Bennett Todd, Duke User Services, Durham, NC 27706-7756; +1 919 684 3695 UUCP: ...{philabs,akgua,decvax,ihnp4}!mcnc!ecsvax!dukeac!bet BITNET: DBTODD@TUCC ***** end of response ***** While I did not get around to trying this fix, it does sound good. Good luck to everyone not having this happen to them with other mispelled words! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Laurence S. Kaplan | NYU Ultracomputer Research Project ||| 715 Broadway Rm. 1005 ||||| New York, NY 10003 ||||| (212) 460-7327 --- //\ --- arpa: kaplan@cmcl2 ----/ \ --- uucp: {ihnp4,seismo}!cmcl2!kaplan ---- ----