[comp.bugs.4bsd] bug in spell

kaplan@cmcl2.UUCP (03/04/87)

Here's a good one.  It really makes you lose confidence in UNIX text processing
tools, especially 'spell'.  Try the following:

prompt > spell -v
aniversery
^D
prompt >

(Don't type 'prompt >', that's your prompt.)
spell does not flag the word as being misspelled.  I tried this on various
vanilla (direct from the factory ) systems and none thought that the word was
misspelled.  the -x option says that the root is the whole word and spellout
says that the word IS in the hlist.  However, when I made a new hlist directly
from a words file that did not have the word,  it still seemed to appear in the
hlist!  This would lend me to believe that the bug is in spellin.  Anyone
who can show me why this is not wrong or explain it better, please send me
mail (I do not subscribe to this newsgroup) and I will summarize to the net
if necessary.

Laurence S. Kaplan
NYU Ultracomputer Research Project
715 Broadway  Rm. 1005
New York, NY  10003
(212) 460-7327
arpa:	kaplan@cmcl2
uucp:	{ihnp4,seismo}!cmcl2!kaplan

thomas%spline.uucp@utah-gr.UUCP (Spencer W. Thomas) (03/06/87)

Spell uses a hash table to determine correctly spelled words.  The
chance of a misspelled word hitting a `1' in the hash table is small,
but non-zero.  Spell will pass many totally garbage sequences of
characters as being spelled "correctly".  It is unfortunate that your
misspelling of "anniversary" gets a hit.  We should note, however,
that spell would run *much* slower if it attempted to look each word
up in a dictionary.

=Spencer   ({ihnp4,decvax}!utah-cs!thomas, thomas@utah-cs.ARPA)

jerry@oliveb.UUCP (Jerry F Aguirre) (03/10/87)

In article <1943@utah-gr.UUCP> thomas%spline.UUCP@utah-gr.UUCP (Spencer W. Thomas) writes:
>that spell would run *much* slower if it attempted to look each word
>up in a dictionary.

Not true!  I have a version of spell that I wrote myself.  It uses a
long word list (>90,000 words).  An index based on the first two letters
provides a seek pointer into the list and the program "guestimates" the
final position based on letter frequency.

I like it because it displays the entire input file with unfound words
in reverse video.  This allows me to see the exceptions in context and
verify whether I really wanted that funny character string.  Also it is
not subject to random acceptance of funny strings.

I have run timing test comparing my version to the standard spell and it
runs about 10% slower.  Acceptable to me given the greater
functionality.

I also tried out a version using "dbm" to access the word list and the
speed was comparable.  This could add some improved functionality as the
entry could contain additional information about afixes and such.  The
speed was comparable.

					Jerry Aguirre

kaplan@cmcl2.UUCP (Laurence S. Kaplan) (03/18/87)

I finally got around to reading through the responses to my posting about the 
bug in spell.  To refresh your memory I had the word "aniversery" accepted by 
spell with an hlist that I knew did not contain the word.  The best (and only)
description of what must have happened follows:

***** start of response *****

From: mcnc.org!ecsvax!dukeac!bet@seismo.UUCP

Perhaps (I'm not sure) you've found an instance of a potential failure mode
that spell has always had. I had never heard of an example before. Here's the
situation: spell(1) wants to have a reasonably huge number of possible words
in its dictionary, wants to be able to run in a reasonably small amount of
memory, and wants to be FAST. So, they set up a table in memory (50K bytes)
and treat it as a bit array (400K bits). Then they compute, for each word, N
independant hash functions in the range [0-400K]. For each word, they turn on
the N bits at the locations identified by the N hash functions. N is chosen
depending on the size of the hash table and the number of words, so that
approximately 1/2 of the bits are turned on in the final table (which
obviously maximizes the information content of the table). This comes to 11 in
the standard UNIX spell, if I recall correctly. This means that the odds of
any random string being in the dictionary are 1 in 2**11 == 1/2048 -- right
small. Seems like you might have found such an example, however. If I recall
the implementation of the full spell(1) command, it is a shell script that
calls the spelling check program twice, first with a stop file which is a hash
table of words to insist are misspelled, regardless of whether they are in the
dictionary, then runs the remaining words through the regular dictionary. The
stop list mechanism was put in place, as I recall, as a method of trapping
words that would be (mistakenly) accepted due to misbehavior in the
prefix/suffix stripper (the example I recall is "thier == thy - y + ier");
however, it seems to me you should be able to add your example word to the
hstop word list and rebuild the hstop table, and fix the problem that way. 

-Bennett
-- 
Bennett Todd, Duke User Services, Durham, NC 27706-7756; +1 919 684 3695
UUCP: ...{philabs,akgua,decvax,ihnp4}!mcnc!ecsvax!dukeac!bet
BITNET: DBTODD@TUCC

***** end of response *****

While I did not get around to trying this fix, it does sound good.  Good luck
to everyone not having this happen to them with other mispelled words!

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Laurence S. Kaplan						     |
NYU Ultracomputer Research Project				    |||
715 Broadway  Rm. 1005						   |||||
New York, NY  10003						   |||||
(212) 460-7327							--- //\ ---
arpa:	kaplan@cmcl2						----/ \ ---
uucp:	{ihnp4,seismo}!cmcl2!kaplan				----   ----