[comp.unix.questions] spell and /usr/dict/words.

mike@cs.vu.nl (Mike Marcel Jonkmans) (08/27/90)

Allright I had nothing to do and typed for fun the following :

(csh)% spell < /usr/dict/words > error
(csh)% cat error
belying
revisable
(csh)% 

What's so special about belying and revisable ??


--

			Mike Jonkmans.  (mike@cs.vu.nl)
			       ..!uunet!mcsun!botter!mike

mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) (08/27/90)

In article <7385@star.cs.vu.nl> mike@cs.vu.nl (Mike Marcel Jonkmans) writes:
>Allright I had nothing to do and typed for fun the following :
>
>(csh)% spell < /usr/dict/words > error
>(csh)% cat error
>belying
>revisable
>(csh)% 
>
>What's so special about belying and revisable ??
>
>
>--
>
>			Mike Jonkmans.  (mike@cs.vu.nl)
>			       ..!uunet!mcsun!botter!mike
Interesting. I tried the same thing on our VS system and found the
following words from /usr/dict/words which spell did not accept:

acclimatize
belying (same as your list)
implementer
Remus
revisable (same as your list)
vis

My guess would be that an out of date hash file would account for
such errors, but finding two such words on different systems seems
too much of a coincidence to support that theory. Maybe others on
the net can try this on their systems and see what anomalies they get.

I have also found another interesting deficiency in spell. It appears
to test only whether or not the hashed value of strings match the
hashed values of words from the the dictionary. For relatively long
strings there are many hash collisions which cause nonwords to be
accepted. I have run some tests which confirm that this is the reason
for the false acceptance.

For example, passing all 8! permutations of "abcdefgh" through 
/usr/lib/spell (the executable which does the actual spell check for 
the spell script) finds 10 strings which are accepted, none of which
are really words. Trying the 9! permutations of "abcdefghi" gives
159 accepted strings. Note that this is more than 9 times as many,
which indicates to me that the hashing algorithm is better at generating
unique codes for short strings, a desirable feature, I think. When
I experimented with this some time ago I found that the word
"receptionist" had many thousands of accepted anagrams. Recently I tried
to duplicate this and ran out of disk space. The partial result showed
2380 anagrams beginning with 'c' alone. (For efficiency my program
generates them in collating sequence, hence I found all the 'c' words
and some of the 'e' words before running out of space.) These results
have been about the same using VS, AIX (on an RT), SCO Xenix, and SunOS,
often getting exact matches across different systems, indicating that
the same dictionary and hashing algorithm is being used.

P.S. to Brad Appleton of Harris Computer Systems, who sent me email
asking for a copy of the program I used to find this: I got your
request. I would be glad to share the code. Unfortunately my system
doesn't seem to know how to send email to yours. Can you suggest
a routing which might work. I'm not very good at figuring out the 
mysteries of Internet.

gwyn@smoke.BRL.MIL (Doug Gwyn) (08/28/90)

In article <517@sun13.scri.fsu.edu> mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) writes:
>... For relatively long strings there are many hash collisions which
>cause nonwords to be accepted.

Yes, and if there are too many arcane-but-correct spellings in the word
list that "spell" uses, many misspellings will not be detected because
the misspelling matches some arcane, unintended word.  When it comes to
"spell", bigger is definitely not better.

vlr@litwin.com (Vic Rice) (08/29/90)

gwyn@smoke.BRL.MIL (Doug Gwyn) writes:

>In article <517@sun13.scri.fsu.edu> mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) writes:
>>... For relatively long strings there are many hash collisions which
>>cause nonwords to be accepted.

>Yes, and if there are too many arcane-but-correct spellings in the word
>list that "spell" uses, many misspellings will not be detected because
>the misspelling matches some arcane, unintended word.  When it comes to
>"spell", bigger is definitely not better.

I just tried this on my system : SCO Opendesktop (SYSV R3.2.1). No 
misspellings were flagged.
-- 
Dr. Victor L. Rice
Litwin Process Automation

avg@hq.demos.su (Vadim G. Antonov) (08/29/90)

In article <517@sun13.scri.fsu.edu> mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) writes:
>... For relatively long strings there are many hash collisions which
>cause nonwords to be accepted.

	You could defeat such effects by splitting dictionary
	into several different pieces (with different hash-tables)
	and patch spell's shell script to call spellers like that:

		spell {hash1} | spell {hash2} | spell {hash3}

	The words not filtered out by all spellers will appear at the end.

gwyn@smoke.BRL.MIL (Doug Gwyn) (08/30/90)

In article <1990Aug29.002759.4012@litwin.com> vlr@litwin.com (Vic Rice) writes:
>I just tried this on my system : SCO Opendesktop (SYSV R3.2.1). No 
>misspellings were flagged.

If the examples I cited were not flagged by your version of "spell",
then you might as well not use "spell" since it fails to report probable
misspellings.

"spell" is not intended to be like "lint", where any output at all is
considered to indicate a problem.  Rather, it is meant to present you
with a list of "questionable" words that you then manually verify.
The "local word list" (+) option of "spell" allows you to create a
selective list of words that are correct but would otherwise be
reported as "hits" by "spell"; this suffices to keep the volume of
"spell" output manageable.

"manger" should be reported by an ideal implementation of "spell",
even though it's a well-known word, due to the expectation that this is
much more likely to be a misspelling of an intended "manager" than a
correctly-spelled "manger".  Unfortunately, many "spell" word lists are
overly inclusive, thereby reducing the utility of "spell".

dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (08/31/90)

In <13699@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:

     Unfortunately, many "spell" word lists are overly inclusive,
     thereby reducing the utility of "spell".

The number of different words in the spelling list is unfortunately
rather a selling point for word processors.  "Over 50000 words..."
etc. are claimed in the ads.  The public is truly gullible.
--
Rahul Dhesi <dhesi%cirrusl@oliveb.ATC.olivetti.com>
UUCP:  oliveb!cirrusl!dhesi