mike@cs.vu.nl (Mike Marcel Jonkmans) (08/27/90)
Allright I had nothing to do and typed for fun the following : (csh)% spell < /usr/dict/words > error (csh)% cat error belying revisable (csh)% What's so special about belying and revisable ?? -- Mike Jonkmans. (mike@cs.vu.nl) ..!uunet!mcsun!botter!mike
mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) (08/27/90)
In article <7385@star.cs.vu.nl> mike@cs.vu.nl (Mike Marcel Jonkmans) writes: >Allright I had nothing to do and typed for fun the following : > >(csh)% spell < /usr/dict/words > error >(csh)% cat error >belying >revisable >(csh)% > >What's so special about belying and revisable ?? > > >-- > > Mike Jonkmans. (mike@cs.vu.nl) > ..!uunet!mcsun!botter!mike Interesting. I tried the same thing on our VS system and found the following words from /usr/dict/words which spell did not accept: acclimatize belying (same as your list) implementer Remus revisable (same as your list) vis My guess would be that an out of date hash file would account for such errors, but finding two such words on different systems seems too much of a coincidence to support that theory. Maybe others on the net can try this on their systems and see what anomalies they get. I have also found another interesting deficiency in spell. It appears to test only whether or not the hashed value of strings match the hashed values of words from the the dictionary. For relatively long strings there are many hash collisions which cause nonwords to be accepted. I have run some tests which confirm that this is the reason for the false acceptance. For example, passing all 8! permutations of "abcdefgh" through /usr/lib/spell (the executable which does the actual spell check for the spell script) finds 10 strings which are accepted, none of which are really words. Trying the 9! permutations of "abcdefghi" gives 159 accepted strings. Note that this is more than 9 times as many, which indicates to me that the hashing algorithm is better at generating unique codes for short strings, a desirable feature, I think. When I experimented with this some time ago I found that the word "receptionist" had many thousands of accepted anagrams. Recently I tried to duplicate this and ran out of disk space. The partial result showed 2380 anagrams beginning with 'c' alone. (For efficiency my program generates them in collating sequence, hence I found all the 'c' words and some of the 'e' words before running out of space.) These results have been about the same using VS, AIX (on an RT), SCO Xenix, and SunOS, often getting exact matches across different systems, indicating that the same dictionary and hashing algorithm is being used. P.S. to Brad Appleton of Harris Computer Systems, who sent me email asking for a copy of the program I used to find this: I got your request. I would be glad to share the code. Unfortunately my system doesn't seem to know how to send email to yours. Can you suggest a routing which might work. I'm not very good at figuring out the mysteries of Internet.
gwyn@smoke.BRL.MIL (Doug Gwyn) (08/28/90)
In article <517@sun13.scri.fsu.edu> mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) writes: >... For relatively long strings there are many hash collisions which >cause nonwords to be accepted. Yes, and if there are too many arcane-but-correct spellings in the word list that "spell" uses, many misspellings will not be detected because the misspelling matches some arcane, unintended word. When it comes to "spell", bigger is definitely not better.
vlr@litwin.com (Vic Rice) (08/29/90)
gwyn@smoke.BRL.MIL (Doug Gwyn) writes: >In article <517@sun13.scri.fsu.edu> mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) writes: >>... For relatively long strings there are many hash collisions which >>cause nonwords to be accepted. >Yes, and if there are too many arcane-but-correct spellings in the word >list that "spell" uses, many misspellings will not be detected because >the misspelling matches some arcane, unintended word. When it comes to >"spell", bigger is definitely not better. I just tried this on my system : SCO Opendesktop (SYSV R3.2.1). No misspellings were flagged. -- Dr. Victor L. Rice Litwin Process Automation
avg@hq.demos.su (Vadim G. Antonov) (08/29/90)
In article <517@sun13.scri.fsu.edu> mayne@VSSERV.SCRI.FSU.EDU (William (Bill) Mayne) writes: >... For relatively long strings there are many hash collisions which >cause nonwords to be accepted. You could defeat such effects by splitting dictionary into several different pieces (with different hash-tables) and patch spell's shell script to call spellers like that: spell {hash1} | spell {hash2} | spell {hash3} The words not filtered out by all spellers will appear at the end.
gwyn@smoke.BRL.MIL (Doug Gwyn) (08/30/90)
In article <1990Aug29.002759.4012@litwin.com> vlr@litwin.com (Vic Rice) writes: >I just tried this on my system : SCO Opendesktop (SYSV R3.2.1). No >misspellings were flagged. If the examples I cited were not flagged by your version of "spell", then you might as well not use "spell" since it fails to report probable misspellings. "spell" is not intended to be like "lint", where any output at all is considered to indicate a problem. Rather, it is meant to present you with a list of "questionable" words that you then manually verify. The "local word list" (+) option of "spell" allows you to create a selective list of words that are correct but would otherwise be reported as "hits" by "spell"; this suffices to keep the volume of "spell" output manageable. "manger" should be reported by an ideal implementation of "spell", even though it's a well-known word, due to the expectation that this is much more likely to be a misspelling of an intended "manager" than a correctly-spelled "manger". Unfortunately, many "spell" word lists are overly inclusive, thereby reducing the utility of "spell".
dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (08/31/90)
In <13699@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
Unfortunately, many "spell" word lists are overly inclusive,
thereby reducing the utility of "spell".
The number of different words in the spelling list is unfortunately
rather a selling point for word processors. "Over 50000 words..."
etc. are claimed in the ads. The public is truly gullible.
--
Rahul Dhesi <dhesi%cirrusl@oliveb.ATC.olivetti.com>
UUCP: oliveb!cirrusl!dhesi