dndobrin@athena.MIT.EDU (01/20/87)
Does anyone out there know where the 4.x or System V dictionaries come from? Did someone just copy in a standard dictionary? Were the words chosen by the Bell Labs (?) programmers? How exactly do the suffix and prefix (?) algorithms work? And on and on. Are there many different versions of the dictionary? How many words does it check? Since it's hashed, is there any way of correcting the original? Is somebody out there an authority? Mostly I'm just curious, but I could also use the answers for an article I'm writing. I will summarize answers and post them to the list. David N. Dobrin Lexicom (617) 492-8035 1612 Cambridge St. Cambridge MA 02138 dndobrin@athena.mit.edu
dave@sq.UUCP (01/22/87)
In article <2828@brl-adm.ARPA> dndobrin@athena.MIT.EDU writes: >Does anyone out there know where the 4.x or System V dictionaries come from? | >How exactly do the suffix and prefix (?) algorithms work? > It makes me wonder too, when words like the following pass spell! Csvmre Enova Gurbqber Mnzovn Oeraare Qehzzbaq enat enax enva fbccvat fgnynpgvgr reenapl sbefbbx thys traer ubne vaqvfpreavoyr zvyy Running Sun 3/180 Sun Unix 3.0 David R. Seaman ------ {utai,utzoo}!sq!dave dave@sq.utoronto.bitnet SoftQuad Inc. (home of sqtroff) 720 Spadina Ave Toronto, Ontario, Canada (416) 963-8337
devine@vianet.UUCP (01/27/87)
> In article <2828@brl-adm.ARPA> dndobrin@athena.MIT.EDU writes: > >Does anyone out there know where the 4.x or System V dictionaries come from? In article <1987Jan22.110150.29415@sq.uucp>, dave@sq.UUCP writes: > It makes me wonder too, when words like the following pass spell! > [list of misspelled words deleted] Back around 1980 I wrote a spell check program for VAX/VMS. To create the dictionary I first gathered 100's of files to get the words. Later I copied the UNIX spelling list from a Version 7 machine. That dictionary was *loaded* with many misspelled words! I spent hours editing it to cull the obviously bad words (no fun staring at a screen). There are probably still some words I missed. Moral: if it isn't your dictionary beware of who may have added words. Bob Devine [ I still have my 95,000 word dictionary if someone wants a copy.]
gwyn@brl-smoke.UUCP (01/28/87)
In article <1987Jan22.110150.29415@sq.uucp> dave@sq.UUCP (David R. Seaman) writes: >It makes me wonder too, when words like the following pass spell! With so many different implementations of "spell" loose in the world, it's hard to say much generically (although I seem to remember a good summary in CACM a couple of years back). For example, my version of "spell" only passed four of the more plausible words in the list as "not possibly misspelled". Note that some (most?) "spells" have a "stop list" which can be expanded to trounce words that would otherwise be accepted as correctly spelled. If a site has a problem with a frequent word like that, that's the simplest "solution".
dave@sq.UUCP (01/29/87)
In article <135@vianet.UUCP> devine@vianet.UUCP writes: >> In article <2828@brl-adm.ARPA> dndobrin@athena.MIT.EDU writes: >> >Does anyone out there know where the 4.x or System V dictionaries come from? > >In article <1987Jan22.110150.29415@sq.uucp>, dave@sq.UUCP writes: >> It makes me wonder too, when words like the following pass spell! >> [list of misspelled words deleted] > > Back around 1980 I wrote a spell check program for VAX/VMS. To create >the dictionary I first gathered 100's of files to get the words. Later >I copied the UNIX spelling list from a Version 7 machine. That dictionary >was *loaded* with many misspelled words! The list of words that I posted above [deleted] are not contained in /usr/dict/words they are probably generated by suffix/prefix algorithms that spell(1) and kindred use. Note that the words are found misspelled by SYS V spell(1). Sun Unix 3.0 spell is the command under fire. Another attribute about the words are that they are all words that exist in the dictionary if rot13'ed. This line : I sbefbbx my zvyy before I dropped my fgnynpgvgr in the fbccvat stream. rot13'd is: V forsook zl mill orsber V qebccrq zl stalactite va gur sopping fgernz. The first line passes spell. The second catches fgernz gur orsber qebccrq. David R. Seaman ------ {utai,utzoo}!sq!dave dave@sq.utoronto.bitnet SoftQuad Inc. (home of sqtroff) 720 Spadina Ave Toronto, Ontario, Canada (416) 963-8337
jaw@ames.UUCP (01/31/87)
The standard reference for this is "Development of a spelling list," M. D. McIlroy, IEEE Trans. on Communications, Jan. 1982.
dlc@zog.cs.cmu.edu.UUCP (02/03/87)
> Csvmre Enova Gurbqber Mnzovn Oeraare Qehzzbaq enat enax > enva fbccvat fgnynpgvgr reenapl sbefbbx thys traer ubne > vaqvfpreavoyr zvyy What? Are you surprised that spell (4.2bsd) also accpets Rot13? Don't you love undocumented features.
beth@brillig (Beth Katz) (02/04/87)
I am not a Unix expert, but I have looked at 'spell' and how it accepts garbage. I haven't read the papers mentioned previously. One reason why 'spell' accepts so much garbage is that it uses a hashed list of acceptable words. On many systems I have seen, this list is 50000 bytes. Given all the garbage that can be generated by random combinations of letters, you run out of space in that table very quickly. 'spell' was designed to catch misspelled words rather than filtering out absolute garbage. The stop lists catch words that could be created through transformations but that are misspelled nonetheless. You can do some extra transformations to clean up the lists if you've fed 'spell' real garbage, but for most situations, it doesn't matter all that much. Beth Katz
rbj@icst-cmr.arpa (02/05/87)
>Does anyone know where the 4.x or System V dictionaries come from?
No, but one way is to start with a null dictionary, point spell at
some text files, and add words that are spelt korectly.
(Root Boy) Jim "Just Say Yes" Cottrell <rbj@icst-cmr.arpa>
BELA LUGOSI is my co-pilot... and he's 900 feet tall!
PAAAAAR@calstate.bitnet (02/06/87)
In-Reply-To: Bob Devine's message of 27 Jan 87 20:12:03 GMT This may or may not help - but Jon Bentley's "Programming Pearl's" Book (Chapter 13) has a 3 page description of how one Doug McIlroy developed a spell program in 1978 for an unidentified UNIX system. Other possible sources - McIlroy "Development of a spelling checker" IEEE Transactions on Communications, vol COM-30, No 1, Jan 1982, pp91-99. Peterson "Comp prog. for detecting and corecting spelling errors" CACM Dec 1980 Dick Botting, Dept Comp Sci., Cal State U, San Bernardino, CA 92407 PAAAAAR@CCS.CSUSCC.CALSTATE.EDU voice:714-887-7368 modem:714-887-7365 (Silicon Mountain -- where the LA smog stops)
devine@vianet.UUCP (02/09/87)
In article <257@ames.UUCP>, jaw@ames.UUCP (James A. Woods) writes: > The standard reference for this is "Development of a spelling list," > M. D. McIlroy, IEEE Trans. on Communications, Jan. 1982. Another good source to look at if you are developing a spelling checker is "Computer Programs for Spelling Correction" by James Peterson, 1980. It's #96 in the Goos & Hartmanis series "Lecture Notes in Computer Science". Bob
dennisg@fritz.UUCP (02/11/87)
In article <4284@brl-adm.ARPA> PAAAAAR@calstate.bitnet writes: >Peterson "Comp prog. for detecting and corecting spelling errors" CACM Dec 1980 It does WHAT to spelling errors?