[comp.unix.questions] Unix Dictionaries

dndobrin@athena.MIT.EDU (01/20/87)

Does anyone out there know where the 4.x or System V dictionaries come from?  Did someone just copy in a standard dictionary?  Were the words chosen by the Bell Labs (?) programmers?  How exactly do the suffix and prefix (?) algorithms work?  

And on and on.  Are there many different versions of the dictionary?  How many words does it check?  Since it's hashed, is there any way of correcting the original?  Is somebody out there an authority?

Mostly I'm just curious, but I could also use the answers for an article I'm writing.  I will summarize answers and post them to the list.

David N. Dobrin  
Lexicom
(617) 492-8035
1612 Cambridge St.
Cambridge  MA  02138

dndobrin@athena.mit.edu

dave@sq.UUCP (01/22/87)

In article <2828@brl-adm.ARPA> dndobrin@athena.MIT.EDU writes:
>Does anyone out there know where the 4.x or System V dictionaries come from?
  |
>How exactly do the suffix and prefix (?) algorithms work?  
>

It makes me wonder too, when words like the following pass spell!

Csvmre Enova Gurbqber Mnzovn Oeraare Qehzzbaq enat enax
enva fbccvat fgnynpgvgr reenapl sbefbbx thys traer ubne
vaqvfpreavoyr zvyy

Running Sun 3/180 Sun Unix 3.0

				David R. Seaman
------
{utai,utzoo}!sq!dave
dave@sq.utoronto.bitnet

SoftQuad Inc. (home of sqtroff)
720 Spadina Ave
Toronto, Ontario, Canada
(416) 963-8337

devine@vianet.UUCP (01/27/87)

> In article <2828@brl-adm.ARPA> dndobrin@athena.MIT.EDU writes:
> >Does anyone out there know where the 4.x or System V dictionaries come from?

In article <1987Jan22.110150.29415@sq.uucp>, dave@sq.UUCP writes:
> It makes me wonder too, when words like the following pass spell!
> [list of misspelled words deleted]

  Back around 1980 I wrote a spell check program for VAX/VMS.  To create
the dictionary I first gathered 100's of files to get the words.  Later
I copied the UNIX spelling list from a Version 7 machine.  That dictionary
was *loaded* with many misspelled words!  I spent hours editing it to
cull the obviously bad words (no fun staring at a screen).  There are 
probably still some words I missed.

  Moral: if it isn't your dictionary beware of who may have added words.

Bob Devine
[ I still have my 95,000 word dictionary if someone wants a copy.]

gwyn@brl-smoke.UUCP (01/28/87)

In article <1987Jan22.110150.29415@sq.uucp> dave@sq.UUCP (David R. Seaman) writes:
>It makes me wonder too, when words like the following pass spell!

With so many different implementations of "spell" loose in the world,
it's hard to say much generically (although I seem to remember a good
summary in CACM a couple of years back).  For example, my version of
"spell" only passed four of the more plausible words in the list as
"not possibly misspelled".  Note that some (most?) "spells" have a
"stop list" which can be expanded to trounce words that would
otherwise be accepted as correctly spelled.  If a site has a problem
with a frequent word like that, that's the simplest "solution".

dave@sq.UUCP (01/29/87)

In article <135@vianet.UUCP> devine@vianet.UUCP writes:
>> In article <2828@brl-adm.ARPA> dndobrin@athena.MIT.EDU writes:
>> >Does anyone out there know where the 4.x or System V dictionaries come from?
>
>In article <1987Jan22.110150.29415@sq.uucp>, dave@sq.UUCP writes:
>> It makes me wonder too, when words like the following pass spell!
>> [list of misspelled words deleted]
>
>  Back around 1980 I wrote a spell check program for VAX/VMS.  To create
>the dictionary I first gathered 100's of files to get the words.  Later
>I copied the UNIX spelling list from a Version 7 machine.  That dictionary
>was *loaded* with many misspelled words!

The list of words that I posted above [deleted] are not contained
in /usr/dict/words they are probably generated by suffix/prefix
algorithms that spell(1) and kindred use. Note that the words
are found misspelled by SYS V spell(1). Sun Unix 3.0 spell is the
command under fire.

Another attribute about the words are that they are all words that
exist in the dictionary if rot13'ed.

This line :
I sbefbbx my zvyy before I dropped my fgnynpgvgr in the fbccvat stream.
rot13'd is:
V forsook zl mill orsber V qebccrq zl stalactite va gur sopping fgernz.
The first line passes spell.
The second catches fgernz gur orsber qebccrq.

					David R. Seaman
------
{utai,utzoo}!sq!dave
dave@sq.utoronto.bitnet

SoftQuad Inc. (home of sqtroff)
720 Spadina Ave
Toronto, Ontario, Canada
(416) 963-8337

jaw@ames.UUCP (01/31/87)

The standard reference for this is "Development of a spelling list,"
M. D. McIlroy, IEEE Trans. on Communications, Jan. 1982.

dlc@zog.cs.cmu.edu.UUCP (02/03/87)

> Csvmre Enova Gurbqber Mnzovn Oeraare Qehzzbaq enat enax
> enva fbccvat fgnynpgvgr reenapl sbefbbx thys traer ubne
> vaqvfpreavoyr zvyy

    What?  Are you surprised that spell (4.2bsd) also accpets Rot13?
    Don't you love undocumented features.

beth@brillig (Beth Katz) (02/04/87)

I am not a Unix expert, but I have looked at 'spell' and how it
accepts garbage.  I haven't read the papers mentioned previously.

One reason why 'spell' accepts so much garbage is that it uses
a hashed list of acceptable words.  On many systems I have seen,
this list is 50000 bytes.  Given all the garbage that can be
generated by random combinations of letters, you run out of space
in that table very quickly.  'spell' was designed to catch misspelled
words rather than filtering out absolute garbage.  The stop lists
catch words that could be created through transformations but that
are misspelled nonetheless.

You can do some extra transformations to clean up the lists if
you've fed 'spell' real garbage, but for most situations, it 
doesn't matter all that much.

				Beth Katz

rbj@icst-cmr.arpa (02/05/87)

   >Does anyone know where the 4.x or System V dictionaries come from?

No, but one way is to start with a null dictionary, point spell at
some text files, and add words that are spelt korectly.

	(Root Boy) Jim "Just Say Yes" Cottrell	<rbj@icst-cmr.arpa>
	BELA LUGOSI is my co-pilot... and he's 900 feet tall!

PAAAAAR@calstate.bitnet (02/06/87)

In-Reply-To: Bob Devine's message of 27 Jan 87 20:12:03 GMT
 
This may or may not help - but Jon Bentley's "Programming Pearl's" Book
(Chapter 13) has a 3 page description of how one Doug McIlroy developed
a spell program in 1978 for an unidentified UNIX system.
 
Other possible sources -
McIlroy "Development of a spelling checker" IEEE Transactions on Communications,
vol COM-30, No 1, Jan 1982, pp91-99.
 
Peterson "Comp prog. for detecting and corecting spelling errors" CACM Dec 1980
 
 
 
   Dick Botting, Dept Comp Sci., Cal State U, San Bernardino, CA 92407
   PAAAAAR@CCS.CSUSCC.CALSTATE.EDU                  voice:714-887-7368
   modem:714-887-7365  (Silicon Mountain  --  where the LA smog stops)
 
 

devine@vianet.UUCP (02/09/87)

In article <257@ames.UUCP>, jaw@ames.UUCP (James A. Woods) writes:
> The standard reference for this is "Development of a spelling list,"
> M. D. McIlroy, IEEE Trans. on Communications, Jan. 1982.

  Another good source to look at if you are developing a spelling
checker is "Computer Programs for Spelling Correction" by James
Peterson, 1980.  It's #96 in the Goos & Hartmanis series "Lecture
Notes in Computer Science".

Bob

dennisg@fritz.UUCP (02/11/87)

In article <4284@brl-adm.ARPA> PAAAAAR@calstate.bitnet writes:
>Peterson "Comp prog. for detecting and corecting spelling errors" CACM Dec 1980

It does WHAT to spelling errors?