[comp.text] Problem with spell

michaelm@bcsaic.UUCP (03/03/87)

Can anyone explain why I can't get Unix spell to accept "millions"?  I am
using my own spelling list (i.e. I have explicitly added "millions" to the
standard spelling list w/ "spellin"), and I have verified that "millions" is
in the hash table using "spellout."  Likewise, it is not in the stoplist, as
verified by "spellout."  But "spell" rejects it.  Explicitly:

	54 % spellout grammar.spelling.hash #my hashed spelling list
	millions
(Note that "spellout" does not echo "millions", indicating that it is in
the spelling hashlist)
	55 % spellout grammar.stop.hash
	millions
	millions
(This time "spellout" echoes "millions", indicating that it is not in the stop
hashlist)
	56 % spell -d grammar.spelling.hash -s grammar.stop.hash
	millions
(^D here)
	millions

How can "spell" reject it when "spellout" accepts it?
-- 
Mike Maxwell
Boeing Advanced Technology Center
	arpa: michaelm@boeing.com
	uucp: uw-beaver!uw-june!bcsaic!michaelm

mohamed@hscfvax.UUCP (750025@Mohamed_el_Lozy) (03/06/87)

millions was stopped by the stop list.  Why?  ons is a non-word which
might be construed by spell as the plural of the valid word on.  Hence
ons is in the stop list.  The stop list is used like the main list, with
prefix and suffix strripping.  Hence millions is seen as a derivative
(milli-ons, like milli-meters) of a word on the stop list and is stopped.
Another one of my favorite stopped word is dishes (dis-hes, hes on stop
list as spurious plural of he).  Also microbes, micro-bes.  There are thre
or four others that I cannot remember at this late hour.

There is really no solution, short of a total (and perhaps needed) rewrite
of spell, a program that originated in the dark ages on a PDP without
separate I & D space.  For an excellnt review of the theory and implementation
of spell, see McIlroy, M. D. "Development of a Spelling List", IEEE Trans.
Communications, Jan 1982, 91-99.  Also an article in the Programming Pearls
column in comm ACM about a year ago.

guy%gorodish@Sun.COM (Guy Harris) (03/07/87)

> There is really no solution, short of a total (and perhaps needed) rewrite
> of spell, a program that originated in the dark ages on a PDP without
> separate I & D space.

It's worth noting that other approaches to spelling checkers have
been successful, even on small machines.  A company called Proximity
Technology (formerly Proximity Devices; I presume they're still
around) build a spelling checker that just looked words up in a
dictionary.  The first version made one pass over the document to
gather a sorted list of words with duplicates eliminated.  The second pass
went through that list and eliminated words not found in the
dictionary; the dictionary was compressed using several techniques
(common prefix elimination, Huffman-coding, etc.) so an ~80,000 word
list took only (if I remember correctly) about 170KB.  The third pass
went through the document and printed the lines containing words in
the remaining list of misspelled words.

The bulk of the time was spent in the decompression; the word list
was indexed by the first three characters of the word, so lookup was
reasonably quick.  At CCI we incorporated this code into an
interactive spelling checker that was part of our word processor; you
could tell it to check from the current cursor position to the end of
the document, or mark a block of text and tell it to check that
block, and the third pass would stop at each potentially-misspelled
word and offer you the chance to correct it.  The word list could
also be purchased with hyphenation points included; our word
processor could be told to insert discretionary hyphens at these
points.  Other people, including vendors of word processors based on
micros (definitely 16-bit micros, perhaps 8-bit ones), also included
this code in their products.

Proximity then rewrote it completely to use one pass; it just took
every word in the document and looked it up in the word list.  They
did this in order to make it work better when used interactively; the
old version took a while to complete passes 1 and 2, so you had to
sit there for a while before it started showing potential
misspellings.  They also used different compressing techniques that
weren't so CPU-intensive; when incorporated into CCI's word
processor, the new one started showing the errors fairly quickly.  I
believe this version traded space for time, by entering a match/nomatch
indication into an in-memory cache, so I don't know whether it would
have worked well on a small-address-space or small-physical-memory
machine.

jewett@hpl-opus.HP.COM (Bob Jewett) (03/09/87)

> There is really no solution, short of a total (and perhaps needed) rewrite
> of spell

    I rather like ispell, which was recently posted to the net.  It takes 2
    to 5 seconds to start up, and the searches for suggested spellings of
    misspelled words can take several seconds on long words, but I find it
    much easier to use than Bell's current offering.

    Its hashed list is about 400k.  It came with a (short) ascii dictionary.

    Bob Jewett

jerry@oliveb.UUCP (03/20/87)

In article <14626@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes:
>It's worth noting that other approaches to spelling checkers have
>been successful, even on small machines.  A company called Proximity
>Technology (formerly Proximity Devices; I presume they're still
>around) build a spelling checker that just looked words up in a
>dictionary.  The first version made one pass over the document to
>gather a sorted list of words with duplicates eliminated.  The second pass
>went through that list and eliminated words not found in the
>dictionary; the dictionary was compressed using several techniques

It is not necessary to prescan the document and eliminate duplicates.  I
ran into the same problems with a spelling checker I wrote.  What made
the most improvement was to add a hashed LRU table to the lookup.

My program kept the last 512 "words" in memory.  Each string was stored
along with a flag indicating whether it was a word.  This had the
greatest performance improvement of any of my changes, including the
addition of an index based on the first few letters.

I did some analysis of "typical" documents and found that the hash was
>90% effective in eliminating duplicate lookups.  Given the elimination
of the first read of the file and the setup delay while it preprocesses,
this is definitely a better solution.

					Jerry Aguirre
					Systems Administration
					Olivetti ATC