michaelm@bcsaic.UUCP (03/03/87)
Can anyone explain why I can't get Unix spell to accept "millions"? I am using my own spelling list (i.e. I have explicitly added "millions" to the standard spelling list w/ "spellin"), and I have verified that "millions" is in the hash table using "spellout." Likewise, it is not in the stoplist, as verified by "spellout." But "spell" rejects it. Explicitly: 54 % spellout grammar.spelling.hash #my hashed spelling list millions (Note that "spellout" does not echo "millions", indicating that it is in the spelling hashlist) 55 % spellout grammar.stop.hash millions millions (This time "spellout" echoes "millions", indicating that it is not in the stop hashlist) 56 % spell -d grammar.spelling.hash -s grammar.stop.hash millions (^D here) millions How can "spell" reject it when "spellout" accepts it? -- Mike Maxwell Boeing Advanced Technology Center arpa: michaelm@boeing.com uucp: uw-beaver!uw-june!bcsaic!michaelm
mohamed@hscfvax.UUCP (750025@Mohamed_el_Lozy) (03/06/87)
millions was stopped by the stop list. Why? ons is a non-word which might be construed by spell as the plural of the valid word on. Hence ons is in the stop list. The stop list is used like the main list, with prefix and suffix strripping. Hence millions is seen as a derivative (milli-ons, like milli-meters) of a word on the stop list and is stopped. Another one of my favorite stopped word is dishes (dis-hes, hes on stop list as spurious plural of he). Also microbes, micro-bes. There are thre or four others that I cannot remember at this late hour. There is really no solution, short of a total (and perhaps needed) rewrite of spell, a program that originated in the dark ages on a PDP without separate I & D space. For an excellnt review of the theory and implementation of spell, see McIlroy, M. D. "Development of a Spelling List", IEEE Trans. Communications, Jan 1982, 91-99. Also an article in the Programming Pearls column in comm ACM about a year ago.
guy%gorodish@Sun.COM (Guy Harris) (03/07/87)
> There is really no solution, short of a total (and perhaps needed) rewrite > of spell, a program that originated in the dark ages on a PDP without > separate I & D space. It's worth noting that other approaches to spelling checkers have been successful, even on small machines. A company called Proximity Technology (formerly Proximity Devices; I presume they're still around) build a spelling checker that just looked words up in a dictionary. The first version made one pass over the document to gather a sorted list of words with duplicates eliminated. The second pass went through that list and eliminated words not found in the dictionary; the dictionary was compressed using several techniques (common prefix elimination, Huffman-coding, etc.) so an ~80,000 word list took only (if I remember correctly) about 170KB. The third pass went through the document and printed the lines containing words in the remaining list of misspelled words. The bulk of the time was spent in the decompression; the word list was indexed by the first three characters of the word, so lookup was reasonably quick. At CCI we incorporated this code into an interactive spelling checker that was part of our word processor; you could tell it to check from the current cursor position to the end of the document, or mark a block of text and tell it to check that block, and the third pass would stop at each potentially-misspelled word and offer you the chance to correct it. The word list could also be purchased with hyphenation points included; our word processor could be told to insert discretionary hyphens at these points. Other people, including vendors of word processors based on micros (definitely 16-bit micros, perhaps 8-bit ones), also included this code in their products. Proximity then rewrote it completely to use one pass; it just took every word in the document and looked it up in the word list. They did this in order to make it work better when used interactively; the old version took a while to complete passes 1 and 2, so you had to sit there for a while before it started showing potential misspellings. They also used different compressing techniques that weren't so CPU-intensive; when incorporated into CCI's word processor, the new one started showing the errors fairly quickly. I believe this version traded space for time, by entering a match/nomatch indication into an in-memory cache, so I don't know whether it would have worked well on a small-address-space or small-physical-memory machine.
jewett@hpl-opus.HP.COM (Bob Jewett) (03/09/87)
> There is really no solution, short of a total (and perhaps needed) rewrite > of spell I rather like ispell, which was recently posted to the net. It takes 2 to 5 seconds to start up, and the searches for suggested spellings of misspelled words can take several seconds on long words, but I find it much easier to use than Bell's current offering. Its hashed list is about 400k. It came with a (short) ascii dictionary. Bob Jewett
jerry@oliveb.UUCP (03/20/87)
In article <14626@sun.uucp> guy%gorodish@Sun.COM (Guy Harris) writes: >It's worth noting that other approaches to spelling checkers have >been successful, even on small machines. A company called Proximity >Technology (formerly Proximity Devices; I presume they're still >around) build a spelling checker that just looked words up in a >dictionary. The first version made one pass over the document to >gather a sorted list of words with duplicates eliminated. The second pass >went through that list and eliminated words not found in the >dictionary; the dictionary was compressed using several techniques It is not necessary to prescan the document and eliminate duplicates. I ran into the same problems with a spelling checker I wrote. What made the most improvement was to add a hashed LRU table to the lookup. My program kept the last 512 "words" in memory. Each string was stored along with a flag indicating whether it was a word. This had the greatest performance improvement of any of my changes, including the addition of an index based on the first few letters. I did some analysis of "typical" documents and found that the hash was >90% effective in eliminating duplicate lookups. Given the elimination of the first read of the file and the setup delay while it preprocesses, this is definitely a better solution. Jerry Aguirre Systems Administration Olivetti ATC