[fa.info-cpm] Fast spelling checker

C70:info-cpm (08/08/82)

>From MADLER@Mit-Ml Sun Aug  8 15:18:31 1982
How do companies like MicroPro manage to produce such fast spelling programs?
Do they simply have small dictionaries?

I have been working on a program that is running 10-20 times slower (at least)
than some performance specs I have seen (3 pages, double spacing takes 3
minutes!!!)

So far, I have alphabetized my dictionary and have compressed it by putting
a byte at the beggining of approporiate words indicating the number of
characters to copy from the previous word (if it saves space, this is done).
Each word has 2 bytes following it of suffix byte flags.  The dictionary has
approximately 48,000 words as a result of the flags and resides in 92K.

The program has an index in memory so that only one 256 byte dictionary record
need be read for each word.  The directory is automatically buffered (eliminate
directory seek over extents).  I have gone so far as to fill up RAM with as
much of the dictionary as I can.

Anybody have any suggestions to improve efficiency.  I am sure that the
slowest part is disk accessing (5.25" floppies, 20ms track to track).
I guess I should minimize head travel.

Does a reasonable algorythm exist?

Thanks,
-Michael

C70:info-cpm (08/09/82)

>From FONER@Mit-Mc Sun Aug  8 23:38:01 1982
You failed to specify what \search/ algorithm you use!  A simple
sequential search could indeed take up gobs and gobs of time...

The subject of searching, sorting, optimal disk organization schemes,
and so forth is quite old.  I very highly recommend D E Knuth, _The
Art of Computer Programming, Volume 3: Searching and Sorting_,
Addison-Wesley Publishing Company, Reading, MA, 1973 (ISBN
0-201-03803-X, Library of Congress Catalog Number 67-26020).

This work is, in my opinion, the classic in the field.  The twenty or
thirty dollars spent will repay themselves many times if you use this
book well.

For anyone else, I recommend all three of Knuth's books (Fundamental
Algorithms, Seminumerical Methods, and Searching and Sorting) as
invaluable reference works.  The mathematics can get very heavy at
times, but the algorithms presented and the methods for evaluating
them are clear and can almost always be applied without a deep
appreciation of their mathematical elegance.  As long as you're not
trying to use the books as you would a textbook, you're all set.

(No, I am not a spokespiece for Knuth, but merely a grateful user of
his works such as this series and his TeX and METAFONT programs.)

Have fun folks.

						<LNF>

C70:info-cpm (08/10/82)

>From MADLER@Mit-Mc Tue Aug 10 00:17:46 1982
The search is binary (through the memory pointer to correct sector).
After that, it is sequential because of the packing method of the dictionary.
I think that the best suggestion I have received so far is to alphabetize
the input file in memory (can't be THAT many unique words), compare those
words to the dictionary, and then use the words in memory to write out
the marked file with a second pass of the input file.   This will serve
to eliminate the largest problem:  head travel.

More thoughts are welcome!
-Michael

C70:info-cpm (08/10/82)

>From FONER@Mit-Mc Tue Aug 10 00:30:27 1982
Yes, that sounds fair.  The worst case for such an algorithm occurs
when every word in the input file is unique, which would necessitate
essentially a copy of the input file to disk if the file itself is
larger than your physical memory (losing all the time because of
the necessity to maintain the output sorted alphabetically...  even a
binary tree would necessitate a lot of head motion, I fear).

However this is a worst case, and you could always simply store only
what won't fit in memory on disk.  Probably you will usually find that
you can hold the whole alphabetized list in memory, and thus win.
Short files (smaller than physical memory) will always win.  Large
ones might be profitably split up if the chances of unique words
filling memory is high, but this is probably rare...  almost any given
document will not be using very many \unique/ words unless you are
checking a dictionary or Brown's Corpus or something equally
inappropriate.

Good luck!

						<LNF>