[comp.sys.mac.apps] How do so many words fit in a dictionary/thesaurus?

pac@stl.stc.co.uk (09/12/90)

Can anyone give me a reference to the compression techniques used by the
dictionary/thesaurus applications and DAs.  Taking Microlytics claim about
Word Finder (220,000 words) it works out at about 1.6 bytes per word.
Presumably the portable calculator-like dictionaries that are available
use similar compression algorithms?

Any leads (by email) would help me as I am trying to find a way of
squeezing a medical reference database into a small handheld computer.
A discussion on the net about the algorithms would also be interesting,
although I've probably kicked this off in the wrong newsgroup.

Thanks,
        Paul Cooper 
Regards,
	Paul Cooper (pac@stl.stc.co.uk) - ICL Europe

jeffe@eniac.seas.upenn.edu (George Jefferson ) (09/13/90)

In article <3374@stl.stc.co.uk> pac@stl.stc.co.uk () writes:
>Can anyone give me a reference to the compression techniques used by the
>dictionary/thesaurus applications and DAs.  Taking Microlytics claim about
>Word Finder (220,000 words) it works out at about 1.6 bytes per word.

I suspect that the '220,000' words counts plurals, and other word
variants as seperate 'words'.  A tremendous 'compression' is
achieved by recognising the (reasonably) simple rules
for adding aprporiate suffixes and prefixes.  Eace base word
need only cary around a couple of bytes to indicate which modifiers
are allowed.

-- just a hunch

george

SAS102@psuvm.psu.edu (Steven A. Schrader) (09/15/90)

Actually alot of the dictionaries do not store "words" they use arrays of point
ers.
In case you do not know a pointer is a variable that points to the location
of the data. Now you can point to an array, so what they do is have a pointer
pointing to an array of the 26 characters (a-z) ( I am only using lower case
for simplicity). This array in turn has pointers to each of the letters.
This keeps going and going. Basically the word Apple will be stored as
pointer-A-Pointer-P-pointer-p-pointer-l-pointer-e. This takes up about 5 bytes
(I think). Adding one more pointer (i.e. Pointer-s) adds Apples to the diction-
ary and this will take 6 bytes for two words. If you look for Appler you might
find that the last E array does not have a pointer-r attachment and therefore
the word is spelled wrong. The computer will suggest the closet, Apples (no
doubt) as well as Apple.

Does this makes sense? It is hard to explain quickly
----------------------------------------------------------------------
         Steven A.  Schrader (SAS102 @ Psuvm.Bitnet)
         Student Consultant Coordinater
         Student Support Initiative, The Center for Academic Computing
         The Pennsylvania State University

glennc@arrester.caltech.edu (Glenn C. Smith) (09/15/90)

Email bounced and its of general interest so...

>  How do so many words fit into a dictionary/thesaurus?  I'd like to
>  use the compression technique for a medical reference system I
>  am developing...
       [the above quote is completely paraphrased from memory]

I've always had the impression that they use a big binary tree.
The letters in a word then reduce to the bits defining 
a path down to a marker signifying the existence of a word.
Each bit corresponds to left or right as you descend the binary tree.

This is in line with the use of such dictionaries.  Given a word
in a spelling dictionary, you follow the path it defines down the
tree and see if you end at nothing (or not).  If you end at nothing
then the word is misspelled, if you end on something, then the
word is correctly spelled.  This is also a very fast search.

The words are not exactly compressed, they are encoded in the
structure of the tree.  To list out all the words in such a tree
you would use a recursive descent process.

This method saves tons of space.  All words that share the same
pattern of letters (read left to right) share the same storage space
until they become unique.  The nature of the data (the words in the 
English language) make the space savings vast.

I hope I was clear here.  I don't think this method of "compression"
lends itself to larger ordered blocks of text like references
for a medical database.  Notice that using a binary tree preserves
no order of each individual word accept alphabetical.

Good luck and I hope this helps!

Glenn C. Smith

--
_________________________________________________________________________
Glenn C. Smith                      |  "It is a weak mind that can think 
California Institute of Technology  |   of only one way to spell a word."
glennc@ccoroadd.caltech.edu         | --- "Build high for happiness." ---