pac@stl.stc.co.uk (09/12/90)
Can anyone give me a reference to the compression techniques used by the dictionary/thesaurus applications and DAs. Taking Microlytics claim about Word Finder (220,000 words) it works out at about 1.6 bytes per word. Presumably the portable calculator-like dictionaries that are available use similar compression algorithms? Any leads (by email) would help me as I am trying to find a way of squeezing a medical reference database into a small handheld computer. A discussion on the net about the algorithms would also be interesting, although I've probably kicked this off in the wrong newsgroup. Thanks, Paul Cooper Regards, Paul Cooper (pac@stl.stc.co.uk) - ICL Europe
jeffe@eniac.seas.upenn.edu (George Jefferson ) (09/13/90)
In article <3374@stl.stc.co.uk> pac@stl.stc.co.uk () writes: >Can anyone give me a reference to the compression techniques used by the >dictionary/thesaurus applications and DAs. Taking Microlytics claim about >Word Finder (220,000 words) it works out at about 1.6 bytes per word. I suspect that the '220,000' words counts plurals, and other word variants as seperate 'words'. A tremendous 'compression' is achieved by recognising the (reasonably) simple rules for adding aprporiate suffixes and prefixes. Eace base word need only cary around a couple of bytes to indicate which modifiers are allowed. -- just a hunch george
SAS102@psuvm.psu.edu (Steven A. Schrader) (09/15/90)
Actually alot of the dictionaries do not store "words" they use arrays of point ers. In case you do not know a pointer is a variable that points to the location of the data. Now you can point to an array, so what they do is have a pointer pointing to an array of the 26 characters (a-z) ( I am only using lower case for simplicity). This array in turn has pointers to each of the letters. This keeps going and going. Basically the word Apple will be stored as pointer-A-Pointer-P-pointer-p-pointer-l-pointer-e. This takes up about 5 bytes (I think). Adding one more pointer (i.e. Pointer-s) adds Apples to the diction- ary and this will take 6 bytes for two words. If you look for Appler you might find that the last E array does not have a pointer-r attachment and therefore the word is spelled wrong. The computer will suggest the closet, Apples (no doubt) as well as Apple. Does this makes sense? It is hard to explain quickly ---------------------------------------------------------------------- Steven A. Schrader (SAS102 @ Psuvm.Bitnet) Student Consultant Coordinater Student Support Initiative, The Center for Academic Computing The Pennsylvania State University
glennc@arrester.caltech.edu (Glenn C. Smith) (09/15/90)
Email bounced and its of general interest so... > How do so many words fit into a dictionary/thesaurus? I'd like to > use the compression technique for a medical reference system I > am developing... [the above quote is completely paraphrased from memory] I've always had the impression that they use a big binary tree. The letters in a word then reduce to the bits defining a path down to a marker signifying the existence of a word. Each bit corresponds to left or right as you descend the binary tree. This is in line with the use of such dictionaries. Given a word in a spelling dictionary, you follow the path it defines down the tree and see if you end at nothing (or not). If you end at nothing then the word is misspelled, if you end on something, then the word is correctly spelled. This is also a very fast search. The words are not exactly compressed, they are encoded in the structure of the tree. To list out all the words in such a tree you would use a recursive descent process. This method saves tons of space. All words that share the same pattern of letters (read left to right) share the same storage space until they become unique. The nature of the data (the words in the English language) make the space savings vast. I hope I was clear here. I don't think this method of "compression" lends itself to larger ordered blocks of text like references for a medical database. Notice that using a binary tree preserves no order of each individual word accept alphabetical. Good luck and I hope this helps! Glenn C. Smith -- _________________________________________________________________________ Glenn C. Smith | "It is a weak mind that can think California Institute of Technology | of only one way to spell a word." glennc@ccoroadd.caltech.edu | --- "Build high for happiness." ---