burgin@ecsvax.uncecs.edu (Robert Burgin) (06/24/91)
I am writing a program that includes a dictionary look-up of 24,500 English words. Each entry in the dictionary will contain the word itself and a string representing valid syntactic tags for that word (noun, verb, etc.). Would a hash table be best for the implementation? Or a trie? Why? The trie would seem to be ideal for short words but bad for longer words. Since the most frequent words in English are rather short, perhaps that should be a consideration in favor of the trie. On the other hand, tries are notoriously memory-intensive. I worry about collisions in the hash table and wonder what the ideal number of buckets would be for this case. Not to mention having to scare up a proper hash function for strings of variable length. Any thoughts would be appreciated. --rb
boyd@prl.dec.com (Boyd Roberts) (06/25/91)
Why not use a `hash table of binary trees'? Each hash table slot has a binary tree hanging off it. A fast hash and an extensible tree. Ideal. Boyd Roberts boyd@prl.dec.com ``When the going gets wierd, the weird turn pro...''
worley@compass.com (Dale Worley) (06/25/91)
In article <1991Jun24.110300.3740@ecsvax.uncecs.edu> burgin@ecsvax.uncecs.edu (Robert Burgin) writes:
I am writing a program that includes a dictionary look-up
of 24,500 English words.
Would a hash table be best for the implementation? Or
a trie? Why?
What are you going to *do* with the dictionary? As in all
data-structure questions, you have to first figure out what you are
going to do with it before you know which data structure to use.
Dale Worley Compass, Inc. worley@compass.com
--
toy dog [toi-dawg] n. Any of several breeds of small dogs, often
high-strung. Syn: yip-yip, yip-dog, microwave delight.
mouse@thunder.mcrcim.mcgill.edu (der Mouse) (06/28/91)
In article <1991Jun24.110300.3740@ecsvax.uncecs.edu>, burgin@ecsvax.uncecs.edu (Robert Burgin) writes: > I am writing a program that includes a dictionary look-up of 24,500 > English words. Each entry in the dictionary will contain the word > itself and a string representing valid syntactic tags for that word > (noun, verb, etc.). > Would a hash table be best for the implementation? Or a trie? Why? Yes. :-) Seriously though, there is no single data structure that can definitely be said to be best, given no more information than you have given. For example, how dear is memory? If you can afford to dedicate several megabytes to the dictionary data structures, want a reasonably fast lookup, no inserts or deletes at run-time, and can afford to spend some CPU time preprocessing the dictionary, there's one reasonably simple answer. If you're tight on memory, need to insert and delete, can afford moderately slow lookups, and can't afford to do much preprocessing, something else would be better. > I worry about collisions in the hash table and wonder what the ideal > number of buckets would be for this case. That depends on the hash function. For example, if you can afford the preprocessing time to compute a perfect hash function, more than 24500 buckets would be a waste, and collisions can be ignored. > Not to mention having to scare up a proper hash function for strings > of variable length. Experiment with various functions, like the (h*33)+c function posted here recently, or a simple sum-of-characters, or a CRC...find out what works well for your case. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu