[comp.lang.c] What Data Structure?

burgin@ecsvax.uncecs.edu (Robert Burgin) (06/24/91)

I am writing a program that includes a dictionary look-up
of 24,500 English words.  Each entry in the dictionary will
contain the word itself and a string representing valid
syntactic tags for that word (noun, verb, etc.).

Would a hash table be best for the implementation?  Or
a trie?  Why?

The trie would seem to be ideal for short words but bad
for longer words.  Since the most frequent words in English
are rather short, perhaps that should be a consideration
in favor of the trie.  On the other hand, tries are
notoriously memory-intensive.

I worry about collisions in the hash table and wonder
what the ideal number of buckets would be for this case.
Not to mention having to scare up a proper hash function
for strings of variable length.

Any thoughts would be appreciated.

--rb

boyd@prl.dec.com (Boyd Roberts) (06/25/91)

Why not use a `hash table of binary trees'?  Each hash table slot has a binary tree hanging off it.  A fast hash and an extensible tree.  Ideal.



Boyd Roberts			boyd@prl.dec.com

``When the going gets wierd, the weird turn pro...''

worley@compass.com (Dale Worley) (06/25/91)

In article <1991Jun24.110300.3740@ecsvax.uncecs.edu> burgin@ecsvax.uncecs.edu (Robert Burgin) writes:
   I am writing a program that includes a dictionary look-up
   of 24,500 English words.  

   Would a hash table be best for the implementation?  Or
   a trie?  Why?

What are you going to *do* with the dictionary?  As in all
data-structure questions, you have to first figure out what you are
going to do with it before you know which data structure to use.

Dale Worley		Compass, Inc.			worley@compass.com
--
toy dog [toi-dawg] n. Any of several breeds of small dogs, often
high-strung.  Syn: yip-yip, yip-dog, microwave delight.

mouse@thunder.mcrcim.mcgill.edu (der Mouse) (06/28/91)

In article <1991Jun24.110300.3740@ecsvax.uncecs.edu>, burgin@ecsvax.uncecs.edu (Robert Burgin) writes:

> I am writing a program that includes a dictionary look-up of 24,500
> English words.  Each entry in the dictionary will contain the word
> itself and a string representing valid syntactic tags for that word
> (noun, verb, etc.).

> Would a hash table be best for the implementation?  Or a trie?  Why?

Yes.  :-)

Seriously though, there is no single data structure that can definitely
be said to be best, given no more information than you have given.

For example, how dear is memory?  If you can afford to dedicate several
megabytes to the dictionary data structures, want a reasonably fast
lookup, no inserts or deletes at run-time, and can afford to spend some
CPU time preprocessing the dictionary, there's one reasonably simple
answer.  If you're tight on memory, need to insert and delete, can
afford moderately slow lookups, and can't afford to do much
preprocessing, something else would be better.

> I worry about collisions in the hash table and wonder what the ideal
> number of buckets would be for this case.

That depends on the hash function.  For example, if you can afford the
preprocessing time to compute a perfect hash function, more than 24500
buckets would be a waste, and collisions can be ignored.

> Not to mention having to scare up a proper hash function for strings
> of variable length.

Experiment with various functions, like the (h*33)+c function posted
here recently, or a simple sum-of-characters, or a CRC...find out what
works well for your case.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu