[comp.compilers] Data structures for symbol tables

webber@athos.rutgers.edu (03/28/88)

In article <929@ima.ISC.COM>, beres@cadnetix.COM (Tim Beres) writes
some stuff to which the moderator appends:

> [It's hard to believe that storing a symbol table in a B-tree is worth the
> effort; B-trees make sense for disk files since looking at a page is
> relatively expensive, but for an in-core symbol table the O(1) lookup time
> of hashing should be better.  It's also a lot easier to program.  -John]

Depends on how large your symbol table is. It is easy to imagine
B-trees out performing hash tables if the b-tree nodes are near page
size.  Remember, virtual memory is on a disk too.  Given the variety
of computer architectures, hash functions, ways of handling hash
collisions, identifier usage patterns among languages, and programming
goals (i.e., how architecture specific it is possible to make the
implementation), I doubt if there is one right structure.  Many of the
studies have been done on systems where page fetch delays aren't
charged to the process making them, making them suspect.  Considering
the number of times a compiler gets used in comparison to the number
of times it gets written, the real question is: is the symbol table a
bottleneck worth speeding up?  Probably the time would be better spent
creating structure editors that hand the compiler a pre-scanned pre-parsed
object to fiddle with as appropriate.

------ BOB (webber@athos.rutgers.edu ; rutgers!athos.rutgers.edu!webber)
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

daveb@ucbvax.Berkeley.EDU (Crack? No thanks, I've got a new CD player) (03/31/88)

In article <930@ima.ISC.COM> webber@athos.rutgers.edu writes:
>In article <929@ima.ISC.COM>, beres@cadnetix.COM (Tim Beres) writes
>some stuff to which the moderator appends:
>
>> [It's hard to believe that storing a symbol table in a B-tree is worth the
>> effort; B-trees make sense for disk files since looking at a page is
>> relatively expensive, but for an in-core symbol table the O(1) lookup time
>> of hashing should be better.  It's also a lot easier to program.  -John]
>
>Depends on how large your symbol table is. It is easy to imagine
>B-trees out performing hash tables if the b-tree nodes are near page
>size.  Remember, virtual memory is on a disk too.  Given the variety
>of computer architectures, hash functions, ways of handling hash
>collisions, identifier usage patterns among languages, and programming
>goals...

I'd also like to question the presumed superiority of hashing for symbol
table functions.  Even when in-memory the constant factors can outweigh
the O(1) vs O(log N) theoretical advantage.

[ Anecdotal ]

An application I was working on was using a hash table to hold around
30k entries.  With instrumentation and profiling, I found long hash
chains *on average*, caused by a not very randomizing hash algorithm. 
The profile showed most of the time being spent in strcmps down the
chains, and in the hash function proper.  Changing the hash function for
better distribution slowed things down, because the time spent doing a
better hash was greater than the cost of doing the strcmps.

I replaced the hash tables with splay trees, and found the performance
was better, mostly because I was not doing the expensive hash function,
and effectively went strcmping right down the chain, whose average
length turned out to be about the same as the collision chain hanging
off the hash buckets before.  As a side effect, it gives you a nicely
ordered scan should that be needed.

The moral is that good performance in a hash table depends a lot on:

	* The distribution of the hashing function.  
	* The cost of the hashing function.
	* The cost of searching a collision chain

Few people bother to instrument and/or tune the hash for the best
distribution, and this can easily wipe out whatever theoretical
advantage may be present.

Those interested in the code for a splay tree library suitable for
symbol table use will find it in comp.sources.unix in a few weeks, wind
and weather permitting.  I'll mail it to anyone who is in a rush.

-dB
{amdahl, cpsc6a, mtxinu, ptsfa, sun, hoptoad}!rtech!daveb daveb@rtech.uucp
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request