[comp.compilers] Symbol Table

firth@sei.cmu.edu (Robert Firth) (03/11/89)

This is a brief response to the question of building a symbol table when
compiling a C-like language. I assume the language has at least these features
of C: no factored names (such as Text_IO.Put_Line) and no overloading (such as
Text_IO.Put). However, we do have global names, local names, and block
structure.

The symbol table seems to serve two purposes: mapping lexical identifiers into
names, and finding defining occurrences of names from applied occurrences.

For the first purpose, I believe a simple hash scheme is much the best. Hash
the identifier into a list head, and chain on collisions. Keep it very simple.
The hashing function I usually use takes the identifier length and a few of
its characters, does some shifts and adds, and then takes the final modulus.
There is very little point in being clever provided your string comparison
function is slick: the cost of a couple more probes is less than the cost of
some elaborate all-character hash, quadratic rehash, or whatever.

The number of list heads depends on how big you expect a program to be; for
medium-size compilations (eg of compiler modules with maybe 1500 lines and 600
names) something around 500 to 1000 is about right. The average number of
probes is then only 2 or 3.

For the second purpose, I have experimented with several structures, and I
believe this is the best:

(a) Each lexical identifier is stored once only, in the string table,
    accessed as described above.

(b) Each lexical identifier is associated with exactly one 'global
    declaration cell', which records a global declaration of that
    identifier.  Note that this relies on the fact that all global
    (or outermost scope) names must be unique.

(c) The global declaration cell has two special flags

    . Not globally declared

    . Also locally declared

    If an identifier is ever redeclared locally, the second flag is
    set.  If the identifier has been declared, but not globally, the
    first flag is set.

(d) A linear list of local redeclarations is kept, with the usual
    simple LIFO algorithm: add declarations to the end; search in
    reverse order; mark on block entry and reset on block exit.

The strategy for using this structure is simple: from the lexical identifier
one can go via the hash to the entry in the string table, and from there
directly to the global declaration cell. If the identifier has never been
redeclared, you are done. If the identifier is marked as redeclared, you have
to search the linear list.

The reasoning behind it is as follows: typically, programs have a large set of
global names, used sparsely, and a small set of local names, used densely.
Rarely is a global name redeclared locally. So the great majority of uses of
globally-declared identifiers are indeed applied occurrences of the global
declaration, and can be found in O(1+). The great majority of uses of
locally-declared identifiers are uses of the innermost current declaration,
and again will be found rapidly by linear search, in O(average number of local
declarations per block).

Since the same local names (x,i,p,r,ch,...) are used over and over again, I
don't even bother to reset the 'Also locally declared' flag on block exit: if
you run off the end of the local list, the global declaration applies; if the
'Not globally declared' flag is set, there is no global declaration and you
have an error.

The strategy also deals moderately well with those wretched 'include' files:
the declared names do get stuffed into the string table and so slow the hash
lookup, but their entries in the global declaration list merely take up space
- they don't slow down the cost of looking up declarations. Since small
programs may never even use 90% of the names they import via 'include' files,
this is a desirable property.

I've used this scheme in two compiler front ends, and it was very fast indeed.

Hope that helps

Robert Firth
[From firth@sei.cmu.edu (Robert Firth)]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

henry@zoo.toronto.edu (03/14/89)

>... I assume the language has at least these features
>of C: no factored names (such as Text_IO.Put_Line)...

I haven't run into this particular terminology before, but if it's what
I think, this is badly out of date:  struct member names, in particular,
have been local to their structs in C for a long time.  (Although there
are still some outdated compilers around...)

                                     Henry Spencer at U of Toronto Zoology
                                 uunet!attcan!utzoo!henry henry@zoo.toronto.edu
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request