firth@sei.cmu.edu (Robert Firth) (03/11/89)
This is a brief response to the question of building a symbol table when compiling a C-like language. I assume the language has at least these features of C: no factored names (such as Text_IO.Put_Line) and no overloading (such as Text_IO.Put). However, we do have global names, local names, and block structure. The symbol table seems to serve two purposes: mapping lexical identifiers into names, and finding defining occurrences of names from applied occurrences. For the first purpose, I believe a simple hash scheme is much the best. Hash the identifier into a list head, and chain on collisions. Keep it very simple. The hashing function I usually use takes the identifier length and a few of its characters, does some shifts and adds, and then takes the final modulus. There is very little point in being clever provided your string comparison function is slick: the cost of a couple more probes is less than the cost of some elaborate all-character hash, quadratic rehash, or whatever. The number of list heads depends on how big you expect a program to be; for medium-size compilations (eg of compiler modules with maybe 1500 lines and 600 names) something around 500 to 1000 is about right. The average number of probes is then only 2 or 3. For the second purpose, I have experimented with several structures, and I believe this is the best: (a) Each lexical identifier is stored once only, in the string table, accessed as described above. (b) Each lexical identifier is associated with exactly one 'global declaration cell', which records a global declaration of that identifier. Note that this relies on the fact that all global (or outermost scope) names must be unique. (c) The global declaration cell has two special flags . Not globally declared . Also locally declared If an identifier is ever redeclared locally, the second flag is set. If the identifier has been declared, but not globally, the first flag is set. (d) A linear list of local redeclarations is kept, with the usual simple LIFO algorithm: add declarations to the end; search in reverse order; mark on block entry and reset on block exit. The strategy for using this structure is simple: from the lexical identifier one can go via the hash to the entry in the string table, and from there directly to the global declaration cell. If the identifier has never been redeclared, you are done. If the identifier is marked as redeclared, you have to search the linear list. The reasoning behind it is as follows: typically, programs have a large set of global names, used sparsely, and a small set of local names, used densely. Rarely is a global name redeclared locally. So the great majority of uses of globally-declared identifiers are indeed applied occurrences of the global declaration, and can be found in O(1+). The great majority of uses of locally-declared identifiers are uses of the innermost current declaration, and again will be found rapidly by linear search, in O(average number of local declarations per block). Since the same local names (x,i,p,r,ch,...) are used over and over again, I don't even bother to reset the 'Also locally declared' flag on block exit: if you run off the end of the local list, the global declaration applies; if the 'Not globally declared' flag is set, there is no global declaration and you have an error. The strategy also deals moderately well with those wretched 'include' files: the declared names do get stuffed into the string table and so slow the hash lookup, but their entries in the global declaration list merely take up space - they don't slow down the cost of looking up declarations. Since small programs may never even use 90% of the names they import via 'include' files, this is a desirable property. I've used this scheme in two compiler front ends, and it was very fast indeed. Hope that helps Robert Firth [From firth@sei.cmu.edu (Robert Firth)] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
henry@zoo.toronto.edu (03/14/89)
>... I assume the language has at least these features >of C: no factored names (such as Text_IO.Put_Line)... I haven't run into this particular terminology before, but if it's what I think, this is badly out of date: struct member names, in particular, have been local to their structs in C for a long time. (Although there are still some outdated compilers around...) Henry Spencer at U of Toronto Zoology uunet!attcan!utzoo!henry henry@zoo.toronto.edu -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request