johnl@ima.UUCP (03/23/88)
In article <917@ima.ISC.COM>, sargas.usc.edu!tli@oberon.usc.edu (Tony Li) writes: > In fact, another cute trick is to toss in a simple hashing function. > Unless you've got lots of keywords, you usually can get away with > doing only one strcmp. I'm very pleased to see many people confirming that what I've done and told my students to do is reasonably widely accepted (despite not appearing in any compiler textbook I know of)... recognizing keywords and identifiers by a single DFA rule and then using symbol table lookup techniques to determine the type of the lexeme. My question is simply: what is this technique officially called and does anyone know of a formal reference for it? Since the Compilers Course Notes I wrote back at Polytechnic University in 1983, I've been refering to it as "atomic lexical analysis" because it closely resembles the way in which Lisp recognizes atoms and then looks 'em up to determine their type... but that's just my name for it. [I've never seen it called anything, most likely because it's only recently that automatic DFA generators have made it possible to do tokenizing any other way. -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
beres@cadnetix.COM (Tim Beres) (03/25/88)
In article <919@ima.ISC.COM> haddock!uunet!uiucdcs!pur-ee!hankd (Hank Dietz) writes: >In article <917@ima.ISC.COM>, sargas.usc.edu!tli@oberon.usc.edu (Tony Li) writes: >> In fact, another cute trick is to toss in a simple hashing function. >> Unless you've got lots of keywords, you usually can get away with >> doing only one strcmp. > >I'm very pleased to see many people confirming that what I've >done and told my students to do is reasonably widely accepted >(despite not appearing in any compiler textbook I know of)... We use the symbol lookup "trick" for our lex as well. But instead of hashing we use a set of balanced B-Trees to store our symbols. We keep a tree for each length string, up to some max length. Works reasonably well if your symbols are somewhat spread over the length spectrum...anyway symbol lookup is not our bottleneck. One more note - I've found that I can exceed the lex defaults when using a rule per symbol (lots of rules and symbols); a nice touch of using a symbol lookup method is reduction in lex usage [It is possible to increase lex table sizes, the %x nnn control, but I prefer to limit my lex because it becomes simpler and easier to debug with fewer rules. Also, the symbol lookup method is a nice model when used in conjuntion with the parsers we need.] -- Tim Beres Cadnetix 303/444-8075 x221 5775 Flatirons Pkwy {uunet,boulder,nbires}!cadnetix!beres Boulder, CO 80301 beres@cadnetix.com <disclaimers: my thoughts != Cadnetix's> [It's hard to believe that storing a symbol table in a B-tree is worth the effort; B-trees make sense for disk files since looking at a page is relatively expensive, but for an in-core symbol table the O(1) lookup time of hashing should be better. It's also a lot easier to program. -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request