phs@lifia.imag.fr (Philippe Schnoebelen) (03/04/88)
I'm having some problems with LEX. When my number of keywords/regexps is growing, the lexical analyzer begins to give strange, unexpected, (let's face it, wrong) results. This is very annoying because I did not get any warning message about my lexical specification being too large. Now, maybe LEX is okay and I'm just blaming it for my weird errors, but you know how it is easy to find a suspect when you're no C wizard :-) Is there anybody who knows something about the behaviour of LEX in such situations, and who could explain how to interpret, avoid, solve the problem ? (A first solution would be to get some warning message...) Much thanks in advance. -- Philippe SCHNOEBELEN, Best: phs@lifia.imag.fr LIFIA - INPG, 46, Avenue Felix VIALLET 2nd: phs@lifia.UUCP 38000 Grenoble, FRANCE last: ..!mcvax!inria!lifia!phs [Lex has never been noted for its robustness, nor for the quality of its implementation, having been basically a summer's student intern project. It could stand serious rewriting which, to the best of my knowlege, it has never received. -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
rsalz@BBN.COM (Rich Salz) (03/19/88)
In comp.compilers (<911@ima.ISC.COM>), phs@lifia.imag.fr (Philippe Schnoebelen) writes: > I'm having some problems with LEX. When my number of keywords/regexps is >growing, the lexical analyzer begins to give strange, unexpected, (let's >face it, wrong) results. Lex ain't robust. As a work-around, you can get real big savings in all of 1. creation time 2. running time 3. exectuable size by going from one pattern/keyword, to a general pattern for identifiers, and doing a table lookup. That is, don't do this: for return(FOR); if return(IF); foo return(FOO); [a-z]+ return(munchonit(yytext)); Do this: table[] = { { "for", FOR }, { "if", IF }, { NULL } }; [a-z]+ { for (p = table; p->name; p++) if (strcmp(p->name, yytext) == 0) return(p->value); return(munchonit(yytext)); } (I left out all sort of declarations and optimizations on the search loop.) This is a real fun thing to do: how often do you get to win on both sides of the time-space tradeoff? /r$ [Similar suggestion from Eddie Wyatt edw@ius1.cs.cmu.edu] [From Rich Salz <rsalz@BBN.COM>] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
chris@mimsy.UUCP (Chris Torek) (03/20/88)
Several people suggested replacing lex description such as if return (IF); then return (THEN); else return (ELSE); {id} { symenter(yytext); return (ID); } with {id} { /* first try it as a keyword */ k = keyword_index(yytext); if (k >= 0) return (k); /* not a keyword, must be an identifier */ symenter(yytext); return (ID); } This may (probably does) help in lex, but it is clearly the wrong way around in almost all cases. The lexer has to traverse a DFA to decide that something is an ID. While it is at it it could easily tell whether it is a keyword instead. Obviously this makes the DFA table larger, but the result should run faster. It turns out (as Van Jacobson discovered) that lex uses a long subroutine---on the order of 200 lines---to do what is essentially described by the loop /* follow the DFA */ for (state = begin; state[c].v == c; c = *in++) state += state[c].n; (best case) or /* * Pick the right start state depending on whether we are at * the beginning of a line. */ state = newline ? hat_begin : begin; while (state[c].v == c) { /* follow the DFA */ state += state[c].n; c = *in++; /* remember action state (for right context) */ if (state->v) { a_s = state; a_c = in; } } in = a_c; if (a_s->n) { in -= a_s->n; /* back up over right context */ c = in[-1]; } newline = in[-2]=='\n'; /* remember whether we are now at ^ */ in[-1] = '\0'; /* kill residual */ (worst case). This pretty much handles everything except YYREJECT. I do not know whether the LBL `flex' (Fast LEX) handles YYREJECT now. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
tli@oberon.usc.edu (Tony Li) (03/20/88)
In fact, another cute trick is to toss in a simple hashing function. Unless you've got lots of keywords, you usually can get away with doing only one strcmp. Tony Tony Li - USC University Computing Services "Fene mele kiki bobo" Uucp: oberon!tli -- Joe Isuzu Bitnet: tli@uscvaxq, tli@ramoth Internet: tli@sargas.usc.edu -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
chris@mimsy.UUCP (Chris Torek) (03/27/88)
-In some article whose referent was deleted, I wrote:
->[rewriting keywords to use table lookup on generic identifiers]
->may (probably does) help in lex, but it is clearly the wrong way
->around in almost all cases. The lexer has to traverse a DFA to decide
->that something is an ID. While it is at it it could easily tell
->whether it is a keyword instead. Obviously this makes the DFA table
->larger, but the result should run faster.
In article <10700014@snail> carroll@snail.CS.UIUC.EDU writes:
- I disagree. For a compiler class, we had to write a lex file, and
-the official way was to use the
-
-IF {return T_IF }
-
-etc. thing. It was big and slow. I changed mine to use the single rule for
-identifier, and look it up in a table, and I got a speed up of roughly 4
-in lex/compile time, and 40% reduction in compiled object size. I think that
-the point is that the DFA that decides if something is a token or not is
-so much simpler than one that has to check for all of the keywords, that
-the gain is positive from removing the keyword cases.
Remember, I said `may (probably does) help in lex' and `in almost all
cases'. The fact that the DFA table is larger may slow you down, or
may not, all depending upon the machine on which the program is run
(does it have caches? How big are they? etc). The fact that Lex
generates horrid DFA code is far more important in this case.
-In addition, I would think that for most cases, you have to find
-non-builtin "keywords" (e.g, variable names, etc.) as you go, and
-so have to check for something that is not a keyword but is still
-an identifier anyway.
Possibly. This depends on the ratio of keywords to identifiers in the
`average' input. Languages with many keywords (e.g., Pascal: begin,
end, do, then, var) may have a high ratio, making faster keyword-id to
token translation important; languages with few keywords (C) may have
a low ratio, making it unimportant. Conversely, many keywords may
make the DFA table explode. `Almost all cases' may have been an
overstatement; I have no hard data. Try rewriting your lexer using
the LBL Fast Lex and see what happens: that will provide one data
point.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris