[comp.lang.c] LEX behaviour when given "large" automata

phs@lifia.imag.fr (Philippe Schnoebelen) (03/04/88)

   I'm having some problems with LEX. When my number of keywords/regexps is
growing, the lexical analyzer  begins to  give strange,  unexpected, (let's
face  it, wrong) results. This  is very annoying because  I did not get any
warning message about my lexical specification being too large.  Now, maybe
LEX is okay and I'm just blaming  it for my weird  errors, but you know how
it is easy to find a suspect when you're no C wizard :-)

  Is there  anybody who knows something about  the behaviour of LEX in such
situations, and  who could   explain   how to interpret,  avoid, solve  the
problem ? (A first solution would be to get some warning message...)

   Much thanks in advance.
--
Philippe SCHNOEBELEN,				Best:        phs@lifia.imag.fr
LIFIA - INPG,
46, Avenue Felix VIALLET			2nd:		phs@lifia.UUCP
38000 Grenoble, FRANCE				last: ..!mcvax!inria!lifia!phs
[Lex has never been noted for its robustness, nor for the quality of its
implementation, having been basically a summer's student intern project.  It
could stand serious rewriting which, to the best of my knowlege, it has never
received.  -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

rsalz@BBN.COM (Rich Salz) (03/19/88)

In comp.compilers (<911@ima.ISC.COM>), phs@lifia.imag.fr (Philippe Schnoebelen) writes:
>   I'm having some problems with LEX. When my number of keywords/regexps is
>growing, the lexical analyzer  begins to  give strange,  unexpected, (let's
>face  it, wrong) results.

Lex ain't robust.  As a work-around, you can get real big savings in
all of
	1.  creation time
	2.  running time
	3.  exectuable size
by going from one pattern/keyword, to a general pattern for identifiers,
and doing a table lookup.  That is, don't do this:
	for	return(FOR);
	if	return(IF);
	foo	return(FOO);
	[a-z]+	return(munchonit(yytext));

Do this:
	table[] = { { "for", FOR }, { "if", IF }, { NULL } };

	[a-z]+ 	{
		for (p = table; p->name; p++)
		    if (strcmp(p->name, yytext) == 0)
			return(p->value);
		return(munchonit(yytext));
	    }

(I left out all sort of declarations and optimizations on the search
loop.)

This is a real fun thing to do:  how often do you get to win on both sides of
the time-space tradeoff?
	/r$
[Similar suggestion from Eddie Wyatt edw@ius1.cs.cmu.edu]
[From Rich Salz <rsalz@BBN.COM>]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

chris@mimsy.UUCP (Chris Torek) (03/20/88)

Several people suggested replacing lex description such as

	if	return (IF);
	then	return (THEN);
	else	return (ELSE);
	{id}	{ symenter(yytext); return (ID); }

with

	{id}	{ /* first try it as a keyword */
		  k = keyword_index(yytext); if (k >= 0) return (k);
		  /* not a keyword, must be an identifier */
		  symenter(yytext); return (ID); }

This may (probably does) help in lex, but it is clearly the wrong way
around in almost all cases.  The lexer has to traverse a DFA to decide
that something is an ID.  While it is at it it could easily tell
whether it is a keyword instead.  Obviously this makes the DFA table
larger, but the result should run faster.

It turns out (as Van Jacobson discovered) that lex uses a long
subroutine---on the order of 200 lines---to do what is essentially
described by the loop

	/* follow the DFA */
	for (state = begin; state[c].v == c; c = *in++)
		state += state[c].n;

(best case) or

	/*
	 * Pick the right start state depending on whether we are at
	 * the beginning of a line.
	 */
	state = newline ? hat_begin : begin;
	while (state[c].v == c) {
		/* follow the DFA */
		state += state[c].n;
		c = *in++;
		/* remember action state (for right context) */
		if (state->v) {
			a_s = state;
			a_c = in;
		}
	}
	in = a_c;
	if (a_s->n) {
		in -= a_s->n;	/* back up over right context */
		c = in[-1];
	}
	newline = in[-2]=='\n';	/* remember whether we are now at ^ */
	in[-1] = '\0';		/* kill residual */

(worst case).  This pretty much handles everything except YYREJECT.
I do not know whether the LBL `flex' (Fast LEX) handles YYREJECT now.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

tli@oberon.usc.edu (Tony Li) (03/20/88)

In fact, another cute trick is to toss in a simple hashing function.
Unless you've got lots of keywords, you usually can get away with
doing only one strcmp.

Tony
Tony Li - USC University Computing Services	"Fene mele kiki bobo"
Uucp: oberon!tli						-- Joe Isuzu
Bitnet: tli@uscvaxq, tli@ramoth Internet: tli@sargas.usc.edu
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

chris@mimsy.UUCP (Chris Torek) (03/27/88)

-In some article whose referent was deleted, I wrote:
->[rewriting keywords to use table lookup on generic identifiers]
->may (probably does) help in lex, but it is clearly the wrong way
->around in almost all cases.  The lexer has to traverse a DFA to decide
->that something is an ID.  While it is at it it could easily tell
->whether it is a keyword instead.  Obviously this makes the DFA table
->larger, but the result should run faster.

In article <10700014@snail> carroll@snail.CS.UIUC.EDU writes:
-	I disagree. For a compiler class, we had to write a lex file, and
-the official way was to use the
-
-IF	 {return T_IF }
-
-etc. thing. It was big and slow. I changed mine to use the single rule for
-identifier, and look it up in a table, and I got a speed up of roughly 4
-in lex/compile time, and 40% reduction in compiled object size. I think that
-the point is that the DFA that decides if something is a token or not is
-so much simpler than one that has to check for all of the keywords, that
-the gain is positive from removing the keyword cases.

Remember, I said `may (probably does) help in lex' and `in almost all
cases'.  The fact that the DFA table is larger may slow you down, or
may not, all depending upon the machine on which the program is run
(does it have caches?  How big are they?  etc).  The fact that Lex
generates horrid DFA code is far more important in this case.

-In addition, I would think that for most cases, you have to find
-non-builtin "keywords" (e.g, variable names, etc.) as you go, and
-so have to check for something that is not a keyword but is still
-an identifier anyway.

Possibly.  This depends on the ratio of keywords to identifiers in the
`average' input.  Languages with many keywords (e.g., Pascal: begin,
end, do, then, var) may have a high ratio, making faster keyword-id to
token translation important; languages with few keywords (C) may have
a low ratio, making it unimportant.  Conversely, many keywords may
make the DFA table explode.  `Almost all cases' may have been an
overstatement; I have no hard data.  Try rewriting your lexer using
the LBL Fast Lex and see what happens: that will provide one data
point.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris