[comp.unix.wizards] Do start conditions work in lex?

jimv@radix (Jim Valerio) (09/19/87)

I've been debugging a problem parsing dates in MH mail, and have narrowed it
down to the point where it looks like lex is not doing the right thing with
start conditions.  I'm having this problem on a System Vr3 80386 system, and
have reproduced the problem on a 4.3bsd Vax system.

Here's a trivial lex file that illustrates the problem I am having:

	%{
	#undef input
	char *inpstr = "01";
	input() { return *inpstr++; }
	yywrap() { return 1; }
	main() { yylex(); }
	%}
	%start Z
	%%
	0 		{ printf("0\n"); BEGIN Z; }
	1		{ printf("shouldn't get here!\n"); }
	<Z>1		{ printf("1\n"); }
	.		{ printf("Error! (%c)\n", yytext[0]); }

When I compile and run this, I get the "shouldn't get here!" message
from processing the the "1" digit in the fixed input string "01".
My understanding of lex leads me to believe that I should instead get
the output lines "0" and then "1".

Am I missing something obvious here, or is there a bug in lex?
If it is a bug, can someone suggest a reasonable workaround?
--
Jim Valerio	{verdix,intelca!mipos3,intel-iwarp.arpa}!omepd!radix!jimv

stan@dnlunx.UUCP (09/21/87)

In article <6@radix>, jimv@radix (Jim Valerio) writes:
> [...] it looks like lex is not doing the right thing with
> start conditions [...]
> [...]
> Am I missing something obvious here, or is there a bug in lex?
> If it is a bug, can someone suggest a reasonable workaround?

	First of all, when lex tries to match some part of the input string
against the specified rules, it will take the rule with the longest match.
When more than one rule matches, lex decides in favor of the rule
which is first encountered in the lex file.
	After the `BEGIN Z;' statement all rules starting with `<Z>' are
active, together with all rules without a start condition.
        -------- ---- --- ----- ------- - ----- ---------
	Your example should work if you swap the rules in which the `1'
is matched, resulting in:

 	%%
 	0 		{ printf("0\n"); BEGIN Z; }
 	<Z>1		{ printf("1\n"); }
 	1		{ printf("shouldn't get here!\n"); }
 	.		{ printf("Error! (%c)\n", yytext[0]); }


				Stan
				----

ejp@ausmelb.oz.au (Esmond Pitt) (09/22/87)

In article <6@radix> jimv@radix.UUCP (Jim Valerio) writes:
>I've been debugging a problem parsing dates in MH mail, and have narrowed it
>down to the point where it looks like lex is not doing the right thing with
>start conditions.

1. Rules with start-conditions are in effect only within that start-condition.
2. Rules without start-conditions are always in effect.
3. In the event of two rules matching the same text, the first occurring rule
is chosen.
4. Therefore, rules with start-conditions must precede rules without them.

If you put your rules in this order:

	<Z>1	{blah}
	1	{blah}

you will get the desired result. There ARE bugs in lex:

1. The metacharacters ^ and $ only
work at the literal beginning and end respectively of a rule;
i.e. they do not work within () brackets, nor can they be put
within a named rule. For example, all occurrences of ^ in
the below represent the character '^', not the beginning of the line.

FRED	^FRED
%%
FRED	printf("FRED = %s\n",yytext);
(^JOE)	printf("JOE = %s\n",yytext);

2. Constructions like x+/xy: if you input xxxy to this it will return xxx,
not x as it should.

-- 
Esmond Pitt, Austec International Ltd
...!seismo!munnari!ausmelb!ejp,ejp@ausmelb.oz.au
D

jbuck@epimass.EPI.COM (Joe Buck) (09/22/87)

In article <6@radix> jimv@radix.UUCP (Jim Valerio) writes:
>I've been debugging a problem parsing dates in MH mail, and have narrowed it
>down to the point where it looks like lex is not doing the right thing with
>start conditions.  I'm having this problem on a System Vr3 80386 system, and
>have reproduced the problem on a 4.3bsd Vax system.

Your problem is that a pattern without a start condition matches
regardless of start condition.  So with the following lex code:

>	%start Z
>	%%
>	0 		{ printf("0\n"); BEGIN Z; }
>	1		{ printf("shouldn't get here!\n"); }
>	<Z>1		{ printf("1\n"); }

and the input "01", the first "1" rule matches even though you are in
the Z state, because if no state is given in the rule it always
matches.  Solution: reverse the order, to

	0 		{ printf("0\n"); BEGIN Z; }
	<Z>1		{ printf("1\n"); }
	1		{ printf("shouldn't get here!\n"); }

since the first rule found is the one that is used.
-- 
- Joe Buck  {uunet,ucbvax,sun,decwrl,<smart-site>}!epimass.epi.com!jbuck
	    Old internet mailers: jbuck%epimass.epi.com@uunet.uu.net

john@frog.UUCP (John Woods, Software) (09/23/87)

In article <6@radix>, jimv@radix (Jim Valerio) writes:
> 	%start Z
> 	%%
> 	0 		{ printf("0\n"); BEGIN Z; }
> 	1		{ printf("shouldn't get here!\n"); }
> 	<Z>1		{ printf("1\n"); }
> 	.		{ printf("Error! (%c)\n", yytext[0]); }
> 
Two keys from the LEX documentation:
(1) Rules with no start condition are always active.
(2) When two rules match the same input string, the first is preferred.

Try changing your rules to

	%start Z
	%%
	0 		{ printf("0\n"); BEGIN Z; }
	<Z>1		{ printf("1\n"); }
	1		{ printf("shouldn't get here!\n"); }
	.		{ printf("Error! (%c)\n", yytext[0]); }


--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw@eddie.mit.edu

Maybe it's the sound of a WET RAG hitting a smooth WEASEL!

chris@mimsy.UUCP (Chris Torek) (09/23/87)

>In article <6@radix> jimv@radix.UUCP (Jim Valerio) writes:
>>... it looks like lex is not doing the right thing with start conditions.

In article <1516@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
>Your problem is that a pattern without a start condition matches
>regardless of start condition.  So with the following lex code:

>>	0 		{ printf("0\n"); BEGIN Z; }
>>	1		{ printf("shouldn't get here!\n"); }
>>	<Z>1		{ printf("1\n"); }

>... Solution: reverse the order, to
>	0 		{ printf("0\n"); BEGIN Z; }
>	<Z>1		{ printf("1\n"); }
>	1		{ printf("shouldn't get here!\n"); }

Or use

	<INITIAL>1	{ printf("shouldn't get here!\n"); }
	<Z>1		{ printf("1\n"); }

Lex begins in state INITIAL; if there are no `%state's or BEGIN
directives, it stays that way forever.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

ejp@ausmelb.oz.au (Esmond Pitt) (09/29/87)

In article <6@radix>, jimv@radix (Jim Valerio) writes:
> [...] it looks like lex is not doing the right thing with
> start conditions [...]
> [...]
> Am I missing something obvious here, or is there a bug in lex?
> If it is a bug, can someone suggest a reasonable workaround?

[1000's of people pointed out lex's rule-selection rules]

Another thing to mention is that rules can be made to be active
*only* in the initial condition,
by using the start-condition name INITIAL, i.e.:

<Z>blah		grumble;
<INITIAL>blah	mumble;


-- 
Esmond Pitt, Austec International Ltd
...!seismo!munnari!ausmelb!ejp,ejp@ausmelb.oz.au
D