[comp.unix.wizards] Lex and initial start conditions

pommerel@sp14.csrd.uiuc.edu (Claude Pommerell) (05/31/90)

Lex rules that do not begin with any starting condition <cond>... are valid for
ALL possible starting conditions. BEGIN 0 resets the Lex interpreter in
its initial
state where it has no explicit starting condition, so that only the
untagged rules
are valid.

There is a way to solve your problem, Jack. You know from the Lex manual
that every
text enclosed by lines starting with "%{" and "%}" is inserted literally
in the C
program generated by Lex. In fact, if you put this insertion text before
the first
line starting with "%%" (that is, in the definitions section of your Lex
source), it
gets inserted at the global scope of the C program, so this is perfect
to declare
externals and such.

However, if you put such an insertion text after "%%" (in the rules
section of your
Lex source), it gets inserted at the start of the body of the function
that performs
the lexical analysis, so you can use it to specify an initial condition.

This is my Lex source to skip nested C-like comments:
-------------------------------------------------------------------
%{
/* context in recursive C-like comments */
static int commentLevel;
%}
  /* Starting conditions to support recursive C-like comments */
%START  Text NewCCom InCCom EndCCom
%%
%{
/* Set the initial condition */
BEGIN Text;
commentLevel = 0;
%}
<Text>\/\*                      { commentLevel = 1; BEGIN InCCom; }
<InCCom>\/                      { BEGIN NewCCom; }
<InCCom>\*                      { BEGIN EndCCom; }
<NewCCom>\*                     { ++commentLevel; BEGIN InCCom; }
<EndCCom>\/                     { if (--commentLevel)
                                    BEGIN InCCom;
                                  else
                                    BEGIN Text;
                                }
<NewCCom,EndCCom>[^\*\/]        { BEGIN InCCom; }
<InCCom>[^\/\*]                 |
<NewCCom>\/                     |
<EndCCom>\*                     ;
-------------------------------------------------------------------
All the other (true regular context-free) rules start with initial
condition <Text>.

This solution seems to be portable. I used it on Alliant, Convex, and
Cray computers
without ever having trouble with it. I will report the fix in case I
have problems
porting it further.

							Claude Pommerell
							(pommy@iis.ethz.ch)

ejp@bohra.cpg.oz (Esmond Pitt) (05/31/90)

In article <1990May30.174745.1161@csrd.uiuc.edu> pommu@iis.ethz.ch (Claude Pommerell) writes:
>
>There is a way to solve your problem, Jack.

There are two even simpler ways.

Instead of effectively changing the initial condition to <Text>, either:

1. Ensure each start-state is equipped with enough rules to handle any
possible input, and, as the documentation does state, place all the
unlabelled rules after all the labelled rules, and/or

2. Label all the rules you only want applied in the INITAL state with
<INITIAL>, so they won't be applied as defaults in other states.

Placing non-labelled rules before labelled rules is probably the single
most common error in writing LEX scripts, even after 15 years.

I don't know why.

-- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz
D

merlyn@iwarp.intel.com (Randal Schwartz) (05/31/90)

In article <116@bohra.cpg.oz>, ejp@bohra (Esmond Pitt) writes:
| There are two even simpler ways.
| 
| Instead of effectively changing the initial condition to <Text>, either:
| 
| 1. Ensure each start-state is equipped with enough rules to handle any
| possible input, and, as the documentation does state, place all the
| unlabelled rules after all the labelled rules, and/or
| 
| 2. Label all the rules you only want applied in the INITAL state with
| <INITIAL>, so they won't be applied as defaults in other states.
| 
| Placing non-labelled rules before labelled rules is probably the single
| most common error in writing LEX scripts, even after 15 years.
| 
| I don't know why.

Because it is insufficient.  It's not "first match", but "longest
match" that determines rule triggering.  ("first match" applies when
the rules have the same length.)

...
<FOO>a	{ something; }
...
ab	{ something_else; }
...

will match the "something_else" clause if "ab" is seen, even in state
"FOO".  I know... this bit me once.

Just another lex hacker,
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/

jeff@samna.UUCP (Jeff Barber) (06/01/90)

In article <1990May30.174745.1161@csrd.uiuc.edu> pommu@iis.ethz.ch (Claude Pommerell) writes:
>However, if you put such an insertion text after "%%" (in the rules
>section of your
>Lex source), it gets inserted at the start of the body of the function
>that performs
>the lexical analysis, so you can use it to specify an initial condition.

That's okay for this particular situation.  But it won't work if
your lex program is a lexical analyzer in a larger program.  Your
placement of the "BEGIN start-symbol;" after the first %% causes
it to be included at the beginning of the yylex() function.

This means that every time you call the lexical analyzer for a 
new token, its state gets reset.

If your actions are designed to return a token to a parser (a yacc
program, for example), they'll contain statements like:
	return TOK_IDENTIFIER;

So, a better general purpose solution is to define some function 
after the *second* %% which contains the BEGIN statement and is
called to initialize the analyzer.

In your case, we can just create a main() function with the
BEGIN in it (You've also got some unnecessary states in
here, so I've simplified a bit):

--------------------Cut Here----------------------------
%{
/* context in recursive C-like comments */
static int commentLevel = 0;
%}
/* Starting conditions to support recursive C-like comments */
%START  Text InCCom
%%
\/\*		{ ++commentLevel; BEGIN InCCom; }
<InCCom>\*\/	{ if (--commentLevel == 0) BEGIN Text; }
<Text>\*\/	{ printf("Syntax error\n"); exit(1); }
<InCCom>.	|
<InCCom>\n	{ /* Ignore stuff inside of comments 
			everything else echoed by default. */ }
%%
main(ac, av)
char	**av;
{
	/* Set the initial condition */
	BEGIN Text;
	return yylex();
}
--------------------Cut Here----------------------------

One last thing, it is possible to utter the name of the
initial state ("INITIAL") so that if INITIAL were substituted
for Text, no state initialization would be necessary
(our main() function wouldn't be either; it would be supplied
by the lex library [ cc ... -ll ]).

(BTW, anybody know whether this is portable - I don't recall reading
about this INITIAL state in the documentation; I just noticed
it in the lex.yy.c output and discovered by experimentation
that lex recognizes it in a <INITIAL> rule).

I've directed followups out of comp.lang.c.

Jeff

lbr@holos0.uucp (Len Reed) (06/03/90)

In article <254@samna.UUCP> jeff@samna.UUCP (Jeff Barber) writes:

For real control of start conditions, use flex, the Berkeley fast lexical
analyzer.  It runs *far* faster than lex, both at lex time and run time, doesn't
require a bunch of table scaling (% this and % that),  has enhancements
to the start conditions, allows you to more control over the lexer program,
is available in source form, and runs on DOS as well as Unix.  (That last
item is unforunately important to me.)  Flex has been posted to the net.

I'd never go back to lex....
-- 
Len Reed
Holos Software, Inc.
Voice: (404) 496-1358
UUCP: ...!gatech!holos0!lbr

norbert@rwthinf.UUCP (Norbert Kiesel) (06/05/90)

Or even better, instead of using lex, use flex! It's
- GNU coptleft (I think),
- fully documented
- has exclusive and inclusive start conditions (normal lex only has inclusive
  start conditions)
- and is *MUCH* faster 

Just check your nearest ftp server. Normally it's stored under /pub/gnu.
The latest version is 2.2.


		so long

			Norbert


*******************************************************************************
* Norbert Kiesel	Institut f. Informatik III     NN       NN    KK   KK *
* RWTH Aachen           Ahornstr. 55  		       NNN      NN    KK  KK  *
* West Germany		D-5100 Aachen 		       NN N     NN    KK KK   *
*                       +49 241 80-7266		       NN  N    NN    KKKK    *
* 		                 		       NN   N   NN    KKKK    *
* EUNET:  norbert@rwthi3.uucp          		       NN    N  NN    KK KK   *
* USENET: ...!mcvax!unido!rwthi3!norbert	       NN     N NN    KK  KK  *
* X.400:  norbert@rwthi3.informatik.rwth-aachen.de     NN      NNN    KK   KK *
*******************************************************************************

bryant@flash.UUCP (Mike Bryant) (06/06/90)

In article <1990Jun3.024659.4122@holos0.uucp> lbr@holos0.uucp (Len Reed) writes:
>In article <254@samna.UUCP> jeff@samna.UUCP (Jeff Barber) writes:
>
>For real control of start conditions, use flex, the Berkeley fast lexical
>analyzer. It runs *far* faster than lex, both at lex time and run time, doesn't
>require a bunch of table scaling (% this and % that),  has enhancements
>to the start conditions, allows you to more control over the lexer program,
>is available in source form, and runs on DOS as well as Unix.  (That last
>item is unforunately important to me.)  Flex has been posted to the net.
>
>I'd never go back to lex....
>-- 
>Len Reed
>Holos Software, Inc.
>Voice: (404) 496-1358
>UUCP: ...!gatech!holos0!lbr

Does anyone know if the scanner generator Gamma-GLA has ever been
posted   to  the  net?  A paper on this tool was presented in the
Proceedings of the Summer 1988 USENIX  Conference  by  Robert  W.
Gray   from   the   University   of  Colorado. The paper showed a
benchmark of 10,000 lines of Pascal code run through several lex-
ical analyzers on a Sun 3/260 with the following results:

				Time
	Program		user	sys	total
	------------	-----	-----	-----
	gla		1.4	0.2	1.6
	flex (-cf)	3.18	0.24	3.42
	flex (-cem)	7.48	0.36	7.84
	lex		11.92	0.18	12.1

Also, when and where was Flex last posted to the net?  Thanks!
-- 
Mike Bryant      bryant@Summation.WA.COM
Summation Inc.   11335 NE 122nd Way  Kirkland, WA 98034