[comp.lang.c] Lex and initial start conditions

nag@hpmtlx.HP.COM ($Diwakar_Nag) (05/30/90)

/ hpmtlx:comp.lang.c / eifrig@crabcake.cs.jhu.edu (Jonathan Eifrig) /  6:42 pm  May 24, 1990 /

> 	I have a not-so-basic question about Lex and its input specific-
>ations.  I want to write a simple parser, and in particular want to sup-
>port nested comments.  The basic idea to use Lex's start conditions:
>		....
>		....
>					Jack Eifrig
>					(eifrig@cs.jhu.edu)

	Just use BEGIN <state_name> eg . BEGIN COMMENT. Since BEGIN is a macro,
and it refers to some global variables defined by lex, it is a good idea to 
define a function like InitScanner() which just calls BEGIN macro in the lex
specs. file. InitScanner() can be used in a different file without worrying
about global vars. used by BEGIN.


-diwakar

pommerel@sp14.csrd.uiuc.edu (Claude Pommerell) (05/31/90)

Lex rules that do not begin with any starting condition <cond>... are valid for
ALL possible starting conditions. BEGIN 0 resets the Lex interpreter in
its initial
state where it has no explicit starting condition, so that only the
untagged rules
are valid.

There is a way to solve your problem, Jack. You know from the Lex manual
that every
text enclosed by lines starting with "%{" and "%}" is inserted literally
in the C
program generated by Lex. In fact, if you put this insertion text before
the first
line starting with "%%" (that is, in the definitions section of your Lex
source), it
gets inserted at the global scope of the C program, so this is perfect
to declare
externals and such.

However, if you put such an insertion text after "%%" (in the rules
section of your
Lex source), it gets inserted at the start of the body of the function
that performs
the lexical analysis, so you can use it to specify an initial condition.

This is my Lex source to skip nested C-like comments:
-------------------------------------------------------------------
%{
/* context in recursive C-like comments */
static int commentLevel;
%}
  /* Starting conditions to support recursive C-like comments */
%START  Text NewCCom InCCom EndCCom
%%
%{
/* Set the initial condition */
BEGIN Text;
commentLevel = 0;
%}
<Text>\/\*                      { commentLevel = 1; BEGIN InCCom; }
<InCCom>\/                      { BEGIN NewCCom; }
<InCCom>\*                      { BEGIN EndCCom; }
<NewCCom>\*                     { ++commentLevel; BEGIN InCCom; }
<EndCCom>\/                     { if (--commentLevel)
                                    BEGIN InCCom;
                                  else
                                    BEGIN Text;
                                }
<NewCCom,EndCCom>[^\*\/]        { BEGIN InCCom; }
<InCCom>[^\/\*]                 |
<NewCCom>\/                     |
<EndCCom>\*                     ;
-------------------------------------------------------------------
All the other (true regular context-free) rules start with initial
condition <Text>.

This solution seems to be portable. I used it on Alliant, Convex, and
Cray computers
without ever having trouble with it. I will report the fix in case I
have problems
porting it further.

							Claude Pommerell
							(pommy@iis.ethz.ch)

ejp@bohra.cpg.oz (Esmond Pitt) (05/31/90)

In article <1990May30.174745.1161@csrd.uiuc.edu> pommu@iis.ethz.ch (Claude Pommerell) writes:
>
>There is a way to solve your problem, Jack.

There are two even simpler ways.

Instead of effectively changing the initial condition to <Text>, either:

1. Ensure each start-state is equipped with enough rules to handle any
possible input, and, as the documentation does state, place all the
unlabelled rules after all the labelled rules, and/or

2. Label all the rules you only want applied in the INITAL state with
<INITIAL>, so they won't be applied as defaults in other states.

Placing non-labelled rules before labelled rules is probably the single
most common error in writing LEX scripts, even after 15 years.

I don't know why.


-- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz
D

merlyn@iwarp.intel.com (Randal Schwartz) (05/31/90)

In article <116@bohra.cpg.oz>, ejp@bohra (Esmond Pitt) writes:
| There are two even simpler ways.
| 
| Instead of effectively changing the initial condition to <Text>, either:
| 
| 1. Ensure each start-state is equipped with enough rules to handle any
| possible input, and, as the documentation does state, place all the
| unlabelled rules after all the labelled rules, and/or
| 
| 2. Label all the rules you only want applied in the INITAL state with
| <INITIAL>, so they won't be applied as defaults in other states.
| 
| Placing non-labelled rules before labelled rules is probably the single
| most common error in writing LEX scripts, even after 15 years.
| 
| I don't know why.

Because it is insufficient.  It's not "first match", but "longest
match" that determines rule triggering.  ("first match" applies when
the rules have the same length.)

...
<FOO>a	{ something; }
...
ab	{ something_else; }
...

will match the "something_else" clause if "ab" is seen, even in state
"FOO".  I know... this bit me once.

Just another lex hacker,
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/

ejp@bohra.cpg.oz (Esmond Pitt) (06/01/90)

In article <1990May31.161800.11133@iwarp.intel.com> merlyn@iwarp.intel.com (Randal Schwartz) writes:
>In article <116@bohra.cpg.oz>, ejp@bohra (Esmond Pitt) [me] writes:
>| 
>| Placing non-labelled rules before labelled rules is probably the single
>| most common error in writing LEX scripts, even after 15 years.
>| 
>| I don't know why.
>
>Because it is insufficient.  It's not "first match", but "longest
>match" that determines rule triggering.  ("first match" applies when
>the rules have the same length.)

How does this explain why people put labelled states after non-labelled
states, when the manual says not to?


-- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz
D

jeff@samna.UUCP (Jeff Barber) (06/01/90)

In article <1990May30.174745.1161@csrd.uiuc.edu> pommu@iis.ethz.ch (Claude Pommerell) writes:
>However, if you put such an insertion text after "%%" (in the rules
>section of your
>Lex source), it gets inserted at the start of the body of the function
>that performs
>the lexical analysis, so you can use it to specify an initial condition.

That's okay for this particular situation.  But it won't work if
your lex program is a lexical analyzer in a larger program.  Your
placement of the "BEGIN start-symbol;" after the first %% causes
it to be included at the beginning of the yylex() function.

This means that every time you call the lexical analyzer for a 
new token, its state gets reset.

If your actions are designed to return a token to a parser (a yacc
program, for example), they'll contain statements like:
	return TOK_IDENTIFIER;

So, a better general purpose solution is to define some function 
after the *second* %% which contains the BEGIN statement and is
called to initialize the analyzer.

In your case, we can just create a main() function with the
BEGIN in it (You've also got some unnecessary states in
here, so I've simplified a bit):

--------------------Cut Here----------------------------
%{
/* context in recursive C-like comments */
static int commentLevel = 0;
%}
/* Starting conditions to support recursive C-like comments */
%START  Text InCCom
%%
\/\*		{ ++commentLevel; BEGIN InCCom; }
<InCCom>\*\/	{ if (--commentLevel == 0) BEGIN Text; }
<Text>\*\/	{ printf("Syntax error\n"); exit(1); }
<InCCom>.	|
<InCCom>\n	{ /* Ignore stuff inside of comments 
			everything else echoed by default. */ }
%%
main(ac, av)
char	**av;
{
	/* Set the initial condition */
	BEGIN Text;
	return yylex();
}
--------------------Cut Here----------------------------

One last thing, it is possible to utter the name of the
initial state ("INITIAL") so that if INITIAL were substituted
for Text, no state initialization would be necessary
(our main() function wouldn't be either; it would be supplied
by the lex library [ cc ... -ll ]).

(BTW, anybody know whether this is portable - I don't recall reading
about this INITIAL state in the documentation; I just noticed
it in the lex.yy.c output and discovered by experimentation
that lex recognizes it in a <INITIAL> rule).

I've directed followups out of comp.lang.c.

Jeff

norbert@rwthinf.UUCP (Norbert Kiesel) (06/05/90)

Or even better, instead of using lex, use flex! It's
- GNU coptleft (I think),
- fully documented
- has exclusive and inclusive start conditions (normal lex only has inclusive
  start conditions)
- and is *MUCH* faster 

Just check your nearest ftp server. Normally it's stored under /pub/gnu.
The latest version is 2.2.


		so long

			Norbert


*******************************************************************************
* Norbert Kiesel	Institut f. Informatik III     NN       NN    KK   KK *
* RWTH Aachen           Ahornstr. 55  		       NNN      NN    KK  KK  *
* West Germany		D-5100 Aachen 		       NN N     NN    KK KK   *
*                       +49 241 80-7266		       NN  N    NN    KKKK    *
* 		                 		       NN   N   NN    KKKK    *
* EUNET:  norbert@rwthi3.uucp          		       NN    N  NN    KK KK   *
* USENET: ...!mcvax!unido!rwthi3!norbert	       NN     N NN    KK  KK  *
* X.400:  norbert@rwthi3.informatik.rwth-aachen.de     NN      NNN    KK   KK *
*******************************************************************************