[comp.lang.c] LEX rule, anyone???

d5kwedb@dtek.chalmers.se. (Kristian Wedberg) (12/05/89)

A question from a friend of mine, P{r Eriksson:

	Does anyone know how to write a LEX rule for C comments,
	ie for everything between /* and */, nesting not allowed?

If YOU know the answer, don't hesitate to answer (preferably by email) to:

	
	kitte				d5kwedb@dtek.chalmers.se

rsalz@bbn.com (Rich Salz) (12/05/89)

In <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes:
>A question from a friend of mine, P{r Eriksson:
>	Does anyone know how to write a LEX rule for C comments,
>	ie for everything between /* and */, nesting not allowed?

We go through this in comp.lang.c about once a year.  Almost everyone
gets it wrong.  The best thing to do is to define a lex rule that
catches "/*" and in the actions for that rule look for */ on your own.

The following code fragment comes from the Cronus type definition compiler:

/* State of our two automata. */
typedef enum _STATE {
    S_EAT, S_STAR, S_NORMAL, S_END
} STATE;

"/*"	        {
		    /* Comment. */
		    register STATE	 S;

		    for (S = S_NORMAL; S != S_END; )
			switch (input()) {
			case '/':
			    if (S == S_STAR) {
				S = S_END;
				break;
			    }
			    /* FALLTHROUGH */
			default:
			    S = S_NORMAL;
			    break;
			case '\0':
			    /* Warn about EOF inside comment? */
			    S = S_END;
			    break;
			case '*':
			    S = S_STAR;
			    break;
			}
		    /* NOTREACHED */
		}
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

cml@tove.umd.edu (Christopher Lott) (12/05/89)

In article <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes:
>	Does anyone know how to write a LEX rule for C comments,
>	ie for everything between /* and */, nesting not allowed?

[ because I would love to see other solutions to this, I posted. ]

This sounds suspiciously like someone's homework assignment.  In fact,
I had exactly such a homework assignment, and this resulted :-)

The following lines are a lex program to recognize c comments, not nested. 
This does NOT take into account any quote marks within a comment; i.e., 
quote marks don't 'escape' the close comment marker.

I compile this via "lex ccom.l ; cc lex.yy.c -o ccom -ll"

----snip----
   /* this is a regular expression to match a c comment	*/
   /* written by cml 890922   (probably not minimal)	*/
%%
"/*"([^*]|[*]*[^*/])*[*]+"/"	{printf("saw a c comment.\n");}
.				{putchar(*yytext);}
----snip----

chris...
--
cml@tove.umd.edu    Computer Science Dept, U. Maryland at College Park
		    4122 A.V.W.  301-454-8711	<standard disclaimers>

mike@wheaties.ai.mit.edu (Mike Haertel) (12/06/89)

In article <2191@prune.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>>A question from a friend of mine, P{r Eriksson:
>>	Does anyone know how to write a LEX rule for C comments,
>>	ie for everything between /* and */, nesting not allowed?
>We go through this in comp.lang.c about once a year.  Almost everyone
>gets it wrong.  The best thing to do is to define a lex rule that
>catches "/*" and in the actions for that rule look for */ on your own.

I've done this before.  I think the regexp I used was (ick!):

"/*"([^*]*("*"[^/])?)*"*/"

Despite the fact that you *can* do it, you don't want to.  The
reason is that lex has a fixed length buffer into which the text
of the entire token must fit.  This is arguably brain damaged
(in fact there is an option to set the buffer length, but one
might argue that the buffer should dynamically resize).

If you think about it, it's kind of silly to carefully read in all
of something you're going to throw away anyway.  Which is yet
another reason why the method rsalz suggests is recommended.
-- 
Mike Haertel <mike@ai.mit.edu>
"Everything there is to know about playing the piano can be taught
 in half an hour, I'm convinced of it." -- Glenn Gould

ejp@bohra.cpg.oz (Esmond Pitt) (12/06/89)

In article <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes:
>A question from a friend of mine, P{r Eriksson:
>
>	Does anyone know how to write a LEX rule for C comments,
>	ie for everything between /* and */, nesting not allowed?

You don't want to do this in one rule, because a sufficently long C
comment will overflow lex's token buffer, with dire results. Three
rules do it; the order is important.

%start COMMENT
%%
<COMMENT>"*/"	BEGIN(INITIAL);
<COMMENT>.|\n	;
<INITIAL>"/*"			BEGIN(COMMENT);

If you are using lex you will find this slow, so you will probably replace
the above with something like this:

%%
"/*"	{
	int c;

	for(;;)
	{
		c = input();
		if (c == '*')
		{
			c = input();
			if (c == '/')
				break;
			unput(c);
		}
	}
}

(E&OE)

-- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz

bill@twwells.com (T. William Wells) (12/06/89)

In article <21108@mimsy.umd.edu> cml@tove.umd.edu (Christopher Lott) writes:
:    /* this is a regular expression to match a c comment       */
:    /* written by cml 890922   (probably not minimal)  */
: %%
: "/*"([^*]|[*]*[^*/])*[*]+"/"  {printf("saw a c comment.\n");}
: .                             {putchar(*yytext);}

This breaks if the comment is longer than lex's internal buffer.
Moreoever, at least some lex's do *not* check for buffer overflow.

Boom.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

bill@twwells.com (T. William Wells) (12/07/89)

In article <224@bohra.cpg.oz> ejp@bohra.cpg.oz (Esmond Pitt) writes:
: In article <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes:
: >A question from a friend of mine, P{r Eriksson:
: >
: >     Does anyone know how to write a LEX rule for C comments,
: >     ie for everything between /* and */, nesting not allowed?
:
: You don't want to do this in one rule, because a sufficently long C
: comment will overflow lex's token buffer, with dire results. Three
: rules do it; the order is important.

(It should not be: the */ is longer than the `.' rule. Or is
INITIAL as a start state treated specially?)

: %start COMMENT
: %%
: <COMMENT>"*/" BEGIN(INITIAL);
: <COMMENT>.|\n ;
: <INITIAL>"/*"                 BEGIN(COMMENT);
:
: If you are using lex you will find this slow

Not only that, but start states are not reliable with lex. After
being bitten by them for the n'th time, I switched to flex.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

tps@chem.ucsd.edu (Tom Stockfisch) (12/07/89)

In article <1989Dec6.180833.2985@twwells.com> bill@twwells.com (T. William Wells) writes:

>... start states are not reliable with lex. After
>being bitten by them for the n'th time, I switched to flex.

I know of a bug that involves variable length trailing contexts (this
happens in flex, too), but I've never run into a bug with start states.
Can you give specifics?
-- 

|| Tom Stockfisch, UCSD Chemistry	tps@chem.ucsd.edu

bill@twwells.com (T. William Wells) (12/08/89)

In article <616@chem.ucsd.EDU> tps@chem.ucsd.edu (Tom Stockfisch) writes:
: In article <1989Dec6.180833.2985@twwells.com> bill@twwells.com (T. William Wells) writes:
:
: >... start states are not reliable with lex. After
: >being bitten by them for the n'th time, I switched to flex.
:
: I know of a bug that involves variable length trailing contexts (this
: happens in flex, too), but I've never run into a bug with start states.
: Can you give specifics?

I don't know what triggers it, but, in certain cases, an RE that
is in some start state gets recognized, even when the program is
not in that start state. I've been bitten by this one three
times, on three different machines (a VAX, a Sun, and a '386), so
it appears to be a generic problem with lex.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

ejp@bohra.cpg.oz (Esmond Pitt) (12/08/89)

In article <1989Dec6.180833.2985@twwells.com> bill@twwells.com (T. William Wells) writes:
>
> ... start states are not reliable with lex.

I keep hearing this, and I've never had any trouble with 'lex' whatever
(in 8 years) except that the scanners are SO SLOW. This is a good reason
to switch to 'flex' all right. But where's the unreliability? Symptoms?

-- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz

chris@mimsy.umd.edu (Chris Torek) (12/09/89)

In article <1989Dec7.193738.9829@twwells.com> bill@twwells.com
(T. William Wells) writes:
>I don't know what triggers it, but, in certain cases, an RE that
>is in some start state gets recognized, even when the program is
>not in that start state. I've been bitten by this one three
>times, on three different machines (a VAX, a Sun, and a '386), so
>it appears to be a generic problem with lex.

Lex's start states are not terribly well documented.  Besides %start
and BEGIN(state) and <state>text, there is the default initial state
(called `INITIAL') and the fact that a lex rule, if it has no state,
acts in *all* states.  For instance:

	%state FOO BAR
	%%
	f	{ BEGIN(FOO); }
	b	{ BEGIN(BAR); }
	i	{ BEGIN(INITIAL); }
	<INITIAL>t { return (1); }
	<FOO>t	{ return (2); }
	ack	{ return (3); }
	<BAR>gasp	{ return (4); }
	.|\n	;
	%%
	yywrap() { return (1); }
	main() { int c; while ((c = yylex()) != 0) printf("lex => %d\n", c); }

(yes, not conformant, main should return an int, but then this belongs
in some other group anyway and is here only because that is where it
started) shows how this works:

	% a.out
	ack
	lex => 3
	gasp
	t
	lex => 1
	f
	ack
	lex => 3
	gasp
	t
	lex => 2
	i
	b
	ack
	lex => 3
	gasp
	lex => 4
	t
	i
	^D %

`ack' is unadorned, hence recognised in all states, while `t' produces
a token only in states INITIAL and FOO, and `gasp' produces something only
in state BAR.  (Everything else is eaten.)

At any rate, perhaps what Bill Wells is remembering is lex taking
something that it apparently should not have because it was in some
state other than INITIAL.  Or it could be yet another lex bug....
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

bill@twwells.com (T. William Wells) (12/09/89)

In article <21182@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
: At any rate, perhaps what Bill Wells is remembering is lex taking
: something that it apparently should not have because it was in some
: state other than INITIAL.  Or it could be yet another lex bug....

I'll argue for the bug. Simply switching to flex made the problem
go away.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

mark@linqdev.UUCP (mark) (12/12/89)

>   /* this is a regular expression to match a c comment    */
>   /* written by cml 890922   (probably not minimal)    */
>%%
>"/*"([^*]|[*]*[^*/])*[*]+"/"    {printf("saw a c comment.\n");}
>.                {putchar(*yytext);}
>----snip----
>
>chris...
>--

As a friend the the office here pointed out, this particular
way of handling C comments tends to 'grow' the parser.

Here is his solution which I thought was a neat way to do it
and save some space:


"/*"        {
			do  {
				yytext[0] = yytext[1];
				if( !( yytext[1] = input() ) )
					yyerror( 
					"EOF in comment.  Missing \"*/\"." );
				} while( yytext[0] != '*' || yytext[1] != '/' );
			}


01010000010000010100001101001011010001010101010000100000010000100100001001010011
* Mark R. Holbrook-WS7M  uw-beaver!                    Interlinq Software Corp *
* Issaquah,  WA   98027       sumax!ole!   /mark       10230 NE Points Dr #200 *
* HM:206-392-9672 Voice            linqdev!            Kirkland, WA 98033-???? *
* HM:206-392-9673 Data                     \ws7m!ws7m  WK:206-827-1112 Ext:152 *
* Opinions..? Yes, they are mine and mine alone!  I'll sell 'em cheap however! *
00100000010101110101001100110111010011010100000001010111010100110011011101001101

tps@chem.ucsd.edu (Tom Stockfisch) (12/12/89)

In article <364@linqdev.UUCP> mark@linqdev.UUCP () writes:
>...way of handling C comments...
>
>"/*"        {
>			do  {
>				yytext[0] = yytext[1];
>				if( !( yytext[1] = input() ) )
>					yyerror( 
>					"EOF in comment.  Missing \"*/\"." );
>				} while( yytext[0] != '*' || yytext[1] != '/' );
>			}

Too bad it doesn't work.  Try "/*/*/" as input.  I think it will work if you
clear yytext[1] before the do-while loop.
-- 

|| Tom Stockfisch, UCSD Chemistry	tps@chem.ucsd.edu