[net.unix-wizards] Comment recognition in Lex, again

anderson@uwvax.ARPA (05/04/84)

<>
I have received several replies to my request for a lex expression
to recognize /* ... */ comments.  The only one that works (sent in
by Jim Hogue) is

"/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/"

which I can't claim to fully understand.  Nor do I understand why my
original,  "/*"([^*]|("*"/[^/]))*"*/", doesn't work.  The idea is that
each character in the string between /* and */ can either be something
other than *, or * followed by something other than /.

Can anyone come up with an expression simpler than Hogue's that works?
By "works", I mean put it in a real "lex" program, as in:

(your expr)	printf("recognized (%s)\n",yytext);

and try it on inputs such as /***/, /*/*/, etc.

-- David Anderson (uwvax!anderson)

chris@umcp-cs.UUCP (05/04/84)

A Lex expression that matches C comments:

	comment			"/*"(\*[^/]|[^*]*)*"*/"

(The * after the first ] shouldn't be necessary, I think, but you
did specify that the expression come from working code...)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

hogue@hsi.UUCP (05/04/84)

Ok I will attempt to explain the lex script.  First a little history.
The script was derived while attempting to write a C complier for a
class at the University of Arizona.  Dr. P. J. Downey and I derived
the regular expression from the finite automata and the knowledge of
lexs' attempt to match the longest possible match.

 > I have received several replies to my request for a lex expression
 > to recognize /* ... */ comments.  The only one that works (sent in
 > by Jim Hogue) is

 > "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/"

1. "/*" 
	matches the initial /* of a comment.

2. ([^*]*"*"*"*"[^/*])*
	matches the stuff in the comment including any *'s or /'s but does 
	not allow the * just prior to the / to be matched.  As such it
	does not match */

3. [^*]*"*"*"*/"
	matches more stuff and this time forces the match of the final */

Problems:
	First the /**/ /***/ ... or /*foo**/ style comments.  These are 
	taken care of by 3.
	Next the /**foo*/ and /*foo*baz*/ style comments.  These are taken
	care of by 2.
	The final trick is that the first */ is matched and thus comments
	of the form /*foo*/a = b;/*baz*/ are matched in the same style as
	PCC.  ie. /*foo*/ and /*baz*/ are matched and a = b; is not.

Now for the real way to do it.  Simply recognize the /* and then call a three
line c program that "eats" up the comment!  (The c preprocessor does
the comment removal for the c complier).

The real reason this looks so complicated is because it was derived not
thought out!
-- 
		Jim Hogue
		{kpno, ihnp4}!hsi!hogue
		Health Systems International
		New Haven, CT  06511

merlyn@sequent.UUCP (05/04/84)

> From: anderson@uwvax.UUCP
> Subject: Comment recognition in Lex, again
> Message-ID: <245@uwvax.ARPA>
> 
> I have received several replies to my request for a lex expression
> to recognize /* ... */ comments.  The only one that works (sent in
> by Jim Hogue) is
> 
> "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/"
> 
> which I can't claim to fully understand.  Nor do I understand why my
> original,  "/*"([^*]|("*"/[^/]))*"*/", doesn't work.  The idea is that
> each character in the string between /* and */ can either be something
> other than *, or * followed by something other than /.

I looked at this expression for a while (translated it into railroad tracks
so I could study it as an FSM).  It's sound, but utterly complex.
That is to say, it will match everything that is considered a C-comment,
and nothing else.

My previous suggestion (deleting the "/" before the "[^/]") fails for
cases of /***/, because the second * and the third * are matched in the
middle of the parthensized expression, leaving no * to use with the trailing /.
Actually, all you have to do is document this :-).

If you want simplicity, here's the way I do it (with start states!)

####	%s INCOMMENT
####	
####	<INITIAL>"/*" 	{
####		BEGIN INCOMMENT;
####	}
####	<INCOMMENT>(.|\n)	{
####		/* ignore */;
####	}
####	<INCOMMENT>"*/"	{
####		BEGIN INITIAL;
####	}

Works just fine, and is VERY clear.  "INITIAL" is an undocumented state that
represents the start state that you are in to begin with.  If you have any
other single char matchers in your lex script, make sure they are AFTER the
middle pattern above, or are prefixed with another start state (even INITIAL,
if you don't need any other start states).

Randal L. ("(null)") Schwartz, esq. (merlyn@sequent.UUCP)
	(Official legendary sorcerer of the 1984 Summer Olympics)
Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine)
UUCP: ...!XXX!sequent!merlyn where XXX is one of:
	decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix

rjhurdal@watmath.UUCP (David Canzi) (05/05/84)

The puzzles in net.unix-wizards are better than the ones in net.puzzle.
I blew two hours composing this:
	"/*"[^*]*"*"+([^*/][^*]*"*"+)*"/" printf("<<<%s>>>", yytext) ;
and tested it on this input:
	asdqwa/*/asaas*/werwer
	sdf/**/sdfsdf/***/erwerwer
	cvb/*tdfg*/*xcvcv*/werwe
	ty/bcvb/*******/***/fssdf
and it appears to work.  Please let me know if you come up with cases
where it doesn't work...

ian@utcsstat.UUCP (Ian F. Darwin) (05/05/84)

	From: anderson@uwvax.ARPA
	
	I have received several replies to my request for a lex expression
	to recognize /* ... */ comments.  The only one that works (sent in
	by Jim Hogue) is
	
	"/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/"
	
	which I can't claim to fully understand.  Nor do I understand why my
	original,  "/*"([^*]|("*"/[^/]))*"*/", doesn't work.  The idea is that
	each character in the string between /* and */ can either be something
	other than *, or * followed by something other than /.
	
	Can anyone come up with an expression simpler than Hogue's that works?
	By "works", I mean put it in a real "lex" program, as in:
	
	(your expr)	printf("recognized (%s)\n",yytext);
	
	and try it on inputs such as /***/, /*/*/, etc.
	
	-- David Anderson (uwvax!anderson)

It's not clear what the goal of this exercise actually is.
On the assumption that you are trying to build part of a compiler,
here is a simple, *readable* code fragment which inputs a C program
and eats all the comments, which is what you do in a real
compiler. I have no wish to strain my eyes on Jim Hogue's excellent
APL program (nor the original, for that matter), so I wrote this.
It's actually more complex than it needs to be; this is more a
comment on the simple-mindedness of lex than on my coding style
(and no, I have nothing better to offer than lex; flames to /dev/null
given that I do acknowledge lex's authors' accomplishments).

Our approach is to use the lex ``start condition''. There can
be many start conditions defined, but only 001 of them can
be active at any time, and you can neither turn off a start
condition (you can only say BEGIN 0, which returns to ``the
normal state''), nor use rule 0 in the <...> prefix.

With that in mind, here is lexcom.l:
	%START INCOM  NOTIN
	%%
	<INCOM>"*/"	{BEGIN NOTIN;}
	<INCOM>"/*"	unput('*');
	<INCOM>.	;
	<NOTIN>"/*"	{BEGIN INCOM;}
	"/*"		BEGIN INCOM;
	<NOTIN>.	ECHO;
	%%

In order to understand this, read carefully section 10 of the Lex manual.
If you think you have a better (shorter) way, **try it before you post it**
because the most obvious optimisations do not work! Geoff Collyer
and I spent some time interpreting the manual and making the
minimum test case that would work *correctly*.

While our version is not as short as Jim Hogue's submission,
	%%
	"/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/"	;
	.	ECHO;
	%%
(I've made it into a full program) it has the following advantages:
	1) it handles multi-line comments correctly.
	2) it does not overflow lex's input buffer (see the
	admonition in section 5 of the Lex manual about
	``Don't try to defeat this with expressions like ...
	... or equivalents; the Lex-generated program will try to read
	the entire input file, causing internal buffer overflows.''
	This is what the APL version does.

	Nor does it suffice to simply ``expand Lex's buffers''.
	This will not work on some machines and certainly
	not on binary-only systems (they do exist!).

Here are the test cases used. Our program passes both.
'To pass' means to do the same thing that CPP does.
(CPP is the zeroeth pass of the C compiler, the preprocessor,
which normally eats the comments from live C programs
when they are being compiled. I diffed the output as follows:
	cc -P t.c	# produces t.i
	lexcom <t.c >t.i2
	diff t.i t.i2
(where t.c contains the first test case) and  got no differences.
The second test case produces nothing but null lines
(cpp and our program) and a core dump (the one line program).

1) A simple C fragment:
	/************************************/
	/* this is a test program */
	/**/int i=2; /* initialise an int */
	/*/*/int j=3; /* init an int */
	/*
	bletch
	end of file in the middle of a comment - a good test.

2) a longer fragment, but still valid C:
	(echo '/*'; cat /etc/passwd; echo '*/') | a.out
(assuming that */ does not appear in /etc/passwd at any
given moment; ours doesn't). Our program produced many
newlines, as did CPP. The one-liner dumped core!

Is there a moral in all this? I think it's that a
program that is a few lines longer, but is readable
and conforms to the manual, is in fact a better program.

Ian Darwin, Toronto Canada {ihnp4|decvax}!utcsstat!ian
-- 
Ian Darwin, Toronto   uucp: utcsstat!ian   Arpa: decvax!utcsstat!ian@Berkeley

mp@whuxle.UUCP (Mark Plotnick) (05/05/84)

The problem with
	"/*"([^*]|("*"/[^/]))*"*/"
is that the right context handling in lex in nested regular expressions
is a little nonintuitive.  After lex recognizes the complete
expression, it backs up one character because of the ``/[^/]''
expression.

In case you still don't see the problem, run this lex program:

	a(b/c)c { printf("I saw this: %s\n", yytext); }
	.	{ printf("char: '%c'\n", yytext[0]);

The first rule will NOT match ``abc'', but it will match ``abcc'',
sort of.  It prints out ``I saw this: ab''.

To be safe, only use right context at the very end of your regular
expression.

Yet Another Way To Recognize Comments:

I really don't enjoy beating my head against a wall playing
with regular expressions and starting conditions.  When we
had to write a compiler a couple of years ago (any other
AM295 survivors out there?), we did something like:
"/*" {
#define LEXEOF 0
	int c, last_c='\0';
	while ((c=input()) != LEXEOF) {
		if (last_c == '*' && c=='/')
			break;
		else
			last_c=c;
	}
	printf("comment seen\n");
	if (c == LEXEOF)
		printf("EOF within comment\n");
    }

Moving some of the effort into the action routine allows you to easily
add more context-dependent features, such as printing a warning message
if there's a ';' within the comment, supporting nested comments, etc.
	Mark Plotnick

merlyn@sequent.UUCP (05/07/84)

> From: chris@umcp-cs.UUCP
> Message-ID: <6879@umcp-cs.UUCP>
> Date: Thu, 3-May-84 20:59:25 PDT
> 
> A Lex expression that matches C comments:
> 
> 	comment			"/*"(\*[^/]|[^*]*)*"*/"
> 
> (The * after the first ] shouldn't be necessary, I think, but you
> did specify that the expression come from working code...)

Funny, your "working code" doesn't match /***/ .. try it!

Randal L. ("no comment") Schwartz, esq. (merlyn@sequent.UUCP)
	(Official legendary sorcerer of the 1984 Summer Olympics)
Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine)
UUCP: ...!XXX!sequent!merlyn where XXX is one of:
	decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix

merlyn@sequent.UUCP (05/07/84)

> From: rjhurdal@watmath.UUCP
> Message-ID: <7666@watmath.UUCP>
> Date: Fri, 4-May-84 14:58:05 PDT
> 
> The puzzles in net.unix-wizards are better than the ones in net.puzzle.
> I blew two hours composing this:
> 	"/*"[^*]*"*"+([^*/][^*]*"*"+)*"/" printf("<<<%s>>>", yytext) ;
> and tested it on this input:
> 	asdqwa/*/asaas*/werwer
> 	sdf/**/sdfsdf/***/erwerwer
> 	cvb/*tdfg*/*xcvcv*/werwe
> 	ty/bcvb/*******/***/fssdf
> and it appears to work.  Please let me know if you come up with cases
> where it doesn't work...

I can't seem to break this one (even spent 20 minutes with a little
"railroad-track" finite-state-machine model).

However, it doesn't appear as elegant as a solution sent in a private
message to me from Andrew Klossner @ tektronix:

	"/*"([^*]*"*"+[^/])*"*"*"/"

This one I can't break either.  (No claims for lack of human error,
however.)

The solution I submitted with start states (mail me if you didn't get
that one) was preferred in another private communication for a reason
that I had failed to notice at the time... all lex regular expressions
which attempt to scarf up a comment in one fell swoop can overflow the
fixed-size "yytext[]" array EASILY.  Take, for example, a typical
start-of-file log produced by the RCS $Log:$ stuff, or an in-source
manpage (yes, I've seen them).  Ick.  Start states avoid that hassle.

It was pointed out to me in that same private communication that any
lex rules that are not qualified by a start state are STILL active
inside the comments.  Boo.  I forgot about that.  I've learned to
prefix ALL my rules by start states (even if it is just <INITIAL>).  I
had forgotten that I do that regularly.

Randal L. ("no comment") Schwartz, esq. (merlyn@sequent.UUCP)
	(Official legendary sorcerer of the 1984 Summer Olympics)
Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine)
UUCP: ...!XXX!sequent!merlyn where XXX is one of:
	decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix

merlyn@sequent.UUCP (05/07/84)

Yucko... in that last message from me, <487@sequent.UUCP>:

	"/*"([^*]*"*"+[^/])*"*"*"/"

should have been:

	"/*"([^*]*"*"+[^/])*"*"+"/"

(Talk about human error!)

My apologies, Andrew.

Randal L. ("typo-matic") Schwartz, esq. (merlyn@sequent.UUCP)
	(Official legendary sorcerer of the 1984 Summer Olympics)
Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine)
UUCP: ...!XXX!sequent!merlyn where XXX is one of:
	decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix

chris@umcp-cs.UUCP (05/10/84)

All right, all right, I goofed!  (There, he admitted it.  Happy? :->)

Yes, my short expression doesn't match /***/.  My code happens to
work because of what it does by default with stuff it doesn't recognize.
Oh well.

Anyone got a foot-from-mouth removal device?
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

loverso@sunybcs.UUCP (05/13/84)

Might I just add one tidbit on comment recognition:
Having just finished a compilers course, where we talked somewhat
about lex/yacc, I, too, came across the bug in lex where the
trailing "/" would not be matched.  Also, as far as I can remember,
not one article had a true comment recognizer, as they would all
produce incorrect output on:
	printf(" just a /* test */ ");
(Unless you're all assuming a rule already handling strings...)

-- 
John Robert LoVerso @ SUNY Buffalo
UUCP	allegra!{watmath|rocksvax}!sunybcs!loverso (watmath preferred)
CSnet	loverso.buffalo-cs@csnet-relay
--
Repeal the Income Tax / Support the Liberty Amendment!

chris@hwcs.UUCP (Chris Miller) (05/18/84)

The following is a fully general comment recogniser for /* ... */
comments in 'lex' - I have used definitions to make it a little more
readable (I just can't cope with things like ("*"[^*]*)!).

It should be pointed out that I don't believe that this is the RIGHT
way to handle comments unless it is essential to retain their text;
comments can be very long, and trying to match them with 'lex' can
easily overflow buffers.  I prefer solutions which match the opening
/* and then throw away the rest of the comment in the action routine,
using a bit of ad hoccery.
____________________________________________________________________
	STAR	"*"
	SLASH	"/"
	CSTART	({SLASH}{STAR})
	CUNIT	([^*]|{STAR}+[^/*])
	CBODY	({CUNIT}*)
	CEND	({STAR}+{SLASH})
	COMMENT	({CSTART}{CBODY}{CEND})
	%%
	{COMMENT}	printf("COMMENT '%s'\n", yytext);
	%%
	yywrap()
	{
		exit(0);
	}

	main()
	{
		for (;;)
			yylex();
	}
____________________________________________________________________
One problem with the original non-working version is that it fails for
comments terminated by an EVEN number of asterisks and a /.  This seems
to be a common bug in distributed compilers, etc, even when they don't use
'lex' for token generation.  I have encountered this bug in several C
compilers and their corresponding lints (of course, since lint usually uses
cpp), and also in the original distribution of CProlog - you may find it
entertaining to try out
	/** This is a legal comment **/
on any language systems which OUGHT to accept it.  The fix is almost always
trivial - the problem comes from reading the character following an asterisk
without subsequently putting it back in the input if it happens to be
another asterisk.

						Chris Miller
						Heriot-Watt Computer Science
						...ukc!edcaad!hwcs

mike@sdcrdcf.UUCP (Michael Williams) (05/19/84)

printf("If you're preparing to handle /* all */ comments,");
/* Be prepared for comments like: /* "these" 		*/
/* "or variants such as: /* "these", "/*", "*/" */" 	*/
/* and mismatched" quotes as "well" 			*/
/* If you're in a comment-istic state, all "quotes" are censored.
 * If your (valid state is) "quoted", all comments are ignored.
 */

Mike

howard@metheus.UUCP (Howard A. Landman) (05/19/84)

The state machine pictured in the referenced article is wrong.  It should
have an extra loop on top of the second state box.  The loop should be
labelled "/".

	Howard A. Landman
	ogcvax!metheus!howard