[comp.lang.c] Match C Comments... the right answer

merlyn@iwarp.intel.com (Randal Schwartz) (12/18/89)

In article <2191@prune.bbn.com>, rsalz@bbn (Rich Salz) writes:
| In <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes:
| >A question from a friend of mine, P{r Eriksson:
| >	Does anyone know how to write a LEX rule for C comments,
| >	ie for everything between /* and */, nesting not allowed?
| 
| We go through this in comp.lang.c about once a year.  Almost everyone
| gets it wrong.  The best thing to do is to define a lex rule that
| catches "/*" and in the actions for that rule look for */ on your own.

OK, almost everyone got it wrong (and I took great delight at pointing
out what was wrong sometimes :-), but there was ONE correct answer --
mine ( :-)...

		"/*"(\**[^*/]|\/)*\*+\/

Don't accept any cheap imitations.  They're probably wrong.  (This one
is wrong in that it will match C comments inside of text strings, but
that's just plain pathological. :-)

(No, I don't have one in Perl... :-)

Just another regex hacker,
-- 
/== Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ====\
| on contract to Intel's iWarp project, Hillsboro, Oregon, USA, Sol III  |
| merlyn@iwarp.intel.com ...!uunet!iwarp.intel.com!merlyn	         |
\== Cute Quote: "Welcome to Oregon... Home of the California Raisins!" ==/

martin@mwtech.UUCP (Martin Weitzel) (12/19/89)

In article <5365@omepd.UUCP> merlyn@iwarp.intel.com (Randal Schwartz) writes:
>In article <2191@prune.bbn.com>, rsalz@bbn (Rich Salz) writes:
>| In <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes:
>| >A question from a friend of mine, P{r Eriksson:
>| >	Does anyone know how to write a LEX rule for C comments,
>| >	ie for everything between /* and */, nesting not allowed?
>| 
>| We go through this in comp.lang.c about once a year.  Almost everyone
>| gets it wrong.  The best thing to do is to define a lex rule that
>| catches "/*" and in the actions for that rule look for */ on your own.

True, but only partially.

>
>OK, almost everyone got it wrong (and I took great delight at pointing
>out what was wrong sometimes :-), but there was ONE correct answer --
>mine ( :-)...
>
>		"/*"(\**[^*/]|\/)*\*+\/

All solutions to this problem only using regex-s, suffer at one major
point: The text buffer (yytext) of lex is - in most cases - only about
200 Bytes. If the buffer overflows, you're out of luck! (You may expand
the buffer, I know, but it's not quite uncommon, that some comment is
well over 2 KByte and there is no constraint in C that requires a certain
limitation of comments, so, how big shall the buffer be?)

>
>Don't accept any cheap imitations.  They're probably wrong.  (This one
>is wrong in that it will match C comments inside of text strings, but
>that's just plain pathological. :-)

How long will I have to hear this! It's so easy with "lex", because
there are these wonderful "Start Conditions", nobody seems to know :-(

The following sceleton draws all comments out of a C-Source (and
*correctly* handles strings and char const-s).

----------------------------------- cut here ------------------------
%start CTEXT COMMENT STRING
%%
%{
		BEGIN CTEXT;
%}
<CTEXT>"/*"	{ BEGIN COMMENT; }
<CTEXT>'\\''	{ ECHO; }
<CTEXT>'[^']+'	{ ECHO; }
<CTEXT>\"	{ BEGIN STRING; yyless(0); }
<CTEXT>.|\n	{ ECHO; }

<COMMENT>"*/"	{ BEGIN CTEXT; }
<COMMENT>.|\n	;

<STRING>[^\\]\"	{ ECHO; BEGIN CTEXT; }
<STRING>.	{ ECHO; }
<STRING>\n	{ ECHO; /* syntax error! */ }
----------------------------------- cut here ------------------------

Note, that character constants and strings are handled by separate
rules, so that this piece of code may easily be extended for building
other tools, which require this distinction. 

>(No, I don't have one in Perl... :-)

A general remark: Sometimes people discover a tool and learn to handle
it. Sometimes they learn, that it is a real powerful tool, by far more
powerfull than they thought in the beginning. And very often then comes
the big mistake: They want to do *everything* with this tool.

IMHO sometimes you should lean back and accept, that not everything
can be done easy in the way you love most. (I have to tell this to
myself sometimes :-).)

[rest deleted]
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83