[comp.lang.c] Match C Comments... the r

martin@mwtech.uucp@canremote.uucp (martin@mwtech.UUCP) (12/21/89)
From: martin@mwtech.UUCP (Martin Weitzel)
Subj: Match C Comments... the right answer (was Re: LEX rule, anyone???)
Orga: MIKROS Systemware, Darmstadt/W-Germany

In article <5365@omepd.UUCP> merlyn@iwarp.intel.com (Randal Schwartz)
writes: >In article <2191@prune.bbn.com>, rsalz@bbn (Rich Salz)
writes: >| In <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se
(Kristian Wedberg) writes: >| >A question from a friend of mine, P{r
Eriksson: >| >    Does anyone know how to write a LEX rule for C
comments, >| >    ie for everything between /* and */, nesting not
allowed? >| 
>| We go through this in comp.lang.c about once a year.  Almost
everyone >| gets it wrong.  The best thing to do is to define a lex
rule that >| catches "/*" and in the actions for that rule look for
*/ on your own.

True, but only partially.

>
>OK, almost everyone got it wrong (and I took great delight at
pointing >out what was wrong sometimes :-), but there was ONE correct
answer -- >mine ( :-)...
>
>               "/*"(\**[^*/]|\/)*\*+\/

All solutions to this problem only using regex-s, suffer at one major
point: The text buffer (yytext) of lex is - in most cases - only about
200 Bytes. If the buffer overflows, you're out of luck! (You may
expand the buffer, I know, but it's not quite uncommon, that some
comment is well over 2 KByte and there is no constraint in C that
requires a certain limitation of comments, so, how big shall the
buffer be?)

>
>Don't accept any cheap imitations.  They're probably wrong.  (This
one >is wrong in that it will match C comments inside of text
strings, but >that's just plain pathological. :-)

How long will I have to hear this! It's so easy with "lex", because
there are these wonderful "Start Conditions", nobody seems to know :-(

The following sceleton draws all comments out of a C-Source (and
*correctly* handles strings and char const-s).

----------------------------------- cut here ------------------------
%start CTEXT COMMENT STRING
%%
%{
                BEGIN CTEXT;
%}
<CTEXT>"/*"     { BEGIN COMMENT; }
<CTEXT>'\\''    { ECHO; }
<CTEXT>'[^']+'  { ECHO; }
<CTEXT>\"       { BEGIN STRING; yyless(0); }
<CTEXT>.|\n     { ECHO; }

<COMMENT>"*/"   { BEGIN CTEXT; }
<COMMENT>.|\n   ;

<STRING>[^\\]\" { ECHO; BEGIN CTEXT; }
<STRING>.       { ECHO; }
<STRING>\n      { ECHO; /* syntax error! */ }
----------------------------------- cut here ------------------------

Note, that character constants and strings are handled by separate
rules, so that this piece of code may easily be extended for building
other tools, which require this distinction. 

>(No, I don't have one in Perl... :-)

A general remark: Sometimes people discover a tool and learn to handle
it. Sometimes they learn, that it is a real powerful tool, by far more
powerfull than they thought in the beginning. And very often then
comes the big mistake: They want to do *everything* with this tool.

IMHO sometimes you should lean back and accept, that not everything
can be done easy in the way you love most. (I have to tell this to
myself sometimes :-).)

[rest deleted]
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

---
 * Via MaSNet/HST96/HST144/V32 - UN C Language
 * Via Usenet Newsgroup comp.lang.c