[comp.lang.c] not the way ...

merlyn@intelob.intel.com (Randal L. Schwartz @ Stonehenge) (03/17/89)

In article <9900010@bradley>, brian@bradley writes:
| >                          Does anyone have a sed or awk script which we
| > can use to preprocess the C source and get rid of all the comments before
| > sending it to the compiler?
| 
|   The following works in vi: :%s/\/\*.*\*\///g
| 
|   I don't know if it will work in sed, but it should...

Nope.  Just try it on the line:

  foo; bar;  /* comment1 */  bletch; /* comment2 */

'bletch;' disappears with the comments.

The regexp that matches comments looks like (in egrep/lex notation):

  [/][*]([*]*[^*/])*[*]+[/]

(I use [X] here instead of \X because I hate backslashes...).

Sed and vi are not powerful enough to eat things like this in one
regexp.

Didn't we just go through this about nine months ago? :-)
(And didn't I give the wrong answer at least twice? :-) :-)
-- 
Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095
on contract to BiiN (for now :-), Hillsboro, Oregon, USA.
ARPA: <@intel-iwarp.arpa:merlyn@intelob> (fastest!)
MX-Internet: <merlyn@intelob.intel.com> UUCP: ...[!uunet]!tektronix!biin!merlyn
Standard disclaimer: I *am* my employer!
Cute quote: "Welcome to Oregon... home of the California Raisins!"

leo@philmds.UUCP (Leo de Wit) (03/18/89)

In article <4221@omepd.UUCP> merlyn@intelob.intel.com (Randal L. Schwartz @ Stonehenge) writes:
   []
|Nope.  Just try it on the line:
|
|  foo; bar;  /* comment1 */  bletch; /* comment2 */
|
|'bletch;' disappears with the comments.
|
|The regexp that matches comments looks like (in egrep/lex notation):
|
|  [/][*]([*]*[^*/])*[*]+[/]
|
|(I use [X] here instead of \X because I hate backslashes...).
|
|Sed and vi are not powerful enough to eat things like this in one
|regexp.

Sed is often underestimated; it IS powerful enough:

s/\/\*\([^*/]*\*\)\1*\///g

will eat the comments away just nicely (I'll leave the HOW as an
exercise for the reader).

    Leo.

leo@philmds.UUCP (Leo de Wit) (03/18/89)

In article <981@philmds.UUCP> leo@philmds.UUCP (Leo de Wit) writes:
|Sed is often underestimated; it IS powerful enough:
|
|s/\/\*\([^*/]*\*\)\1*\///g
|
|will eat the comments away just nicely (I'll leave the HOW as an
|exercise for the reader).

Shame on me. That it works is merely a coincidence (put a * in a
comment and see it fail). \1 matches the previous string matching a
\( \) expression, not the expression itself.  And since sed doesn't
like \( \)* type expressions, this would be hard to do in one regexpr.
Can it be proven to be impossible (that is, deleting the comments
with one sed command - multi-line comments not considered) ?

   Leo.

tps@chem.ucsd.edu (Tom Stockfisch) (03/23/89)

In article <4221@omepd.UUCP> merlyn@intelob.intel.com (Randal L. Schwartz @ Stonehenge) writes:
>| >Does anyone have a sed or awk script which we
>| > can use to preprocess the C source and get rid of all the comments
>|   The following works in vi: :%s/\/\*.*\*\///g
>Nope.  Just try it on the line:
>  foo; bar;  /* comment1 */  bletch; /* comment2 */
>'bletch;' disappears with the comments.
>The regexp that matches comments looks like (in egrep/lex notation):
>  [/][*]([*]*[^*/])*[*]+[/]
>Didn't we just go through this about nine months ago? :-)
>(And didn't I give the wrong answer at least twice? :-) :-)

You still don't have it right, I'm afraid.

This pattern won't work on

	/ /* / */

It is unbelievable how hard this task is in regular expressions, when it is
trivial to code by hand.

To convince yourself that a pattern is correct, I think you have to show
two things
	1.  That the body between the "/*" and "*/" cannot possibly contain
	    a "*/",
	2.  That the body can contain any other sequence of characters.

Various other patterns which have been posted (including ones by famous
net gurus) have failed correctly to match the following:

1.
	/*****//hello world */

2.
	/* hello /* /* world */

3.
	/* */ hello /* */

4.
	/**// /* this input should produce "/ \n" for output */

5.
	/* */ hello */


So what works?  I haven't been able to crack this one, which also correctly
ignores comments in strings and character constants.

If you want a practical program, use start states and don't match an entire
comment with one pattern -- you won't be in danger of overflowing yytext[].
If you want to see how it's done with regular expressions, study the
following.


	/* lex program that strips comments */

okslash	([^*/]"/"+)

%%
"/*""/"*([^/]|{okslash})*"*/"	;

\"((\\(.|\n))|[^\\"])*\"	ECHO;

\'((\\(.|\n))|[^\\'])*\'	ECHO;

.|\n	ECHO;
-- 

|| Tom Stockfisch, UCSD Chemistry	tps@chem.ucsd.edu

lfoard@wpi.wpi.edu (Lawrence C Foard) (03/24/89)

I tried the comment stripper I poster earlier today on these pathological
cases and it seems to get the right answer.

Script started on Fri Mar 24 01:56:10 1989
% cat tmp3.tmp
        Commented                  should be
	/ /* / */                #   /
	/*****//hello world */   #  /hello world */
	/* hello /* /* world */  # 
	/* */ hello /* */        #   hello
	/**// /* this input should produce "/ \n" for output */      #     /
	/* */ hello */           #   hello */


% ../tmp/a.out <tmp3.tmp
        Commented                  should be
	/                 #   /
	/hello world */   #  /hello world */
	  # 
	 hello         #   hello
	/       #     /
	 hello */           #   hello */


% ^D
script done on Fri Mar 24 01:56:36 1989

Now the only question is did I parse it right?

-- 
Disclaimer: My school does not share my views about FORTRAN.
            FORTRAN does not share my views about my school.

rupley@arizona.edu (John Rupley) (03/25/89)

In article <1492@wpi.wpi.edu>, lfoard@wpi.wpi.edu (Lawrence C Foard) writes:
> I tried the comment stripper I poster earlier today on these pathological
> cases and it seems to get the right answer.

Close, but no cigar.  

We're talking real pathology, here.... try:
	(echo '/*';yes '*//*';echo 'cosmetic */') | stripper_name
Recursion blows the stack for your program.  Previously posted strippers
handle the above.

If you insist on a compilable file, use a script to produce:
        /*
        [stack-blowing number of lines of *//*]
        */
        compilable program text

Why strip comments? (1) the original poster had a broken compiler that choked
on comments; (2) the start of a cheap way to get a list or inverted index of
identifiers (cpp does too much).

I suspect all useful points (and more? :-) have been made about
comment stripping -- perhaps this thread should die now.

John Rupley
rupley!local@megaron.arizona.edu