[comp.lang.c] Want a way to strip comments from a C file

jrv@siemens.UUCP (James R Vallino) (03/09/89)

We're working with this lousy compiler which is choking on files which
have too many comments.  Does anyone have a sed or awk script which we
can use to preprocess the C source and get rid of all the comments before
sending it to the compiler?

Thanks!


-- 
Jim Vallino	Siemens Corporate Research, Princeton, NJ
jrv@siemens.com
princeton!siemens!jrv
(609) 734-3331

mnc@m10ux.UUCP (Michael Condict) (03/13/89)

I recently posted to this group a shell script that calls three sed scripts
to extract function prototypes from (practically) any C source.  One of the
three sed scripts consisted of little more than comment removal -- exactly
what you are looking for.  Here is the relevant portion:

--------------------------------------------------------------------------
# Delete comments:
: delcom
/\/\*/{
	# Change first comment delim to @ (after eliminating existing @'s):
	s/@/<Used#to%be+an-At>/g
	s:/\*:@:

	# Read until we have the end comment:
	: morecm
	/\*\//!{
		# Just to cut down on max buffer length:
		s/@.*/@/
		N
		b morecm
	}

	# Get rid of any $'s:
	s/\$/<Used#to%be+a-Dollar>/g

	# First occurrence of */ is guaranteed to be the corresponding end
	# comment, because it is otherwise not legal C, so:
	s:\*/:$:
	s/@[^$]*\$/ /

	# Restore $'s and @'s:
	s/<Used#to%be+a-Dollar>/$/g
	s/<Used#to%be+an-At>/@/g

	b delcom
}
------------------------------------------------------------------------------

The disclaimers are that (1) it only works with BSD-derived sed, unless you
get rid of all the comments; and (2) it will fail for programs that contain
the extremely unlikely "Used#to%be..." strings used as markers in the script.

This has been tested on thousands of lines of source code from various sources,
but no guarantees.  You get what you pay for.

Mike Condict
-- 
Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

hollombe@ttidca.TTI.COM (The Polymath) (03/16/89)

In article <880@m10ux.UUCP> mnc@m10ux.UUCP (Michael Condict) writes:
}I recently posted to this group a shell script that calls three sed scripts
}to extract function prototypes from (practically) any C source.  One of the
}three sed scripts consisted of little more than comment removal -- exactly
}what you are looking for.  Here is the relevant portion:
}
}[...]

}The disclaimers are that (1) it only works with BSD-derived sed, unless you
}get rid of all the comments; and (2) it will fail for programs that contain
}the extremely unlikely "Used#to%be..." strings used as markers in the script.

If I understood the original posting correctly, it will also fail if it
encounters a /* or */ within a quoted string constant.  E.g.:

     char *msg1 = "The symbol \"/*\" begins a comment in C. \n";
     char *msg2 = "The symbol \"*\\\" ends a comment in C. \n";

I deliberately added the escaped double-quotes to show that true, safe
comment detection and removal isn't a trivial problem.  There are probably
a number of other "special" cases that can cause a simple, scan-for-/*,
scan-for-*/ algorithm to fail.

}This has been tested on thousands of lines of source code from various sources,
}but no guarantees.  You get what you pay for.

Sound advice.

-- 
The Polymath (aka: Jerry Hollombe, hollombe@ttidca.tti.com)  Illegitimati Nil
Citicorp(+)TTI                                                 Carborundum
3100 Ocean Park Blvd.   (213) 452-9191, x2483
Santa Monica, CA  90405 {csun|philabs|psivax}!ttidca!hollombe

cdold@starfish.Convergent.COM (Clarence Dold) (03/18/89)

From article <4060@ttidca.TTI.COM>, by hollombe@ttidca.TTI.COM (The Polymath):
> In article <880@m10ux.UUCP> mnc@m10ux.UUCP (Michael Condict) writes:
> }I recently posted to this group a shell script that calls three sed scripts
> }to extract function prototypes from (practically) any C source.  One of the
> }three sed scripts consisted of little more than comment removal -- exactly
> }what you are looking for.  Here is the relevant portion:

I managed to miss the original question, but none of the replies I've seen
use the compiler to strip comments.
From UNIX cpp target.c will strip off the comments.
From Microsoft QuickC, QCL -E target.c will strip the comments.

Since both cases are a part of a normal compile, they have to be 
'double-escape/ comment in a comment'- proof.

-- 
Clarence A Dold - cdold@starfish.Convergent.COM         (408) 435-5293
                ...pyramid!ctnews!starfish!cdold         
                P.O.Box 6685, San Jose, CA 95150-6685

mnc@m10ux.UUCP (Michael Condict) (03/24/89)

In article <4060@ttidca.TTI.COM>, hollombe@ttidca.TTI.COM (The Polymath) writes:
> In article <880@m10ux.UUCP> mnc@m10ux.UUCP (Michael Condict) writes:
> }I recently posted to this group a shell script that
> }    [ deletes comments from C source, among other things ]
>     . . .
> If I understood the original posting correctly, it will also fail if it
> encounters a /* or */ within a quoted string constant.  E.g.:
>     . . .

Oops, you are absolutely right.  After some analysis of this limitation in
my sed script, it is obvious that the regular expressions of sed (or awk or
vi/ex/ed) are too limited to handle the job in any reasonable fashion.
Besides the lex script that does the job is trivial.  Someone pointed out that
they were posting a six-line lex script to comp.sources.unix.  This doesn't
seem like the best way to display the solution, since the article announcing
the posting was itself longer than six lines.  I'll throw out the following
3-line lex script, which has been tested on all the devious ways of forming
comments and quotes that I can think of.  In particular, it handles comment
delimiters within quotes and quotes within comment delimiters:

----------- Lex script to delete comments from C source code ----------------
%%
\"([^\\"]*\\(.|\n))*[^\\"]*\"	ECHO;
"/*"([^*]*"*"[^/])*[^*]*"*/"	;
.				ECHO;
-----------------------------------------------------------------------------

Can anyone find anything wrong with this one (he asks stupidly)?  Can anyone
find a shorter solution?  Boy this is almost as much fun as computing factorial
in the minimum-sized C program.
-- 
Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

tps@chem.ucsd.edu (Tom Stockfisch) (03/25/89)

In article <891@m10ux.UUCP> mnc@m10ux.UUCP (Michael Condict) writes:
#----------- Lex script to delete comments from C source code ----------------
#%%
#\"([^\\"]*\\(.|\n))*[^\\"]*\"	ECHO;
#"/*"([^*]*"*"[^/])*[^*]*"*/"	;
#.				ECHO;
#Can anyone find anything wrong with this one (he asks stupidly)?

Fails on
	/***/
-- 

|| Tom Stockfisch, UCSD Chemistry	tps@chem.ucsd.edu