[comp.lang.c] request for C comment stripper

ian@ux.cs.man.ac.uk (Ian Cottam) (03/20/89)

I usually use a lex script for such things.  I didn't have one for
C, but the following might do the trick.  N.B.  Not tested, not proven,
no warranty!
________________________________________________________________________
%{
  /***** Lex script to strip comments from C texts ******/
%}
%s COMMENT STRING CHAR
%%
<INITIAL>\'	        {BEGIN CHAR;   ECHO;}
<INITIAL>\"	        {BEGIN STRING; ECHO;}
<INITIAL>"/*"		BEGIN COMMENT;
<INITIAL>.		ECHO;
<INITIAL>\n		ECHO;
<CHAR>\\'               ECHO;
<CHAR>\'                {ECHO; BEGIN INITIAL;}
<STRING>\\\"            ECHO;
<STRING>\"              {ECHO; BEGIN INITIAL;}
<COMMENT>"*/"		BEGIN INITIAL;
<COMMENT>.		;
<COMMENT>\n		;
%%
-----------------------------------------------------------------
Ian Cottam, Room IT101, Department of Computer Science,
University of Manchester, Oxford Road, Manchester, M13 9PL, U.K.
Tel: (+44) 61-275 6157         FAX: (+44) 61-275-6280
ARPA: ian%ux.cs.man.ac.uk@nss.cs.ucl.ac.uk   
JANET: ian@uk.ac.man.cs.ux    UUCP: ..!mcvax!ukc!mur7!ian
-----------------------------------------------------------------

leo@philmds.UUCP (Leo de Wit) (03/21/89)

In article <5693@ux.cs.man.ac.uk> ian@ux.cs.man.ac.uk (Ian Cottam) writes:
|
|I usually use a lex script for such things.  I didn't have one for
|C, but the following might do the trick.  N.B.  Not tested, not proven,
|no warranty!

	 [lex script omitted]

This will cover most ordinary cases; but not this one:

(startcom.h is either empty or contains a /* ).

main()
{
    puts("Testing 1");
#include "startcom.h"
    puts("Testing 2");
/*
    puts("Testing 3");
*/
}

The second puts should get commented out or not, depending on the
contents of the header file.
OK, I'll admit, it is a bit far-fetched 8-); it however proves once
again that it isn't exactly trivial to do the general case right.

	 Leo.

ian@ux.cs.man.ac.uk (Ian Cottam) (03/21/89)

In article <985@philmds.UUCP> leo@philmds.UUCP (Leo de Wit) writes:
>In article <5693@ux.cs.man.ac.uk> ian@ux.cs.man.ac.uk (Ian Cottam) writes:
>|
>|I usually use a lex script for such things....

>This will cover most ordinary cases; but not this one:
>
>[an include file...] (startcom.h is either empty or contains a /* ).

According to my understanding of the pANS C preprocessor and the
notion of a "translation unit", your example is erroneous as the
comment must be terminated within startcom.h.  (As a practical
man I also confirmed my suspicion with the help of gcc :-) )
However, I suspect someone can come up with a case that will
throw my lex script as comment-strippers always require more
thought than people can believe possible.

-----------------------------------------------------------------
Ian Cottam, Room IT101, Department of Computer Science,
University of Manchester, Oxford Road, Manchester, M13 9PL, U.K.
Tel: (+44) 61-275 6157         FAX: (+44) 61-275-6280
ARPA: ian%ux.cs.man.ac.uk@nss.cs.ucl.ac.uk   
JANET: ian@uk.ac.man.cs.ux    UUCP: ..!mcvax!ukc!mur7!ian
-----------------------------------------------------------------

leo@philmds.UUCP (Leo de Wit) (03/23/89)

In article <5695@ux.cs.man.ac.uk> ian@mucs.UUCP (Ian Cottam) writes:
|According to my understanding of the pANS C preprocessor and the
|notion of a "translation unit", your example is erroneous as the
|comment must be terminated within startcom.h.  (As a practical
|man I also confirmed my suspicion with the help of gcc :-) )

OK, if your stripper is pANS conforming (whatever that means) it
should handle trigraphs too (it doesn't). If it isn't, then, being
a practical man, try and test my sample program on an Ultrix 2.x C
compiler (4.3 BSD will probably do too). It'll compile just fine.

|However, I suspect someone can come up with a case that will
|throw my lex script as comment-strippers always require more
|thought than people can believe possible.

That's exactly the point I'm trying to make all the time: it shouldn't
be difficult, it isn't entirely trivial however (it may even depend on
the version of pANS you're reading; look for instance for /* in include
file names, compare the Feb '87 and May '88 drafts).

   Leo.