[comp.lang.c] Want a way to strip comments from a

brian@bradley.UUCP (03/14/89)

> /* Written  9:58 am  Mar  9, 1989 by jrv@siemens.UUCP */
> /* ---------- "Want a way to strip comments from a" ---------- */
>                          Does anyone have a sed or awk script which we
> can use to preprocess the C source and get rid of all the comments before
> sending it to the compiler?

  The following works in vi: :%s/\/\*.*\*\///g

  I don't know if it will work in sed, but it should...

...............................................................................

  "Don't drop acid, take it pass-fail!"

  Brian Michael Wendt       UUCP: {cepu,uiucdcs,noao}!bradley!brian
  Bradley University        ARPA: cepu!bradley!brian@seas.ucla.edu
  (309) 677-2335            ICBM: 40 40' N  89 34' W

smk@cbnews.ATT.COM (Stephen M. Kennedy) (03/17/89)

In article <9900010@bradley> brian@bradley.UUCP writes:
>  The following works in vi: :%s/\/\*.*\*\///g

/*
 * Unfortunately, multi-line comments aren't deleted.
 */

---
Steve Kennedy
cbatt!cbosgd!smk

smk@cbnews.ATT.COM (Stephen M. Kennedy) (03/17/89)

In article <9900010@bradley> brian@bradley.UUCP writes:
>  The following works in vi: :%s/\/\*.*\*\///g

/* And this */ important_variable = 42 /* doesn't work either! */

---
Steve Kennedy
cbatt!cbosgd!smk

rkl1@hound.UUCP (K.LAUX) (03/17/89)

In article <9900010@bradley>, brian@bradley.UUCP writes:
| 
| > /* Written  9:58 am  Mar  9, 1989 by jrv@siemens.UUCP */
| > /* ---------- "Want a way to strip comments from a" ---------- */
| >                          Does anyone have a sed or awk script which we
| > can use to preprocess the C source and get rid of all the comments before
| > sending it to the compiler?
| 
|   The following works in vi: :%s/\/\*.*\*\///g
| 
|   I don't know if it will work in sed, but it should...
| 

	Yes, it will.  The only problem is that it won't strip out comments
that span more than one line...'Aye, There's the Rub.

--rkl

leo@philmds.UUCP (Leo de Wit) (03/17/89)

In article <4896@cbnews.ATT.COM> smk@cbnews.ATT.COM (Stephen M. Kennedy) writes:
|In article <9900010@bradley> brian@bradley.UUCP writes:
|>  The following works in vi: :%s/\/\*.*\*\///g
|
|/* And this */ important_variable = 42 /* doesn't work either! */

And how about:

    puts(" A comment /* in here */");

And you can give more examples showing it isn't that trivial; a challenge
for the sed adept, perhaps ...

   Leo.

loo@mister-curious.sw.mcc.com (Joel Loo) (03/18/89)

In article <978@philmds.UUCP>, leo@philmds.UUCP (Leo de Wit) writes:
> In article <4896@cbnews.ATT.COM> smk@cbnews.ATT.COM (Stephen M. Kennedy) writes:
> |In article <9900010@bradley> brian@bradley.UUCP writes:
> |>  The following works in vi: :%s/\/\*.*\*\///g
> |
> |/* And this */ important_variable = 42 /* doesn't work either! */
> 
> And how about:
> 
>     puts(" A comment /* in here */");
> 
> And you can give more examples showing it isn't that trivial; a challenge
> for the sed adept, perhaps ...
> 
>    Leo.

[And a lot of previous articles on the same topic]

The problem is: sed and vi do not understand C syntax.

Solution: write a lex program to strip comments. The program must
understand C syntax enough to know what is a comment and what is not.

Encouragement: it should not be too difficult.

--------------------------------------------------------------------
Joel Loo Peing Ling composed on Fri Mar 17 10:44:52 CST 1989
---------------------------- Now: ----------------------------------
MCC                            |   Email:  loo@sw.mcc.com
3500 West Balcones Centre Dr.  |   Voice:  (512)338-3680 (O)
Austin, TX 78759               |           (512)343-1780 (H)

rupley@arizona.edu (John Rupley) (03/18/89)

In article <2131@mister-curious.sw.mcc.com>, loo@mister-curious.sw.mcc.com
(Joel Loo) writes:
> In article <978@philmds.UUCP>, leo@philmds.UUCP (Leo de Wit) writes:
> > And how about:
> >     puts(" A comment /* in here */");
> > And you can give more examples showing it isn't that trivial; a challenge
> > for the sed adept, perhaps ...
> >    Leo.
> [And a lot of previous articles on the same topic]
> 
> The problem is: sed and vi do not understand C syntax.
> 
> Solution: write a lex program to strip comments. The program must
> understand C syntax enough to know what is a comment and what is not.
> 
> Encouragement: it should not be too difficult.
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It isn't.  Six lines of Lex source (not counting initialization) are 
enough. A Lex source for ``uncomment'' has been posted in comp.sources.unix,
as part of:
	Subject: Volume 16 (Ends January 17, 1989)
	identlist	List identifiers and declarations for C sources

Attached is a minimum test for an uncommenting algorithm, including
tests for quotes inside and outside comments.

John Rupley
 uucp: ..{uunet | ucbvax | cmcl2 | hao!ncar!noao}!arizona!rupley!local
 internet: rupley!local@megaron.arizona.edu
 (O) Dept. Biochemistry, Univ. Arizona, Tucson AZ 85721 - (602) 621-3929
----------------------------------------------------------------------------
/*
 * tests for ``uncomment''
 * assume C-code conventions:
 * 	strings start and end on one line
 * 	comments can be multi-line
 * no tests for varieties of:	'"'   \'"\'   etc
 * no tests for strings with newline escaped
 */
string4		"hi /*\"hi there*/there\""
comment1		/*one"*/"*/
comment2		/*\"hi there"*/"*/"
comment3		/*\"hi there*/
comment4		/* hello/*hello/*hello/*hello*/
comment5		/*******/
comment6		/*/*/ a /**/ b /***/ c /****/ d /*////*/
comment7		/*/*// a /**// b /***// c /****// d /*////*//

1. /*****//"hello world */"				ok   /"hello world */"
2. /* hello /* /* world */				ok	
3. /* */ hello /* */					ok	hello
4. /**// /* this should produce "/ \n" for output */	ok	/ 
5. /* */ hello */					ok	hello */
6. /*/*/ hello						ok	hello
7. /*////*/						ok	
8. /*//*/						ok	
9. abc = "/* fake comment"; /* got who ? */		ok	abc = "STRING";
10. /*   "start quote
	"then next line end quote, after more characters than on line 1"
	more more more */  "				ok	"
----------------------------------------------------------------------------

jeenglis@nunki.usc.edu (Joe English) (03/19/89)

leo@philmds.UUCP (Leo de Wit) writes:
>In article <4896@cbnews.ATT.COM> smk@cbnews.ATT.COM (Stephen M. Kennedy) writes:
>|In article <9900010@bradley> brian@bradley.UUCP writes:
>|>  The following works in vi: :%s/\/\*.*\*\///g
>|
>|/* And this */ important_variable = 42 /* doesn't work either! */
>
>And how about:
>
>    puts(" A comment /* in here */");
>
>And you can give more examples showing it isn't that trivial; a challenge
>for the sed adept, perhaps ...

Does it *have* to be done in sed/awk/other text processor?
This problem is fairly difficult to solve using regexp/editor
commands, but it's a piece of cake to do in C:

#include <stdio.h>

void eatcomment(void);

main() 
{
    int ch;
    int instring = 0;

    ch = getchar();
    while (ch != EOF) {
      switch (ch) {
	case '"' :
	  instring = !instring;
	  break;
	case '/' :
	  if (!instring) 
	    if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); }
	    else putchar('/');
	  break;
	case '\\' :         /* in case this is a \" in a string, */
          putchar('\\');    /*  pass it through now and don't let */
          ch = getchar();  /*  the switch() eat it */
      }
      putchar(ch);
      ch = getchar();
    }
    exit(0);
}

void eatcomment(void)
{
    int ch;

    for (;;) {
      ch = getchar();
      while (ch == '*')
        if ((ch = getchar()) == '/') return;
      if (ch == EOF) exit(1); /* oops */
    }
}
------------

This hasn't been tested thoroughly; it's mostly 
from memory.  

Joe English

jeenglis@nunki.usc.edu

jeenglis@nunki.usc.edu (Joe English) (03/20/89)

I made a mistake in the comment-eating program I
posted yesterday -- it won't handle

/* something like *//* this. */

Change the line in the '/' case from:

    if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); }

to:

    if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); continue; }

and it will work.  If anyone's interested.


--Joe English

  jeenglis@nunki.usc.edu

rupley@arizona.edu (John Rupley) (03/20/89)

In article <3145@nunki.usc.edu>, jeenglis@nunki.usc.edu (Joe English) writes:
> I made a mistake in the comment-eating program I
> posted yesterday -- it won't handle
> 	/* something like *//* this. */
> Change the line in the '/' case from:
>     if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); }
> to:
>     if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); continue; }
> and it will work.  If anyone's interested.

It still doesn't work.  It won't uncomment itself.  Or the following line:

	'"' /* hi there */ '"'

Or distinguish a correct string, with escaped newlines,

	"hi\
	/*\*/ /**/\
	there"

from an incorrect string without the escapes.

The point is not _whether_ one can write an ``uncomment'' in C, but how,
and in what language, one can do it most simply.  It is certainly right
to use C if uncommenting is part of a larger design, as in cpp or ctags.
But if the whole aim is to uncomment, then a pattern-handling language,
such as Lex, is more appropriate.  A few lines of Lex source do the job,
and assuming familiarity with regular expression syntax, it is easy to
write and understand, and hard to get the logic wrong.  It should be
doable with sed or awk, but probably not as easily, because they see a
file as a stream of lines rather than characters.  In C, the proper
setting up of the switch and flags is not trivial, as the previous
posting witnesses.

A Lex source for uncommenting is attached (which I hope does not belie
the remark above about hard to get the logic wrong :-).

John Rupley
 uucp: ..{uunet | ucbvax | cmcl2 | hao!ncar!noao}!arizona!rupley!local
 internet: rupley!local@megaron.arizona.edu
--------------------------------------------------------------------
%{
/* UNCOMMENT- */
/*	regexp for comment recognition based on usenet posting by: */
/*	Chris Thewalt; thewalt@ritz.cive.cmu.edu */
%}
STRING		\"(\\\n|\\\"|[^"\n])*\"
COMMENTBODY	([^*\n]|"*"+[^*/\n])*
COMMENTEND	([^*\n]|"*"+[^*/\n])*"*"*"*/"
QUOTECHAR	\'[^\\]\'|\'\\.\'|\'\\[x0-9][0-9]*\'
ESCAPEDCHAR	\\.
%START	COMMENT
%%
<COMMENT>{COMMENTBODY}		;
<COMMENT>{COMMENTEND}		BEGIN 0;
<COMMENT>.|\n			;
"/*"				BEGIN COMMENT;
{STRING}			ECHO;
{QUOTECHAR}			ECHO;
{ESCAPEDCHAR}			ECHO;
.|\n				ECHO;
---------------------------------------------------------------------------

leo@philmds.UUCP (Leo de Wit) (03/20/89)

In article <3114@nunki.usc.edu> jeenglis@nunki.usc.edu (Joe English) writes:
|
|leo@philmds.UUCP (Leo de Wit) writes:
|>
|>    puts(" A comment /* in here */");
|>
|>And you can give more examples showing it isn't that trivial; a challenge
|>for the sed adept, perhaps ...
|
|Does it *have* to be done in sed/awk/other text processor?
|This problem is fairly difficult to solve using regexp/editor
|commands, but it's a piece of cake to do in C:

Piece of cake? Your program can't even strip its own comments (try it)!
Reason:

|	case '"' :
|	  instring = !instring;
|	  break;

This is both a defect in your program, and the cause that subsequent
comments aren't detected when using the source as input. After the
sequence '"' instring is 1. Besides it doesn't handle multiple
character char constants (e.g. '/*', though one could perhaps argue
whether it should).

|This hasn't been tested thoroughly; it's mostly 
|from memory.  

If your memory was ok, the program wasn't tested thoroughly 8-).
Though the problem isn't difficult, it isn't so trivial as you thought
it was.

	 Leo.

dave@motto.UUCP (dave brown) (03/21/89)

I missed the original posting, so I didn't catch the exact question,
but when I have needed to remove comments, I have simply passed the
source through the preprocessor stage of the compiler only.  Granted,
this does a lot of other things, which may or may not be undesirable
for your application.  Some compilers, however, have options on
the preprocessor which can limit the scope of the damage.

If the original poster still hasn't solved his problem, he can contact
me.  I think we also have a quick and dirty C program which someone wrote
which does the job.

 -----------------------------------------------------------------------------
|  David C. Brown		|  uunet!mnetor!motto!dave		      |
|  Motorola Canada, Ltd.	|  416-499-1441 ext 3708		      |
|  Communications Division	|  Disclaimer: Motorola is a very big company |
 -----------------------------------------------------------------------------

jeenglis@nunki.usc.edu (Joe English) (03/21/89)

In article <9797@megaron.arizona.edu> you write:
>It still doesn't work.  It won't uncomment itself.  Or the following line:
>
>	'"' /* hi there */ '"'
>

Thanks -- I had a feeling I was forgetting something.
I wrote an uncomment program a couple years ago
(and I swear, it *did* work and it wasn't too hard
to write :-)  and I was trying to recall it from 
memory.  Characters in single-quotes were the other
case I forgot about -- and if I had tested the program
on it's own source I would have caught that oversight.
(I feel really stupid now... I think I'm going to 
stop posting to this newsgroup, as I have failed to
say anything correct or intelligent for about a 
month now.)

The Lex solution posted is much more elegant and 
simple; but since lex isn't universally available
a C version is also useful...  (I'm not going to
try a third time, though.)

--Joe English

  jeenglis@nunki.usc.edu

pem@zyx.SE (Per-Erik Martin) (03/22/89)

In article <983@philmds.UUCP> leo@philmds.UUCP (Leo de Wit) writes:
>In article <3114@nunki.usc.edu> jeenglis@nunki.usc.edu (Joe English) writes:
>|
>|Does it *have* to be done in sed/awk/other text processor?
>|This problem is fairly difficult to solve using regexp/editor
>|commands, but it's a piece of cake to do in C:
>
>Piece of cake? Your program can't even strip its own comments (try it)!

Here's another example in C. It *is* a piece of cake (15 minutes work).
The problem can be described with a simple automata which is easily coded
in in C (with goto's, >yech<). I've tested it on most of the pathological
examples given in this group and it seems to work.

----------------------------------------------------------------------------
/* cstrip.c
   pem@zyx.SE, 1989 */

#include <stdio.h>

main()
{
  char c, c1;

  goto into_code;

 in_code:
  putchar(c);
 into_code:
  switch (c = (char)getchar()) {
  case EOF:
    exit(0);
  case '\'':
    goto in_char;
  case '"':
    goto in_string;
  case '/':
    c1 = c;
    if ((c = (char)getchar()) == '*')
      goto in_comment;
    putchar(c1);
  default:
    goto in_code;
  }

 in_char:
  putchar(c);
  switch (c = (char)getchar()) {
  case EOF:
    exit(1);
  case '\\':
    putchar(c);
    c = (char)getchar();
  default:
    putchar(c);
    while ((c = (char)getchar()) != '\'')
      putchar(c);
    goto in_code;
  }

 in_string:
  putchar(c);
  switch (c = (char)getchar()) {
  case EOF:
    exit(1);
  case '"':
    goto in_code;
  case '\\':
    putchar(c);
    c = (char)getchar();
  default:
    goto in_string;
  }

 in_comment:
  switch (c = (char)getchar()) {
  case EOF:
    exit(1);
  case '*':
    if ((c = (char)getchar()) == '/')
      goto into_code;
  default:
    goto in_comment;
  }
}
----------------------------------------------------------------------------
-- 
-------------------------------------------------------------------------------
- Per-Erik Martin, ZYX Sweden AB, Bangardsgatan 13, S-753 20  Uppsala, Sweden -
- Email: pem@zyx.SE                                                           -
-------------------------------------------------------------------------------

Tim_CDC_Roberts@cup.portal.com (03/22/89)

You know, this discussion has brought up something that has bothered me
(although not a great deal).

When scanning the result of preprocessing a nontrivial C program with 
many include files, one finds dozens (in some cases hundreds) of blank
lines.  Obviously, they are the result of eliminating preprocessor
directives and multiline comments.  What I have always wondered is why,
given the #line directive which can re-sync the preprocessor and the
compiler, does the preprocessor insist on keeping all those blank lines?
Why not eliminate them and issue a #line instead?

Just curious.

Tim_CDC_Roberts@cup.portal.com                | Control Data...
...!sun!portal!cup.portal.com!tim_cdc_roberts |   ...or it will control you.

rupley@arizona.edu (John Rupley) (03/22/89)

In article <852@lynx.zyx.SE>, pem@spunk.zyx.SE (Per-Erik Martin) writes:
> Here's another example in C. It *is* a piece of cake (15 minutes work).
> The problem can be described with a simple automata which is easily coded
> in in C (with goto's, >yech<). I've tested it on most of the pathological
> examples given in this group and it seems to work.

This one fails, too.  Try:

	/***/ hi there /**/

Goes to show, for a quick and clean coding of a pattern-matching
automaton, think Lex.  The Lex source that was posted is so simple it
would be hard to get the logic wrong.  Two out of two C postings suggest
that it may be easier to err in coding the same automaton in C.

Not to imply that C has no advantages -- following comparison is for
size of source and for time of uncommenting main.c of an emacs distribution:

	timex/real   wc -l
	13.95	     10    eatLex.l	Lex  
	2.53	     37    eatC.c	C code that works
	1:27.13	     78    eat.sed	Maarten L's recently posted sed script
					(more lines than the C code :-) :-)

John Rupley
rupley!local@megaron.arizona.edu

leo@philmds.UUCP (Leo de Wit) (03/22/89)

In article <852@lynx.zyx.SE> pem@spunk.zyx.SE (Per-Erik Martin) writes:
|Here's another example in C. It *is* a piece of cake (15 minutes work).
|The problem can be described with a simple automata which is easily coded
|in in C (with goto's, >yech<). I've tested it on most of the pathological
|examples given in this group and it seems to work.
    []

Appearances are deceptive, it won't handle trigraphs. For instance, try:
??' (trigraph for ^) and your code thinks it is in_char.

What's worse, on systems where char isn't signed and EOF == -1, it will
fail to see EOF (suggestion: don't use a char to compare against EOF).

Another cake that is hard to digest (let alone the goto's, it was baked
in only 15 minutes) 8-).

	 Leo.

P.S. What's the benefit of having a separate program strip off comments anyway?

chris@mimsy.UUCP (Chris Torek) (03/22/89)

In article <16078@cup.portal.com> Tim_CDC_Roberts@cup.portal.com writes:
>When scanning the result of preprocessing a nontrivial C program with 
>many include files, one finds dozens (in some cases hundreds) of blank
>lines. ... Why not eliminate them and issue a #line instead?

Why bother?  Typically there are at most a few tens in a row.  It is
probably faster to count 20 blank lines than to process one
`#line 1234' directive.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

mnc@m10ux.UUCP (Michael Condict) (03/23/89)

In <9900010@bradley>, brian@bradley.UUCP writes:

>> /* Written  9:58 am  Mar  9, 1989 by jrv@siemens.UUCP */
>>                          Does anyone have a sed or awk script which we
>> can use to preprocess the C source and get rid of all the comments before
>> sending it to the compiler?
>
>  The following works in vi: :%s/\/\*.*\*\///g
>
>  I don't know if it will work in sed, but it should...

Lest anyone actually be tempted to use such a naive method, you should be
aware that it DOESN'T WORK, except for the simplest case of one comment per
line and no multi-line comments.  A correct sed command, which I may have
posted before (forgive me) is shown below.  To use it on SystemV-derived
seds, you have to first delete all the comments from the sed script
itself (ironically, enough!).

To see all of the reasons why the simple method doesn't work, try this:
Take the test C file appended after the sed script below and run it through
the sed script into a file.  Now run diff on the original C file and the one
with comments removed.  What you are looking at is all of the various ways
that comments and things looking almost like comments can be intertwined in C
source files.

Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

-------------------- Sed script to delete C comments -------------------------
# Delete comments from C source files:
: delcom
/\/\*/{
	# Change first comment delim to @ (after eliminating existing @'s):
	s/@/<Used#to%be+an-At>/g
	s:/\*:@:

	# Read until we have the end comment:
	: morecm
	/\*\//!{
		# Just to cut down on max buffer length:
		s/@.*/@/
		N
		b morecm
	}

	# Get rid of any $'s:
	s/\$/<Used#to%be+a-Dollar>/g

	# First occurrence of */ is guaranteed to be the corresponding end
	# comment, because it is otherwise not legal C, so:
	s:\*/:$:
	s/@[^$]*\$/ /

	# Restore $'s and @'s:
	s/<Used#to%be+a-Dollar>/$/g
	s/<Used#to%be+an-At>/@/g

	b delcom
}
------------------------ The test C program ----------------------------------
#define APAP\
		37
# /*hi*/ define GOO(x) y

char *abc = "hi \"Joe\"";
/* this is
 * a comment
 */
struct A_S {
	int wopper /**** a *** b *** c *//*again*/ ;
}; int
f
(x, /* a * in a comment */
	yoohoo)  /**/    /* a /* b */ char *yoohoo;
{
	int a, b, c = '\'';
	char * quote="h#w \
#bo{ut @hat?";
	a = b /*oops*/*c;	/****************/
} enum goober {a,b};
	struct A_S *george(x) struct {int x;
				      float y;} x; { return 0; }

typedef int bar;
struct A_S * * george2(moo, x, glop, foo) struct {
					     int q[13]; float y;} x[];
	bar moo ,	*foo[];
	struct A_S *glop;
/*a*/{
		return 0;
}

/* Try various combinations of register arg decls:*/
flop(a_1, b) register a_1; { return 0; }
struct BB {int f,g;} floop(a_1, b_1) register char *a_1; float register*b_1;
{ struct BB j; return j;}

/* Test arg names that are substrings of one another: */
char sub1(abc, abcdef) int* abcdef; float abc; { return 0; }
-----------------------------------------------------------------------------
-- 
Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

bph@buengc.BU.EDU (Blair P. Houghton) (03/23/89)

In article <16492@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <16078@cup.portal.com> Tim_CDC_Roberts@cup.portal.com writes:
>>When scanning the result of preprocessing a nontrivial C program with 
>>many include files, one finds dozens (in some cases hundreds) of blank
>>lines. ... Why not eliminate them and issue a #line instead?
>
>Why bother?  Typically there are at most a few tens in a row.  It is
>probably faster to count 20 blank lines than to process one
>`#line 1234' directive.

Howsabout 'cat -s file.c | whatever'  or just 'more -s file.c' ?

				--Blair
				  "What is the sound of one
				   Usener posting...many times?"

ftw@masscomp.UUCP (Farrell Woods) (03/23/89)

In article <9833@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes:

>This one fails, too.  Try:
>
>	/***/ hi there /**/

Shouldn't it be a requirement that the program to be stripped at least compile?
This example will generate a syntax error.

-- 
Farrell T. Woods				Voice:  (508) 392-2471
Concurrent Computer Corporation			Domain: ftw@masscomp.com
1 Technology Way				uucp:   {backbones}!masscomp!ftw
Westford, MA 01886				OS/2:   Half an operating system

pem@zyx.SE (Per-Erik Martin) (03/24/89)

In article <9833@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes:
>
>This one fails, too.  Try:
>
>	/***/ hi there /**/
>
Oops! Well, if you change the '*'-case in 'in_comment:' to this:

   do {
     if ((c = (char)getchar()) == '/')
       goto into_code;
   } while (c == '*');

it should work better. (Funny no one found the other bug yet... What do
you expect after 15 minutes? ;-)

>Goes to show, for a quick and clean coding of a pattern-matching
>automaton, think Lex.  The Lex source that was posted is so simple it
>would be hard to get the logic wrong.  Two out of two C postings suggest
>that it may be easier to err in coding the same automaton in C.
>
>Not to imply that C has no advantages -- following comparison is for
>size of source and for time of uncommenting main.c of an emacs distribution:
>
>[...timings...]

Another advantage with C is that it's portable outside the Unix universe...

-- 
-------------------------------------------------------------------------------
- Per-Erik Martin, ZYX Sweden AB, Bangardsgatan 13, S-753 20  Uppsala, Sweden -
- Email: pem@zyx.SE                                                           -
-------------------------------------------------------------------------------

pem@zyx.SE (Per-Erik Martin) (03/24/89)

In article <987@philmds.UUCP> leo@philmds.UUCP (Leo de Wit) writes:
>
>Appearances are deceptive, it won't handle trigraphs. For instance, try:
>??' (trigraph for ^) and your code thinks it is in_char.
>
>What's worse, on systems where char isn't signed and EOF == -1, it will
>fail to see EOF (suggestion: don't use a char to compare against EOF).
>
I simply didn't include trigraphs in the automaton and I'm well aware of
the problem with EOF. The point I tried to make was that it's possible
to solve a problem like that in, for example, C in a reasonable time,
instead of using sed-scripts or lex (which is of no use outside the
unix-world anyway).
If you really want a comment stripper you can easily add trigraphs, handle
EOF, etc.

>
>P.S. What's the benefit of having a separate program strip off comments anyway?

Good question. None, as far as I know...

-- 
-------------------------------------------------------------------------------
- Per-Erik Martin, ZYX Sweden AB, Bangardsgatan 13, S-753 20  Uppsala, Sweden -
- Email: pem@zyx.SE                                                           -
-------------------------------------------------------------------------------

Tim_CDC_Roberts@cup.portal.com (03/24/89)

I hereby revoke my suggestion that the preprocessor should suppress blank
lines and use #line instead.  In a typically homocentric fashion, I 
neglected to realize that even though it is more difficult for *ME* to
read a preprocessor output with many blank lines, it is trivially easy
for the compiler lexical analyzer to ignore them, since a "blank line"
is only one byte long.  Thanks to those who pointed this out.

Tim_CDC_Roberts@cup.portal.com                | Control Data...
...!sun!portal!cup.portal.com!tim_cdc_roberts |   ...or it will control you.

rupley@arizona.edu (John Rupley) (03/24/89)

In article <1179@masscomp.UUCP>, ftw@masscomp.UUCP (Farrell Woods) writes:
>In article <9833@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes:
>>This one fails, too.  Try:
>>	/***/ hi there /**/
>
>Shouldn't it be a requirement that the program to be stripped at least compile?
>This example will generate a syntax error.

Aw, c'mon... be imaginative... replace "hi there" by a proper statement or
whatever:

	/***/ main() {printf("hi there\n");} /**/

Cpp strips the comments (properly) and passes the program text.  The buggy
C code, which was being discussed in the previous posting, strips everything.
Both of the earlier Lex postings do it right, which would seem to be the
take-home lesson.

John Rupley
rupley!local@megaron.arizona.edu

daveb@gonzo.UUCP (Dave Brower) (03/24/89)

In article <16492@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <16078@cup.portal.com> Tim_CDC_Roberts@cup.portal.com writes:
>>When scanning the result of preprocessing a nontrivial C program with 
>>many include files, one finds dozens (in some cases hundreds) of blank
>>lines. ... Why not eliminate them and issue a #line instead?
>
>Why bother?  Typically there are at most a few tens in a row.  It is
>probably faster to count 20 blank lines than to process one
>`#line 1234' directive.

Yup, true enough for compilation.  It is sort of annoying tough when you
need to look at the intermediate file to figure something out.

So, I offer this week's challenge:  Smallest program that will take
"blank line" style cpp output on stdin and send to stdout a scrunched
version with appropriate #line directives.  [f]lex, Yacc, [na]awk, sed,
perl, c, c++ are all acceptable.  This will be an amusing excercise in
typical text massaging that can be enlightening for many people.

Is this branching out of comp.lang.c?  Where should it go?

-dB
-- 
"I came here for an argument." "Oh.  This is getting hit on the head"
{sun,mtxinu,amdahl,hoptoad}!rtech!gonzo!daveb	daveb@gonzo.uucp

mnc@m10ux.UUCP (Michael Condict) (03/24/89)

Oops, the previous lex script I posted for deleting comments from
C source code is incorrect -- it doesn't recognize:  /***...**/
Here is a better one (simpler, too):

	%%
	\"([^\\"]*\\(.|\n))*[^\\"]*\"	ECHO;
	"/*"([^*]|"*"+[^/*])*"*"*"*/"	;
	.				ECHO;

Okay, I promise to stop now.  (Unless there is a bug in this one.)
-- 
Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

bill@twwells.uucp (T. William Wells) (03/26/89)

In article <9797@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes:
: A Lex source for uncommenting is attached (which I hope does not belie
: the remark above about hard to get the logic wrong :-).

Try it on a very long comment. You might discover an overflowed lex
buffer. On the other hand, this shouldn't be too hard to fix. Just do
for the comment what you did for the noncommented text.

---
Bill                            { uunet | novavax } !twwells!bill
(BTW, I'm may be looking for a new job sometime in the next few
months.  If you know of a good one where I can be based in South
Florida do send me e-mail.)

rupley@arizona.edu (John Rupley) (03/26/89)

> In article <620@gonzo.UUCP>, daveb@gonzo.UUCP (Dave Brower) writes:
> So, I offer this week's challenge:  Smallest program that will take
> "blank line" style cpp output on stdin and send to stdout a scrunched
> version with appropriate #line directives.  [f]lex, Yacc, [na]awk, sed,
> perl, c, c++ are all acceptable.  This will be an amusing excercise in
> typical text massaging that can be enlightening for many people.

"Scrunching" is probably a matter of taste, with regard to the format
of the ouput.  So I am not sure what you, yourself, want.  But below
is a guess.  Lex, of course.  May not be portable, but it should work
with minor mods on other Unices.  Should be easy to modify for different
output format.

John Rupley
rupley!local@megaron.arizona.edu


%{ /*---------------------------start of text---------------------------*/
/*-
 * SCRUNCH.l
 *
 * Scrunch cpp output.
 * 	In-Reply-To: daveb@gonzo.UUCP (Dave Brower)
 * 	Message-ID: <620@gonzo.UUCP>			#comp.lang.c
 * 
 * Compress runs of "#" lines and blank lines, or runs of two or more
 * blank lines:
 * 	(\n*# lineno "file"\n+)*  or  \n\n\n+
 * into a single line:
 *	#line lineno "file"\n
 * which is output before the next line of program text 
 * (corresponding to line "lineno" of the source "file").
 * The values of "lineno" and "file" are adjusted for changes in
 * source resulting from #include statements.
 * Lines with whitespace are not considered blank and are passed.
 *
 * Compilation:
 *	lex scrunch.l
 *	cc -O lex.yy.c -ll -o scrunch
 *
 * Minimally tested with UNIX sys5r2 cpp only, as follows:
 * (a)	/lib/cpp -Dprocessor=1 lex.yy.c >scruch.cpp	#specify your processor
 *	scrunch <scrunch.cpp >scrunch.cpp.c
 *	cc -O scrunch.cpp.c -ll
 *	cmp -l a.out scrunch		#should give date/name diffs only
 * (b)	compare line numbers in scrunch.cpp.c with lex.yy.c and scrunch.cpp
 *		(no differences stood out)
 *
 * Possible bugs:
 *	escaped newlines in macros.
 *	????
 *
 * John Rupley
 * rupley!local@megaron.arizona.edu
 */
%}
	char		file[BUFSIZ];

POUND	#[ ]+[0-9]+[ ]+\".*$
TEXT	[^#\n].*$
%START	POUND TEXT
%%
<INITIAL>.	{unput(yytext[0]); BEGIN TEXT;}
<POUND>{POUND}	sscanf(yytext, "# %d %s", &yylineno, &file[0]);
<POUND>{TEXT}	{printf("#line %d %s\n", yylineno-1, file); ECHO; BEGIN TEXT;}
<POUND>\n	;
<TEXT>{POUND}	{sscanf(yytext, "# %d %s", &yylineno, &file[0]); BEGIN POUND;}
<TEXT>\n{3,}	{printf("\n"); BEGIN POUND;}
<TEXT>{TEXT}|\n	ECHO;
.		printf("\nERROR: file %s, line %d, char 0x%x=%c\n",
			file, yylineno, (unsigned int) yytext[0], yytext[0]);
%%
/*----------------------------end of text-------------------------------*/

rupley@arizona.edu (John Rupley) (03/26/89)

In article <893@m10ux.UUCP>, mnc@m10ux.UUCP (Michael Condict) writes:
> Oops, the previous lex script I posted for deleting comments from
> C source code is incorrect -- it doesn't recognize:  /***...**/
> Here is a better one (simpler, too):
> 
> 	%%
> 	\"([^\\"]*\\(.|\n))*[^\\"]*\"	ECHO;
> 	"/*"([^*]|"*"+[^/*])*"*"*"*/"	;
> 	.				ECHO;

You indeed fixed the /***/ error, but two errors remain.

First, no handling of single-quoted double quotes:

        main() {printf("%c\n", '"');/*gotcha*/printf("%c\n", '"');}

Second, your program crashes when uncommenting a real source file, with
a sizeable change history or whatever inside a comment.  You need at
least one state change, so a comment can be matched line-by-line, and
so not overflow a Lex buffer.  Both previous Lex postings did it
right.  A third state, to handle quoted strings line-by-line, is perhaps
optional, and the previous postings differ here. Apparently you missed
the previous Lex postings, which I will be happy to email you on request.

My argument, that it's difficult to make a logical error in coding this
problem in Lex, has now been demonstrated wrong (sob :-). But at least 
Lex is still outscoring straight C (faint praise :-?).

John Rupley
rupley!local@megaron.arizona.edu

danw@tekchips.LABS.TEK.COM (Daniel E. Wilson) (03/27/89)

   Why is everyone obsessed with stripping comments from their C programs.
Is this some new programming trend? :)

Dan Wilson

rupley@arizona.edu (John Rupley) (03/28/89)

In article <795@twwells.uucp>, bill@twwells.uucp (T. William Wells) writes:
> In article <9797@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes:
> : A Lex source for uncommenting is attached (which I hope does not belie
> : the remark above about hard to get the logic wrong :-).
> 
> Try it on a very long comment. You might discover an overflowed lex
> buffer. On the other hand, this shouldn't be too hard to fix. Just do
> for the comment what you did for the noncommented text.

Nope.... no problem.... comments are thrown away line-by-line, by design,
so that very long comments indeed do not blow the buffer.

A very long string, however, will overflow the buffer, but clearly this
is understood, and it can be viewed as a feature, although
idiosyncratic, as noted in <9888@megaron.arizona.edu>.  If you want to
handle strings differently, add another start condition (state) begun
by '"' and make explicit start condition 0 = <INITIAL>, or change the
size of the match buffer (yytext[]) by including in the definitions:

	%{
	#define YYLMAX 5000	/* or whatever */
	%}

John Rupley
rupley!local@megaron.arizona.edu

stil@nikhefh.hep.nl (Gertjan Stil) (03/29/89)

In article <4895@cbnews.ATT.COM> smk@cbnews.ATT.COM (Stephen M. Kennedy) writes:
>In article <9900010@bradley> brian@bradley.UUCP writes:
>>  The following works in vi: :%s/\/\*.*\*\///g
>
>/*
> * Unfortunately, multi-line comments aren't deleted.
> */

What about the following command in vi:  :%s/\/\*[.|\n]*\*\///g

This will work for multi-line comments.

Gertjan Stil

<no signature yet>

maw@auc.UUCP (Michael A. Walker) (03/30/89)

In article <9887@megaron.arizona.edu>, rupley@arizona.edu (John Rupley) writes:
> 
> > In article <620@gonzo.UUCP>, daveb@gonzo.UUCP (Dave Brower) writes:
> > So, I offer this week's challenge:  Smallest program that will take
> > "blank line" style cpp output on stdin and send to stdout a scrunched
> > version with appropriate #line directives.  [f]lex, Yacc, [na]awk, sed,
> > perl, c, c++ are all acceptable.  This will be an amusing excercise in
> > typical text massaging that can be enlightening for many people.
> 
> "Scrunching" is probably a matter of taste, with regard to the format
> of the ouput.

I don't know what is ment by the term scrunching, but here is my entry to
the problem of removing comments in a C program.  YACCR (Yet Another C
Comment Remover :-) is a crazy looking lex specification that removes C
comments from a source file.  It also does not put out a lot of extra blank
lines that cpp does.  I have tested on most styles of C comments that I
have seen and it seems to work,  but PLEASE no flames if it doesn't!!!!

In an earlier message, someone address the problem of a yytext overflow.
YACCR redefines the YYLMAX constant as 500, but you can test it with other
values.

To use:
	1.  Save message in file called yaccr.l and edit this file to
	    unwanted text.

	2.  Type: lex yaccr.l

	3.  Type: cc lex.yy.c -ll -lyaccr

It should then be ready to go.

Good luck.

---mike
EMAIL: ...!gatech!auc!rambro!maw

--------------------------------cut here--------------------------
%{
/*
**	Specification:	YACCR
**	Description  :	YACCR removes comments from C programs.
*/

#define CR		0x0d

#ifdef YYLMAX
#undef YYLMAX
#define YYLMAX		500
#endif
%}

%%

"/*""*"*("/*"*|[^*/]|[^*]"/"|"*"[^/])*"*"*"*/"	putchar(CR);
.						printf("%s",yytext);
--------------------------------cut here--------------------------