[net.unix] comments in lex

ray@emacs.uucp (Ray Reeves) (06/25/85)

I'm new to lex, and the first thing I tried wouldn't fly!
How do you recognise a PL/1 style comment?
This is what I did:

startcom	\/\*
endcom		\*\/
%%
{startcom}[^{endcom}]{endcom}	printf("%s","comment")

But negation doesn't take conjunctions, even when packaged like this.
What should I do?
-- 
Ray Reeves,
CCA-UNIWORKS,20 William St,Wellesley, Ma. 02181. (617)235-2600
emacs!ray@CCA-UNIX

sambo@ukma.UUCP (Inventor of micro-S) (06/29/85)

In article <114@emacs.uucp> ray@emacs.uucp (Ray Reeves) writes:
>This is what I did:
>
>startcom	\/\*
>endcom		\*\/
>%%
>{startcom}[^{endcom}]{endcom}	printf("%s","comment")

Here is what I do.  If there is a better way, let me know.

%%
"/*"[^*\n]*	{
		  int c, i;
		  if ((c = input ()) == '*')
		    if ((c = input ()) == '/') {
			/* Have found a comment. */
		    } /* if ((c = input ()) == '/') */
		    else {
		      unput (c); unput ('*'); unput ('/');
			/* This makes lex think that the very next thing on
			   the input is also a comment. */
		    } /* else - if ((c = input ()) == '/') */
		  else {
		      /* found '\n' */
		    unput ('*'); unput ('/');
			/* Processing may be different upon reaching the end of
			   line - in my compiler I keep track of which line of
			   text I am looking at.  Also, by doing things this way
			   I am hoping to avoid overflowing the input buffer. */
		  } /* else - first if ((c */
		}
-----------------------------------------
Samuel A. Figueroa, Dept. of CS, Univ. of KY, Lexington, KY  40506-0027
ARPA: ukma!sambo<@ANL-MCS>, or sambo%ukma.uucp@anl-mcs.arpa,
      or even anlams!ukma!sambo@ucbvax.arpa
UUCP: {ucbvax,unmvax,boulder,oddjob}!anlams!ukma!sambo,
      or cbosgd!ukma!sambo

	"Micro-S is great, if only people would start using it."

haahr@jendeh.UUCP (Paul Haahr) (07/01/85)

> How do you recognise a PL/1 style comment?

"/*"([^*]|"*"[^/])*"*/"

				Paul Haahr
				..!allegra!princeton!jendeh!haahr

dudek@harvard.ARPA (Glen Dudek) (07/04/85)

What I consider the "best" (read "most efficient") way to eat C or
PL-1 style comments in lex is as in the ANSI-C standard yacc/lex
grammar recently posted to the net:

%%
"/*"			{ comment(); printf("comment"); }
%%
comment()
{
	char c, c1;
loop:
	while ((c = input()) != '*' && c != 0)
		putchar(c);
	if ((c1 = input()) != '/' && c != 0) {
		unput(c1);
		goto loop;
	}
	if (c != 0)
		putchar(c1);
}

My favorite way is to use the following lex expression:

%%
"/*"("/"|("*"*[^*/]))*"*"+"/"	{ printf("comment"); }
%%

Although it may make your head hurt, it's interesting to figure out.

	Glen Dudek

ray@emacs.uucp (Ray Reeves) (07/11/85)

Thanks for all the contributions on this subject.
Many people recommended embedding C in Lex to solve
my problem, pointing out some precedents for this.
This, of course, is tantamount to saying that Lex
can't hack it, and indeed amdahl!drivax!alan says
I shouldn't expect a finite state machine to do so.

Paul Haahr of Princeton made a snappy answer which was:

"/*"([^*]|"*"[^/]*"*/"

but Glen Dudek of Harvard pointed out that this fails for

/***/, and it should have been:

"/*"("/"|("*"*[^*/]))*"*"+"/"

Several people pointed out the hazard of enormous block 
comments, and McQueer steered me to the use of START 
transitions, which I decided was sound advice when I 
discovered the yymore function.

I append the Lex code that I have arrived at, a program to
which you all contributed.  My special problem is a lexical
processor for a PL/1 pretty-printer, and in this environment
people typically have enormous comment blocks with some sort
of pattern or table in them.  Thus, although code can be torn 
to shreds and reformatted block comments must be left undisturbed.
The problem of large blocks is solved by entering a "comment" mode
and tokenising each line separately.  The residual problem is that
the first line of such a block has to respect leading white space
even before the comment starts. This is solved by tokenising the
whole line, not just the comment part. Elsewhere, white space is
discarded.

To my astonishment, a fault in Lex showed up which almost 
crippled me.  It is impossible to recognise just one \n
character under a START mode, although you can in normal mode.
Thus, my last rule looks for [\n]+ followed by any character and
then unputs that character back.  Is this a known wart?


startcom \/\*
endcom \*\/
%START com maybecom

%%
{startcom} 	      {yymore();BEGIN com;}
[\n]+		      {printf("%s%u%c","nl(",yyleng,')');BEGIN maybecom;}
[ \t]+		      ;

<maybecom>[\ \t]*{startcom}  {yymore();BEGIN com;}

<com>[^\*\n]*{endcom} 
          {printf("%s%u%s%s%s","cm(",yyleng,",\"",yytext,"\")");BEGIN 0;}
<com>[^\*\n]*	    printf("%s%u%s%s%s","cm(",yyleng,",\"",yytext,"\")");
<com>[^\*\n]*\*	    yymore();
<com>[\n]+.	    
 	  {unput(yytext[yyleng-1]);printf("%s%u%c","nl(",yyleng-1,')');}

%%
main() {while (2) yylex();}
-- 
Ray Reeves,
CCA-UNIWORKS,20 William St,Wellesley, Ma. 02181. (617)235-2600
emacs!ray@CCA-UNIX

andrew@orca.UUCP (Andrew Klossner) (07/14/85)

>> How do you recognise a PL/1 style comment?
>
> "/*"([^*]|"*"[^/])*"*/"

Two problems:

1) This pattern will incorrectly recognize "/***/ */" as a comment.

2) This approach to comment skipping is a bad idea in lex, because the
   generated lexer will try to accumulate the entire comment in the
   "yytext" buffer, which has a fixed size.  (On our system, the size
   is 1024 bytes.)  If ever a comment with more bytes than the buffer
   size is found, the lex driver will merrily overwrite the memory
   following the buffer and blow away your compile.

  -=- Andrew Klossner   (decvax!tektronix!orca!andrew)       [UUCP]
                        (orca!andrew.tektronix@csnet-relay)  [ARPA]

hammond@steinmetz.UUCP (Steve Hammond) (07/23/85)

> >> How do you recognise a PL/1 style comment?
> >
> > "/*"([^*]|"*"[^/])*"*/"
> 
> Two problems:
> 
> 1) This pattern will incorrectly recognize "/***/ */" as a comment.
> 
> 2) This approach to comment skipping is a bad idea in lex, because the
>    generated lexer will try to accumulate the entire comment in the
>    "yytext" buffer, which has a fixed size.  (On our system, the size
>    is 1024 bytes.)  If ever a comment with more bytes than the buffer
>    size is found, the lex driver will merrily overwrite the memory
>    following the buffer and blow away your compile.
> 

I get around #1 it by switching modes when I encounter a /*
and switching back when I encounter a */.  To date I
have not overflowed the yytext buffer (I hope).
-- 
  Steve Hammond   arpa: hammond@GE   uucp: {...edison!}steinmetz!hammond