[comp.lang.c] LEX

michael@nyit.UUCP (Michael Gwilliam) (02/02/88)

I'm writting a C like language to discribe data structures.  When I
was writting the tokenizer using LEX and I got intrigued by a little
problem.  Is it possible to write a regular expression that will
transform a /* comment */ into nothing?

I tried something like  

\/\*[^(\/\*)]*\*\/ ;  

which works pretty good, but won't handle cases like

/*  comment /* more comment */

which is legal in C.  Then somebody came up with something like...

\/\*(.\*)\*\/ ;  

which works pretty good, but it doesn't reject

/* /* nested comment */ */ 

which are illlegal.  So my question is, to all you experienced lex
users and compiler writers, can this be done?  Or do I need to
use input() and other lex functions.  I imagine that the lex code
would be similar in pascal.

Thanks for the help or at least reading this far.

michael

P.S. If reponses are mailed instead of posted, I will be happy to
summerize the results and post them myself, giving proper credit
where it is due.

              nyit!michael
      philabs!nyit!michael
ihnp4!philabs!nyit!michael

mike@arizona.edu (Mike Coffin) (02/04/88)

In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes:

> I'm writting a C like language to discribe data structures.  When I
> was writting the tokenizer using LEX and I got intrigued by a little
> problem.  Is it possible to write a regular expression that will
> transform a /* comment */ into nothing?

I tried to mail this, but the mailer couldn't find you:

You can probably write a single regular expression to recognize C
comments, but it would be a bad idea.  In general, comments can be
long.  Lex, being a tokenizer, is not designed to recognize things
bigger than its internal buffer size, which is only a few thousand
characters.  When presented with long tokens, lex drops core.  Two
possible solutions:

1) Upon recognizing "/*", call a C routine to eat the rest of the
comment.  Inside the routine, use the Lex macro "input()" to get
characters.

2) Use Lex "start conditions".  These allow you to specify several
different tokenizers and switch between them explicitly.  Untested
code:

<N>"/*"			{BEGIN BC;}
<BC>[^*\n]*		;
<BC>"*"			;
<BC>"\n"		{lineno++;}
<BC>"*/"		{BEGIN N;}

Start condition <N> is the "normal" start condition, while <BC> is
"block-comment" condition.  This is much safer than recognizing entire
comments as one token; to overflow the buffer a single line would have
to be longer than the buffer.

-- 

Mike Coffin				mike@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci.	{allegra,cmcl2,ihnp4}!arizona!mike
Tucson, AZ  85721			(602)621-4252

ix426@sdcc6.ucsd.EDU (Tom Stockfisch) (02/04/88)

In article <260@nyit.UUCP> michael@nyit.UUCP (Michael Gwilliam) writes:
>
>....  When I
>was writting the tokenizer using LEX and I got intrigued by a little
>problem.  Is it possible to write a regular expression that will
>transform a /* comment */ into nothing? ....
>So my question is, to all you experienced lex
>users and compiler writers, can this be done?  Or do I need to
>use input() and other lex functions.

[sorry for not emailing, I can't seem to get mail to Michael]

I can't believe how hard this task is in regular expressions, when it is
trivial to code by hand.  I have found a solution which I think is correct,
but it took several tries (see end of this posting).

To convince yourself that a pattern is correct, I think you have to show
two things
	1.  That the body between the "/*" and "*/" cannot possibly contain
	    a "*/",
	2.  That the body can contain any other sequence of characters.

If you come up with your own solution, be sure it works properly on the
following input.

1.	/*****//hello world */

2.	/* hello /* /* world */

3.	/* */ hello /* */

4.	/**// /* this input should produce "/ \n" for output */

5.	/* */ hello */


The following lex source should "elide" all legal comments, and pass all
the rest thru to stdout.  As requested, it does not use input().

--cut----

okslash	([^*/]"/"+)

%%
"/*""/"*([^/]|{okslash})*"*/"	;

--cut----

Compile using

	lex comment.l; cc lex.yy.c -ll
-- 

||  Tom Stockfisch, UCSD Chemistry   tps@chem.ucsd.edu

djones@megatest.UUCP (Dave Jones) (02/05/88)

 
 In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes:
 
> I'm writting a C like language to discribe data structures.  When I
> was writting the tokenizer using LEX and I got intrigued by a little
> problem.  Is it possible to write a regular expression that will
> transform a /* comment */ into nothing?
 
It is indeed intriguing.   I don't think you can write any
LR(k) context-free grammar to "transform it" into anything.

		- djones

peter@edsews.EDS.COM (Peter Zadrozny) (02/06/88)

nyit.uucp is unknown...

In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes:
> Is it possible to write a regular expression that will
> transform a /* comment */ into nothing?

A while ago I had to do something similar for a PL/1 preprocessor
(PL/1 also uses /* ... */ as comment delimiter).

After toying a while trying to get the perfect lex syntax, I remembered
that I had to do more stuff than just skip the comment such as line counting,
error trapping and more, therefore my final solution was to recognize
"/*" and send it to a comment eater routine similar to:

#define EOF_MESSAGE	"Unexpected EOF in comment"

void skip_comm ()
{
    register int    c;

    for (;;) {
	if ((c = input ()) == '\n')
	     line_counter++;
	else {
	    if (c == 0)
		lex_error (EOF_MESSAGE, UNRECOVERABLE);
	    if (c == '*') {
		if ((c = input ()) == '/')
		    break;
		else
		    if (c == 0)
			lex_error (EOF_MESSAGE, UNRECOVERABLE);
		    else
			unput (c);
	}
    }
}

input () and unput () are lex routines, and lex_error is my error
handler which uses line_counter to output a decent error message.

--------------------------------------------------------------------
Peter Zadrozny					peter@edsews.eds.com
Oh my god, he can speak!!!			uunet!edsews!peter
Yeah, but don't tell anyone, he'll get fired	(313) 645-4725
--------------------------------------------------------------------

ok@quintus.UUCP (Richard A. O'Keefe) (02/08/88)

In article <248@goofy.megatest.UUCP>, djones@megatest.UUCP (Dave Jones) writes:
>  In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes:
> > I'm writing a C like language to discribe data structures.  When I
> > was writing the tokenizer using LEX and I got intrigued by a little
> > problem.  Is it possible to write a regular expression that will
> > transform a /* comment */ into nothing?
> It is indeed intriguing.   I don't think you can write any
> LR(k) context-free grammar to "transform it" into anything.

It is quite straightforward to write a Yacc grammar which matches
comments, so LR(1) is not just possible, it's easy.
Here it is.  (OTHER is any character but / or *.)

%token '/' '*' OTHER
%start comment
%%

comment : '/' '*' rest
	;
rest : OTHER rest
     | '/' rest
     | '*' hope
     ;
hope : '/'
     | '*' hope
     | OTHER rest
     ;

It is possible to express this directly as a single RE:
                            *   +                      *   + *
        {/}{*} ({OTHER}|{/}) {*} ({OTHER} ({OTHER}|{/}) {*} ) {/}

but as I don't use LEX, I don't know how to say this in LEXese.
Note that *nested* comments cannot be expressed as a RE, but can
be handled by a simple Yacc grammar.

But if all you want to do is to skip C-like comments, why not RTM?
In "Chapter 5: LEX" of the "UNIX System V Programmer's Guide",
we find EXACTLY this example on page 207 of the 1987 edition.

	...
	"/*"		skipcmnts();
	...
	%%
	skipcmnts()
	    {
		for (;;) {
		    while (input() != '*') ;
		    if (input() == '/') return;
		    unput(yytext[yyleng-1]);
		}
	    }

{I've hacked it around a bit.}  Note that this stores the comment in
the yytext[] buffer, which is a Good Thing if you want to write the
comment out somewhere, but it can overflow.  yyless() can be used to
keep that buffer empty.

Frankly, I've always found it easier to transcribe the state machine
directly into C.

edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (02/08/88)

> > I'm writting a C like language to discribe data structures.  When I
> > was writting the tokenizer using LEX and I got intrigued by a little
> > problem.  Is it possible to write a regular expression that will
> > transform a /* comment */ into nothing?
>  
> It is indeed intriguing.   I don't think you can write any
> LR(k) context-free grammar to "transform it" into anything.
> 

  I don't think you should be concerned with trying to put comment handling
in the grammar.   It is usually the job of the lexer to strip comments
out of the source code which is a very easy task.

  A comment in regards to the question about handling an ambiguity in
the C grammar.  The problems was that there are cases where a
variable declaration could be confused with a function call.
Example:

	typedef int xx;
	main()
	    {
	    xx (x);
	    .....
	    }

 The statement xx (x) could be parsed as either a variable declaration
of x or a function call of xx and is depended upon the context
(whether xx is a function or a type).

   If you change our grammar to distinguish between the to you should
have no problems. 


	funcbody	:	varlist statements
			;

	varlist		:	varlist vardec
			|	/* epsilon */
			;

	vardec		:	TYPENAME varexp;
			;

	statements	:	FUNCNAME arglist
			|	......

    In the lexer, one would have to distinguish ids of function calls
from ids of typedefs and return the tokens FUNCNAME or TYPENAME
respectively.  This can be done simple by doing a lookup in the 
type name table , followed by a lookup  in the func name table if
the lookup in the type name table fails.  I think the semantics
of C will require that order execution.  If the lookup in the
type name table succeeds then the id has to be a type name,
a function could not have a name the same as the typedef id,
or should I say my C compiler flags this one.  If the
lookup in the type name table fails the statement has to be a function
call.   If the lookup in the func name table fails, insert it
into the table as a forward reference to the function.
-- 

Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu