michael@nyit.UUCP (Michael Gwilliam) (02/02/88)
I'm writting a C like language to discribe data structures. When I was writting the tokenizer using LEX and I got intrigued by a little problem. Is it possible to write a regular expression that will transform a /* comment */ into nothing? I tried something like \/\*[^(\/\*)]*\*\/ ; which works pretty good, but won't handle cases like /* comment /* more comment */ which is legal in C. Then somebody came up with something like... \/\*(.\*)\*\/ ; which works pretty good, but it doesn't reject /* /* nested comment */ */ which are illlegal. So my question is, to all you experienced lex users and compiler writers, can this be done? Or do I need to use input() and other lex functions. I imagine that the lex code would be similar in pascal. Thanks for the help or at least reading this far. michael P.S. If reponses are mailed instead of posted, I will be happy to summerize the results and post them myself, giving proper credit where it is due. nyit!michael philabs!nyit!michael ihnp4!philabs!nyit!michael
mike@arizona.edu (Mike Coffin) (02/04/88)
In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes: > I'm writting a C like language to discribe data structures. When I > was writting the tokenizer using LEX and I got intrigued by a little > problem. Is it possible to write a regular expression that will > transform a /* comment */ into nothing? I tried to mail this, but the mailer couldn't find you: You can probably write a single regular expression to recognize C comments, but it would be a bad idea. In general, comments can be long. Lex, being a tokenizer, is not designed to recognize things bigger than its internal buffer size, which is only a few thousand characters. When presented with long tokens, lex drops core. Two possible solutions: 1) Upon recognizing "/*", call a C routine to eat the rest of the comment. Inside the routine, use the Lex macro "input()" to get characters. 2) Use Lex "start conditions". These allow you to specify several different tokenizers and switch between them explicitly. Untested code: <N>"/*" {BEGIN BC;} <BC>[^*\n]* ; <BC>"*" ; <BC>"\n" {lineno++;} <BC>"*/" {BEGIN N;} Start condition <N> is the "normal" start condition, while <BC> is "block-comment" condition. This is much safer than recognizing entire comments as one token; to overflow the buffer a single line would have to be longer than the buffer. -- Mike Coffin mike@arizona.edu Univ. of Ariz. Dept. of Comp. Sci. {allegra,cmcl2,ihnp4}!arizona!mike Tucson, AZ 85721 (602)621-4252
ix426@sdcc6.ucsd.EDU (Tom Stockfisch) (02/04/88)
In article <260@nyit.UUCP> michael@nyit.UUCP (Michael Gwilliam) writes: > >.... When I >was writting the tokenizer using LEX and I got intrigued by a little >problem. Is it possible to write a regular expression that will >transform a /* comment */ into nothing? .... >So my question is, to all you experienced lex >users and compiler writers, can this be done? Or do I need to >use input() and other lex functions. [sorry for not emailing, I can't seem to get mail to Michael] I can't believe how hard this task is in regular expressions, when it is trivial to code by hand. I have found a solution which I think is correct, but it took several tries (see end of this posting). To convince yourself that a pattern is correct, I think you have to show two things 1. That the body between the "/*" and "*/" cannot possibly contain a "*/", 2. That the body can contain any other sequence of characters. If you come up with your own solution, be sure it works properly on the following input. 1. /*****//hello world */ 2. /* hello /* /* world */ 3. /* */ hello /* */ 4. /**// /* this input should produce "/ \n" for output */ 5. /* */ hello */ The following lex source should "elide" all legal comments, and pass all the rest thru to stdout. As requested, it does not use input(). --cut---- okslash ([^*/]"/"+) %% "/*""/"*([^/]|{okslash})*"*/" ; --cut---- Compile using lex comment.l; cc lex.yy.c -ll -- || Tom Stockfisch, UCSD Chemistry tps@chem.ucsd.edu
djones@megatest.UUCP (Dave Jones) (02/05/88)
In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes: > I'm writting a C like language to discribe data structures. When I > was writting the tokenizer using LEX and I got intrigued by a little > problem. Is it possible to write a regular expression that will > transform a /* comment */ into nothing? It is indeed intriguing. I don't think you can write any LR(k) context-free grammar to "transform it" into anything. - djones
peter@edsews.EDS.COM (Peter Zadrozny) (02/06/88)
nyit.uucp is unknown... In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes: > Is it possible to write a regular expression that will > transform a /* comment */ into nothing? A while ago I had to do something similar for a PL/1 preprocessor (PL/1 also uses /* ... */ as comment delimiter). After toying a while trying to get the perfect lex syntax, I remembered that I had to do more stuff than just skip the comment such as line counting, error trapping and more, therefore my final solution was to recognize "/*" and send it to a comment eater routine similar to: #define EOF_MESSAGE "Unexpected EOF in comment" void skip_comm () { register int c; for (;;) { if ((c = input ()) == '\n') line_counter++; else { if (c == 0) lex_error (EOF_MESSAGE, UNRECOVERABLE); if (c == '*') { if ((c = input ()) == '/') break; else if (c == 0) lex_error (EOF_MESSAGE, UNRECOVERABLE); else unput (c); } } } input () and unput () are lex routines, and lex_error is my error handler which uses line_counter to output a decent error message. -------------------------------------------------------------------- Peter Zadrozny peter@edsews.eds.com Oh my god, he can speak!!! uunet!edsews!peter Yeah, but don't tell anyone, he'll get fired (313) 645-4725 --------------------------------------------------------------------
ok@quintus.UUCP (Richard A. O'Keefe) (02/08/88)
In article <248@goofy.megatest.UUCP>, djones@megatest.UUCP (Dave Jones) writes: > In article <260@nyit.UUCP>, michael@nyit.UUCP (Michael Gwilliam) writes: > > I'm writing a C like language to discribe data structures. When I > > was writing the tokenizer using LEX and I got intrigued by a little > > problem. Is it possible to write a regular expression that will > > transform a /* comment */ into nothing? > It is indeed intriguing. I don't think you can write any > LR(k) context-free grammar to "transform it" into anything. It is quite straightforward to write a Yacc grammar which matches comments, so LR(1) is not just possible, it's easy. Here it is. (OTHER is any character but / or *.) %token '/' '*' OTHER %start comment %% comment : '/' '*' rest ; rest : OTHER rest | '/' rest | '*' hope ; hope : '/' | '*' hope | OTHER rest ; It is possible to express this directly as a single RE: * + * + * {/}{*} ({OTHER}|{/}) {*} ({OTHER} ({OTHER}|{/}) {*} ) {/} but as I don't use LEX, I don't know how to say this in LEXese. Note that *nested* comments cannot be expressed as a RE, but can be handled by a simple Yacc grammar. But if all you want to do is to skip C-like comments, why not RTM? In "Chapter 5: LEX" of the "UNIX System V Programmer's Guide", we find EXACTLY this example on page 207 of the 1987 edition. ... "/*" skipcmnts(); ... %% skipcmnts() { for (;;) { while (input() != '*') ; if (input() == '/') return; unput(yytext[yyleng-1]); } } {I've hacked it around a bit.} Note that this stores the comment in the yytext[] buffer, which is a Good Thing if you want to write the comment out somewhere, but it can overflow. yyless() can be used to keep that buffer empty. Frankly, I've always found it easier to transcribe the state machine directly into C.
edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (02/08/88)
> > I'm writting a C like language to discribe data structures. When I > > was writting the tokenizer using LEX and I got intrigued by a little > > problem. Is it possible to write a regular expression that will > > transform a /* comment */ into nothing? > > It is indeed intriguing. I don't think you can write any > LR(k) context-free grammar to "transform it" into anything. > I don't think you should be concerned with trying to put comment handling in the grammar. It is usually the job of the lexer to strip comments out of the source code which is a very easy task. A comment in regards to the question about handling an ambiguity in the C grammar. The problems was that there are cases where a variable declaration could be confused with a function call. Example: typedef int xx; main() { xx (x); ..... } The statement xx (x) could be parsed as either a variable declaration of x or a function call of xx and is depended upon the context (whether xx is a function or a type). If you change our grammar to distinguish between the to you should have no problems. funcbody : varlist statements ; varlist : varlist vardec | /* epsilon */ ; vardec : TYPENAME varexp; ; statements : FUNCNAME arglist | ...... In the lexer, one would have to distinguish ids of function calls from ids of typedefs and return the tokens FUNCNAME or TYPENAME respectively. This can be done simple by doing a lookup in the type name table , followed by a lookup in the func name table if the lookup in the type name table fails. I think the semantics of C will require that order execution. If the lookup in the type name table succeeds then the id has to be a type name, a function could not have a name the same as the typedef id, or should I say my C compiler flags this one. If the lookup in the type name table fails the statement has to be a function call. If the lookup in the func name table fails, insert it into the table as a forward reference to the function. -- Eddie Wyatt e-mail: edw@ius1.cs.cmu.edu