d5kwedb@dtek.chalmers.se. (Kristian Wedberg) (12/05/89)
A question from a friend of mine, P{r Eriksson: Does anyone know how to write a LEX rule for C comments, ie for everything between /* and */, nesting not allowed? If YOU know the answer, don't hesitate to answer (preferably by email) to: kitte d5kwedb@dtek.chalmers.se
rsalz@bbn.com (Rich Salz) (12/05/89)
In <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes: >A question from a friend of mine, P{r Eriksson: > Does anyone know how to write a LEX rule for C comments, > ie for everything between /* and */, nesting not allowed? We go through this in comp.lang.c about once a year. Almost everyone gets it wrong. The best thing to do is to define a lex rule that catches "/*" and in the actions for that rule look for */ on your own. The following code fragment comes from the Cronus type definition compiler: /* State of our two automata. */ typedef enum _STATE { S_EAT, S_STAR, S_NORMAL, S_END } STATE; "/*" { /* Comment. */ register STATE S; for (S = S_NORMAL; S != S_END; ) switch (input()) { case '/': if (S == S_STAR) { S = S_END; break; } /* FALLTHROUGH */ default: S = S_NORMAL; break; case '\0': /* Warn about EOF inside comment? */ S = S_END; break; case '*': S = S_STAR; break; } /* NOTREACHED */ } -- Please send comp.sources.unix-related mail to rsalz@uunet.uu.net. Use a domain-based address or give alternate paths, or you may lose out.
cml@tove.umd.edu (Christopher Lott) (12/05/89)
In article <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes: > Does anyone know how to write a LEX rule for C comments, > ie for everything between /* and */, nesting not allowed? [ because I would love to see other solutions to this, I posted. ] This sounds suspiciously like someone's homework assignment. In fact, I had exactly such a homework assignment, and this resulted :-) The following lines are a lex program to recognize c comments, not nested. This does NOT take into account any quote marks within a comment; i.e., quote marks don't 'escape' the close comment marker. I compile this via "lex ccom.l ; cc lex.yy.c -o ccom -ll" ----snip---- /* this is a regular expression to match a c comment */ /* written by cml 890922 (probably not minimal) */ %% "/*"([^*]|[*]*[^*/])*[*]+"/" {printf("saw a c comment.\n");} . {putchar(*yytext);} ----snip---- chris... -- cml@tove.umd.edu Computer Science Dept, U. Maryland at College Park 4122 A.V.W. 301-454-8711 <standard disclaimers>
mike@wheaties.ai.mit.edu (Mike Haertel) (12/06/89)
In article <2191@prune.bbn.com> rsalz@bbn.com (Rich Salz) writes: >>A question from a friend of mine, P{r Eriksson: >> Does anyone know how to write a LEX rule for C comments, >> ie for everything between /* and */, nesting not allowed? >We go through this in comp.lang.c about once a year. Almost everyone >gets it wrong. The best thing to do is to define a lex rule that >catches "/*" and in the actions for that rule look for */ on your own. I've done this before. I think the regexp I used was (ick!): "/*"([^*]*("*"[^/])?)*"*/" Despite the fact that you *can* do it, you don't want to. The reason is that lex has a fixed length buffer into which the text of the entire token must fit. This is arguably brain damaged (in fact there is an option to set the buffer length, but one might argue that the buffer should dynamically resize). If you think about it, it's kind of silly to carefully read in all of something you're going to throw away anyway. Which is yet another reason why the method rsalz suggests is recommended. -- Mike Haertel <mike@ai.mit.edu> "Everything there is to know about playing the piano can be taught in half an hour, I'm convinced of it." -- Glenn Gould
ejp@bohra.cpg.oz (Esmond Pitt) (12/06/89)
In article <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes: >A question from a friend of mine, P{r Eriksson: > > Does anyone know how to write a LEX rule for C comments, > ie for everything between /* and */, nesting not allowed? You don't want to do this in one rule, because a sufficently long C comment will overflow lex's token buffer, with dire results. Three rules do it; the order is important. %start COMMENT %% <COMMENT>"*/" BEGIN(INITIAL); <COMMENT>.|\n ; <INITIAL>"/*" BEGIN(COMMENT); If you are using lex you will find this slow, so you will probably replace the above with something like this: %% "/*" { int c; for(;;) { c = input(); if (c == '*') { c = input(); if (c == '/') break; unput(c); } } } (E&OE) -- Esmond Pitt, Computer Power Group ejp@bohra.cpg.oz
bill@twwells.com (T. William Wells) (12/06/89)
In article <21108@mimsy.umd.edu> cml@tove.umd.edu (Christopher Lott) writes:
: /* this is a regular expression to match a c comment */
: /* written by cml 890922 (probably not minimal) */
: %%
: "/*"([^*]|[*]*[^*/])*[*]+"/" {printf("saw a c comment.\n");}
: . {putchar(*yytext);}
This breaks if the comment is longer than lex's internal buffer.
Moreoever, at least some lex's do *not* check for buffer overflow.
Boom.
---
Bill { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com
bill@twwells.com (T. William Wells) (12/07/89)
In article <224@bohra.cpg.oz> ejp@bohra.cpg.oz (Esmond Pitt) writes: : In article <601@vice2utc.chalmers.se> d5kwedb@dtek.chalmers.se (Kristian Wedberg) writes: : >A question from a friend of mine, P{r Eriksson: : > : > Does anyone know how to write a LEX rule for C comments, : > ie for everything between /* and */, nesting not allowed? : : You don't want to do this in one rule, because a sufficently long C : comment will overflow lex's token buffer, with dire results. Three : rules do it; the order is important. (It should not be: the */ is longer than the `.' rule. Or is INITIAL as a start state treated specially?) : %start COMMENT : %% : <COMMENT>"*/" BEGIN(INITIAL); : <COMMENT>.|\n ; : <INITIAL>"/*" BEGIN(COMMENT); : : If you are using lex you will find this slow Not only that, but start states are not reliable with lex. After being bitten by them for the n'th time, I switched to flex. --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com
tps@chem.ucsd.edu (Tom Stockfisch) (12/07/89)
In article <1989Dec6.180833.2985@twwells.com> bill@twwells.com (T. William Wells) writes: >... start states are not reliable with lex. After >being bitten by them for the n'th time, I switched to flex. I know of a bug that involves variable length trailing contexts (this happens in flex, too), but I've never run into a bug with start states. Can you give specifics? -- || Tom Stockfisch, UCSD Chemistry tps@chem.ucsd.edu
bill@twwells.com (T. William Wells) (12/08/89)
In article <616@chem.ucsd.EDU> tps@chem.ucsd.edu (Tom Stockfisch) writes: : In article <1989Dec6.180833.2985@twwells.com> bill@twwells.com (T. William Wells) writes: : : >... start states are not reliable with lex. After : >being bitten by them for the n'th time, I switched to flex. : : I know of a bug that involves variable length trailing contexts (this : happens in flex, too), but I've never run into a bug with start states. : Can you give specifics? I don't know what triggers it, but, in certain cases, an RE that is in some start state gets recognized, even when the program is not in that start state. I've been bitten by this one three times, on three different machines (a VAX, a Sun, and a '386), so it appears to be a generic problem with lex. --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com
ejp@bohra.cpg.oz (Esmond Pitt) (12/08/89)
In article <1989Dec6.180833.2985@twwells.com> bill@twwells.com (T. William Wells) writes: > > ... start states are not reliable with lex. I keep hearing this, and I've never had any trouble with 'lex' whatever (in 8 years) except that the scanners are SO SLOW. This is a good reason to switch to 'flex' all right. But where's the unreliability? Symptoms? -- Esmond Pitt, Computer Power Group ejp@bohra.cpg.oz
chris@mimsy.umd.edu (Chris Torek) (12/09/89)
In article <1989Dec7.193738.9829@twwells.com> bill@twwells.com (T. William Wells) writes: >I don't know what triggers it, but, in certain cases, an RE that >is in some start state gets recognized, even when the program is >not in that start state. I've been bitten by this one three >times, on three different machines (a VAX, a Sun, and a '386), so >it appears to be a generic problem with lex. Lex's start states are not terribly well documented. Besides %start and BEGIN(state) and <state>text, there is the default initial state (called `INITIAL') and the fact that a lex rule, if it has no state, acts in *all* states. For instance: %state FOO BAR %% f { BEGIN(FOO); } b { BEGIN(BAR); } i { BEGIN(INITIAL); } <INITIAL>t { return (1); } <FOO>t { return (2); } ack { return (3); } <BAR>gasp { return (4); } .|\n ; %% yywrap() { return (1); } main() { int c; while ((c = yylex()) != 0) printf("lex => %d\n", c); } (yes, not conformant, main should return an int, but then this belongs in some other group anyway and is here only because that is where it started) shows how this works: % a.out ack lex => 3 gasp t lex => 1 f ack lex => 3 gasp t lex => 2 i b ack lex => 3 gasp lex => 4 t i ^D % `ack' is unadorned, hence recognised in all states, while `t' produces a token only in states INITIAL and FOO, and `gasp' produces something only in state BAR. (Everything else is eaten.) At any rate, perhaps what Bill Wells is remembering is lex taking something that it apparently should not have because it was in some state other than INITIAL. Or it could be yet another lex bug.... -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
bill@twwells.com (T. William Wells) (12/09/89)
In article <21182@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
: At any rate, perhaps what Bill Wells is remembering is lex taking
: something that it apparently should not have because it was in some
: state other than INITIAL. Or it could be yet another lex bug....
I'll argue for the bug. Simply switching to flex made the problem
go away.
---
Bill { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com
mark@linqdev.UUCP (mark) (12/12/89)
> /* this is a regular expression to match a c comment */ > /* written by cml 890922 (probably not minimal) */ >%% >"/*"([^*]|[*]*[^*/])*[*]+"/" {printf("saw a c comment.\n");} >. {putchar(*yytext);} >----snip---- > >chris... >-- As a friend the the office here pointed out, this particular way of handling C comments tends to 'grow' the parser. Here is his solution which I thought was a neat way to do it and save some space: "/*" { do { yytext[0] = yytext[1]; if( !( yytext[1] = input() ) ) yyerror( "EOF in comment. Missing \"*/\"." ); } while( yytext[0] != '*' || yytext[1] != '/' ); } 01010000010000010100001101001011010001010101010000100000010000100100001001010011 * Mark R. Holbrook-WS7M uw-beaver! Interlinq Software Corp * * Issaquah, WA 98027 sumax!ole! /mark 10230 NE Points Dr #200 * * HM:206-392-9672 Voice linqdev! Kirkland, WA 98033-???? * * HM:206-392-9673 Data \ws7m!ws7m WK:206-827-1112 Ext:152 * * Opinions..? Yes, they are mine and mine alone! I'll sell 'em cheap however! * 00100000010101110101001100110111010011010100000001010111010100110011011101001101
tps@chem.ucsd.edu (Tom Stockfisch) (12/12/89)
In article <364@linqdev.UUCP> mark@linqdev.UUCP () writes: >...way of handling C comments... > >"/*" { > do { > yytext[0] = yytext[1]; > if( !( yytext[1] = input() ) ) > yyerror( > "EOF in comment. Missing \"*/\"." ); > } while( yytext[0] != '*' || yytext[1] != '/' ); > } Too bad it doesn't work. Try "/*/*/" as input. I think it will work if you clear yytext[1] before the do-while loop. -- || Tom Stockfisch, UCSD Chemistry tps@chem.ucsd.edu