anderson@uwvax.ARPA (05/04/84)
<> I have received several replies to my request for a lex expression to recognize /* ... */ comments. The only one that works (sent in by Jim Hogue) is "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/" which I can't claim to fully understand. Nor do I understand why my original, "/*"([^*]|("*"/[^/]))*"*/", doesn't work. The idea is that each character in the string between /* and */ can either be something other than *, or * followed by something other than /. Can anyone come up with an expression simpler than Hogue's that works? By "works", I mean put it in a real "lex" program, as in: (your expr) printf("recognized (%s)\n",yytext); and try it on inputs such as /***/, /*/*/, etc. -- David Anderson (uwvax!anderson)
chris@umcp-cs.UUCP (05/04/84)
A Lex expression that matches C comments: comment "/*"(\*[^/]|[^*]*)*"*/" (The * after the first ] shouldn't be necessary, I think, but you did specify that the expression come from working code...) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690 UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
hogue@hsi.UUCP (05/04/84)
Ok I will attempt to explain the lex script. First a little history. The script was derived while attempting to write a C complier for a class at the University of Arizona. Dr. P. J. Downey and I derived the regular expression from the finite automata and the knowledge of lexs' attempt to match the longest possible match. > I have received several replies to my request for a lex expression > to recognize /* ... */ comments. The only one that works (sent in > by Jim Hogue) is > "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/" 1. "/*" matches the initial /* of a comment. 2. ([^*]*"*"*"*"[^/*])* matches the stuff in the comment including any *'s or /'s but does not allow the * just prior to the / to be matched. As such it does not match */ 3. [^*]*"*"*"*/" matches more stuff and this time forces the match of the final */ Problems: First the /**/ /***/ ... or /*foo**/ style comments. These are taken care of by 3. Next the /**foo*/ and /*foo*baz*/ style comments. These are taken care of by 2. The final trick is that the first */ is matched and thus comments of the form /*foo*/a = b;/*baz*/ are matched in the same style as PCC. ie. /*foo*/ and /*baz*/ are matched and a = b; is not. Now for the real way to do it. Simply recognize the /* and then call a three line c program that "eats" up the comment! (The c preprocessor does the comment removal for the c complier). The real reason this looks so complicated is because it was derived not thought out! -- Jim Hogue {kpno, ihnp4}!hsi!hogue Health Systems International New Haven, CT 06511
merlyn@sequent.UUCP (05/04/84)
> From: anderson@uwvax.UUCP > Subject: Comment recognition in Lex, again > Message-ID: <245@uwvax.ARPA> > > I have received several replies to my request for a lex expression > to recognize /* ... */ comments. The only one that works (sent in > by Jim Hogue) is > > "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/" > > which I can't claim to fully understand. Nor do I understand why my > original, "/*"([^*]|("*"/[^/]))*"*/", doesn't work. The idea is that > each character in the string between /* and */ can either be something > other than *, or * followed by something other than /. I looked at this expression for a while (translated it into railroad tracks so I could study it as an FSM). It's sound, but utterly complex. That is to say, it will match everything that is considered a C-comment, and nothing else. My previous suggestion (deleting the "/" before the "[^/]") fails for cases of /***/, because the second * and the third * are matched in the middle of the parthensized expression, leaving no * to use with the trailing /. Actually, all you have to do is document this :-). If you want simplicity, here's the way I do it (with start states!) #### %s INCOMMENT #### #### <INITIAL>"/*" { #### BEGIN INCOMMENT; #### } #### <INCOMMENT>(.|\n) { #### /* ignore */; #### } #### <INCOMMENT>"*/" { #### BEGIN INITIAL; #### } Works just fine, and is VERY clear. "INITIAL" is an undocumented state that represents the start state that you are in to begin with. If you have any other single char matchers in your lex script, make sure they are AFTER the middle pattern above, or are prefixed with another start state (even INITIAL, if you don't need any other start states). Randal L. ("(null)") Schwartz, esq. (merlyn@sequent.UUCP) (Official legendary sorcerer of the 1984 Summer Olympics) Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine) UUCP: ...!XXX!sequent!merlyn where XXX is one of: decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix
rjhurdal@watmath.UUCP (David Canzi) (05/05/84)
The puzzles in net.unix-wizards are better than the ones in net.puzzle. I blew two hours composing this: "/*"[^*]*"*"+([^*/][^*]*"*"+)*"/" printf("<<<%s>>>", yytext) ; and tested it on this input: asdqwa/*/asaas*/werwer sdf/**/sdfsdf/***/erwerwer cvb/*tdfg*/*xcvcv*/werwe ty/bcvb/*******/***/fssdf and it appears to work. Please let me know if you come up with cases where it doesn't work...
ian@utcsstat.UUCP (Ian F. Darwin) (05/05/84)
From: anderson@uwvax.ARPA I have received several replies to my request for a lex expression to recognize /* ... */ comments. The only one that works (sent in by Jim Hogue) is "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/" which I can't claim to fully understand. Nor do I understand why my original, "/*"([^*]|("*"/[^/]))*"*/", doesn't work. The idea is that each character in the string between /* and */ can either be something other than *, or * followed by something other than /. Can anyone come up with an expression simpler than Hogue's that works? By "works", I mean put it in a real "lex" program, as in: (your expr) printf("recognized (%s)\n",yytext); and try it on inputs such as /***/, /*/*/, etc. -- David Anderson (uwvax!anderson) It's not clear what the goal of this exercise actually is. On the assumption that you are trying to build part of a compiler, here is a simple, *readable* code fragment which inputs a C program and eats all the comments, which is what you do in a real compiler. I have no wish to strain my eyes on Jim Hogue's excellent APL program (nor the original, for that matter), so I wrote this. It's actually more complex than it needs to be; this is more a comment on the simple-mindedness of lex than on my coding style (and no, I have nothing better to offer than lex; flames to /dev/null given that I do acknowledge lex's authors' accomplishments). Our approach is to use the lex ``start condition''. There can be many start conditions defined, but only 001 of them can be active at any time, and you can neither turn off a start condition (you can only say BEGIN 0, which returns to ``the normal state''), nor use rule 0 in the <...> prefix. With that in mind, here is lexcom.l: %START INCOM NOTIN %% <INCOM>"*/" {BEGIN NOTIN;} <INCOM>"/*" unput('*'); <INCOM>. ; <NOTIN>"/*" {BEGIN INCOM;} "/*" BEGIN INCOM; <NOTIN>. ECHO; %% In order to understand this, read carefully section 10 of the Lex manual. If you think you have a better (shorter) way, **try it before you post it** because the most obvious optimisations do not work! Geoff Collyer and I spent some time interpreting the manual and making the minimum test case that would work *correctly*. While our version is not as short as Jim Hogue's submission, %% "/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/" ; . ECHO; %% (I've made it into a full program) it has the following advantages: 1) it handles multi-line comments correctly. 2) it does not overflow lex's input buffer (see the admonition in section 5 of the Lex manual about ``Don't try to defeat this with expressions like ... ... or equivalents; the Lex-generated program will try to read the entire input file, causing internal buffer overflows.'' This is what the APL version does. Nor does it suffice to simply ``expand Lex's buffers''. This will not work on some machines and certainly not on binary-only systems (they do exist!). Here are the test cases used. Our program passes both. 'To pass' means to do the same thing that CPP does. (CPP is the zeroeth pass of the C compiler, the preprocessor, which normally eats the comments from live C programs when they are being compiled. I diffed the output as follows: cc -P t.c # produces t.i lexcom <t.c >t.i2 diff t.i t.i2 (where t.c contains the first test case) and got no differences. The second test case produces nothing but null lines (cpp and our program) and a core dump (the one line program). 1) A simple C fragment: /************************************/ /* this is a test program */ /**/int i=2; /* initialise an int */ /*/*/int j=3; /* init an int */ /* bletch end of file in the middle of a comment - a good test. 2) a longer fragment, but still valid C: (echo '/*'; cat /etc/passwd; echo '*/') | a.out (assuming that */ does not appear in /etc/passwd at any given moment; ours doesn't). Our program produced many newlines, as did CPP. The one-liner dumped core! Is there a moral in all this? I think it's that a program that is a few lines longer, but is readable and conforms to the manual, is in fact a better program. Ian Darwin, Toronto Canada {ihnp4|decvax}!utcsstat!ian -- Ian Darwin, Toronto uucp: utcsstat!ian Arpa: decvax!utcsstat!ian@Berkeley
mp@whuxle.UUCP (Mark Plotnick) (05/05/84)
The problem with "/*"([^*]|("*"/[^/]))*"*/" is that the right context handling in lex in nested regular expressions is a little nonintuitive. After lex recognizes the complete expression, it backs up one character because of the ``/[^/]'' expression. In case you still don't see the problem, run this lex program: a(b/c)c { printf("I saw this: %s\n", yytext); } . { printf("char: '%c'\n", yytext[0]); The first rule will NOT match ``abc'', but it will match ``abcc'', sort of. It prints out ``I saw this: ab''. To be safe, only use right context at the very end of your regular expression. Yet Another Way To Recognize Comments: I really don't enjoy beating my head against a wall playing with regular expressions and starting conditions. When we had to write a compiler a couple of years ago (any other AM295 survivors out there?), we did something like: "/*" { #define LEXEOF 0 int c, last_c='\0'; while ((c=input()) != LEXEOF) { if (last_c == '*' && c=='/') break; else last_c=c; } printf("comment seen\n"); if (c == LEXEOF) printf("EOF within comment\n"); } Moving some of the effort into the action routine allows you to easily add more context-dependent features, such as printing a warning message if there's a ';' within the comment, supporting nested comments, etc. Mark Plotnick
merlyn@sequent.UUCP (05/07/84)
> From: chris@umcp-cs.UUCP > Message-ID: <6879@umcp-cs.UUCP> > Date: Thu, 3-May-84 20:59:25 PDT > > A Lex expression that matches C comments: > > comment "/*"(\*[^/]|[^*]*)*"*/" > > (The * after the first ] shouldn't be necessary, I think, but you > did specify that the expression come from working code...) Funny, your "working code" doesn't match /***/ .. try it! Randal L. ("no comment") Schwartz, esq. (merlyn@sequent.UUCP) (Official legendary sorcerer of the 1984 Summer Olympics) Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine) UUCP: ...!XXX!sequent!merlyn where XXX is one of: decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix
merlyn@sequent.UUCP (05/07/84)
> From: rjhurdal@watmath.UUCP > Message-ID: <7666@watmath.UUCP> > Date: Fri, 4-May-84 14:58:05 PDT > > The puzzles in net.unix-wizards are better than the ones in net.puzzle. > I blew two hours composing this: > "/*"[^*]*"*"+([^*/][^*]*"*"+)*"/" printf("<<<%s>>>", yytext) ; > and tested it on this input: > asdqwa/*/asaas*/werwer > sdf/**/sdfsdf/***/erwerwer > cvb/*tdfg*/*xcvcv*/werwe > ty/bcvb/*******/***/fssdf > and it appears to work. Please let me know if you come up with cases > where it doesn't work... I can't seem to break this one (even spent 20 minutes with a little "railroad-track" finite-state-machine model). However, it doesn't appear as elegant as a solution sent in a private message to me from Andrew Klossner @ tektronix: "/*"([^*]*"*"+[^/])*"*"*"/" This one I can't break either. (No claims for lack of human error, however.) The solution I submitted with start states (mail me if you didn't get that one) was preferred in another private communication for a reason that I had failed to notice at the time... all lex regular expressions which attempt to scarf up a comment in one fell swoop can overflow the fixed-size "yytext[]" array EASILY. Take, for example, a typical start-of-file log produced by the RCS $Log:$ stuff, or an in-source manpage (yes, I've seen them). Ick. Start states avoid that hassle. It was pointed out to me in that same private communication that any lex rules that are not qualified by a start state are STILL active inside the comments. Boo. I forgot about that. I've learned to prefix ALL my rules by start states (even if it is just <INITIAL>). I had forgotten that I do that regularly. Randal L. ("no comment") Schwartz, esq. (merlyn@sequent.UUCP) (Official legendary sorcerer of the 1984 Summer Olympics) Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine) UUCP: ...!XXX!sequent!merlyn where XXX is one of: decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix
merlyn@sequent.UUCP (05/07/84)
Yucko... in that last message from me, <487@sequent.UUCP>: "/*"([^*]*"*"+[^/])*"*"*"/" should have been: "/*"([^*]*"*"+[^/])*"*"+"/" (Talk about human error!) My apologies, Andrew. Randal L. ("typo-matic") Schwartz, esq. (merlyn@sequent.UUCP) (Official legendary sorcerer of the 1984 Summer Olympics) Sequent Computer Systems, Inc. (503)626-5700 (sequent = 1/quosine) UUCP: ...!XXX!sequent!merlyn where XXX is one of: decwrl nsc ogcvax pur-ee rocks34 shell teneron unisoft vax135 verdix
chris@umcp-cs.UUCP (05/10/84)
All right, all right, I goofed! (There, he admitted it. Happy? :->) Yes, my short expression doesn't match /***/. My code happens to work because of what it does by default with stuff it doesn't recognize. Oh well. Anyone got a foot-from-mouth removal device? -- In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690 UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
loverso@sunybcs.UUCP (05/13/84)
Might I just add one tidbit on comment recognition: Having just finished a compilers course, where we talked somewhat about lex/yacc, I, too, came across the bug in lex where the trailing "/" would not be matched. Also, as far as I can remember, not one article had a true comment recognizer, as they would all produce incorrect output on: printf(" just a /* test */ "); (Unless you're all assuming a rule already handling strings...) -- John Robert LoVerso @ SUNY Buffalo UUCP allegra!{watmath|rocksvax}!sunybcs!loverso (watmath preferred) CSnet loverso.buffalo-cs@csnet-relay -- Repeal the Income Tax / Support the Liberty Amendment!
chris@hwcs.UUCP (Chris Miller) (05/18/84)
The following is a fully general comment recogniser for /* ... */ comments in 'lex' - I have used definitions to make it a little more readable (I just can't cope with things like ("*"[^*]*)!). It should be pointed out that I don't believe that this is the RIGHT way to handle comments unless it is essential to retain their text; comments can be very long, and trying to match them with 'lex' can easily overflow buffers. I prefer solutions which match the opening /* and then throw away the rest of the comment in the action routine, using a bit of ad hoccery. ____________________________________________________________________ STAR "*" SLASH "/" CSTART ({SLASH}{STAR}) CUNIT ([^*]|{STAR}+[^/*]) CBODY ({CUNIT}*) CEND ({STAR}+{SLASH}) COMMENT ({CSTART}{CBODY}{CEND}) %% {COMMENT} printf("COMMENT '%s'\n", yytext); %% yywrap() { exit(0); } main() { for (;;) yylex(); } ____________________________________________________________________ One problem with the original non-working version is that it fails for comments terminated by an EVEN number of asterisks and a /. This seems to be a common bug in distributed compilers, etc, even when they don't use 'lex' for token generation. I have encountered this bug in several C compilers and their corresponding lints (of course, since lint usually uses cpp), and also in the original distribution of CProlog - you may find it entertaining to try out /** This is a legal comment **/ on any language systems which OUGHT to accept it. The fix is almost always trivial - the problem comes from reading the character following an asterisk without subsequently putting it back in the input if it happens to be another asterisk. Chris Miller Heriot-Watt Computer Science ...ukc!edcaad!hwcs
mike@sdcrdcf.UUCP (Michael Williams) (05/19/84)
printf("If you're preparing to handle /* all */ comments,"); /* Be prepared for comments like: /* "these" */ /* "or variants such as: /* "these", "/*", "*/" */" */ /* and mismatched" quotes as "well" */ /* If you're in a comment-istic state, all "quotes" are censored. * If your (valid state is) "quoted", all comments are ignored. */ Mike
howard@metheus.UUCP (Howard A. Landman) (05/19/84)
The state machine pictured in the referenced article is wrong. It should have an extra loop on top of the second state box. The loop should be labelled "/". Howard A. Landman ogcvax!metheus!howard