gls@corona.ATT.COM (Col. G. L. Sicherman) (04/02/91)
As I promised, here are the bugs in the prize-winning comment- stripping programs. First the C program: #include <stdio.h> char *sccsID="@(#) cstrip.c 1.1 Bart J. Besseling, 8/90"; int m[9][8] = { /* finite-state machine */ /* events: / * " ' \ \n sp ch states: */ { 0x01,0x80,0x85,0x87,0x80,0x80,0x80,0x80 }, /* 0: hunt */ { 0x02,0x33,0xc0,0xc0,0xc0,0xc0,0xc0,0xc0 }, /* 1: maybe */ { 0x02,0x02,0x02,0x02,0x02,0x80,0x02,0x02 }, /* 2: c++ */ { 0x13,0x14,0x13,0x13,0x13,0x83,0x83,0x13 }, /* 3: c */ { 0x10,0x13,0x13,0x13,0x13,0x83,0x83,0x13 }, /* 4: end c */ { 0x85,0x85,0x80,0x85,0x86,0x80,0x85,0x85 }, /* 5: string */ { 0x85,0x85,0x85,0x85,0x85,0x85,0x85,0x85 }, /* 6: \ in str */ { 0x87,0x87,0x87,0x80,0x88,0x80,0x87,0x87 }, /* 7: char */ { 0x87,0x87,0x87,0x87,0x87,0x87,0x87,0x87 }, /* 8: \ in char */ }; int main() /* Input parser and output generator */ { register int ch, event, state; for (state = 0; (ch = getchar()) != EOF;) { /* translate character into event */ switch (ch) { case '/': event = 0; break; case '*': event = 1; break; case '"': event = 2; break; case '\'': event = 3; break; case '\\': event = 4; break; case '\n': event = 5; break; case '\t': case ' ': event = 6; break; default: event = 7; break; } /* obtain next state and operation from machine */ state = m[state & 0x0f][event]; /* perform operation */ if (state & 0x10) putchar(' '); if (state & 0x20) putchar(' '); if (state & 0x40) putchar('/'); if (state & 0x80) putchar(ch); } return 0; } The transition matrix has an erroneous entry that resets the automaton after two asterisks. The program will fail to terminate any comment that ends in "**/", such as /* This compiles, though it shouldn't. **/ IDENTIFICATION DIVISION. /* What's a COBOL statement doing here? */ main() {printf("hello, world\n");} Ian Collier found the bug and told me so. If you found the bug and didn't tell me, that's all right too. Now the lex program: %Start CODE CCOM STRING CHAR CPLUS %% %{ char *sccsID = "@(#) sc 1.0 Andre van Dalen, 6/90"; BEGIN CODE; %} <STRING>([^\\]\")|(\\\\\") | <CHAR>([^.\\]\')|(\\\\\') | <CPLUS>\n { ECHO; BEGIN CODE; } <CCOM>"*/" { two_space(); BEGIN CODE; } <CCOM,CPLUS>. { output(*yytext=='\t'?'\t':' ');} <CODE>"/*" { two_space(); BEGIN CCOM ; } <CODE>"//" { two_space(); BEGIN CPLUS ;} <CODE>\" { ECHO; BEGIN STRING; } <CODE>\' { ECHO; BEGIN CHAR; } <STRING,CODE>. { ECHO; } %% two_space() { output(' '); output(' '); } main(argc, argv) int argc; char **argv; { if (argc==1) yylex(); else while (*++argv) { fclose(yyin); if (!(yyin=fopen(*argv,"r"))) { perror(*argv); exit(1); } yylex(); } exit(0); } This one doesn't handle multiple backslashes, though lex has the power to do so easily. A program like this will break it: main() { char *str = "This string has everything \\\" /* and more!\n"; printf(str); } Finally, the shell program: # @(#) sc Strip comments from a C/C++ source file # Author: Carl Bergerson, August 1990 # set -x # Uncomment for debugging # Define correct usage message: USAGE="Usage: $0 [sourcefile]" case $# in 0) sed -e 's/^#/a#/' | /lib/cpp | sed -e '/^#/d' -e 's/^a#/#/';; 1) sed -e 's/^#/a#/' $1 | /lib/cpp | sed -e '/^#/d' -e 's/^a#/#/';; *) echo $USAGE >&2 exit 1 ;; esac Even assuming that /lib/cpp is a C++ preprocessor that strips // comments, this can be broken with a little ingenuity: /* * play music on your home computer */ main() { printf("press the return key to hear Mozart's sonata in \ a# "); getchar(); play(); } The script uses "a#" as a flag, but it is not a safe flag. -- G. L. Sicherman gls@corona.att.COM
andre@targon.UUCP (andre) (04/17/91)
In article <1991Apr2.012639.25454@cbnewsh.att.com> gls@corona.ATT.COM (Col. G. L. Sicherman) writes: >As I promised, here are the bugs in the prize-winning comment- [ removed other two programs ] Now the lex program: >%} ><STRING>([^\\]\")|(\\\\\") | ><CHAR>([^.\\]\')|(\\\\\') | ><CPLUS>\n { ECHO; BEGIN CODE; } ><CCOM>"*/" { two_space(); BEGIN CODE; } ><CCOM,CPLUS>. { output(*yytext=='\t'?'\t':' ');} ><CODE>"/*" { two_space(); BEGIN CCOM ; } ><CODE>"//" { two_space(); BEGIN CPLUS ;} ><CODE>\" { ECHO; BEGIN STRING; } ><CODE>\' { ECHO; BEGIN CHAR; } ><STRING,CODE>. { ECHO; } >%% > >This one doesn't handle multiple backslashes, though lex has the power >to do so easily. A program like this will break it: > > main() > { > char *str = "This string has everything \\\" /* and more!\n"; > > printf(str); > } Oke oke you win, I made a mistake (blush) how about with the following ? This should fix it no? Just change the first three lines of the lex part to: <STRING,TEXT,CHAR>\\. { ECHO; } <STRING>\" | <CHAR>\' | <CPLUS>\n { ECHO; BEGIN CODE; } happy lexing... -- The mail| AAA DDDD It's not the kill, but the thrill of the chase. demon...| AA AAvv vvDD DD Ketchup is a vegetable. hits!.@&| AAAAAAAvv vvDD DD {nixbur|nixtor}!adalen.via --more--| AAA AAAvvvDDDDDD Andre van Dalen, uunet!hp4nl!targon!andre