[comp.lang.c] Unixworld competition - SOLUTIONS

gls@corona.ATT.COM (Col. G. L. Sicherman) (04/02/91)

As I promised, here are the bugs in the prize-winning comment-
stripping programs.  First the C program:

#include <stdio.h>
char *sccsID="@(#) cstrip.c 1.1 Bart J. Besseling, 8/90";

int m[9][8]  = { /* finite-state machine */
/* events:
        /    *    "    '    \   \n   sp    ch    states: */
    { 0x01,0x80,0x85,0x87,0x80,0x80,0x80,0x80 }, /* 0: hunt */
    { 0x02,0x33,0xc0,0xc0,0xc0,0xc0,0xc0,0xc0 }, /* 1: maybe */
    { 0x02,0x02,0x02,0x02,0x02,0x80,0x02,0x02 }, /* 2: c++ */
    { 0x13,0x14,0x13,0x13,0x13,0x83,0x83,0x13 }, /* 3: c */
    { 0x10,0x13,0x13,0x13,0x13,0x83,0x83,0x13 }, /* 4: end c */
    { 0x85,0x85,0x80,0x85,0x86,0x80,0x85,0x85 }, /* 5: string */
    { 0x85,0x85,0x85,0x85,0x85,0x85,0x85,0x85 }, /* 6: \ in str */
    { 0x87,0x87,0x87,0x80,0x88,0x80,0x87,0x87 }, /* 7: char */
    { 0x87,0x87,0x87,0x87,0x87,0x87,0x87,0x87 }, /* 8: \ in char */
};

int
main() /* Input parser and output generator */
{
    register int    ch, event, state;

    for (state = 0; (ch = getchar()) != EOF;) {
        /* translate character into event */
        switch (ch) {
            case '/':   event = 0; break;
            case '*':   event = 1; break;
            case '"':   event = 2; break;
            case '\'':  event = 3; break;
            case '\\':  event = 4; break;
            case '\n':  event = 5; break;
            case '\t':
            case ' ':   event = 6; break;
            default:    event = 7; break;
        }
        /* obtain next state and operation from machine */
        state = m[state & 0x0f][event];
        /* perform operation */
        if (state & 0x10) putchar(' ');
        if (state & 0x20) putchar(' ');
        if (state & 0x40) putchar('/');
        if (state & 0x80) putchar(ch);
    }
    return 0;
}

The transition matrix has an erroneous entry that resets the automaton
after two asterisks.  The program will fail to terminate any comment
that ends in "**/", such as

	/* This compiles, though it shouldn't. **/
	IDENTIFICATION DIVISION.
	/* What's a COBOL statement doing here? */
	main() {printf("hello, world\n");}

Ian Collier found the bug and told me so.  If you found the bug and
didn't tell me, that's all right too.

Now the lex program:

%Start CODE CCOM STRING CHAR CPLUS
%%
%{
	char *sccsID = "@(#) sc 1.0 Andre van Dalen, 6/90";
	BEGIN CODE;
%}
<STRING>([^\\]\")|(\\\\\")	|
<CHAR>([^.\\]\')|(\\\\\')	|
<CPLUS>\n	{ ECHO; BEGIN CODE; }
<CCOM>"*/"	{ two_space(); BEGIN CODE; }
<CCOM,CPLUS>.	{ output(*yytext=='\t'?'\t':' ');} 
<CODE>"/*"	{ two_space(); BEGIN CCOM ; }
<CODE>"//"	{ two_space(); BEGIN CPLUS ;}
<CODE>\"	{ ECHO; BEGIN STRING; }
<CODE>\'	{ ECHO; BEGIN CHAR; }
<STRING,CODE>.	{ ECHO; }
%%
two_space()
{
	output(' '); output(' ');
}
main(argc, argv)
int argc; char **argv;
{
	if (argc==1) yylex();
	else while (*++argv) {
		fclose(yyin);
		if (!(yyin=fopen(*argv,"r"))) {
			perror(*argv);
			exit(1);
		}
		yylex();
	}
	exit(0);
}

This one doesn't handle multiple backslashes, though lex has the power
to do so easily.  A program like this will break it:

	main()
	{
	    char *str = "This string has everything \\\" /* and more!\n";

	    printf(str);
	}

Finally, the shell program:

# @(#) sc  Strip comments from a C/C++ source file
# Author: Carl Bergerson, August 1990
# set -x    # Uncomment for debugging
# Define correct usage message:
USAGE="Usage: $0 [sourcefile]"
case $# in
    0)  sed -e 's/^#/a#/' | /lib/cpp |
        sed -e '/^#/d' -e 's/^a#/#/';;
    1)  sed -e 's/^#/a#/' $1 | /lib/cpp |
        sed -e '/^#/d' -e 's/^a#/#/';;
    *)  echo $USAGE >&2
        exit 1 ;;
esac

Even assuming that /lib/cpp is a C++ preprocessor that strips // comments,
this can be broken with a little ingenuity:

	/*
	 * play music on your home computer
	 */
	main() {
		printf("press the return key to hear Mozart's sonata in \
	a# ");
		getchar();
		play();
	}

The script uses "a#" as a flag, but it is not a safe flag.
-- 
G. L. Sicherman
gls@corona.att.COM

andre@targon.UUCP (andre) (04/17/91)

In article <1991Apr2.012639.25454@cbnewsh.att.com> gls@corona.ATT.COM (Col. G. L. Sicherman) writes:
>As I promised, here are the bugs in the prize-winning comment-
 [ removed other two programs ]

Now the lex program:

>%}
><STRING>([^\\]\")|(\\\\\")	|
><CHAR>([^.\\]\')|(\\\\\')	|
><CPLUS>\n	{ ECHO; BEGIN CODE; }
><CCOM>"*/"	{ two_space(); BEGIN CODE; }
><CCOM,CPLUS>.	{ output(*yytext=='\t'?'\t':' ');} 
><CODE>"/*"	{ two_space(); BEGIN CCOM ; }
><CODE>"//"	{ two_space(); BEGIN CPLUS ;}
><CODE>\"	{ ECHO; BEGIN STRING; }
><CODE>\'	{ ECHO; BEGIN CHAR; }
><STRING,CODE>.	{ ECHO; }
>%%
>
>This one doesn't handle multiple backslashes, though lex has the power
>to do so easily.  A program like this will break it:
>
>	main()
>	{
>	    char *str = "This string has everything \\\" /* and more!\n";
>
>	    printf(str);
>	}

Oke oke you win, I made a mistake (blush) how about with the following ?
This should fix it no?

Just change the first three lines of the lex part to:

<STRING,TEXT,CHAR>\\.	{ ECHO; }
<STRING>\"	|
<CHAR>\'	|
<CPLUS>\n      { ECHO; BEGIN CODE; }

	happy lexing...

-- 
The mail|    AAA         DDDD  It's not the kill, but the thrill of the chase.
demon...|   AA AAvv   vvDD  DD        Ketchup is a vegetable.
hits!.@&|  AAAAAAAvv vvDD  DD                    {nixbur|nixtor}!adalen.via
--more--| AAA   AAAvvvDDDDDD    Andre van Dalen, uunet!hp4nl!targon!andre