[comp.lang.c] whats wrong with this lex grammer?

820785gm@aucs.UUCP (09/21/87)

       Well, I wasn't sure where to post this, but I'm sure someone on
this group will know the answer. My question refers to  the regular
grammer of lex. For recognizing comments (ie (*...*) ) I have this
horrible long statement  that works. However, I have a much nicer one
that I don't uunderstand why it doesn't work.(Nobody here can seem to tell
me why either). Here is  the short lex program:


%%
"(*"([^*]|("*"/[^)]))*"*)"     printf("&%s&\n", yytext);


  but it doesn't work.  Would somebody mind explaining why ?

The part that really confuses me is shown in this run :


(*andrew*) macleod (*was*) here
&(*andrew*&
) macleod &(*was*&
) here


(**)kk
&(**&             it  recognizes the grammer here, but it doesn't
)kk               accept the last ')'.


(****)             Anyway, can anybody help me. I'm sure it  must be simple.
&(****&
)

(***)
*)                 Oh yeah, its under 4.3BSD.
&(***)
*&
)

                          Thanks a lot,       ANDROID.

rsalz@bbn.com (Richard Salz) (09/22/87)

In comp.lang.c (<443@aucs.UUCP>), 820785gm@aucs.UUCP (ANDREW MACLEOD) writes:
] how do I write a lex regexp that recognizes comments (* .... *) ?

You're probably better off doing what I did.  This is from "real-life"
code:

	/* State of our comment automata. */
	typedef enum {
	    S_STAR, S_NORMAL, S_END
	} STATE;

	"/*"	        {
			    /* Comment. */
			    register STATE	 S;

			    for (S = S_NORMAL; S != S_END; )
				switch (input()) {
				    case '\0':
					S = S_END;
					break;
				    case '*':
					S = S_STAR;
					break;
				    case '/':
					if (S == S_STAR) {
					    S = S_END;
					    break;
					}
					/* FALLTHROUGH */
				    default:
					S = S_NORMAL;
					break;
				}
			}

-- 
For comp.sources.unix stuff, mail to sources@uunet.uu.net.

vern%lbl-helios@LBL-RTSG.ARPA (Vern Paxson) (09/22/87)

Regarding the lex pattern:

	"(*"([^*]|("*"/[^)]))*"*)"     printf("&%s&\n", yytext);
		      ^
		      |

Lex does not handle trailing context properly inside of ()'s.
Essentially you can only have one instance of trailing context in a
pattern, and it splits the pattern into two halves, the
stuff-to-be-matched and the stuff-that-must-follow.  This limitation is
well rooted in the overall DFA approach lex uses for pattern matching,
so don't expect it to go away.  The limitations mentioned by Esmond
Pitt in comp.unix.wizards (that ^ and $ are not matched inside ()'s)
have a similar origin.

		Vern

	Vern Paxson				vern@lbl-csam.arpa
	Real Time Systems			ucbvax!lbl-csam.arpa!vern
	Lawrence Berkeley Laboratory		(415) 486-6411

wsmith@uiucdcsb.cs.uiuc.edu (09/23/87)

This works well:

nostarp		([^*]|\*[^)])
%%
\(\*{nostarp}*\*\)	return(PCOMMENT);   /* normal comment */

\(\*{nostarp}*		return(BADCOMMENT); /* unterminated comment */


Bill Smith
wsmith@a.cs.uiuc.edu
ihnp4!uiucdcs!wsmith

chris@mimsy.UUCP (Chris Torek) (09/25/87)

In article <443@aucs.UUCP> 820785gm@aucs.UUCP (ANDREW MACLEOD) writes:
>... For recognizing comments (ie (*...*) ) I have this horrible long
>statement that works.  However, I have a much nicer one that I don't
>understand why it doesn't work. ...  Here is  the short lex program:
> 
>"(*"([^*]|("*"/[^)]))*"*)"     printf("&%s&\n", yytext);

>(*andrew*) macleod (*was*) here
>&(*andrew*&
>) macleod &(*was*&
>) here

Your `/' (right context) causes lex to back up one character after
matching a string that contains one `*)'.  In general, right context
does not work; it should be used sparingly, if at all.

The expression

	\(\*(\*[^)]|[^*])*\*+\)

should work.  You should be aware that this stores the entire
comment in the array `yytext'---whether it fits or not.  Comments
can easily overrun lex's token buffer.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris