[comp.unix.questions] Any LEX gurus?

jwabik@cs.D.UMN.EDU (Jeff Wabik) (08/08/87)

I am using (trying to use) LEX (& YACC, of course) to write a parser
for a language, and I'm running into problems defining some regular
expressions that need to be fed to LEX.  In short, everything was 
going fine and I had my parser working 'til I remembered that I 
needed to completely ignore all preprocessor directives.  Well.. 
I thought that I could tell LEX to toss out all the preprocessor
directives, much the same way I was discarding comments:

	\{[^\n\}]*\}?		{ ; /* Toss out comments */ }

The problem here is that (as you'll note) this expression cannot cope with
comments that contain newlines (which is fine, since neither can the
compiler 8^), and for the LIFE of me, I CANNOT write an expression
that WILL accept similar preprocessor directives that NEED to contain
newlines.  In short, I need to nuke lines that have "??" as the first
non-white  character of a line, and ending somewhere down the code 
a bit with another matching "??".  I've tried things similar to:


	^{WS}*"??"(.|{WS})*"??"

{  Where {WS} is defined as whitespace (i.e. \n, \t, ' ')	}

Which, of course, doesn't work since the (.|{WS})* eats everything to the
end of the file and LEX segment faults..  I've also tried things like:

	^{WS}*"??"[^"??"]"??"

Which in my mind should work, but doesn't..  Arg. My problem is lack of
documentation, since all I've got to work from is the "Compiler Construction
under UNIX" book by Schreiner and Friedman..  Good book, but .. 

Any help would be greatly appreciated and would be responsible for 
retaining a fellow netter's sanity.

Thanks!


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=
Jeff A. Wabik
University of Minnesota, Duluth  55812

ARPA:	jwabik@cs-gw.d.umn.edu		UUCP:  {umn-cs!umnd-cs-gw!jwabik}

Live long and program.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

edw@ius1.cs.cmu.edu (Eddie Wyatt) (08/09/87)

In article <716@umnd-cs.D.UMN.EDU>, jwabik@cs.D.UMN.EDU (Jeff Wabik) writes:
> I am using (trying to use) LEX (& YACC, of course) to write a parser
> for a language, and I'm running into problems defining some regular
> expressions that need to be fed to LEX.  In short, everything was 
> going fine and I had my parser working 'til I remembered that I 
> needed to completely ignore all preprocessor directives.  Well.. 
> I thought that I could tell LEX to toss out all the preprocessor
> directives, much the same way I was discarding comments:
> 
> 	\{[^\n\}]*\}?		{ ; /* Toss out comments */ }
> 
> The problem here is that (as you'll note) this expression cannot cope with
> comments that contain newlines (which is fine, since neither can the
> compiler 8^), and for the LIFE of me, I CANNOT write an expression
> that WILL accept similar preprocessor directives that NEED to contain
> newlines.  In short, I need to nuke lines that have "??" as the first
> non-white  character of a line, and ending somewhere down the code 
> a bit with another matching "??".  I've tried things similar to:
> 
> 

   What I've done in simuliar situations is to detect the first "??"  then
call a routine to scan to the next "??".  Here's an example :


comment		[\/][\*]
 ....

%%

{comment}		read_comment();
"\n"			linecount++;
{separator}		;
.....


	/* scans til a "*/" is found in the parse file */

static void read_comment()
    {
    register int ch;

    while (TRUE)
        {
        switch (getc(yyin))
            {
            case '*' :
		do
		    {
			/* end of comment */
		    if ((ch = getc(yyin)) == '/') return;

		 	/* should actually be an error - hit EOF before
			 end of comment */
		    if (ch == EOF) return;

			/* found newline - increment line count for
			   error reporting if needed */
		    if (ch == '\n') { linecount++; break; }
		    }
		while (ch == '*');
                break;

            case '\n' :
			/* found newline - increment line count for
			   error reporting if needed */
                linecount++;
                break;

	    case EOF :
		 	/* should actually be an error - hit EOF before
			 end of comment */
	        return;
            }
        }    
    }

> 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=
> Jeff A. Wabik
> University of Minnesota, Duluth  55812
> 
> ARPA:	jwabik@cs-gw.d.umn.edu		UUCP:  {umn-cs!umnd-cs-gw!jwabik}
> 
> Live long and program.
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

-- 

					Eddie Wyatt

e-mail: edw@ius1.cs.cmu.edu

kenny@uiucdcsb.cs.uiuc.edu (08/09/87)

/* Written  4:09 pm  Aug  7, 1987 by jwabik@cs.D.UMN.EDU in uiucdcsb:comp.unix.questions */
/* ---------- "Any LEX gurus?" ---------- */
[Story of the woes trying to code a LEX expression that will accept
newlines, wanting to match preprocessor directives that look like:
	?? --stuff-possibly-containing-newlines-- ?? ]

The problem here is that (as you'll note) this expression cannot cope with
comments that contain newlines (which is fine, since neither can the
compiler 8^), and for the LIFE of me, I CANNOT write an expression
that WILL accept similar preprocessor directives that NEED to contain
newlines.

/* End of text from uiucdcsb:comp.unix.questions */

You COULD code a single regular expression to match all this stuff,
but it would be (1) really awkward, and (2) likely to overrun LEX's
token buffer trying to absorb your preprocessor directives.  A better
plan is to use lex's syntax for left context sensitivity.

Your file will begin with something like:

	%Start	DIRECTIVE
	%%
	<INITIAL>"??"		{ BEGIN DIRECTIVE; }
	<DIRECTIVE>"??"		{ BEGIN INITIAL; }
	<DIRECTIVE>.|"\n"	;

which tells lex the following:

	There is a new scanner state called DIRECTIVE state.  When you
	see the string "??" in the initial state, go to DIRECTIVE
	state.  If you see the string "??" in the DIRECTIVE state, go
	back to the initial state.  Any other characters in the
	DIRECTIVE state, including newlines, are ignored.

All the rest of your rules will have to have <INITIAL> placed in front
of them, to keep them from firing in DIRECTIVE state.  For example, a
complete LEX program to strip preprocessor directives in your format
and place the remainder of the program on stdout would consist of the
above, followed by the single line

	<INITIAL>.|"\n"		ECHO;

which says, ``echo any single character, including a newline, seen in
the INITIAL state to stdout.''

Kevin Kenny			UUCP: {ihnp4,pur-ee,convex}!uiucdcs!kenny
Department of Computer Science	ARPA: kenny@B.CS.UIUC.EDU (kenny@UIUC.ARPA)
University of Illinois		CSNET: kenny@UIUC.CSNET
1304 W. Springfield Ave.
Urbana, Illinois, 61801		Voice: (217) 333-8740

garrity@garrity.applicon.UUCP (08/13/87)

>> I am using (trying to use) LEX (& YACC, of course) to write a parser
>> for a language, and I'm running into problems defining some regular
>> expressions that need to be fed to LEX.  In short, everything was 
>> going fine and I had my parser working 'til I remembered that I 
>> needed to completely ignore all preprocessor directives.  Well.. 
>> I thought that I could tell LEX to toss out all the preprocessor
>> directives, much the same way I was discarding comments:
>> 
>> 	\{[^\n\}]*\}?		{ ; /* Toss out comments */ }
>> 
>> The problem here is that (as you'll note) this expression cannot cope with
>> comments that contain newlines (which is fine, since neither can the
>> compiler 8^), and for the LIFE of me, I CANNOT write an expression
>> that WILL accept similar preprocessor directives that NEED to contain
>> newlines.  In short, I need to nuke lines that have "??" as the first
>> non-white  character of a line, and ending somewhere down the code 
>> a bit with another matching "??".  I've tried things similar to:
>> 
>> 
>
>   What I've done in simuliar situations is to detect the first "??"  then
>call a routine to scan to the next "??".  Here's an example :

  Why don't you just use a start condition?  This is exactly the sort of thing
they are great at.  Just do this.

%START COMMENT
%%
<COMMENT>"?"?"		BEGIN 0;
<COMMENT>.		;
"?"?"			BEGIN COMMENT;
.			ECHO;

-- Mike Garrity
--
-- snail: Applicon, a division of Schlumberger Systems, Inc.
--        829 Middlesex Tpk.
--        P.O. box 7004
--        Billerica MA, 01821
--
-- uucp: {allegra|decvax|mit-eddie|utzoo}!linus!raybed2!applicon!garrity
--       {amd|bbncca|cbosgd|wjh12|ihnp4|yale}!ima!applicon!garrity

allbery@ncoast.UUCP (08/13/87)

As quoted from <716@umnd-cs.D.UMN.EDU> by jwabik@cs.D.UMN.EDU (Jeff Wabik):
+---------------
| 	^{WS}*"??"(.|{WS})*"??"
| 
| {  Where {WS} is defined as whitespace (i.e. \n, \t, ' ')	}
| 
| Which, of course, doesn't work since the (.|{WS})* eats everything to the
| end of the file and LEX segment faults..  I've also tried things like:
| 
| 	^{WS}*"??"[^"??"]"??"
+---------------

Don't try to do it -- I have done it, but lex-generated parsers which try to
do this will overflow buffers and dump core.

The best way to do it is something like:

^[ \t]*"??" {
	int ch, lastch;
	
	lastch = '\0';
	while ((ch = input()) != '?' || lastch == '?') {
		if (ch == 0) {
			yyerror("unexpected end of file");
			return 0;
		}
		lastch = ch;
	}
}
-- 
 Brandon S. Allbery, moderator of comp.sources.misc and comp.binaries.ibm.pc
  {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery
ARPA: necntc!ncoast!allbery@harvard.harvard.edu  Fido: 157/502  MCI: BALLBERY
   <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>>
** Site "cwruecmp" is changing its name to "mandrill".  Please re-address **
*** all mail to ncoast to pass through "mandrill" instead of "cwruecmp". ***