jwabik@cs.D.UMN.EDU (Jeff Wabik) (08/08/87)
I am using (trying to use) LEX (& YACC, of course) to write a parser for a language, and I'm running into problems defining some regular expressions that need to be fed to LEX. In short, everything was going fine and I had my parser working 'til I remembered that I needed to completely ignore all preprocessor directives. Well.. I thought that I could tell LEX to toss out all the preprocessor directives, much the same way I was discarding comments: \{[^\n\}]*\}? { ; /* Toss out comments */ } The problem here is that (as you'll note) this expression cannot cope with comments that contain newlines (which is fine, since neither can the compiler 8^), and for the LIFE of me, I CANNOT write an expression that WILL accept similar preprocessor directives that NEED to contain newlines. In short, I need to nuke lines that have "??" as the first non-white character of a line, and ending somewhere down the code a bit with another matching "??". I've tried things similar to: ^{WS}*"??"(.|{WS})*"??" { Where {WS} is defined as whitespace (i.e. \n, \t, ' ') } Which, of course, doesn't work since the (.|{WS})* eats everything to the end of the file and LEX segment faults.. I've also tried things like: ^{WS}*"??"[^"??"]"??" Which in my mind should work, but doesn't.. Arg. My problem is lack of documentation, since all I've got to work from is the "Compiler Construction under UNIX" book by Schreiner and Friedman.. Good book, but .. Any help would be greatly appreciated and would be responsible for retaining a fellow netter's sanity. Thanks! =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-= Jeff A. Wabik University of Minnesota, Duluth 55812 ARPA: jwabik@cs-gw.d.umn.edu UUCP: {umn-cs!umnd-cs-gw!jwabik} Live long and program. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
edw@ius1.cs.cmu.edu (Eddie Wyatt) (08/09/87)
In article <716@umnd-cs.D.UMN.EDU>, jwabik@cs.D.UMN.EDU (Jeff Wabik) writes: > I am using (trying to use) LEX (& YACC, of course) to write a parser > for a language, and I'm running into problems defining some regular > expressions that need to be fed to LEX. In short, everything was > going fine and I had my parser working 'til I remembered that I > needed to completely ignore all preprocessor directives. Well.. > I thought that I could tell LEX to toss out all the preprocessor > directives, much the same way I was discarding comments: > > \{[^\n\}]*\}? { ; /* Toss out comments */ } > > The problem here is that (as you'll note) this expression cannot cope with > comments that contain newlines (which is fine, since neither can the > compiler 8^), and for the LIFE of me, I CANNOT write an expression > that WILL accept similar preprocessor directives that NEED to contain > newlines. In short, I need to nuke lines that have "??" as the first > non-white character of a line, and ending somewhere down the code > a bit with another matching "??". I've tried things similar to: > > What I've done in simuliar situations is to detect the first "??" then call a routine to scan to the next "??". Here's an example : comment [\/][\*] .... %% {comment} read_comment(); "\n" linecount++; {separator} ; ..... /* scans til a "*/" is found in the parse file */ static void read_comment() { register int ch; while (TRUE) { switch (getc(yyin)) { case '*' : do { /* end of comment */ if ((ch = getc(yyin)) == '/') return; /* should actually be an error - hit EOF before end of comment */ if (ch == EOF) return; /* found newline - increment line count for error reporting if needed */ if (ch == '\n') { linecount++; break; } } while (ch == '*'); break; case '\n' : /* found newline - increment line count for error reporting if needed */ linecount++; break; case EOF : /* should actually be an error - hit EOF before end of comment */ return; } } } > > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-= > Jeff A. Wabik > University of Minnesota, Duluth 55812 > > ARPA: jwabik@cs-gw.d.umn.edu UUCP: {umn-cs!umnd-cs-gw!jwabik} > > Live long and program. > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- -- Eddie Wyatt e-mail: edw@ius1.cs.cmu.edu
kenny@uiucdcsb.cs.uiuc.edu (08/09/87)
/* Written 4:09 pm Aug 7, 1987 by jwabik@cs.D.UMN.EDU in uiucdcsb:comp.unix.questions */ /* ---------- "Any LEX gurus?" ---------- */ [Story of the woes trying to code a LEX expression that will accept newlines, wanting to match preprocessor directives that look like: ?? --stuff-possibly-containing-newlines-- ?? ] The problem here is that (as you'll note) this expression cannot cope with comments that contain newlines (which is fine, since neither can the compiler 8^), and for the LIFE of me, I CANNOT write an expression that WILL accept similar preprocessor directives that NEED to contain newlines. /* End of text from uiucdcsb:comp.unix.questions */ You COULD code a single regular expression to match all this stuff, but it would be (1) really awkward, and (2) likely to overrun LEX's token buffer trying to absorb your preprocessor directives. A better plan is to use lex's syntax for left context sensitivity. Your file will begin with something like: %Start DIRECTIVE %% <INITIAL>"??" { BEGIN DIRECTIVE; } <DIRECTIVE>"??" { BEGIN INITIAL; } <DIRECTIVE>.|"\n" ; which tells lex the following: There is a new scanner state called DIRECTIVE state. When you see the string "??" in the initial state, go to DIRECTIVE state. If you see the string "??" in the DIRECTIVE state, go back to the initial state. Any other characters in the DIRECTIVE state, including newlines, are ignored. All the rest of your rules will have to have <INITIAL> placed in front of them, to keep them from firing in DIRECTIVE state. For example, a complete LEX program to strip preprocessor directives in your format and place the remainder of the program on stdout would consist of the above, followed by the single line <INITIAL>.|"\n" ECHO; which says, ``echo any single character, including a newline, seen in the INITIAL state to stdout.'' Kevin Kenny UUCP: {ihnp4,pur-ee,convex}!uiucdcs!kenny Department of Computer Science ARPA: kenny@B.CS.UIUC.EDU (kenny@UIUC.ARPA) University of Illinois CSNET: kenny@UIUC.CSNET 1304 W. Springfield Ave. Urbana, Illinois, 61801 Voice: (217) 333-8740
garrity@garrity.applicon.UUCP (08/13/87)
>> I am using (trying to use) LEX (& YACC, of course) to write a parser >> for a language, and I'm running into problems defining some regular >> expressions that need to be fed to LEX. In short, everything was >> going fine and I had my parser working 'til I remembered that I >> needed to completely ignore all preprocessor directives. Well.. >> I thought that I could tell LEX to toss out all the preprocessor >> directives, much the same way I was discarding comments: >> >> \{[^\n\}]*\}? { ; /* Toss out comments */ } >> >> The problem here is that (as you'll note) this expression cannot cope with >> comments that contain newlines (which is fine, since neither can the >> compiler 8^), and for the LIFE of me, I CANNOT write an expression >> that WILL accept similar preprocessor directives that NEED to contain >> newlines. In short, I need to nuke lines that have "??" as the first >> non-white character of a line, and ending somewhere down the code >> a bit with another matching "??". I've tried things similar to: >> >> > > What I've done in simuliar situations is to detect the first "??" then >call a routine to scan to the next "??". Here's an example : Why don't you just use a start condition? This is exactly the sort of thing they are great at. Just do this. %START COMMENT %% <COMMENT>"?"?" BEGIN 0; <COMMENT>. ; "?"?" BEGIN COMMENT; . ECHO; -- Mike Garrity -- -- snail: Applicon, a division of Schlumberger Systems, Inc. -- 829 Middlesex Tpk. -- P.O. box 7004 -- Billerica MA, 01821 -- -- uucp: {allegra|decvax|mit-eddie|utzoo}!linus!raybed2!applicon!garrity -- {amd|bbncca|cbosgd|wjh12|ihnp4|yale}!ima!applicon!garrity
allbery@ncoast.UUCP (08/13/87)
As quoted from <716@umnd-cs.D.UMN.EDU> by jwabik@cs.D.UMN.EDU (Jeff Wabik): +--------------- | ^{WS}*"??"(.|{WS})*"??" | | { Where {WS} is defined as whitespace (i.e. \n, \t, ' ') } | | Which, of course, doesn't work since the (.|{WS})* eats everything to the | end of the file and LEX segment faults.. I've also tried things like: | | ^{WS}*"??"[^"??"]"??" +--------------- Don't try to do it -- I have done it, but lex-generated parsers which try to do this will overflow buffers and dump core. The best way to do it is something like: ^[ \t]*"??" { int ch, lastch; lastch = '\0'; while ((ch = input()) != '?' || lastch == '?') { if (ch == 0) { yyerror("unexpected end of file"); return 0; } lastch = ch; } } -- Brandon S. Allbery, moderator of comp.sources.misc and comp.binaries.ibm.pc {{harvard,mit-eddie}!necntc,well!hoptoad,sun!mandrill!hal}!ncoast!allbery ARPA: necntc!ncoast!allbery@harvard.harvard.edu Fido: 157/502 MCI: BALLBERY <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>> ** Site "cwruecmp" is changing its name to "mandrill". Please re-address ** *** all mail to ncoast to pass through "mandrill" instead of "cwruecmp". ***