jwp@larry.sal.wisc.edu (Jeffrey W Percival) (08/22/89)
I am trying to use lex and yacc to help me read a dense, long, machine-produced listing of some crappy "special purpose" computer language. I have a listing of the "rules" (grammar?) governing the format of each line in the listing. I believe lex and yacc are the right tools, because the set of rules I have seem to match the spirit of the examples I read in the lex and yacc papers by Lesk and Schmidt (Lex) and Johnson (yacc). For example: digit: [0-9] integer: {DIGIT}+ and so on to the more complicated command definition: {command introducer} {statement}+ {command terminator} My first question is how one trades off work between lex and yacc. Should lex do more than just return characters? There are all sorts of keywords in my language that a lexical analyzer could recognize, and just return tokens for them. Along these lines, a problem I am having is getting the message "too many definitions" from lex, when all I have are a few keywords and ancillary definitions: (lex file included below for illustration). Is lex truly this limited in the number of definitions? Can I increase this limit? Or am I using lex for too much, and not using yacc for enough? SMSHDR "SMSHDR" ENDSMS "ENDSMS" CP224 "CP224" GROUP "GROUP" PRT "PRT" RTS "RTS" SAFING "SAFING" BEGINDATA "BEGINDATA" ENDDATA "ENDDATA" _IF "_IF" _ELSE "_ELSE" _ENDIF "_ENDIF" _MESSAGE "_MESSAGE" _SET "_SET" _DELETE "_DELETE" INCLUDE "INCLUDE" LETTER [A-Za-z] DIGIT [0-9] HEX_DIGIT [0-9A-F] OCT_DIGIT [0-7] BIN_DIGIT [0-1] SPECIAL [_%#@] STRING ({DIGIT}|{LETTER}|{SPECIAL})+ WORD {LETTER}({DIGIT}|{LETTER}|{SPECIAL})* OCT_MNEMONIC ("_"{STRING})|({WORD}) LABEL {STRING}":" LABEL_REF "'"{STRING}"'" TEXT_STRING "'"[ -~]"'" HEX_INT '{HEX_DIGIT}+'X OCT_INT '{OCT_DIGIT}+'O BIN_INT '{BIN_DIGIT}+'B U_INT {DIGIT}+ S_INT [+-]?{U_INT} U_REAL {U_INT}"."{U_INT} S_REAL [+-]?{U_REAL} FLOAT ({S_REAL}|{S_INT})([ED]{S_INT})? YY {U_INT}"Y" DD {U_INT}"D" HH {U_INT}"H" MM {U_INT}"M" SS ({U_INT}|{U_REAL})"S" REL_TIME [+-]?(({HH})?({MM})?({SS}))|(({HH})?({MM})({SS})?)|(({HH})({MM})?({SS})?) UTC_TIME {YY}?{DD}{REL_TIME} DEL_TIME ({U_INT}C)|({REL_TIME}) ORB_REL_TIME "ORB,"{U_INT}","{WORD}(","[+-]?{REL_TIME})? ORB_TIME "("{ORB_REL_TIME}")" MFS_TIME "("({UTC_TIME}|{ORB_REL_TIME})",MFSYNC"(","[+-]?{REL_TIME})?")" SOI_OFFSET [+-](({HEX_DIGIT}+"%X")|({U_INT})|({OCT_DIGIT}+"%O")) SOI "'"{WORD}({SOI_OFFSET})?"'"[ND] EOL "\n" %% -- Jeff Percival (jwp@larry.sal.wisc.edu)
paton@latcs1.oz (Jeff Paton) (08/24/89)
> >From: jwp@larry.sal.wisc.edu (Jeffrey W Percival) > Newsgroups: comp.unix.questions > Subject: lex/yacc questions from a novice... > Message-ID: <711@larry.sal.wisc.edu> > Date: 22 Aug 89 16:41:14 GMT > Organization: Space Astronomy Lab, Madison WI > Lines: 81 > > I am trying to use lex and yacc to help me read a dense, long, (...) > (deleted stuff here) > > My first question is how one trades off work between lex and yacc. > > Along these lines, a problem I am having is getting the message "too > many definitions" from lex, when all I have are a few keywords and > ancillary definitions: (lex file included below for illustration). Is > lex truly this limited in the number of definitions? Can I increase > this limit? Or am I using lex for too much, and not using yacc for > enough? These are probably the right sort of tools for you - I am working on a similar sort of problem. Some rough rules of thumb that I have found to work: 1. lex doesn't like too many things to recognise, but you can get around a lot of troubles by defining states for your rules - ie some rules are ignored - be careful with states however, I recall some problem with the number that you can have active at any one time. 2. Store your keywords in a table of some description, and just get lex to look for "words" and then do a lookup and return token id (as per the y.tab.h from yacc -d) - this means you need to define more rules in yacc maybe but at least your statement analysis *may* be clearer. 3. Having lex tell me what sort of token it has seen works well. 4. lex has a number of parameters you can alter to make it recognise more things - working out what to adjust is a "suck and see" exercise (If any of the whizards can give me some better rules please do!) Jeff Paton while ( --braincell ) drink(...);
clewis@eci386.uucp (Chris Lewis) (08/26/89)
In article <711@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes: >I am trying to use lex and yacc to help me read a dense, long, >machine-produced listing of some crappy "special purpose" computer >language. I have a listing of the "rules" (grammar?) governing >the format of each line in the listing. >Along these lines, a problem I am having is getting the message "too >many definitions" from lex, when all I have are a few keywords and >ancillary definitions: (lex file included below for illustration). Is Generally speaking, identifying keywords directly in lex isn't worth the bother. Normally, (when you're writing a compiler for example), once you've made this decision, the tokenizing rules are pretty easy: [A-Z][a-z0-9_]*: word [0-9][0-9]*: number +: PLUS -: MINUS +=: PLUSEQ In this case, you generally have lex search for "words", and once you've caught one, you do some sort of hashed lookup in a keyword table to see whether it's a keyword, and return the YACC define for it, or return "IDENTIFIER" if you couldn't find it. Actually, I usually skip lex altogether - once you've eliminated explicit keyword recognization, it's usually simpler (and a hell of a lot smaller and faster) to code the analyzer in C. Ie: 100 lines will often do a reasonable job for C (excepting possibly floating point stuff). -- Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc. UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis Phone: (416)-595-5425