jwp@larry.sal.wisc.edu (Jeffrey W Percival) (08/22/89)
I am trying to use lex and yacc to help me read a dense, long,
machine-produced listing of some crappy "special purpose" computer
language. I have a listing of the "rules" (grammar?) governing
the format of each line in the listing.
I believe lex and yacc are the right tools, because the set of rules
I have seem to match the spirit of the examples I read in the lex and yacc
papers by Lesk and Schmidt (Lex) and Johnson (yacc). For example:
digit: [0-9]
integer: {DIGIT}+
and so on to the more complicated
command definition: {command introducer} {statement}+ {command terminator}
My first question is how one trades off work between lex and yacc.
Should lex do more than just return characters? There are all sorts of
keywords in my language that a lexical analyzer could recognize, and
just return tokens for them.
Along these lines, a problem I am having is getting the message "too
many definitions" from lex, when all I have are a few keywords and
ancillary definitions: (lex file included below for illustration). Is
lex truly this limited in the number of definitions? Can I increase
this limit? Or am I using lex for too much, and not using yacc for
enough?
SMSHDR "SMSHDR"
ENDSMS "ENDSMS"
CP224 "CP224"
GROUP "GROUP"
PRT "PRT"
RTS "RTS"
SAFING "SAFING"
BEGINDATA "BEGINDATA"
ENDDATA "ENDDATA"
_IF "_IF"
_ELSE "_ELSE"
_ENDIF "_ENDIF"
_MESSAGE "_MESSAGE"
_SET "_SET"
_DELETE "_DELETE"
INCLUDE "INCLUDE"
LETTER [A-Za-z]
DIGIT [0-9]
HEX_DIGIT [0-9A-F]
OCT_DIGIT [0-7]
BIN_DIGIT [0-1]
SPECIAL [_%#@]
STRING ({DIGIT}|{LETTER}|{SPECIAL})+
WORD {LETTER}({DIGIT}|{LETTER}|{SPECIAL})*
OCT_MNEMONIC ("_"{STRING})|({WORD})
LABEL {STRING}":"
LABEL_REF "'"{STRING}"'"
TEXT_STRING "'"[ -~]"'"
HEX_INT '{HEX_DIGIT}+'X
OCT_INT '{OCT_DIGIT}+'O
BIN_INT '{BIN_DIGIT}+'B
U_INT {DIGIT}+
S_INT [+-]?{U_INT}
U_REAL {U_INT}"."{U_INT}
S_REAL [+-]?{U_REAL}
FLOAT ({S_REAL}|{S_INT})([ED]{S_INT})?
YY {U_INT}"Y"
DD {U_INT}"D"
HH {U_INT}"H"
MM {U_INT}"M"
SS ({U_INT}|{U_REAL})"S"
REL_TIME [+-]?(({HH})?({MM})?({SS}))|(({HH})?({MM})({SS})?)|(({HH})({MM})?({SS})?)
UTC_TIME {YY}?{DD}{REL_TIME}
DEL_TIME ({U_INT}C)|({REL_TIME})
ORB_REL_TIME "ORB,"{U_INT}","{WORD}(","[+-]?{REL_TIME})?
ORB_TIME "("{ORB_REL_TIME}")"
MFS_TIME "("({UTC_TIME}|{ORB_REL_TIME})",MFSYNC"(","[+-]?{REL_TIME})?")"
SOI_OFFSET [+-](({HEX_DIGIT}+"%X")|({U_INT})|({OCT_DIGIT}+"%O"))
SOI "'"{WORD}({SOI_OFFSET})?"'"[ND]
EOL "\n"
%%
--
Jeff Percival (jwp@larry.sal.wisc.edu)paton@latcs1.oz (Jeff Paton) (08/24/89)
> >From: jwp@larry.sal.wisc.edu (Jeffrey W Percival) > Newsgroups: comp.unix.questions > Subject: lex/yacc questions from a novice... > Message-ID: <711@larry.sal.wisc.edu> > Date: 22 Aug 89 16:41:14 GMT > Organization: Space Astronomy Lab, Madison WI > Lines: 81 > > I am trying to use lex and yacc to help me read a dense, long, (...) > (deleted stuff here) > > My first question is how one trades off work between lex and yacc. > > Along these lines, a problem I am having is getting the message "too > many definitions" from lex, when all I have are a few keywords and > ancillary definitions: (lex file included below for illustration). Is > lex truly this limited in the number of definitions? Can I increase > this limit? Or am I using lex for too much, and not using yacc for > enough? These are probably the right sort of tools for you - I am working on a similar sort of problem. Some rough rules of thumb that I have found to work: 1. lex doesn't like too many things to recognise, but you can get around a lot of troubles by defining states for your rules - ie some rules are ignored - be careful with states however, I recall some problem with the number that you can have active at any one time. 2. Store your keywords in a table of some description, and just get lex to look for "words" and then do a lookup and return token id (as per the y.tab.h from yacc -d) - this means you need to define more rules in yacc maybe but at least your statement analysis *may* be clearer. 3. Having lex tell me what sort of token it has seen works well. 4. lex has a number of parameters you can alter to make it recognise more things - working out what to adjust is a "suck and see" exercise (If any of the whizards can give me some better rules please do!) Jeff Paton while ( --braincell ) drink(...);
clewis@eci386.uucp (Chris Lewis) (08/26/89)
In article <711@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes: >I am trying to use lex and yacc to help me read a dense, long, >machine-produced listing of some crappy "special purpose" computer >language. I have a listing of the "rules" (grammar?) governing >the format of each line in the listing. >Along these lines, a problem I am having is getting the message "too >many definitions" from lex, when all I have are a few keywords and >ancillary definitions: (lex file included below for illustration). Is Generally speaking, identifying keywords directly in lex isn't worth the bother. Normally, (when you're writing a compiler for example), once you've made this decision, the tokenizing rules are pretty easy: [A-Z][a-z0-9_]*: word [0-9][0-9]*: number +: PLUS -: MINUS +=: PLUSEQ In this case, you generally have lex search for "words", and once you've caught one, you do some sort of hashed lookup in a keyword table to see whether it's a keyword, and return the YACC define for it, or return "IDENTIFIER" if you couldn't find it. Actually, I usually skip lex altogether - once you've eliminated explicit keyword recognization, it's usually simpler (and a hell of a lot smaller and faster) to code the analyzer in C. Ie: 100 lines will often do a reasonable job for C (excepting possibly floating point stuff). -- Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc. UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis Phone: (416)-595-5425