[comp.unix.questions] lex/yacc questions from a novice...

jwp@larry.sal.wisc.edu (Jeffrey W Percival) (08/22/89)

I am trying to use lex and yacc to help me read a dense, long,
machine-produced listing of some crappy "special purpose" computer
language.  I have a listing of the "rules" (grammar?) governing
the format of each line in the listing.

I believe lex and yacc are the right tools, because the set of rules
I have seem to match the spirit of the examples I read in the lex and yacc
papers by Lesk and Schmidt (Lex) and Johnson (yacc).  For example:

digit:		[0-9]
integer:	{DIGIT}+

and so on to the more complicated

command definition: {command introducer} {statement}+ {command terminator}

My first question is how one trades off work between lex and yacc.
Should lex do more than just return characters?  There are all sorts of
keywords in my language that a lexical analyzer could recognize, and
just return tokens for them.

Along these lines, a problem I am having is getting the message "too
many definitions" from lex, when all I have are a few keywords and
ancillary definitions: (lex file included below for illustration).  Is
lex truly this limited in the number of definitions?  Can I increase
this limit?  Or am I using lex for too much, and not using yacc for
enough?

SMSHDR		"SMSHDR"
ENDSMS		"ENDSMS"
CP224		"CP224"
GROUP		"GROUP"
PRT		"PRT"
RTS		"RTS"
SAFING		"SAFING"
BEGINDATA	"BEGINDATA"
ENDDATA		"ENDDATA"
_IF		"_IF"
_ELSE		"_ELSE"
_ENDIF		"_ENDIF"
_MESSAGE	"_MESSAGE"
_SET		"_SET"
_DELETE		"_DELETE"
INCLUDE		"INCLUDE"
LETTER		[A-Za-z]
DIGIT		[0-9]
HEX_DIGIT	[0-9A-F]
OCT_DIGIT	[0-7]
BIN_DIGIT	[0-1]
SPECIAL		[_%#@]
STRING		({DIGIT}|{LETTER}|{SPECIAL})+
WORD		{LETTER}({DIGIT}|{LETTER}|{SPECIAL})*
OCT_MNEMONIC	("_"{STRING})|({WORD})
LABEL		{STRING}":"
LABEL_REF	"'"{STRING}"'"
TEXT_STRING	"'"[ -~]"'"
HEX_INT		'{HEX_DIGIT}+'X
OCT_INT		'{OCT_DIGIT}+'O
BIN_INT		'{BIN_DIGIT}+'B
U_INT		{DIGIT}+
S_INT		[+-]?{U_INT}
U_REAL		{U_INT}"."{U_INT}
S_REAL		[+-]?{U_REAL}
FLOAT		({S_REAL}|{S_INT})([ED]{S_INT})?
YY		{U_INT}"Y"
DD		{U_INT}"D"
HH		{U_INT}"H"
MM		{U_INT}"M"
SS		({U_INT}|{U_REAL})"S"
REL_TIME	[+-]?(({HH})?({MM})?({SS}))|(({HH})?({MM})({SS})?)|(({HH})({MM})?({SS})?)
UTC_TIME	{YY}?{DD}{REL_TIME}
DEL_TIME	({U_INT}C)|({REL_TIME})
ORB_REL_TIME	"ORB,"{U_INT}","{WORD}(","[+-]?{REL_TIME})?
ORB_TIME	"("{ORB_REL_TIME}")"
MFS_TIME	"("({UTC_TIME}|{ORB_REL_TIME})",MFSYNC"(","[+-]?{REL_TIME})?")"
SOI_OFFSET	[+-](({HEX_DIGIT}+"%X")|({U_INT})|({OCT_DIGIT}+"%O"))
SOI		"'"{WORD}({SOI_OFFSET})?"'"[ND]
EOL		"\n"
%%
-- 
Jeff Percival (jwp@larry.sal.wisc.edu)

paton@latcs1.oz (Jeff Paton) (08/24/89)

> >From: jwp@larry.sal.wisc.edu (Jeffrey W Percival)
> Newsgroups: comp.unix.questions
> Subject: lex/yacc questions from a novice...
> Message-ID: <711@larry.sal.wisc.edu>
> Date: 22 Aug 89 16:41:14 GMT
> Organization: Space Astronomy Lab, Madison WI
> Lines: 81
> 
> I am trying to use lex and yacc to help me read a dense, long, (...)
> (deleted stuff here)
> 
> My first question is how one trades off work between lex and yacc.
> 
> Along these lines, a problem I am having is getting the message "too
> many definitions" from lex, when all I have are a few keywords and
> ancillary definitions: (lex file included below for illustration).  Is
> lex truly this limited in the number of definitions?  Can I increase
> this limit?  Or am I using lex for too much, and not using yacc for
> enough?

These are probably the right sort of tools for you - I am working on a similar
sort of problem.  Some rough rules of thumb that I have found to work:

1.	lex doesn't like too many things to recognise, but you can get around a
	lot of troubles by defining states for your rules - ie some rules are
	ignored - be careful with states however, I recall some problem with
	the number that you can have active at any one time.

2.	Store your keywords in a table of some description, and just get lex
	to look for "words" and then do a lookup and return token id (as per
	the y.tab.h from yacc -d) - this means you need to define more rules
	in yacc maybe but at least your statement analysis *may* be clearer.

3.	Having lex tell me what sort of token it has seen works well.

4.	lex has a number of parameters you can alter to make it recognise
	more things - working out what to adjust is a "suck and see" exercise
	(If any of the whizards can give me some better rules please do!)

Jeff Paton		while ( --braincell ) drink(...);

clewis@eci386.uucp (Chris Lewis) (08/26/89)

In article <711@larry.sal.wisc.edu> jwp@larry.sal.wisc.edu (Jeffrey W Percival) writes:
>I am trying to use lex and yacc to help me read a dense, long,
>machine-produced listing of some crappy "special purpose" computer
>language.  I have a listing of the "rules" (grammar?) governing
>the format of each line in the listing.

>Along these lines, a problem I am having is getting the message "too
>many definitions" from lex, when all I have are a few keywords and
>ancillary definitions: (lex file included below for illustration).  Is

Generally speaking, identifying keywords directly in lex isn't worth the
bother.  Normally, (when you're writing a compiler for example), 
once you've made this decision, the tokenizing rules are pretty easy:

[A-Z][a-z0-9_]*:	word
[0-9][0-9]*:	number
+:	PLUS
-:	MINUS
+=:	PLUSEQ

In this case, you generally have lex search for "words", and once you've
caught one, you do some sort of hashed lookup in a keyword table to see 
whether it's a keyword, and return the YACC define for it, or return
"IDENTIFIER" if you couldn't find it.

Actually, I usually skip lex altogether - once you've eliminated
explicit keyword recognization, it's usually simpler (and a hell of a lot
smaller and faster) to code the analyzer in C.  Ie: 100 lines will often
do a reasonable job for C (excepting possibly floating point stuff).
-- 
Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc.
UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis
Phone: (416)-595-5425