[comp.lang.c] FAQ for lex + yacc?

martin@mwtech.UUCP (Martin Weitzel) (05/09/91)

In article <5384@lectroid.sw.stratus.com> leavitt@mordor.hw.stratus.com (Will Leavitt) writes:
>
>Hi-
>   I'd like to use lex & yacc to parse strings within an application, but they
>seem to be hardwired to take their input from stdin.  What is the cannonical 
>way to get them to work on a string?  Thanks!
>
>     -will

In the last few weeks there have been some more questions concerning
lex + yacc. Certain questions seem to come up again and again. Not quite
a year ago I posted two rather lengthy articles in which I tried to
explain the following topics:

   1.)	How does one redefine the I/O in a yacc/lex piece of code?
   2.)	How to make lex and yacc work smoothly together with make,
	when separate compilation of individual modules is desired.
   3.)	My grammar contains the following rule
		line3	: 	A B C {  yacc action sequence }
	which indicates that the construct line3 is composed of the 3 tokens
	A B and C, in that order. How can i now assign the string values which
	originally constituted the tokens of A, B, and C into local variables
	of my choice?

Since I originally posted my answers here, I composed the articles into one
and polished this one up a little (whow, I even cared to use "spell" :-)).
If similar questions appeared here (and I had time to read this group),
I usually offered to the poster to mail him my answers, though the latter
is quite expensive for me (it's 16 KB mail from Europe to USA, which costs
me about $7 per copy - not too much for a single copy but it sums up).

Now I dare to use some bandwidth here again with *my* questions:

a) Should I turn my answers to above named questions into a regular FAQ
   posting for lex + yacc?

b) If you think I SHOULD post the stuff on a regular basis, which group
   would be appropriate for such a posting and in which intervals should
   I post it.

c) If you DON'T think I should post, where would be a good place to
   deposit my answers, so that the ones who are interested could obtain
   it from there without the need for me to mail 16 KB individually.
   Note that I can NOT ftp but I would well like to incorporate
   updates from time to time.

d) What other topics (besides the above ones) would you like to see
   in a FAQ for lex + yacc. Note that I don't want to write another
   book about lex + yacc right now (though, maybe, I'll do that some
   day :-)), but only want to shed a little light in some of the darker
   corners.

Thanks for listening so far.
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

martin@mwtech.UUCP (Martin Weitzel) (05/13/91)

Here I follow up to my own article. Thanks to all who responded by mail,
especially to Steve Summit who wrote a very long answer and made valuable
proposuals.

For now I've decided to post the stuff once more since there seems to
be sufficient request for it. I even included one more new Q+A on the
request of one person who mailed me. (BTW: I'm still open to include
more Q+As in future releases.)

There seems to be some consensus NOT to make a REGULAR posting of it.
Instead I request all who archive FAQ-like stuff to do so with this
article. If you do and your archive has an easy way to be accessed from
the public (ftp, mail-server), let me know the procedure how to do so,
so that I can point future requests for the answers to this archive
and must neither mail all this to individuals, nor use up bandwitdh in
the news with it.

BTW: Sometime ago someone came up with the idea of a - probably moderated -
news group to cross-post FAQ-like stuff so that it could be archived
automatically or at least given a long expiration period. I'm not sure how
this proposual advanced (and I'm even not sure if such a group is a good
idea in general) but for the stuff I have it sounds all right.

============ Now, for all who are waiting, here we go again ==============


1.)  How does one redefine the I/O in a yacc/lex piece of code?  I.e.
     the code which is generated defaults to stdin and stdout for input
     and output, respectively.

The calling tree in a lex+yacc application, when it comes to read
input and you do not change anything, normally is:

   (main or whatever)
	---> yyparse
		---> yylex
			---> input[Macro]
				---> getc(yyin) [yyin defaults to stdin]

If you want to read from some other source instead of stdin, you have
several points where you can change the behavior.  (In the very simplest
case you could even change nothing and use the input redirection of UNIX.)

If your program generated by lex has to read from another source as stdin,
just fopen() the file and assign the returned FILE-pointer to yyin.  The
latter is a global symbol in the object file that results from compiling
what lex generated.  If you prefer separate compilation you must define
it as
	extern FILE *yyin;

in the module where you assign to it.  The right place for this will
probably be the function which calls yyparse in the above example.  Note
that the main program from the yacc-library is NOT linked if you supply
your own.  So you can play any games you like before (and also after)
calling yyparse.

Next step of complexity is to change the input-macro of yylex.  This
is useful sometimes, but you should be reluctant to do so until you have
gathered a bit experience with lex and understand the implications.  Here
are some details how the individual routines and functions call each other
(note that the situation described is reverse engineered from several
implementations under XENIX and SysV UNIX - it *may* differ with other
implementations, most notably "flex"):

	      main (from lex-library or own)
	       |
	       V
	      yylex --------------------------+-+
	+---+  | |  +-----+                   | |
	|   V  V V  V     |           ......  V V .........
	|  input unput    |           :  yyless yyreject  : in the
	|                 |           :.... | ... | . | ..: lex-library
	|                 |                 V     V   V
	|                 |                 yyunput   yyinput
	|                 |                    |         |
	|                 +--------------------+         |
	+------------------------------------------------+
	
What you should note here first is:

      -	When the next character is needed in yylex, input (NOT yyinput!!)
	is called.  Normally, input is #defined as macro but you can
	re-#define it, or #undef it and make a function with this name
	visible when you compile and link lex.yy.c.

      - There is another macro, unput, that must properly undo the actions
	of input, though unput is only called if your regular expressions
	require some look-ahead.  (If you are not *very* experienced with
	regular expressions, assume that there will *always* will be
	look-ahead.)

So if you want to change things at this level, you must find the right
place for your redefinitions, that is, you must write them somewhere
into the ".l"-file (with the lex-source), so that they appear *after*
the #define-s that are automatically generated by lex, but *before* the
first use of input/unput.  As the order in which the parts of your ".l"-file
appear in lex.yy.c seem to have changed with the evolution of lex, you
should check for the right place if you try this the first time!

(It appears as if the "safest" (i.e.  most portable) place for this is
right at the beginning of the second part of the ".l"-file, immediately
before the first regular expression.)

	file.l
	------------------------------------------------------------
		first part
	%%
	%{
	#undef	input /* ANSI-C requires that, though */
	#undef	unput /* older compilers may do without */
	#define	input ....  whatever ....
	#define unput ...  as you like ..
	%}
	first-regex {
		...  action ...
	}
	second-regex {
		....  action ...
	}
		...  etc.  ......
	%%
		third part
	------------------------------------------------------------

Now for the tricky part: As you see from the picture above which shows how
the functions call each other, there are some routines in the lex-library
which need sometimes to input or unput characters.  These routines *must*
use exactly the redefined versions of input and unput.  But: How can these
library routines link to something that is defined as macro?

The solution can again be seen if you carefully study lex.yy.c, the
source generated by lex.  At the end of lex.yy.c you find the functions
yyinput and yyunput (note the yy-prefix now!), which do no more and no
less than calling input and unput.  As the two functions are COMPILED
in the context of lex.yy.c, where the (possibly changed) macro-definitions
are visible, they build the "stubs" through which the functions in the
lex-library access the macros.  (Quite easy after you understand the
trick but not at all obvious for the newcomer.)

** AGAIN: Look at the above scheme showing the calling hierarchy and
** try to understand the dependencies.  Eventually study lex.yy.c a bit
** further.  THEN you might consider writing your own input/unput macros!
  
Finally, you can consider avoiding lex at all and roll your own yylex.
If you have only the "ancient" lex which is supplied with most unix systems
(contrary to the rewrite "flex"), it could eventually be an advantage to
do so, since lex-generated programs are known to be not so much efficient
compared to hand-written scanners.  (It's not easy to decide whether the
performance loss in lex generated programs is a legend or reality. Most
comparisons are based on trivial scanners which are easily written by hand.
In any case it seems recommendable to use lex during the early development
steps as prototyping tool and eventually change to hand-written scanners
when things are becoming more stable.)

For the redirection of output there is no problem at all, since this is
fully under control of the C-program fragments you write in your actions
of the lex+yacc source.
2.)  How can I get the automagically-defined #defines, which can normally
     be created from yacc with the -d flag, to come out when you use a
     Makefile?  I.e.  suppose I have scanner.l and grammar.y lex and yacc
     source files, respectively, and I have object files defined in my
     Makefile called scanner.o and grammar.o such that "make" follows
     default rules to create these from the aforementioned source files.   

If you use lex+yacc with the UNIX tool make, you can add your own explicit
dependencies or change the default suffix rules and add your own commands
there.  There is no "catch all" method for this - several variations exist,
all with their specific advantages and drawbacks - but if you know make or
are willing to learn about make, you can determine the dependencies between
your files generated by lex+yacc in the same way as with any other source
files.  You'll also find a framework for using lex+yacc together with
make in the book "The Unix Programming Environment" by Kernighan + Pike
[K&P].  Though only one chapter (but not the smallest) is dedicated to
lex+yacc, it is recommendable to get this book anyway -  it pays, not only
for your work with lex+yacc.)

Now let's look a little closer to the problem.  If you have a file grammar.y
and tell make that a file grammar.o is needed, make "knows" by its builtin
rules how to call yacc and rename and compile the resulting y.tab.c.  If you
need the definitions of the %tokens and the value stack %union in a separate
file - say grammar.h - a first attempt may be to add the following explicit
rule to the Makefile:

	grammar.h: grammar.y
		yacc -d grammar.y
		mv y.tab.h grammar.h
		rm -f y.tab.c

Obviously, here is some work duplicated, as yacc will run twice now: Once
for producing grammar.h in the explicit rule, once to produce grammar.o
with the suffix rule.  This is no real problem because (in general) yacc
runs very fast compared to the compilation step.

If you are really concerned about avoiding unnecessary CPU time, you
should consider some other fine point: Generally, the situation is that
changes in grammar.y will be much more reflected in y.tab.c than in
y.tab.h (respectively grammar.h in the above example).  The latter will
only occur when new tokens are introduced or the type of the value stack
is changed.  Typically, the latter two changes to grammar.y are much less
frequently done compared to changing some statements in the C-program
fragments that specify the actions for the rules.  So it is recommendable
to extend the above further to:

	grammar.h: grammar.y
		yacc -d grammar.y
		test -f grammar.h && cmp -s y.tab.h grammar.h ||\
			 mv y.tab.h grammar.h
		rm -f y.tab.c

Here y.tab.h is only renamed to grammar.h if the latter doesn't exist
OR is different from the former.  Leaving an existing unchanged grammar.h
alone will avoid many unnecessary recompilations of other modules which
depend on grammar.h.  (If you think you have understood the last sentence,
think about it and make sure that you really did! The problem only shows
up if many *other* modules *depend* on grammar.h!)

As a slight modification of the above, you may also write a "cc"-like
shell wrapper for yacc, which generates - controlled by options - either
".c", ".o", or ".h" files from given ".y" files.  Of course, this wrapper
should only conditionally rename y.tab.h, as shown above in the excerpt
from a Makefile.

{short digression on}
Some people may recommend to simply change the definition of YFLAGS in
the Makefile to `YFLAGS = -d', so that y.tab.h is produced as "side
effect" when the suffix rule gets executed.  Note that this has several
drawbacks: The default name for the definitions is now y.tab.h, which is
more a work file than the name of something useful and you cannot manage
different yacc projects in the same directory without the danger of
overwriting this file by accident.  Of course you can circumvent this by
changing the actions in the suffix rule and rename y.tab.h to grammar.h
there - but in this case you have already changed more than the definition
of YFLAGS.

But there's still one more problem: Make doesn't know through this how
grammar.h (or y.tab.h, if you do not rename it) can be made! All it knows
through the suffix rules is how to make grammar.o from grammar.y.  So if
you have an explicit dependency elsewhere, which tells that some other target
depends on grammar.h, make will complain when it had not accidentally
decided to make grammar.o before that.  (If you know make well, you may
write the dependencies in a way that y.tab.h is generated early enough,
but why worry about such details when the alternative solution shown above
also is not too complex.)
{short digression off}
3.) My grammar contains the following rule

	line3	: 	A B C
			{  yacc action sequence }

 
    which indicates that the construct line3 is composed of the 3 tokens
    A B and C, in that order.

    How can I now assign the values of A, B, and C into local variables of
    my choice? The problem lies in the fact that each of A B and C represent
    three calls to lex, and if I pass back a pointer to yytext from lex, I
    only retain the value of the last token in the sequence, in this case C, 
    when I get to the action sequence in my yacc code.  What if I want to 
    be able to select the EXACT ascii tokens for each of A B and C above in 
    my yacc code.  how do I do that?

Transferring strings from yylex to yyparse (respectively the action which
has the relevant tokens on the RHS of its grammar rule) must be done with
care: Simply using pointers to yytext is no solution here - you must copy
the contents of yytext to a safe place.  For that purpose you could malloc
some space in the action of yylex (not yyparse!!) which recognizes the token,
as shown in the code excerpt following below.  (Your C standard library may
also contain strdup which does malloc and strcpy all in one, but it's not
difficult to do without.)  Of course you must be careful here:

      -	malloc may return a NULL-pointer because of memory limits;
	ALWAYS CHECK FOR THIS CONDITION. (And even if you think this
        will never occur on your machine since you have lots of MB-s
        available - one day someone else may take your code and port
	it to some other environment and the program will fail!)

      -	you must not forget to allocate space for the terminating
	NUL-byte, hence malloc(yylen + 1) is the right thing;

      -	you must carefully plan for deallocation if your program
	shall not run out of memory when it analyzes large input.

If you transfer pointers to the malloc-ed space via the value stack, the
last chance for free-ing the space is immediately before the value stack
is cleared.  So, if you don't copy the pointers which correspond to A, B,
and C in the above example, your last chance is in the grammar action.
A short code excerpt should help to understand what is required:

	strsave-source
	------------------------------------------------------------------
	char *strsave(s, n) char *s;
	{
		char *t = malloc(n + 1);
		if (t == (char *)0) {
			scream and shout and die horrible death
		}
		strcpy(t, s);
		return t;
	}
	------------------------------------------------------------------

	lex-source
	------------------------------------------------------------------
	........
	%%
	regex-for-token-A	{
			yylval.str = strsave(yytext, yylen);
			return(A);
		}
	regex-for-token-B	{
			yylval.str = strsave(yytext, yylen);
			return(B);
		}
	regex-for-token-C	{
			yylval.str = strsave(yytext, yylen);
			return(C);
		}
	------------------------------------------------------------------

	yacc-source
	------------------------------------------------------------------
	......
	%union {
		.....
		char *str;
		.....
	}
	......
	%token <str> A B C
	......
	%%
	......
	line3 : A B C {
			$1, $2, and $3 are now pointers to "safe" copies of
			the original token strings, but if you don't copy
			these pointers to variables which SURVIVE THIS BLOCK,
			you must cleanup before this action ends:
				free($1);
				free($2);
				free($3);
			Be especially careful if you create multiple
			references to the malloc-ed space or if you transfer
			one of these further out, say: $$ = $1.  In this
			case you must of course NOT free $1 here! Instead
			the action(s) of the rule(s) where the nonterminal
			"line3" appears on the RHS must take over that
			responsibility.
		}
	......
	-------------------------------------------------------------------

		***************************************************
		********* WARNING: Dangerous Code ahead ***********
		***************************************************

Some people successfully solve the problem in principle as follows:

	yacc-source
	-------------------------------------------------------------------
	.......
	%%
	........
	line3: 
		A { atext = strsave(yytext, yylen); }
		B { btext = strsave(yytext, yylen); }
		C { ctext = strsave(yytext, yylen); }
	........
	-------------------------------------------------------------------

This introduces an obvious minor problem and a by far less obvious major
problem:

First, if your grammar is not trivial, writing actions WITHIN the RHS of
the rules effectively changes the grammar and introduces new rules with
an empty RHS.  Because of that yacc may complain about about conflicts now
and you must have a good understanding of the algorithm of yyparse, to decide
that the grammar does what you want.  This is the minor problem and if you
are lucky, yacc sees no conflicts and you may think the above works.

But be warned if you don't have a thorough understanding under which
circumstances yyparse needs to have a look-ahead symbol to continue:
Never, again NEVER, again ***NEVER*** depend on an unchanged contents
of yytext in the actions of yyparse: In yyparse the calls to yylex which
in turn change the contents of yytext are slightly "asynchronous", i.e.
there might be a read ahead of one token and yytext doesn't contain what
you think! But there's not ALWAYS a read-ahead, it depends whether yyparse
needs one more token to decide what to do further.  So it may occur that
your grammar works fine with the above and after a minor change you make
during further developing your project, there will be the need for look
ahead and you run in big trouble!

Concerning access to yytext (and some other yy-variables), a good rule for
professional coding is the following:

** If you combine lex+yacc generated programs, the only place where yytext
** is valid is in the action block following the regular expression in the
** ".l"-file.  Never access yytext (or any other variables that belong to
** the parts of the program generated by lex) within the actions of the
** grammar rules.
4.) How do I use two yacc grammars in the same program?  The problem is,
    when I try to link several modules create by lex and/or yacc, the
    linker complains about "name clashes".

With respect to this problem, there is some bad and some good news.


The bad news is that obviously lex+yacc were not designed with this in
mind, i.e. there are plenty of globally visible symbols which would cause
the aforementioned complaints from the linker, when you try to compile and
link several lex and/or yacc generated sources together into one program.
This is a bit of a pity, since every possible work-around is bound to be
more complicated and less comfortable as when provisions for that were
made in the design of lex+yacc. (Note that there may be modified versions
of lex+yacc locally available, which help you with the steps described
in the rest of the answer.)

The good news, however, is that lex+yacc both generate C-source, which
you may massage in any desired way before you continue to compile and
link the modules. Further, all the globally visible names (that make it
impossible to link any two unchanged modules together), start with the
unique prefix "yy". So changing this prefix from "yy" to something else
is the key to the solution you need.

Of course, you don't want to go through the lex.yy.c or yy.tab.c file and
make all these changes "per hand" each time you generated one of the files
anew, so the challenge is to find a way to apply the changes automatically.
Two possibilities immediately spring into mind: Using the UNIX tool "sed"
or using some #define-s and let the C-preprocessor hide the original names.

The first solution is straight forward if you first identify all global
yy-symbols in the program, then build your sed-script which changes them
on all occurrences in each line. Of course you can (and should) include the
call to this sed-program into the Makefile that coordinates the compilation
steps of your project.

The other possibility is to #define all the "yy"-symbols to something
unique. Again you must first identify what globally symbols are affected
by this. For a program generated by yacc you may finally come up with a
file that reads more or less as follows:

	zzgrammar.c
	-------------------------------------------------------------
	#define yyparse zzparse
	#define yystate zzstate
	#define yynerr	zznerr
	..... etc. etc. .....
	#include y.tab.c
	-------------------------------------------------------------

The compiler proper now doesn't see any names beginning with "yy" but
only those beginning with "zz". (Of course you should consider choosing
a more meaningful prefix - but be careful: Not all linkers use all
characters to distinguish external names - ANSI-C for example requires
global symbols only to differ within the first six characters with no
distinction between upper and lower case, and many older UNIX linkers
have a limit of seven characters here.)

No matter which way you go to change the global names (sed or #define),
there is one further important point: If you change the names of any
functions which come from the library, you must also supply a version
of this function yourself. (Be sure to understand the calling hierarchy
of the functions - the topic is treated in another of these lex+yacc FAQs.)

Fortunately, the library functions are absolutely trivial in case of yacc.
If you supply your own main program (which is the only meaningful thing
when you link several yacc generated modules together), the only routine
which may still come from the yacc-library is yyerror.  It is usually
called with the string "syntax error" and does nothing but print this
string to stderr.

The lex-library contains a little more - besides the main program (which
you surely don't need when calling yylex from yyparse), the absolutely
trivial routine yywrap, and finally two functions which are called when
you use yyless or REJECT in the actions of your rules. If you are not sure
what these functions really do, consider changing the regular expressions
and actions so that these two are never used. (Sorry, you have been told
that the work-arounds are not so comfortable as they might be if the case
of more than one yylex linked into a single program were considered in the
early design of lex+yacc.)

					Martin Weitzel
					EDV-Beratung
					Heidelberger Str. 197
					D-W-6100 Darmstadt
					Germany
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83