martin@mwtech.UUCP (Martin Weitzel) (05/13/91)
Here I follow up to my own article. Thanks to all who responded by mail, especially to Steve Summit who wrote a very long answer and made valuable proposuals. For now I've decided to post the stuff once more since there seems to be sufficient request for it. I even included one more new Q+A on the request of one person who mailed me. (BTW: I'm still open to include more Q+As in future releases.) There seems to be some consensus NOT to make a REGULAR posting of it. Instead I request all who archive FAQ-like stuff to do so with this article. If you do and your archive has an easy way to be accessed from the public (ftp, mail-server), let me know the procedure how to do so, so that I can point future requests for the answers to this archive and must neither mail all this to individuals, nor use up bandwitdh in the news with it. BTW: Sometime ago someone came up with the idea of a - probably moderated - news group to cross-post FAQ-like stuff so that it could be archived automatically or at least given a long expiration period. I'm not sure how this proposual advanced (and I'm even not sure if such a group is a good idea in general) but for the stuff I have it sounds all right. ============ Now, for all who are waiting, here we go again ============== 1.) How does one redefine the I/O in a yacc/lex piece of code? I.e. the code which is generated defaults to stdin and stdout for input and output, respectively. The calling tree in a lex+yacc application, when it comes to read input and you do not change anything, normally is: (main or whatever) ---> yyparse ---> yylex ---> input[Macro] ---> getc(yyin) [yyin defaults to stdin] If you want to read from some other source instead of stdin, you have several points where you can change the behavior. (In the very simplest case you could even change nothing and use the input redirection of UNIX.) If your program generated by lex has to read from another source as stdin, just fopen() the file and assign the returned FILE-pointer to yyin. The latter is a global symbol in the object file that results from compiling what lex generated. If you prefer separate compilation you must define it as extern FILE *yyin; in the module where you assign to it. The right place for this will probably be the function which calls yyparse in the above example. Note that the main program from the yacc-library is NOT linked if you supply your own. So you can play any games you like before (and also after) calling yyparse. Next step of complexity is to change the input-macro of yylex. This is useful sometimes, but you should be reluctant to do so until you have gathered a bit experience with lex and understand the implications. Here are some details how the individual routines and functions call each other (note that the situation described is reverse engineered from several implementations under XENIX and SysV UNIX - it *may* differ with other implementations, most notably "flex"): main (from lex-library or own) | V yylex --------------------------+-+ +---+ | | +-----+ | | | V V V V | ...... V V ......... | input unput | : yyless yyreject : in the | | :.... | ... | . | ..: lex-library | | V V V | | yyunput yyinput | | | | | +--------------------+ | +------------------------------------------------+ What you should note here first is: - When the next character is needed in yylex, input (NOT yyinput!!) is called. Normally, input is #defined as macro but you can re-#define it, or #undef it and make a function with this name visible when you compile and link lex.yy.c. - There is another macro, unput, that must properly undo the actions of input, though unput is only called if your regular expressions require some look-ahead. (If you are not *very* experienced with regular expressions, assume that there will *always* will be look-ahead.) So if you want to change things at this level, you must find the right place for your redefinitions, that is, you must write them somewhere into the ".l"-file (with the lex-source), so that they appear *after* the #define-s that are automatically generated by lex, but *before* the first use of input/unput. As the order in which the parts of your ".l"-file appear in lex.yy.c seem to have changed with the evolution of lex, you should check for the right place if you try this the first time! (It appears as if the "safest" (i.e. most portable) place for this is right at the beginning of the second part of the ".l"-file, immediately before the first regular expression.) file.l ------------------------------------------------------------ first part %% %{ #undef input /* ANSI-C requires that, though */ #undef unput /* older compilers may do without */ #define input .... whatever .... #define unput ... as you like .. %} first-regex { ... action ... } second-regex { .... action ... } ... etc. ...... %% third part ------------------------------------------------------------ Now for the tricky part: As you see from the picture above which shows how the functions call each other, there are some routines in the lex-library which need sometimes to input or unput characters. These routines *must* use exactly the redefined versions of input and unput. But: How can these library routines link to something that is defined as macro? The solution can again be seen if you carefully study lex.yy.c, the source generated by lex. At the end of lex.yy.c you find the functions yyinput and yyunput (note the yy-prefix now!), which do no more and no less than calling input and unput. As the two functions are COMPILED in the context of lex.yy.c, where the (possibly changed) macro-definitions are visible, they build the "stubs" through which the functions in the lex-library access the macros. (Quite easy after you understand the trick but not at all obvious for the newcomer.) ** AGAIN: Look at the above scheme showing the calling hierarchy and ** try to understand the dependencies. Eventually study lex.yy.c a bit ** further. THEN you might consider writing your own input/unput macros! Finally, you can consider avoiding lex at all and roll your own yylex. If you have only the "ancient" lex which is supplied with most unix systems (contrary to the rewrite "flex"), it could eventually be an advantage to do so, since lex-generated programs are known to be not so much efficient compared to hand-written scanners. (It's not easy to decide whether the performance loss in lex generated programs is a legend or reality. Most comparisons are based on trivial scanners which are easily written by hand. In any case it seems recommendable to use lex during the early development steps as prototyping tool and eventually change to hand-written scanners when things are becoming more stable.) For the redirection of output there is no problem at all, since this is fully under control of the C-program fragments you write in your actions of the lex+yacc source. 2.) How can I get the automagically-defined #defines, which can normally be created from yacc with the -d flag, to come out when you use a Makefile? I.e. suppose I have scanner.l and grammar.y lex and yacc source files, respectively, and I have object files defined in my Makefile called scanner.o and grammar.o such that "make" follows default rules to create these from the aforementioned source files. If you use lex+yacc with the UNIX tool make, you can add your own explicit dependencies or change the default suffix rules and add your own commands there. There is no "catch all" method for this - several variations exist, all with their specific advantages and drawbacks - but if you know make or are willing to learn about make, you can determine the dependencies between your files generated by lex+yacc in the same way as with any other source files. You'll also find a framework for using lex+yacc together with make in the book "The Unix Programming Environment" by Kernighan + Pike [K&P]. Though only one chapter (but not the smallest) is dedicated to lex+yacc, it is recommendable to get this book anyway - it pays, not only for your work with lex+yacc.) Now let's look a little closer to the problem. If you have a file grammar.y and tell make that a file grammar.o is needed, make "knows" by its builtin rules how to call yacc and rename and compile the resulting y.tab.c. If you need the definitions of the %tokens and the value stack %union in a separate file - say grammar.h - a first attempt may be to add the following explicit rule to the Makefile: grammar.h: grammar.y yacc -d grammar.y mv y.tab.h grammar.h rm -f y.tab.c Obviously, here is some work duplicated, as yacc will run twice now: Once for producing grammar.h in the explicit rule, once to produce grammar.o with the suffix rule. This is no real problem because (in general) yacc runs very fast compared to the compilation step. If you are really concerned about avoiding unnecessary CPU time, you should consider some other fine point: Generally, the situation is that changes in grammar.y will be much more reflected in y.tab.c than in y.tab.h (respectively grammar.h in the above example). The latter will only occur when new tokens are introduced or the type of the value stack is changed. Typically, the latter two changes to grammar.y are much less frequently done compared to changing some statements in the C-program fragments that specify the actions for the rules. So it is recommendable to extend the above further to: grammar.h: grammar.y yacc -d grammar.y test -f grammar.h && cmp -s y.tab.h grammar.h ||\ mv y.tab.h grammar.h rm -f y.tab.c Here y.tab.h is only renamed to grammar.h if the latter doesn't exist OR is different from the former. Leaving an existing unchanged grammar.h alone will avoid many unnecessary recompilations of other modules which depend on grammar.h. (If you think you have understood the last sentence, think about it and make sure that you really did! The problem only shows up if many *other* modules *depend* on grammar.h!) As a slight modification of the above, you may also write a "cc"-like shell wrapper for yacc, which generates - controlled by options - either ".c", ".o", or ".h" files from given ".y" files. Of course, this wrapper should only conditionally rename y.tab.h, as shown above in the excerpt from a Makefile. {short digression on} Some people may recommend to simply change the definition of YFLAGS in the Makefile to `YFLAGS = -d', so that y.tab.h is produced as "side effect" when the suffix rule gets executed. Note that this has several drawbacks: The default name for the definitions is now y.tab.h, which is more a work file than the name of something useful and you cannot manage different yacc projects in the same directory without the danger of overwriting this file by accident. Of course you can circumvent this by changing the actions in the suffix rule and rename y.tab.h to grammar.h there - but in this case you have already changed more than the definition of YFLAGS. But there's still one more problem: Make doesn't know through this how grammar.h (or y.tab.h, if you do not rename it) can be made! All it knows through the suffix rules is how to make grammar.o from grammar.y. So if you have an explicit dependency elsewhere, which tells that some other target depends on grammar.h, make will complain when it had not accidentally decided to make grammar.o before that. (If you know make well, you may write the dependencies in a way that y.tab.h is generated early enough, but why worry about such details when the alternative solution shown above also is not too complex.) {short digression off} 3.) My grammar contains the following rule line3 : A B C { yacc action sequence } which indicates that the construct line3 is composed of the 3 tokens A B and C, in that order. How can I now assign the values of A, B, and C into local variables of my choice? The problem lies in the fact that each of A B and C represent three calls to lex, and if I pass back a pointer to yytext from lex, I only retain the value of the last token in the sequence, in this case C, when I get to the action sequence in my yacc code. What if I want to be able to select the EXACT ascii tokens for each of A B and C above in my yacc code. how do I do that? Transferring strings from yylex to yyparse (respectively the action which has the relevant tokens on the RHS of its grammar rule) must be done with care: Simply using pointers to yytext is no solution here - you must copy the contents of yytext to a safe place. For that purpose you could malloc some space in the action of yylex (not yyparse!!) which recognizes the token, as shown in the code excerpt following below. (Your C standard library may also contain strdup which does malloc and strcpy all in one, but it's not difficult to do without.) Of course you must be careful here: - malloc may return a NULL-pointer because of memory limits; ALWAYS CHECK FOR THIS CONDITION. (And even if you think this will never occur on your machine since you have lots of MB-s available - one day someone else may take your code and port it to some other environment and the program will fail!) - you must not forget to allocate space for the terminating NUL-byte, hence malloc(yylen + 1) is the right thing; - you must carefully plan for deallocation if your program shall not run out of memory when it analyzes large input. If you transfer pointers to the malloc-ed space via the value stack, the last chance for free-ing the space is immediately before the value stack is cleared. So, if you don't copy the pointers which correspond to A, B, and C in the above example, your last chance is in the grammar action. A short code excerpt should help to understand what is required: strsave-source ------------------------------------------------------------------ char *strsave(s, n) char *s; { char *t = malloc(n + 1); if (t == (char *)0) { scream and shout and die horrible death } strcpy(t, s); return t; } ------------------------------------------------------------------ lex-source ------------------------------------------------------------------ ........ %% regex-for-token-A { yylval.str = strsave(yytext, yylen); return(A); } regex-for-token-B { yylval.str = strsave(yytext, yylen); return(B); } regex-for-token-C { yylval.str = strsave(yytext, yylen); return(C); } ------------------------------------------------------------------ yacc-source ------------------------------------------------------------------ ...... %union { ..... char *str; ..... } ...... %token <str> A B C ...... %% ...... line3 : A B C { $1, $2, and $3 are now pointers to "safe" copies of the original token strings, but if you don't copy these pointers to variables which SURVIVE THIS BLOCK, you must cleanup before this action ends: free($1); free($2); free($3); Be especially careful if you create multiple references to the malloc-ed space or if you transfer one of these further out, say: $$ = $1. In this case you must of course NOT free $1 here! Instead the action(s) of the rule(s) where the nonterminal "line3" appears on the RHS must take over that responsibility. } ...... ------------------------------------------------------------------- *************************************************** ********* WARNING: Dangerous Code ahead *********** *************************************************** Some people successfully solve the problem in principle as follows: yacc-source ------------------------------------------------------------------- ....... %% ........ line3: A { atext = strsave(yytext, yylen); } B { btext = strsave(yytext, yylen); } C { ctext = strsave(yytext, yylen); } ........ ------------------------------------------------------------------- This introduces an obvious minor problem and a by far less obvious major problem: First, if your grammar is not trivial, writing actions WITHIN the RHS of the rules effectively changes the grammar and introduces new rules with an empty RHS. Because of that yacc may complain about about conflicts now and you must have a good understanding of the algorithm of yyparse, to decide that the grammar does what you want. This is the minor problem and if you are lucky, yacc sees no conflicts and you may think the above works. But be warned if you don't have a thorough understanding under which circumstances yyparse needs to have a look-ahead symbol to continue: Never, again NEVER, again ***NEVER*** depend on an unchanged contents of yytext in the actions of yyparse: In yyparse the calls to yylex which in turn change the contents of yytext are slightly "asynchronous", i.e. there might be a read ahead of one token and yytext doesn't contain what you think! But there's not ALWAYS a read-ahead, it depends whether yyparse needs one more token to decide what to do further. So it may occur that your grammar works fine with the above and after a minor change you make during further developing your project, there will be the need for look ahead and you run in big trouble! Concerning access to yytext (and some other yy-variables), a good rule for professional coding is the following: ** If you combine lex+yacc generated programs, the only place where yytext ** is valid is in the action block following the regular expression in the ** ".l"-file. Never access yytext (or any other variables that belong to ** the parts of the program generated by lex) within the actions of the ** grammar rules. 4.) How do I use two yacc grammars in the same program? The problem is, when I try to link several modules create by lex and/or yacc, the linker complains about "name clashes". With respect to this problem, there is some bad and some good news. The bad news is that obviously lex+yacc were not designed with this in mind, i.e. there are plenty of globally visible symbols which would cause the aforementioned complaints from the linker, when you try to compile and link several lex and/or yacc generated sources together into one program. This is a bit of a pity, since every possible work-around is bound to be more complicated and less comfortable as when provisions for that were made in the design of lex+yacc. (Note that there may be modified versions of lex+yacc locally available, which help you with the steps described in the rest of the answer.) The good news, however, is that lex+yacc both generate C-source, which you may massage in any desired way before you continue to compile and link the modules. Further, all the globally visible names (that make it impossible to link any two unchanged modules together), start with the unique prefix "yy". So changing this prefix from "yy" to something else is the key to the solution you need. Of course, you don't want to go through the lex.yy.c or yy.tab.c file and make all these changes "per hand" each time you generated one of the files anew, so the challenge is to find a way to apply the changes automatically. Two possibilities immediately spring into mind: Using the UNIX tool "sed" or using some #define-s and let the C-preprocessor hide the original names. The first solution is straight forward if you first identify all global yy-symbols in the program, then build your sed-script which changes them on all occurrences in each line. Of course you can (and should) include the call to this sed-program into the Makefile that coordinates the compilation steps of your project. The other possibility is to #define all the "yy"-symbols to something unique. Again you must first identify what globally symbols are affected by this. For a program generated by yacc you may finally come up with a file that reads more or less as follows: zzgrammar.c ------------------------------------------------------------- #define yyparse zzparse #define yystate zzstate #define yynerr zznerr ..... etc. etc. ..... #include y.tab.c ------------------------------------------------------------- The compiler proper now doesn't see any names beginning with "yy" but only those beginning with "zz". (Of course you should consider choosing a more meaningful prefix - but be careful: Not all linkers use all characters to distinguish external names - ANSI-C for example requires global symbols only to differ within the first six characters with no distinction between upper and lower case, and many older UNIX linkers have a limit of seven characters here.) No matter which way you go to change the global names (sed or #define), there is one further important point: If you change the names of any functions which come from the library, you must also supply a version of this function yourself. (Be sure to understand the calling hierarchy of the functions - the topic is treated in another of these lex+yacc FAQs.) Fortunately, the library functions are absolutely trivial in case of yacc. If you supply your own main program (which is the only meaningful thing when you link several yacc generated modules together), the only routine which may still come from the yacc-library is yyerror. It is usually called with the string "syntax error" and does nothing but print this string to stderr. The lex-library contains a little more - besides the main program (which you surely don't need when calling yylex from yyparse), the absolutely trivial routine yywrap, and finally two functions which are called when you use yyless or REJECT in the actions of your rules. If you are not sure what these functions really do, consider changing the regular expressions and actions so that these two are never used. (Sorry, you have been told that the work-arounds are not so comfortable as they might be if the case of more than one yylex linked into a single program were considered in the early design of lex+yacc.) Martin Weitzel EDV-Beratung Heidelberger Str. 197 D-W-6100 Darmstadt Germany -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83