[comp.lang.c] yacc sorrows

evil@arcturus.UUCP (Wade Guthrie) (02/07/90)

HELP!  I, an acclaimed yacc novice, am having severe difficulties
with a problem which should, by all rights, be way over my head.  
Given a yacc grammar for the C programming language (which I got 
off the net), I am trying to write an autoprototyper for ANSI C.  
Now I know what you're thinking (did he shoot six bugs or only 
five. . .): I could either buy cheaply something that does this
for me or I could be a SLIME and ask for a profiler too -- Instead,
I have opted for the most frustrating approach. . .writing my own
(I did think that it would be a good way to learn some more things
about yacc and lex).

My problem is this: I am trying to get access to the strings that
got matched by lex to make the tokens which are passed to yacc.
Given this, I can do the job (I think).  This is on a sun 3/60 under 
the 3.4 version of the operating system.  After RTFMing (and
gratuitous consultation of my local guru), I got to the part that 
says "the programmer includes in the declaration section [of the 
yacc grammar] %union { body }  This declares the yacc value stack 
[...] the value is referenced through a $$ or $n construction, yacc 
automatically inserts the appropriate union name", or some such.  
I tried this approach (and another that I will get to soon).

At this point, I would like to give an example of what I think the
pertinent pieces of code are.  The lex source looks something like:

	%{
	#include "y.tab.h"
	[...]
	%}
	[...]
	%%
	auto        { return(AUTO); }
	register    { return(REGISTER); }
	[...]
	"->"        { return(ARROW); }
	";"         { return(SEMICOLON); }
	.           { return(yytext[0]); }

And the yacc grammar that looks like . . .

	%{
	#include <stdio.h>
	[...]
	%}
	%union VALTYPE {
	    int type;
	    char *string;
	};
	%token AUTO REGISTER STATIC EXTERN TYPEDEF ENUM
	[...]
	%token COMMA SEMICOLON
	%left   COMMA
	[...]
	%left   ARROW '.'
	%%  
	translation_unit
	    : external_declaration
	    | translation_unit external_declaration
	    ;
	function_definition
	    : decln_spec declarator decln_list compound_statement
		{ printf("Found function %s\n",$2);}
	    | decln_spec declarator compound_statement
		{ printf("Found function %s\n",$2);}
	    | declarator decln_list compound_statement
		{ printf("Found function %s\n",$1);}
	    | declarator compound_statement
		{ printf("Found function %s\n",$1);}
	    ;
	[. . .]

For those that care, my y.tab.h looks something like:

	typedef union  VALTYPE {
	    int type;
	    char *string;
	} YYSTYPE;

	# define AUTO 257
	# define REGISTER 258
	[...]
	# define COMMA 318
	# define SEMICOLON 319

Which should be okay, since I compile my grammar with:

	yacc -vd grammar.y

Assuming that my interpretation of the manual is correct, this
(may I call your attention to the function_definition rule of the
yacc grammar) should give me the proper info.  Instead, the $n 
values turn out to be NULL pointers.

On another tach, I thought that I would have to (shudder) build the 
string myself, so I tried:

	type_specifier
	    : VOID
		{printf("found type %s\n",yytext);}
	[...]

in the grammar.  Now, THIS got a lot more reaction (I got core dumps).

I run the thing by having gcc remove the comments
and preprocessor directives (and piping that through a filter that
removes the '#' lines inserted by gcc) before running proto (a
simple routine that, at this point, only calls yyparse and has
a simple yyerror set up.

Anyone got any ideas?  Can normal yacc and lex do this sort of
thing?  How?  In lieu of this, can you name a good single malt
whiskey in which to drown my programming sorrows?

I appreciate any help.  Thanks.


Wade Guthrie
evil@arcturus.UUCP
Rockwell International; Anaheim, CA

(How could Rockwell stand by what I'm saying when *I* don't even know 
what I'm talking about???)

utility@quiche.cs.mcgill.ca (Ronald BODKIN) (02/08/90)

In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes:
>off the net), I am trying to write an autoprototyper for ANSI C.  
	Does anyone have such a beast already developed for PD use.
I've toyed with writing such a program, but it seems like a lot of effort.
I gather GNU C has protoize, but I can't use GNU C on my PC.  Any
pointers would be welcome.
		Ron

dkim@wam.umd.edu (Daeshik Kim) (02/08/90)

In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes:
>
>	%{
>	#include "y.tab.h"
>	[...]
>	%}
>	[...]
>	%%
>	auto        { return(AUTO); }
>	register    { return(REGISTER); }
>	[...]
>	"->"        { return(ARROW); }
>	";"         { return(SEMICOLON); }
>	.           { return(yytext[0]); }
>
	If you need to pass a value to 'yacc', then do the following:

	somestring	{
			yylval.sval=(char*)malloc(yyleng);
			strcpy(yylval.sval, yytext);
			return(TOKENNAME);
			}

	someinteger	{
			yylval.ival=atoi(yytext);
			return(TOKENNAME);
			}
	...

	Note that "yyval" is the type defined by "%union ....".
	In the above example, in 'yacc' def,
		%union
			{
			int ival;
			char *sval;
			}

	scan over c code from lex.
--
	Daeshik Kim	H: (301) 445-0475/2147 O: (703) 689-5878
	SCHOOL:	dkim@wam.umd.edu, dskim@eng.umd.edu, mz518@umd5.umd.edu
	WORK:	dkim@daffy.uu.net (uunet!daffy!dkim)

prs@tcsc3b2.tcsc.com (Paul Stath) (02/10/90)

evil@arcturus.UUCP (Wade Guthrie) writes:

[... Introduction text deleted ...]

>My problem is this: I am trying to get access to the strings that
>got matched by lex to make the tokens which are passed to yacc.
>Given this, I can do the job (I think).  This is on a sun 3/60 under 
>the 3.4 version of the operating system.  After RTFMing (and
>gratuitous consultation of my local guru), I got to the part that 
>says "the programmer includes in the declaration section [of the 
>yacc grammar] %union { body }  This declares the yacc value stack 
>[...] the value is referenced through a $$ or $n construction, yacc 
>automatically inserts the appropriate union name", or some such.  
>I tried this approach (and another that I will get to soon).

The string that gets matched in LEX is stored in a character pointer called
`yytext'.

>At this point, I would like to give an example of what I think the
>pertinent pieces of code are.  The lex source looks something like:

>	%{
>	#include "y.tab.h"
>	[...]
>	%}
>	[...]
>	%%
>	auto        { return(AUTO); }
>	register    { return(REGISTER); }
>	[...]
>	"->"        { return(ARROW); }
>	";"         { return(SEMICOLON); }
>	.           { return(yytext[0]); }

>And the yacc grammar that looks like . . .

>	%{
>	#include <stdio.h>
>	[...]
>	%}
>	%union VALTYPE {
>	    int type;
>	    char *string;
>	};
>	%token AUTO REGISTER STATIC EXTERN TYPEDEF ENUM
>	[...]
>	%token COMMA SEMICOLON
>	%left   COMMA
>	[...]
>	%left   ARROW '.'
>	%%  
>	translation_unit
>	    : external_declaration
>	    | translation_unit external_declaration
>	    ;
>	function_definition
>	    : decln_spec declarator decln_list compound_statement
>		{ printf("Found function %s\n",$2);}
>	    | decln_spec declarator compound_statement
>		{ printf("Found function %s\n",$2);}
>	    | declarator decln_list compound_statement
>		{ printf("Found function %s\n",$1);}
>	    | declarator compound_statement
>		{ printf("Found function %s\n",$1);}
>	    ;
>	[. . .]

>For those that care, my y.tab.h looks something like:

>	typedef union  VALTYPE {
>	    int type;
>	    char *string;
>	} YYSTYPE;

>	# define AUTO 257
>	# define REGISTER 258
>	[...]
>	# define COMMA 318
>	# define SEMICOLON 319

>Which should be okay, since I compile my grammar with:

>	yacc -vd grammar.y

>Assuming that my interpretation of the manual is correct, this
>(may I call your attention to the function_definition rule of the
>yacc grammar) should give me the proper info.  Instead, the $n 
>values turn out to be NULL pointers.

Your interpretation of the manual is ALMOST correct, but not quite.  YACC allows
the $n construct as a shorthand to access the stack of token values that are
being shift'ed during the parse sequence.  By doing the:

%union {
	/* union declaration */
}

declareation in YACC, you redefine the stack type to be something other
than int.  Since YACC and LEX are -mostly- independent programs, you have
to do a little bit more work when you change the default stack.

In your LEX actions which find strings, you need to allocate space for the
string to be stored, point the stack pointer at that allocated space,
and then copy the string into that space.

Here is an example from the LEX code for that parser:

%{
#include "y.tab.h"

char	*str_ptr;
%}
%%
[....]
${alpha}{alphanum}*	{
				yylval.str=malloc(strlen(yytext)+1);
				strcpy(yylval.str, yytext);
				return (Identifier);
			}
[....]
.					return (yytext[0]);
%%


Here is the relevant YACC code for that parser:
%{

extern	char	yytext[];
%}

%union {
	char	*str;
}

[......]
%token <str> Identifier
[......]
%%
[......]
file_declaration:
		FILE Identifier sysname ';'
		{
			token_in(token_init(Identifier, $2));
			file_rec_insert (&token_anchor);
		}
	;
[......]
%%
[......]

>Anyone got any ideas?  Can normal yacc and lex do this sort of
>thing?  How?  In lieu of this, can you name a good single malt
>whiskey in which to drown my programming sorrows?

LEX and YACC are powerful tools which IMHO are poorly documented.  Just
RTFM'ing will NOT help.  You have to read both between the lines, and
sometimes THROUGH the page to find out what you want.  Making LEX and
YACC do the work is MUCH easier than righting a parser yourself,
but the learning curve involved is VERY high!  I spent almost 3 months up
to my elbows in YACC, LEX and C code to produce a parser for a database
report language.  (Not terribly complex grammer, but hard enough.)  Most of
my knowledge of LEX and YACC came from poking it until it broke.  I STILL
refer back to this code whenever I need to do something tricky, because I
probably had to do it to write the report language parser.

I would like to see this thing posted to the net if it is not something
you are doing on company time.  I would be happy to help it you have any
other YACC or LEX questions.  Just E-Mail.

Just another application hacker drowning in a world of suits and stuffed shirts!
-- 
===============================================================================
Paul R. Stath       The Computer Solution Co., Inc.       Voice: 804-794-3491
------------------------------------------------+------------------------------
INTERNET:	prs@tcsc3b2.tcsc.com		| "There was no diety involved,

andre@targon.UUCP (andre) (02/14/90)

In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes:

  >My problem is this: I am trying to get access to the strings that
  >got matched by lex to make the tokens which are passed to yacc.
  >Given this, I can do the job (I think).  This is on a sun 3/60 under
  >the 3.4 version of the operating system.  After RTFMing (and
  >gratuitous consultation of my local guru), I got to the part that
  >says "the programmer includes in the declaration section [of the
  >yacc grammar] %union { body }  This declares the yacc value stack
  >[...] the value is referenced through a $$ or $n construction, yacc
  >automatically inserts the appropriate union name", or some such.
  >I tried this approach (and another that I will get to soon).

If you declare a union for the yacc stack, you must do two
things to use the union.

1       tell yacc with (non) terminals have which type
2       let the lex code fill the members of the
	global union yylval.
3       (easier) put an include lex.yy.c in the last part
	of the yacc file then you need not fiddle with the include
	file.

Example (not tested)

lex:
%%
[a-zA-Z][a-zA-Z0-9_]*   { yylval.str = strncpy(malloc(yyleng+1), yytext, yyleng);
			  return NAME; }
[0-9]+                  { yylval.nr  = atoi(yytext);
			  return INT; }
%%

yacc:

%union VALTYPE {
    int  nr;
    char *str;
};


%token <str> NAME string
%token <nr>  INT  number

%%

file    : string {printf("NAME %s\n", $1};
	| number {printf("INT  %d\n", $1};
	;

string  : NAME
	;

number  : INT
	;

%%

#include "lex.yy.c"

This should get you back on the road :-).


-- 
The mail|    AAA         DDDD  It's not the kill, but the thrill of the chase.
demon...|   AA AAvv   vvDD  DD        Ketchup is a vegetable.
hits!.@&|  AAAAAAAvv vvDD  DD                    {nixbur|nixtor}!adalen.via
--more--| AAA   AAAvvvDDDDDD    Andre van Dalen, uunet!hp4nl!targon!andre

chris@mimsy.umd.edu (Chris Torek) (02/15/90)

(Incidentally, this is another thing that does not really belong in
comp.lang.c, but in this case there *is* no appropriate group, so I
have not attempted to redirect followups....)

A few minor points:

In article <1990Feb9.171557.18465@tcsc3b2.tcsc.com> prs@tcsc3b2.tcsc.com
(Paul Stath) writes:
>The string that gets matched in LEX is stored in a character pointer called
>`yytext'.

Actually, this is an array (of size YYLMAX, typically 200) of characters,
not a pointer.

[example lex code]
>${alpha}{alphanum}*	{
>				yylval.str=malloc(strlen(yytext)+1);
>				strcpy(yylval.str, yytext);
>				return (Identifier);
>			}


It is not actually necessary to call malloc() here, as the characters
in yytext[] will be left undisturbed until the next call to yylex().
The string saving, if necessary, can be deferred to the parser.  One
useful trick is to have a parse rule like save_id:

	%type <str> save_id
	%token <str> ID
	%%
	save_id: ID { $$ = savestr($1); };

Then, whenever you need an ID that must be saved from destruction by
the next call to yylex(), you can use save_id instead of ID.

Another different trick (which I have used in some hand-coded lexers) is
to save all strings in hash tables, possibly reference counted (depending
on whether many should be freed later).  In any case, a routine that
calls malloc() should check for no-space: instead of

			yylval.str=malloc(strlen(yytext)+1);
			strcpy(yylval.str, yytext);

you need something like

			yylval.str = malloc(strlen(yytext) + 1);
			if (yylval.str == NULL)
				die_horribly_due_to_running_out_of_space();
			strcpy(yylval.str, yytext);

or more simply

			yylval.str = estrdup(yytext);

where estrdup is like strdup, but errors out if out of space.  (strdup
is a common library function that acts like malloc+strcpy, returning
NULL if out of space.)

>LEX and YACC are powerful tools which IMHO are poorly documented.

The real documentation for both of these tools is found in compiler
courses and in compiler textbooks, not in the supplementary Unix
documents.  The latter assume you know what LALR parsing and regular
expressions are all about, and merely tell you how to tell yacc and
lex what syntax rules and regular expressions to recognise, and what
actions to take on recognition.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

amull@Morgan.COM (Andrew P. Mullhaupt) (02/16/90)

In article <22529@mimsy.umd.edu>, chris@mimsy.umd.edu (Chris Torek) writes:
> In article <1990Feb9.171557.18465@tcsc3b2.tcsc.com> prs@tcsc3b2.tcsc.com
> (Paul Stath) writes:
> >The string that gets matched in LEX is stored in a character pointer called
> >`yytext'.
> 
> Actually, this is an array (of size YYLMAX, typically 200) of characters,
> not a pointer.

Yes, if you're referring to real live 'lex'. If you have 'flex' instead,
it gets declared as a pointer to character. 

This discrepancy, typical of those between yacc and bison, is small
but annoying. Does anyone know what the reason is for this one?

Discrepancies aside, the documentation for bison is a pretty good
surrogate for yacc. Two books _The UNIX Programming Environment_
by Kernighan and Pike, and _UNIX Utilities_ by R.S. Tare provide
tutorials on yacc. 

Although you should be able to get yacc to generate the parser you
want independent of what OS you're running, there are some differences
of note. System V UNIX has text messages for debugging a yacc parser
based on the production in the grammar specification, and BSD UNIX
(and Bison) provide references to the production by number, but also
give the stack state. 

Later,
Andrew Mullhaupt

kan@dg-rtp.dg.com (Victor Kan) (02/21/90)

In article <1044@targon.UUCP> andre@targon.UUCP (andre) writes:
>In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes:
>  >My problem is this: I am trying to get access to the strings that
>  >got matched by lex to make the tokens which are passed to yacc.
>  >Given this, I can do the job (I think).  This is on a sun 3/60 under

Try looking at the yytext array declared in lex.yy.c.
You'll have to extern char *yytext[] it in y.tab.c, unless you
do #3 as suggested below by andre@targon.UUCP.  But I've had
nothing but problems with that trick.

>If you declare a union for the yacc stack, you must do two
>things to use the union.
>
>1       tell yacc with (non) terminals have which type
>2       let the lex code fill the members of the
>	global union yylval.
>3       (easier) put an include lex.yy.c in the last part
>	of the yacc file then you need not fiddle with the include
>	file.

You may wish to avoid #3 if you can.  It's not a good idea to 
#include C files because it can make debugging a living nightmare.  
Your compiler will give you bogus line numbers for syntax errors and 
warnings.  When you're program gets to the debugging stage,
breakpoints in your debugger may be bogus too.  GCC 1.35 and GDB 3.2 
don't handle the line numbering and symbol table correctly in this 
respect.  I doubt other compilers/debuggers can do any better.

Besides, it takes very little effort to support a separate y.tab.c
and lex.yy.c with y.tab.h declarations.  A little laziness here can
cause you unnecessary Exedrin headaches.

>lex:
>%%
>[a-zA-Z][a-zA-Z0-9_]*   { yylval.str = strncpy(malloc(yyleng+1), yytext, yyleng);

yylval.str = strdup (yytext); works fine.
Just remember to free the block when you're done with it in Yacc.

>yacc:
>
>%%
>#include "lex.yy.c"

Avoid this if you can!!!!!

| Victor Kan               | I speak only for myself.               |  ***
| Data General Corporation | Edito cum Emacs, ergo sum.             | ****
| 62 T.W. Alexander Drive  | Columbia Lions Win, 9 October 1988 for | **** %%%%
| RTP, NC  27709           | a record of 1-44.  Way to go, Lions!   |  *** %%%

martin@mwtech.UUCP (Martin Weitzel) (02/21/90)

In article <22529@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>(Incidentally, this is another thing that does not really belong in
>comp.lang.c, but in this case there *is* no appropriate group, so I
>have not attempted to redirect followups....)

Do we need comp.lang.yacc?

>
>A few minor points:
>
>In article <1990Feb9.171557.18465@tcsc3b2.tcsc.com> prs@tcsc3b2.tcsc.com
>(Paul Stath) writes:
>>The string that gets matched in LEX is stored in a character pointer called
>>`yytext'.
>
>Actually, this is an array (of size YYLMAX, typically 200) of characters,
>not a pointer.
>
>[example lex code]
>>${alpha}{alphanum}*	{
>>				yylval.str=malloc(strlen(yytext)+1);
>>				strcpy(yylval.str, yytext);
>>				return (Identifier);
>>			}
>
>
>It is not actually necessary to call malloc() here, as the characters
>in yytext[] will be left undisturbed until the next call to yylex().

Though I have no strong motivation to argue against Chris, because he
gives allways good and correct advice, I have to warn here, that yylex()
*may* be called to read one token ahead, so that the Parser can decide
wether to shift or reduce. This may not be of importance in the example
Chris had in mind, but consider the following:

%token ID
%%
list	: id1 ';'
	| id2 ',' list
	;
id1	: ID	{ /*1*/ }
	;
id2	: ID	{ /*2*/ }
	;
%%

Before /*1*/ or /*2*/ can be executed, yylex() will have been called
to see if the next token is ',' or ';', because otherwise it could
not decide if 'id1' or 'id2' should be reduced.

Even if you can deduce that this will not be the case in a certain
grammar, it introduces a possible bug for someone who later builds
uppon your work, modifies the grammar and introduces the need for
look-ahead ...

[rest deleted]
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

amull@Morgan.COM (Andrew P. Mullhaupt) (02/25/90)

In article <647@mwtech.UUCP>, martin@mwtech.UUCP (Martin Weitzel) writes:
> In article <22529@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
> >(Incidentally, this is another thing that does not really belong in
> >comp.lang.c, but in this case there *is* no appropriate group, so I
> >have not attempted to redirect followups....)
> 
> Do we need comp.lang.yacc?
> 
Yes, or something like it. The group should not be devoted only to C
(yacc-lex-bison-flex-...) but to language construction techniques in
general. 

I think a pool of experience in this area would be an admirable
supplement to the anecdotal texts which are presently available. What
say someone calls for discussion.

Also: There will be a 'name the baby' crisis here. comp.lang.yacc
seems to have no claim to be preferred over comp.lang.bison, or
comp.lang.lex, etc.. 

Later,
Andrew Mullhaupt

zed@mdbs.UUCP (Zed Smith) (02/28/90)

In article <751@s5.Morgan.COM>, amull@Morgan.COM (Andrew P. Mullhaupt) writes:
) In article <647@mwtech.UUCP>, martin@mwtech.UUCP (Martin Weitzel) writes:
) > In article <22529@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
) > >(Incidentally, this is another thing that does not really belong in
) > >comp.lang.c, but in this case there *is* no appropriate group, so I
) > >have not attempted to redirect followups....)
) > 
) > Do we need comp.lang.yacc?
) > 
) Yes, or something like it. The group should not be devoted only to C
) (yacc-lex-bison-flex-...) but to language construction techniques in
) general. 
) 
) Also: There will be a 'name the baby' crisis here. comp.lang.yacc
) seems to have no claim to be preferred over comp.lang.bison, or
) comp.lang.lex, etc.. 


) Andrew Mullhaupt

I think comp.lang.antlr would be an unbiased and fair name since antlr is 
non- commercial and as fresh off the press (or soon to be).  (Ask 
parrt@ee.ecn.purdue.edu)

Zed Smith
pur-ee!mdbs!zed
zed@mdbs.UUCP (?)

(Yes this is a cheap advertisement for antlr, but I'm pretty sure antlr will 
be free so I don't feel guilty.  (By the way, MDBS doesn't have any idea
what antlr is either, so they aren't benefiting from this either.))

"antlr" is pronounced "antler"

evil@arcturus.UUCP (Wade Guthrie) (03/01/90)

Someone asked:
> Do we need comp.lang.yacc?

amull@Morgan.COM (Andrew P. Mullhaupt) writes:
>Yes, or something like it. . .

GROAN! More fragmentation!  Let's not start that...couldn't we
just leave the discussions here?
-- 
Wade Guthrie (evil@arcturus.UUCP)
Rockwell International; Anaheim, CA;  My opinions, not my employer's.

"All right, so I'm panicking, what else is there to do?"