evil@arcturus.UUCP (Wade Guthrie) (02/07/90)
HELP! I, an acclaimed yacc novice, am having severe difficulties with a problem which should, by all rights, be way over my head. Given a yacc grammar for the C programming language (which I got off the net), I am trying to write an autoprototyper for ANSI C. Now I know what you're thinking (did he shoot six bugs or only five. . .): I could either buy cheaply something that does this for me or I could be a SLIME and ask for a profiler too -- Instead, I have opted for the most frustrating approach. . .writing my own (I did think that it would be a good way to learn some more things about yacc and lex). My problem is this: I am trying to get access to the strings that got matched by lex to make the tokens which are passed to yacc. Given this, I can do the job (I think). This is on a sun 3/60 under the 3.4 version of the operating system. After RTFMing (and gratuitous consultation of my local guru), I got to the part that says "the programmer includes in the declaration section [of the yacc grammar] %union { body } This declares the yacc value stack [...] the value is referenced through a $$ or $n construction, yacc automatically inserts the appropriate union name", or some such. I tried this approach (and another that I will get to soon). At this point, I would like to give an example of what I think the pertinent pieces of code are. The lex source looks something like: %{ #include "y.tab.h" [...] %} [...] %% auto { return(AUTO); } register { return(REGISTER); } [...] "->" { return(ARROW); } ";" { return(SEMICOLON); } . { return(yytext[0]); } And the yacc grammar that looks like . . . %{ #include <stdio.h> [...] %} %union VALTYPE { int type; char *string; }; %token AUTO REGISTER STATIC EXTERN TYPEDEF ENUM [...] %token COMMA SEMICOLON %left COMMA [...] %left ARROW '.' %% translation_unit : external_declaration | translation_unit external_declaration ; function_definition : decln_spec declarator decln_list compound_statement { printf("Found function %s\n",$2);} | decln_spec declarator compound_statement { printf("Found function %s\n",$2);} | declarator decln_list compound_statement { printf("Found function %s\n",$1);} | declarator compound_statement { printf("Found function %s\n",$1);} ; [. . .] For those that care, my y.tab.h looks something like: typedef union VALTYPE { int type; char *string; } YYSTYPE; # define AUTO 257 # define REGISTER 258 [...] # define COMMA 318 # define SEMICOLON 319 Which should be okay, since I compile my grammar with: yacc -vd grammar.y Assuming that my interpretation of the manual is correct, this (may I call your attention to the function_definition rule of the yacc grammar) should give me the proper info. Instead, the $n values turn out to be NULL pointers. On another tach, I thought that I would have to (shudder) build the string myself, so I tried: type_specifier : VOID {printf("found type %s\n",yytext);} [...] in the grammar. Now, THIS got a lot more reaction (I got core dumps). I run the thing by having gcc remove the comments and preprocessor directives (and piping that through a filter that removes the '#' lines inserted by gcc) before running proto (a simple routine that, at this point, only calls yyparse and has a simple yyerror set up. Anyone got any ideas? Can normal yacc and lex do this sort of thing? How? In lieu of this, can you name a good single malt whiskey in which to drown my programming sorrows? I appreciate any help. Thanks. Wade Guthrie evil@arcturus.UUCP Rockwell International; Anaheim, CA (How could Rockwell stand by what I'm saying when *I* don't even know what I'm talking about???)
utility@quiche.cs.mcgill.ca (Ronald BODKIN) (02/08/90)
In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes: >off the net), I am trying to write an autoprototyper for ANSI C. Does anyone have such a beast already developed for PD use. I've toyed with writing such a program, but it seems like a lot of effort. I gather GNU C has protoize, but I can't use GNU C on my PC. Any pointers would be welcome. Ron
dkim@wam.umd.edu (Daeshik Kim) (02/08/90)
In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes: > > %{ > #include "y.tab.h" > [...] > %} > [...] > %% > auto { return(AUTO); } > register { return(REGISTER); } > [...] > "->" { return(ARROW); } > ";" { return(SEMICOLON); } > . { return(yytext[0]); } > If you need to pass a value to 'yacc', then do the following: somestring { yylval.sval=(char*)malloc(yyleng); strcpy(yylval.sval, yytext); return(TOKENNAME); } someinteger { yylval.ival=atoi(yytext); return(TOKENNAME); } ... Note that "yyval" is the type defined by "%union ....". In the above example, in 'yacc' def, %union { int ival; char *sval; } scan over c code from lex. -- Daeshik Kim H: (301) 445-0475/2147 O: (703) 689-5878 SCHOOL: dkim@wam.umd.edu, dskim@eng.umd.edu, mz518@umd5.umd.edu WORK: dkim@daffy.uu.net (uunet!daffy!dkim)
prs@tcsc3b2.tcsc.com (Paul Stath) (02/10/90)
evil@arcturus.UUCP (Wade Guthrie) writes: [... Introduction text deleted ...] >My problem is this: I am trying to get access to the strings that >got matched by lex to make the tokens which are passed to yacc. >Given this, I can do the job (I think). This is on a sun 3/60 under >the 3.4 version of the operating system. After RTFMing (and >gratuitous consultation of my local guru), I got to the part that >says "the programmer includes in the declaration section [of the >yacc grammar] %union { body } This declares the yacc value stack >[...] the value is referenced through a $$ or $n construction, yacc >automatically inserts the appropriate union name", or some such. >I tried this approach (and another that I will get to soon). The string that gets matched in LEX is stored in a character pointer called `yytext'. >At this point, I would like to give an example of what I think the >pertinent pieces of code are. The lex source looks something like: > %{ > #include "y.tab.h" > [...] > %} > [...] > %% > auto { return(AUTO); } > register { return(REGISTER); } > [...] > "->" { return(ARROW); } > ";" { return(SEMICOLON); } > . { return(yytext[0]); } >And the yacc grammar that looks like . . . > %{ > #include <stdio.h> > [...] > %} > %union VALTYPE { > int type; > char *string; > }; > %token AUTO REGISTER STATIC EXTERN TYPEDEF ENUM > [...] > %token COMMA SEMICOLON > %left COMMA > [...] > %left ARROW '.' > %% > translation_unit > : external_declaration > | translation_unit external_declaration > ; > function_definition > : decln_spec declarator decln_list compound_statement > { printf("Found function %s\n",$2);} > | decln_spec declarator compound_statement > { printf("Found function %s\n",$2);} > | declarator decln_list compound_statement > { printf("Found function %s\n",$1);} > | declarator compound_statement > { printf("Found function %s\n",$1);} > ; > [. . .] >For those that care, my y.tab.h looks something like: > typedef union VALTYPE { > int type; > char *string; > } YYSTYPE; > # define AUTO 257 > # define REGISTER 258 > [...] > # define COMMA 318 > # define SEMICOLON 319 >Which should be okay, since I compile my grammar with: > yacc -vd grammar.y >Assuming that my interpretation of the manual is correct, this >(may I call your attention to the function_definition rule of the >yacc grammar) should give me the proper info. Instead, the $n >values turn out to be NULL pointers. Your interpretation of the manual is ALMOST correct, but not quite. YACC allows the $n construct as a shorthand to access the stack of token values that are being shift'ed during the parse sequence. By doing the: %union { /* union declaration */ } declareation in YACC, you redefine the stack type to be something other than int. Since YACC and LEX are -mostly- independent programs, you have to do a little bit more work when you change the default stack. In your LEX actions which find strings, you need to allocate space for the string to be stored, point the stack pointer at that allocated space, and then copy the string into that space. Here is an example from the LEX code for that parser: %{ #include "y.tab.h" char *str_ptr; %} %% [....] ${alpha}{alphanum}* { yylval.str=malloc(strlen(yytext)+1); strcpy(yylval.str, yytext); return (Identifier); } [....] . return (yytext[0]); %% Here is the relevant YACC code for that parser: %{ extern char yytext[]; %} %union { char *str; } [......] %token <str> Identifier [......] %% [......] file_declaration: FILE Identifier sysname ';' { token_in(token_init(Identifier, $2)); file_rec_insert (&token_anchor); } ; [......] %% [......] >Anyone got any ideas? Can normal yacc and lex do this sort of >thing? How? In lieu of this, can you name a good single malt >whiskey in which to drown my programming sorrows? LEX and YACC are powerful tools which IMHO are poorly documented. Just RTFM'ing will NOT help. You have to read both between the lines, and sometimes THROUGH the page to find out what you want. Making LEX and YACC do the work is MUCH easier than righting a parser yourself, but the learning curve involved is VERY high! I spent almost 3 months up to my elbows in YACC, LEX and C code to produce a parser for a database report language. (Not terribly complex grammer, but hard enough.) Most of my knowledge of LEX and YACC came from poking it until it broke. I STILL refer back to this code whenever I need to do something tricky, because I probably had to do it to write the report language parser. I would like to see this thing posted to the net if it is not something you are doing on company time. I would be happy to help it you have any other YACC or LEX questions. Just E-Mail. Just another application hacker drowning in a world of suits and stuffed shirts! -- =============================================================================== Paul R. Stath The Computer Solution Co., Inc. Voice: 804-794-3491 ------------------------------------------------+------------------------------ INTERNET: prs@tcsc3b2.tcsc.com | "There was no diety involved,
andre@targon.UUCP (andre) (02/14/90)
In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes: >My problem is this: I am trying to get access to the strings that >got matched by lex to make the tokens which are passed to yacc. >Given this, I can do the job (I think). This is on a sun 3/60 under >the 3.4 version of the operating system. After RTFMing (and >gratuitous consultation of my local guru), I got to the part that >says "the programmer includes in the declaration section [of the >yacc grammar] %union { body } This declares the yacc value stack >[...] the value is referenced through a $$ or $n construction, yacc >automatically inserts the appropriate union name", or some such. >I tried this approach (and another that I will get to soon). If you declare a union for the yacc stack, you must do two things to use the union. 1 tell yacc with (non) terminals have which type 2 let the lex code fill the members of the global union yylval. 3 (easier) put an include lex.yy.c in the last part of the yacc file then you need not fiddle with the include file. Example (not tested) lex: %% [a-zA-Z][a-zA-Z0-9_]* { yylval.str = strncpy(malloc(yyleng+1), yytext, yyleng); return NAME; } [0-9]+ { yylval.nr = atoi(yytext); return INT; } %% yacc: %union VALTYPE { int nr; char *str; }; %token <str> NAME string %token <nr> INT number %% file : string {printf("NAME %s\n", $1}; | number {printf("INT %d\n", $1}; ; string : NAME ; number : INT ; %% #include "lex.yy.c" This should get you back on the road :-). -- The mail| AAA DDDD It's not the kill, but the thrill of the chase. demon...| AA AAvv vvDD DD Ketchup is a vegetable. hits!.@&| AAAAAAAvv vvDD DD {nixbur|nixtor}!adalen.via --more--| AAA AAAvvvDDDDDD Andre van Dalen, uunet!hp4nl!targon!andre
chris@mimsy.umd.edu (Chris Torek) (02/15/90)
(Incidentally, this is another thing that does not really belong in comp.lang.c, but in this case there *is* no appropriate group, so I have not attempted to redirect followups....) A few minor points: In article <1990Feb9.171557.18465@tcsc3b2.tcsc.com> prs@tcsc3b2.tcsc.com (Paul Stath) writes: >The string that gets matched in LEX is stored in a character pointer called >`yytext'. Actually, this is an array (of size YYLMAX, typically 200) of characters, not a pointer. [example lex code] >${alpha}{alphanum}* { > yylval.str=malloc(strlen(yytext)+1); > strcpy(yylval.str, yytext); > return (Identifier); > } It is not actually necessary to call malloc() here, as the characters in yytext[] will be left undisturbed until the next call to yylex(). The string saving, if necessary, can be deferred to the parser. One useful trick is to have a parse rule like save_id: %type <str> save_id %token <str> ID %% save_id: ID { $$ = savestr($1); }; Then, whenever you need an ID that must be saved from destruction by the next call to yylex(), you can use save_id instead of ID. Another different trick (which I have used in some hand-coded lexers) is to save all strings in hash tables, possibly reference counted (depending on whether many should be freed later). In any case, a routine that calls malloc() should check for no-space: instead of yylval.str=malloc(strlen(yytext)+1); strcpy(yylval.str, yytext); you need something like yylval.str = malloc(strlen(yytext) + 1); if (yylval.str == NULL) die_horribly_due_to_running_out_of_space(); strcpy(yylval.str, yytext); or more simply yylval.str = estrdup(yytext); where estrdup is like strdup, but errors out if out of space. (strdup is a common library function that acts like malloc+strcpy, returning NULL if out of space.) >LEX and YACC are powerful tools which IMHO are poorly documented. The real documentation for both of these tools is found in compiler courses and in compiler textbooks, not in the supplementary Unix documents. The latter assume you know what LALR parsing and regular expressions are all about, and merely tell you how to tell yacc and lex what syntax rules and regular expressions to recognise, and what actions to take on recognition. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
amull@Morgan.COM (Andrew P. Mullhaupt) (02/16/90)
In article <22529@mimsy.umd.edu>, chris@mimsy.umd.edu (Chris Torek) writes: > In article <1990Feb9.171557.18465@tcsc3b2.tcsc.com> prs@tcsc3b2.tcsc.com > (Paul Stath) writes: > >The string that gets matched in LEX is stored in a character pointer called > >`yytext'. > > Actually, this is an array (of size YYLMAX, typically 200) of characters, > not a pointer. Yes, if you're referring to real live 'lex'. If you have 'flex' instead, it gets declared as a pointer to character. This discrepancy, typical of those between yacc and bison, is small but annoying. Does anyone know what the reason is for this one? Discrepancies aside, the documentation for bison is a pretty good surrogate for yacc. Two books _The UNIX Programming Environment_ by Kernighan and Pike, and _UNIX Utilities_ by R.S. Tare provide tutorials on yacc. Although you should be able to get yacc to generate the parser you want independent of what OS you're running, there are some differences of note. System V UNIX has text messages for debugging a yacc parser based on the production in the grammar specification, and BSD UNIX (and Bison) provide references to the production by number, but also give the stack state. Later, Andrew Mullhaupt
kan@dg-rtp.dg.com (Victor Kan) (02/21/90)
In article <1044@targon.UUCP> andre@targon.UUCP (andre) writes: >In article <7179@arcturus> evil@arcturus.UUCP (Wade Guthrie) writes: > >My problem is this: I am trying to get access to the strings that > >got matched by lex to make the tokens which are passed to yacc. > >Given this, I can do the job (I think). This is on a sun 3/60 under Try looking at the yytext array declared in lex.yy.c. You'll have to extern char *yytext[] it in y.tab.c, unless you do #3 as suggested below by andre@targon.UUCP. But I've had nothing but problems with that trick. >If you declare a union for the yacc stack, you must do two >things to use the union. > >1 tell yacc with (non) terminals have which type >2 let the lex code fill the members of the > global union yylval. >3 (easier) put an include lex.yy.c in the last part > of the yacc file then you need not fiddle with the include > file. You may wish to avoid #3 if you can. It's not a good idea to #include C files because it can make debugging a living nightmare. Your compiler will give you bogus line numbers for syntax errors and warnings. When you're program gets to the debugging stage, breakpoints in your debugger may be bogus too. GCC 1.35 and GDB 3.2 don't handle the line numbering and symbol table correctly in this respect. I doubt other compilers/debuggers can do any better. Besides, it takes very little effort to support a separate y.tab.c and lex.yy.c with y.tab.h declarations. A little laziness here can cause you unnecessary Exedrin headaches. >lex: >%% >[a-zA-Z][a-zA-Z0-9_]* { yylval.str = strncpy(malloc(yyleng+1), yytext, yyleng); yylval.str = strdup (yytext); works fine. Just remember to free the block when you're done with it in Yacc. >yacc: > >%% >#include "lex.yy.c" Avoid this if you can!!!!! | Victor Kan | I speak only for myself. | *** | Data General Corporation | Edito cum Emacs, ergo sum. | **** | 62 T.W. Alexander Drive | Columbia Lions Win, 9 October 1988 for | **** %%%% | RTP, NC 27709 | a record of 1-44. Way to go, Lions! | *** %%%
martin@mwtech.UUCP (Martin Weitzel) (02/21/90)
In article <22529@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >(Incidentally, this is another thing that does not really belong in >comp.lang.c, but in this case there *is* no appropriate group, so I >have not attempted to redirect followups....) Do we need comp.lang.yacc? > >A few minor points: > >In article <1990Feb9.171557.18465@tcsc3b2.tcsc.com> prs@tcsc3b2.tcsc.com >(Paul Stath) writes: >>The string that gets matched in LEX is stored in a character pointer called >>`yytext'. > >Actually, this is an array (of size YYLMAX, typically 200) of characters, >not a pointer. > >[example lex code] >>${alpha}{alphanum}* { >> yylval.str=malloc(strlen(yytext)+1); >> strcpy(yylval.str, yytext); >> return (Identifier); >> } > > >It is not actually necessary to call malloc() here, as the characters >in yytext[] will be left undisturbed until the next call to yylex(). Though I have no strong motivation to argue against Chris, because he gives allways good and correct advice, I have to warn here, that yylex() *may* be called to read one token ahead, so that the Parser can decide wether to shift or reduce. This may not be of importance in the example Chris had in mind, but consider the following: %token ID %% list : id1 ';' | id2 ',' list ; id1 : ID { /*1*/ } ; id2 : ID { /*2*/ } ; %% Before /*1*/ or /*2*/ can be executed, yylex() will have been called to see if the next token is ',' or ';', because otherwise it could not decide if 'id1' or 'id2' should be reduced. Even if you can deduce that this will not be the case in a certain grammar, it introduces a possible bug for someone who later builds uppon your work, modifies the grammar and introduces the need for look-ahead ... [rest deleted] -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83 -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83
amull@Morgan.COM (Andrew P. Mullhaupt) (02/25/90)
In article <647@mwtech.UUCP>, martin@mwtech.UUCP (Martin Weitzel) writes: > In article <22529@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: > >(Incidentally, this is another thing that does not really belong in > >comp.lang.c, but in this case there *is* no appropriate group, so I > >have not attempted to redirect followups....) > > Do we need comp.lang.yacc? > Yes, or something like it. The group should not be devoted only to C (yacc-lex-bison-flex-...) but to language construction techniques in general. I think a pool of experience in this area would be an admirable supplement to the anecdotal texts which are presently available. What say someone calls for discussion. Also: There will be a 'name the baby' crisis here. comp.lang.yacc seems to have no claim to be preferred over comp.lang.bison, or comp.lang.lex, etc.. Later, Andrew Mullhaupt
zed@mdbs.UUCP (Zed Smith) (02/28/90)
In article <751@s5.Morgan.COM>, amull@Morgan.COM (Andrew P. Mullhaupt) writes: ) In article <647@mwtech.UUCP>, martin@mwtech.UUCP (Martin Weitzel) writes: ) > In article <22529@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: ) > >(Incidentally, this is another thing that does not really belong in ) > >comp.lang.c, but in this case there *is* no appropriate group, so I ) > >have not attempted to redirect followups....) ) > ) > Do we need comp.lang.yacc? ) > ) Yes, or something like it. The group should not be devoted only to C ) (yacc-lex-bison-flex-...) but to language construction techniques in ) general. ) ) Also: There will be a 'name the baby' crisis here. comp.lang.yacc ) seems to have no claim to be preferred over comp.lang.bison, or ) comp.lang.lex, etc.. ) Andrew Mullhaupt I think comp.lang.antlr would be an unbiased and fair name since antlr is non- commercial and as fresh off the press (or soon to be). (Ask parrt@ee.ecn.purdue.edu) Zed Smith pur-ee!mdbs!zed zed@mdbs.UUCP (?) (Yes this is a cheap advertisement for antlr, but I'm pretty sure antlr will be free so I don't feel guilty. (By the way, MDBS doesn't have any idea what antlr is either, so they aren't benefiting from this either.)) "antlr" is pronounced "antler"
evil@arcturus.UUCP (Wade Guthrie) (03/01/90)
Someone asked: > Do we need comp.lang.yacc? amull@Morgan.COM (Andrew P. Mullhaupt) writes: >Yes, or something like it. . . GROAN! More fragmentation! Let's not start that...couldn't we just leave the discussions here? -- Wade Guthrie (evil@arcturus.UUCP) Rockwell International; Anaheim, CA; My opinions, not my employer's. "All right, so I'm panicking, what else is there to do?"