cluther@ponder.csci.unt.edu (Clay Luther) (05/23/91)
Hello All: I recently found a need to learn and use lex/yacc. I have purchased the nutshell book, but I find it somewhat lacking. I don't know if it's me or what. Anyway, I though I would turn to here for some advice. I need to create a sentence parser. Right now, it only needs to be very simple. Here is a sketchy grammer for the language. sentences :== sentences sentence sentence :== verbphrase prepphrase verbphrase :== verb | verb object prepphrase :== prep object | EMPTY verb :== word prep :== "with" | "in" | "on" | "about" object :== word | words | string string :== "\"" words "\"" words :== words word | word word :== [a-zA-Z]+ It should parse as follows: throw ball - verb = throw, verb object = ball throw ball with "blue stripes" - verb throw, vo ball, prep with, po "blue stripes" throw blue ball with white stripes - v throw, vo blue ball, p with, po white stripes I cannot seem to create a yacc/lex definition for this. Any help is *greatly* appreciated. Thanks. -- Clay Luther Lonely C++ Programmer cluther@ponder.csci.unt.edu cluther@solo.csci.unt.edu I work here now and they still cluther@supernet.haus.com don't let me have any opinions.
pg@bsg.com (Peter Garst) (05/24/91)
In article <cluther.674954332@ponder> cluther@ponder.csci.unt.edu (Clay Luther) writes: >I need to create a sentence parser. Right now, it only needs to be very >simple. Here is a sketchy grammer for the language. > >sentences :== sentences sentence >sentence :== verbphrase prepphrase >verbphrase :== verb | verb object >prepphrase :== prep object | EMPTY >verb :== word >prep :== "with" | "in" | "on" | "about" >object :== word | words | string >string :== "\"" words "\"" >words :== words word | word The main problem I see with your grammar is that it is VERY ambiguous. Consider the following input - word word word word word This could be interpreted as five sentences; it could be a verb with four words in the object; or it could be many other things. You might try specifying which words are nouns, which are verbs and so on - that is, verb :== "throw" | "eat" | ... noun :== "ball" | "burger" | ... This is very simple minded, but parsing English in any realistic sense is an extremely complex task. In particular, if you look only at syntax English really is highly ambiguous. Peter Garst pg@bsg.com P.S. - Your job will be far easier if you use a good yacc grammar development tool like ydb (my company's product); in a case like this it could have pointed out the ambiguities and the alternate interpretations right away.
martin@mwtech.UUCP (Martin Weitzel) (05/25/91)
In article <cluther.674954332@ponder> cluther@ponder.csci.unt.edu (Clay Luther) writes: >Hello All: > >I recently found a need to learn and use lex/yacc. I have purchased the >nutshell book, but I find it somewhat lacking. I don't know if it's me or >what. Anyway, I though I would turn to here for some advice. >I need to create a sentence parser. Right now, it only needs to be very >simple. Here is a sketchy grammer for the language. Let's see. (This questions seems to lend itself well to present a slightly complex but complete lex + yacc application. When asking if I should post a lex + yacc FAQ here some weeks ago, there were requests to see such an example program.) >sentences :== sentences sentence >sentence :== verbphrase prepphrase >verbphrase :== verb | verb object >prepphrase :== prep object | EMPTY >verb :== word >prep :== "with" | "in" | "on" | "about" >object :== word | words | string >string :== "\"" words "\"" >words :== words word | word >word :== [a-zA-Z]+ I just tried to translate this into the syntax with which yacc wants to see its specifications. I tried to stick as close as possible to your grammar, but there were a few things I changed before I started: - You must have an alternative for the recursive first rule - otherwise, if |sentences| always mean that already some |sentences| must have been seen, you'll never manage to start parsing. - I don't think that you really want to write your alternatives for |prep| with double quotes. Supposing these are meant to be tokens which can not appear in any other context (esp. they can never be a |word|), I turned them into tokens. - I also turned word into a token instead of building it by a grammar rule. (This allows much easier white space elimination within the scanner.) With this changes I came up with the following: ------------------------------------------------------------------ %token WITH IN ON ABOUT WORD %% /* * sentences :== sentences sentence */ sentences : /*EMPTY*/ | sentences sentence ; /* * sentence :== verbphrase prepphrase */ sentence : verbphrase prepphrase ; /* * verbphrase :== verb | verb object */ verbphrase : verb | verb object ; /* * prepphrase :== prep object | EMPTY */ prepphrase : prep object | /*EMPTY*/ ; /* * verb :== word */ verb : WORD ; /* * prep :== "with" | "in" | "on" | "about" */ prep : WITH | IN | ON | ABOUT ; /* * object :== word | words | string */ object : WORD | words | string ; /* * string :== "\"" words "\"" */ string : '"' words '"' ; /* * words :== words word | word */ words : words WORD | WORD ; ------------------------------------------------------------------ I've included your original specification in the above grammar so that you can clearly see the differences between your way to write the grammar and the way yacc wants it. (Further it helped me to verify that I changed nothing inadvertently.) Trying to translate this with yacc, I got the message that there are a number of conflicts in the grammar. Generally this means that - there may be sentences in your language which may be structured in more than one way according to the given grammar; - there may be sentences in your language which the parsing algorithm of the program generated by yacc will reject though these sentences could be parsed according to your grammar. If you want to see more informations with respect to the conflicts, you should simply save the above grammar in some file and translate it with "yacc -v grammar-file"; after doing so you'll find more information in y.output (this file is a bit too large to include here.) With some experience one will be able to tell from looking at y.output that there are two basic problems in your grammar: 1) There are two ways to parse an |object|, if it should consist of a single |WORD|: Either you go over |wordlist| or not. I suppose this was a small oversight of you and one can simply delete the rule which allows an object to be a single |WORD| and hence parse this always over |wordlist|. 2) The other problem I identified (using yacc's y.output) was that |objects| are in principle word lists of arbitrary length. Since the |prepphrase| may be empty, the algorithm of yyparse will not be able to detect when the current sentence ends and the next one starts, since all words could simply be seen as a (very long) |words| part of the |object| of the first sentence. Adding a unique punctuator to separate sentences will solve this. After the proposed two changes your grammar looks as follows (this time I've deleted your original description for compactness) : ------------------------------------------------------------------ %token WITH IN ON ABOUT WORD %% sentences : /*EMPTY*/ | sentences sentence ; sentence : verbphrase prepphrase '.' ; verbphrase : verb | verb object ; prepphrase : prep object | /*EMPTY*/ ; verb : WORD ; prep : WITH | IN | ON | ABOUT ; object : words | string ; string : '"' words '"' ; words : words WORD | WORD ; ------------------------------------------------------------------ If you translate the above with yacc, you will see that all the conflicts have vanished now and we can decide from that that - any sentence of the language can be structured in only one unique way according to this grammar; - the algorithm of yyparse will be able to recognize any sentence that follows the above grammar and it will reject any other sentence. >It should parse as follows: > >throw ball > - verb = throw, verb object = ball >throw ball with "blue stripes" > - verb throw, vo ball, prep with, po "blue stripes" >throw blue ball with white stripes > - v throw, vo blue ball, p with, po white stripes To demonstrate this, I chose to add a scanner and some actions to the grammar. First comes the scanner, written with lex: ------------------------------------------------------------------ %% "with" { yylval.word = save_word(yytext); return WITH; } "in" { yylval.word = save_word(yytext); return IN; } "on" { yylval.word = save_word(yytext); return ON; } "about" { yylval.word = save_word(yytext); return ABOUT;} [a-zA-Z]+ { yylval.word = save_word(yytext); return WORD; } [ \t\n] ; . return yytext[0]; ------------------------------------------------------------------ One thing is a bit unusual here (compared to other lex applications): As you see we have some fixed tokens here as "with", "in", etc. Normally we would ONLY return the token type (WITH, IN, etc.) here, but as we also need the word itself in our demonstration program, I chose to transfer the string itself via yylval. Now comes the biggest part of the program - our parser already stuffed with actions that show the structuring. Probably these actions will have to be modified in large parts. (They also do not QUITE what you want, i.e. |strings| do no more differ from |words| once they have been turned into an |object|. Thus the output will not show the double quotes in your second example.) Changes to the program can be made to adapt it to any other purpose than showing the structuring. This usually will extend to change the grammar, so that it contains better "hooks" for adding appropriate actions. (This was the reason I didn't conserve the double quotes around the strings; I had the impression that it would be easier with some changes to the grammar, which I didn't want to make for the purpose of this article.) It may even be advantageous to change the scanner sometimes. (E.g. I could imagine that it would be help to pull the recognition of strings completely out of the parser and have the scanner return a unique token for them.) But this all depends on the plans you have, i.e. what you want to do in your program. Now, be prepared for a long source listing with few comments. (Well, I thought I should leave something to figure out for you and all the other readers :-).) ------------------------------------------------------------------ %{ #include <stdio.h> extern char *malloc(); extern char *strcpy(); static void no_malloc_space(func) char *func; { fprintf(stderr, "*** NO MORE SPACE from malloc(): in %s\n", func); fflush(stderr); abort(); } /* * Operations on words */ char *save_word(s) char *s; { char *p = malloc(strlen(s) + 1); if (!p) no_malloc_space("save_word"); return strcpy(p, s); } void free_word(word) char *word; { free(word); } /* * Operations on wordlists */ struct word_list { char *word; struct word_list *next; }; struct word_list *add_word_list(list, word) struct word_list *list; char *word; { struct word_list *p = (struct word_list *) malloc(sizeof(struct word_list)); if (!p) no_malloc_space("add_word_list"); p->word = word; p->next = (struct word_list *)0; if (list) { list->next = p; return list; } else { return p; } } void free_word_list(list) struct word_list *list; { while (list) { struct word_list *p =list; list = p->next; free(p); } } void print_word_list(s1, list, s2) char *s1; struct word_list *list; char *s2; { printf("%s", s1); while (list) { printf(" %s", list->word); list = list->next; } printf("%s", s2); } %} %union { char *word; struct word_list *words; } %token <word> WITH IN ON ABOUT %token <word> WORD %type <word> verb prep %type <words> object words string %% sentences : /*EMPTY*/ | sentences sentence '.' ; sentence : verbphrase prepphrase { printf("\n"); } ; verbphrase : verb { printf("- verb = %s,", $1); free_word($1); } | verb object { printf("- verb = %s,", $1); free_word($1); print_word_list(" verb object =", $2, ","); free_word_list($2); } ; prepphrase : prep object { printf(" prep = %s,", $1); free_word($1); print_word_list(" prep object =", $2, ","); free_word_list($2); } | /*EMPTY*/ ; verb : WORD /* uses default: $$ = $1; */ ; prep : WITH | IN | ON | ABOUT ; object : words /* uses default: $$ = $1; */ | string /* uses default: $$ = $1; */ ; string : '"' words '"' { $$ = $2; } ; words : words WORD { $$ = add_word_list($1, $2); } | WORD { $$ = add_word_list((struct word_list *)0, $1); } ; %% #include "lex.yy.c" ------------------------------------------------------------------ >I cannot seem to create a yacc/lex definition for this. Any help is *greatly* >appreciated. I hope this was help enough - but finally I must confess one thing: I have not tested this program very thoroughly and it might well contain some bugs*. So, if it happens that this programming task was your homework, you should not only thoroughly analyze it and be prepared for questions, you should also test it and correct any bugs you find - please understand that I cannot take any responsibility for anything except that the DESIGN of this program accords to my EXPERIENCE how your task COULD be solved ... *: I've just decided to be a little more honest: I've deliberately left one bug in the program. It will hit you whenever a word list has three words or more --- enough hints so far :-) >Thanks. You're welcome. -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83