[comp.unix.wizards] Writing a simple sentence parser in lex/yacc

cluther@ponder.csci.unt.edu (Clay Luther) (05/23/91)

Hello All:

I recently found a need to learn and use lex/yacc.  I have purchased the
nutshell book, but I find it somewhat lacking.  I don't know if it's me or
what.  Anyway, I though I would turn to here for some advice.

I need to create a sentence parser.  Right now, it only needs to be very
simple.  Here is a sketchy grammer for the language.

sentences :== sentences sentence
sentence :== verbphrase prepphrase
verbphrase :== verb | verb object
prepphrase :== prep object | EMPTY
verb :== word
prep :== "with" | "in" | "on" | "about"
object :== word | words | string 
string :== "\"" words "\""
words :== words word | word
word :== [a-zA-Z]+

It should parse as follows:

throw ball
 - verb = throw, verb object = ball
throw ball with "blue stripes"
 - verb throw, vo ball, prep with, po "blue stripes"
throw blue ball with white stripes
 - v throw, vo blue ball, p with, po white stripes

I cannot seem to create a yacc/lex definition for this.  Any help is *greatly*
appreciated.

Thanks.
-- 
Clay Luther                         Lonely C++ Programmer        
cluther@ponder.csci.unt.edu
cluther@solo.csci.unt.edu           I work here now and they still
cluther@supernet.haus.com           don't let me have any opinions.

pg@bsg.com (Peter Garst) (05/24/91)

In article <cluther.674954332@ponder> cluther@ponder.csci.unt.edu (Clay Luther) writes:
>I need to create a sentence parser.  Right now, it only needs to be very
>simple.  Here is a sketchy grammer for the language.
>
>sentences :== sentences sentence
>sentence :== verbphrase prepphrase
>verbphrase :== verb | verb object
>prepphrase :== prep object | EMPTY
>verb :== word
>prep :== "with" | "in" | "on" | "about"
>object :== word | words | string 
>string :== "\"" words "\""
>words :== words word | word

The main problem I see with your grammar is that it is VERY
ambiguous. Consider the following input -

	word word word word word

This could be interpreted as five sentences; it could be a verb with four
words in the object; or it could be many other things.
	
You might try specifying which words are nouns, which are verbs and so on -
that is,

	verb :== "throw" | "eat" | ...
	noun :== "ball" | "burger" | ...

This is very simple minded, but parsing English in any realistic
sense is an extremely complex task. In particular, if you look only at syntax
English really is highly ambiguous.

Peter Garst pg@bsg.com


P.S. - Your job will be far easier if you use a good yacc grammar development
tool like ydb (my company's product); in a case like this it could have
pointed out the ambiguities and the alternate interpretations right away.

martin@mwtech.UUCP (Martin Weitzel) (05/25/91)

In article <cluther.674954332@ponder> cluther@ponder.csci.unt.edu (Clay Luther) writes:
>Hello All:
>
>I recently found a need to learn and use lex/yacc.  I have purchased the
>nutshell book, but I find it somewhat lacking.  I don't know if it's me or
>what.  Anyway, I though I would turn to here for some advice.

>I need to create a sentence parser.  Right now, it only needs to be very
>simple.  Here is a sketchy grammer for the language.

Let's see. (This questions seems to lend itself well to present a slightly
complex but complete lex + yacc application. When asking if I should post
a lex + yacc FAQ here some weeks ago, there were requests to see such an
example program.)

>sentences :== sentences sentence
>sentence :== verbphrase prepphrase
>verbphrase :== verb | verb object
>prepphrase :== prep object | EMPTY
>verb :== word
>prep :== "with" | "in" | "on" | "about"
>object :== word | words | string 
>string :== "\"" words "\""
>words :== words word | word
>word :== [a-zA-Z]+

I just tried to translate this into the syntax with which yacc wants to
see its specifications. I tried to stick as close as possible to your
grammar, but there were a few things I changed before I started:

	- You must have an alternative for the recursive first
	  rule - otherwise, if |sentences| always mean that already
	  some |sentences| must have been seen, you'll never manage
	  to start parsing.

	- I don't think that you really want to write your alternatives
	  for |prep| with double quotes. Supposing these are meant to
	  be tokens which can not appear in any other context (esp.
	  they can never be a |word|), I turned them into tokens.

	- I also turned word into a token instead of building it by
	  a grammar rule. (This allows much easier white space
	  elimination within the scanner.)

With this changes I came up with the following:

	------------------------------------------------------------------
	%token WITH IN ON ABOUT WORD
	%%
	/*
	 * sentences :== sentences sentence
	*/
	sentences	:	/*EMPTY*/
			|	sentences sentence
			;
	/*
	 * sentence :== verbphrase prepphrase
	*/
	sentence	:	verbphrase prepphrase
			;
	/*
	 * verbphrase :== verb | verb object
	*/
	verbphrase	:	verb
			|	verb object
			;
	/*
	 * prepphrase :== prep object | EMPTY
	*/
	prepphrase	:	prep object
			|	/*EMPTY*/
			;
	/*
	 * verb :== word
	*/
	verb		:	WORD
			;
	/*
	 * prep :== "with" | "in" | "on" | "about"
	*/
	prep		:	WITH | IN | ON | ABOUT
			;
	/*
	 * object :== word | words | string 
	*/
	object		:	WORD
			|	words
			|	string 
			;
	/*
	 * string :== "\"" words "\""
	*/
	string		:	'"' words '"'
			;
	/*
	 * words :== words word | word
	*/
	words		:	words WORD
			|	WORD
			;
	------------------------------------------------------------------

I've included your original specification in the above grammar so
that you can clearly see the differences between your way to write
the grammar and the way yacc wants it. (Further it helped me to
verify that I changed nothing inadvertently.)

Trying to translate this with yacc, I got the message that there are
a number of conflicts in the grammar. Generally this means that

	- there may be sentences in your language which may be
	  structured in more than one way according to the given
	  grammar;
	- there may be sentences in your language which the parsing
	  algorithm of the program generated by yacc will reject
	  though these sentences could be parsed according to your
	  grammar.

If you want to see more informations with respect to the conflicts,
you should simply save the above grammar in some file and translate
it with "yacc -v grammar-file"; after doing so you'll find more
information in y.output (this file is a bit too large to include
here.)

With some experience one will be able to tell from looking at y.output
that there are two basic problems in your grammar:

1) There are two ways to parse an |object|, if it should consist of
   a single |WORD|: Either you go over |wordlist| or not. I suppose
   this was a small oversight of you and one can simply delete the
   rule which allows an object to be a single |WORD| and hence parse
   this always over |wordlist|.

2) The other problem I identified (using yacc's y.output) was that
   |objects| are in principle word lists of arbitrary length. Since the
   |prepphrase| may be empty, the algorithm of yyparse will not be able
   to detect when the current sentence ends and the next one starts,
   since all words could simply be seen as a (very long) |words| part
   of the |object| of the first sentence. Adding a unique punctuator
   to separate sentences will solve this.

After the proposed two changes your grammar looks as follows (this time
I've deleted your original description for compactness) :

	------------------------------------------------------------------
	%token WITH IN ON ABOUT WORD
	%%
	sentences	:	/*EMPTY*/
			|	sentences sentence
			;
	sentence	:	verbphrase prepphrase '.'
			;
	verbphrase	:	verb
			|	verb object
			;
	prepphrase	:	prep object
			|	/*EMPTY*/
			;
	verb		:	WORD
			;
	prep		:	WITH | IN | ON | ABOUT
			;
	object		:	words
			|	string 
			;
	string		:	'"' words '"'
			;
	words		:	words WORD
			|	WORD
			;
	------------------------------------------------------------------

If you translate the above with yacc, you will see that all the conflicts
have vanished now and we can decide from that that

	- any sentence of the language can be structured in only one
	  unique way according to this grammar;
	- the algorithm of yyparse will be able to recognize any
	  sentence that follows the above grammar and it will reject
	  any other sentence.

>It should parse as follows:
>
>throw ball
> - verb = throw, verb object = ball
>throw ball with "blue stripes"
> - verb throw, vo ball, prep with, po "blue stripes"
>throw blue ball with white stripes
> - v throw, vo blue ball, p with, po white stripes

To demonstrate this, I chose to add a scanner and some actions to the grammar.
First comes the scanner, written with lex:

	------------------------------------------------------------------
	%%
	"with"		{ yylval.word = save_word(yytext); return WITH; }
	"in"		{ yylval.word = save_word(yytext); return IN;   }
	"on"		{ yylval.word = save_word(yytext); return ON;	}
	"about"		{ yylval.word = save_word(yytext); return ABOUT;}
	[a-zA-Z]+	{ yylval.word = save_word(yytext); return WORD; }
	[ \t\n]		;
	.		return yytext[0];
	------------------------------------------------------------------

One thing is a bit unusual here (compared to other lex applications):
As you see we have some fixed tokens here as "with", "in", etc. Normally
we would ONLY return the token type (WITH, IN, etc.) here, but as we
also need the word itself in our demonstration program, I chose to
transfer the string itself via yylval.

Now comes the biggest part of the program - our parser already stuffed
with actions that show the structuring. Probably these actions will have
to be modified in large parts. (They also do not QUITE what you want,
i.e. |strings| do no more differ from |words| once they have been turned
into an |object|. Thus the output will not show the double quotes in your
second example.)

Changes to the program can be made to adapt it to any other purpose than
showing the structuring. This usually will extend to change the grammar,
so that it contains better "hooks" for adding appropriate actions. (This
was the reason I didn't conserve the double quotes around the strings;
I had the impression that it would be easier with some changes to the
grammar, which I didn't want to make for the purpose of this article.)

It may even be advantageous to change the scanner sometimes. (E.g. I could
imagine that it would be help to pull the recognition of strings completely
out of the parser and have the scanner return a unique token for them.)
But this all depends on the plans you have, i.e. what you want to do in
your program.

Now, be prepared for a long source listing with few comments. (Well, I
thought I should leave something to figure out for you and all the other
readers :-).)

	------------------------------------------------------------------
	%{
	#include <stdio.h>
	extern char *malloc();
	extern char *strcpy();

	static void no_malloc_space(func)
		char *func;
	{
		fprintf(stderr, "*** NO MORE SPACE from malloc(): in %s\n", func);
		fflush(stderr);
		abort();
	}

	/*
	 * Operations on words
	*/
	char *save_word(s)
		char *s;
	{
		char *p = malloc(strlen(s) + 1);
		if (!p) no_malloc_space("save_word");
		return strcpy(p, s);
	}
	void free_word(word)
		char *word;
	{
		free(word);
	}

	/*
	 * Operations on wordlists
	*/
	struct word_list {
		char *word;
		struct word_list *next;
	};
	struct word_list *add_word_list(list, word)
		struct word_list *list;
		char *word;
	{
		struct word_list *p = (struct word_list *)
					malloc(sizeof(struct word_list));
		if (!p) no_malloc_space("add_word_list");
		p->word = word;
		p->next = (struct word_list *)0;
		if (list) {
			list->next = p;
			return list;
		}
		else {
			return p;
		}
	}
	void free_word_list(list)
		struct word_list *list;
	{
		while (list) {
			struct word_list *p =list;
			list = p->next;
			free(p);
		}
	}
	void print_word_list(s1, list, s2)
		char *s1;
		struct word_list *list;
		char *s2;
	{
		printf("%s", s1);
		while (list) {
			printf(" %s", list->word);
			list = list->next;
		}
		printf("%s", s2);
	}

	%}
	%union {
		char *word;
		struct word_list *words;
	}
	%token <word> WITH IN ON ABOUT
	%token <word> WORD
	%type <word> verb prep
	%type <words> object words string
	%%
	sentences	:	/*EMPTY*/
			|	sentences sentence '.'
			;
	sentence	:	verbphrase prepphrase {
					printf("\n");
				}
			;
	verbphrase	:	verb {
					printf("- verb = %s,", $1);
					free_word($1);
				}
			|	verb object {
					printf("- verb = %s,", $1);
					free_word($1);
					print_word_list(" verb object =", $2, ",");
					free_word_list($2);
				}
			;
	prepphrase	:	prep object {
					printf(" prep = %s,", $1);
					free_word($1);
					print_word_list(" prep object =", $2, ",");
					free_word_list($2);
				}
			|	/*EMPTY*/
			;
	verb		:	WORD	/* uses default: $$ = $1; */
			;
	prep		:	WITH | IN | ON | ABOUT
			;
	object		:	words	/* uses default: $$ = $1; */
			|	string 	/* uses default: $$ = $1; */
			;
	string		:	'"' words '"' {
					$$ = $2;
				}
			;
	words		:	words WORD {
					$$ = add_word_list($1, $2);
				}
			|	WORD {
					$$ = add_word_list((struct word_list *)0, $1);
				}
			;
	%%
	#include "lex.yy.c"
	------------------------------------------------------------------

>I cannot seem to create a yacc/lex definition for this.  Any help is *greatly*
>appreciated.

I hope this was help enough - but finally I must confess one thing: I
have not tested this program very thoroughly and it might well contain
some bugs*. So, if it happens that this programming task was your homework,
you should not only thoroughly analyze it and be prepared for questions, you
should also test it and correct any bugs you find - please understand that
I cannot take any responsibility for anything except that the DESIGN of
this program accords to my EXPERIENCE how your task COULD be solved ...

*: I've just decided to be a little more honest: I've deliberately left
one bug in the program. It will hit you whenever a word list has three
words or more --- enough hints so far :-)

>Thanks.

You're welcome.
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83