[comp.compilers] Grammar analysis tools

steven@cwi.nl (Steven Pemberton) (01/24/91)
In article <14868@goofy.megatest.UUCP> megatest!djones@decwrl.dec.com (Dave Jones) writes:
> Is there any way i can configure yacc or some other tool to tell me,
> at each step in a parse, the set of tokens that may follow the current
> token?  I am currently doing this by hand, and it is, as they say,
> a tedious and error-prone process.

Here is a presentation of some tools I use to analyse grammars, using
the language ABC, with as example an answer to the above question. If
there is enough interest, I willing to post the sources as a packaged
workspace to comp.sources.misc. I also have other tools, like an LL(1)
checker. Let me know.

Best wishes,

Steven Pemberton, CWI, Amsterdam; steven@cwi.nl

GRAMMAR TOOLS IN ABC

When I have to work with grammars, I always use ABC to do it. Among
the advantages are that you can do the work interactively, that you
can very quickly build additional tools, and that you have the already
powerful programming environment at your disposal.

What follows is a brief description of some of the tools I use, with
an example. There is no description of ABC: you can find a quick
description of the language by ftp-ing the file "abc.intro" from
uunet.uu.net, mcsun.eu.net, or hp4nl.nluug.nl, in directory
{pub}/{programming}/languages/abc, or by sending the message

	request: programming/languages/abc
	topic: abc.intro

to info-server@hp4nl.nluug.nl. The file tells you where to get more
information about ABC (there's a book), and how to get the
implementations (they're free).

Some of what follows is also presented in the book, though at a more
relaxed pace :-).

GRAMMARS

The representation that I use is more or less a direct transcription
of what a grammar is. I use a table whose keys are texts (i.e.
strings) representing the nonterminals of the language, and whose
items are sets of alternatives. Each alternative is a sequence of
texts, representing terminals and nonterminals. So here is a how-to
that displays a grammar in this form:

	HOW TO DISPLAY grammar:
	   FOR name IN keys grammar:
	      WRITE "`name`: " /
	      FOR alt IN grammar[name]:
		 WRITE "    "
		 FOR symbol IN alt:
		    WRITE symbol, " "
		 WRITE /

and as example:

	>>> DISPLAY sentence
	ADJ:
	    EMPTY
	    clever
	    shy
	BOY:
	    John
	    Kevin
	EMPTY:

	GIRL:
	    Mary
	    Susan
	OBJ:
	    SUBJ
	SENT:
	    SUBJ loves OBJ
	SUBJ:
	    ADJ BOY
	    ADJ GIRL

You can generate a random phrase from a grammar with the following:

	HOW TO GENERATE sym FROM grammar:
	   SELECT:
	      sym in keys grammar: \ Nonterminal
		 FOR new IN choice grammar[sym]:
		    GENERATE new FROM grammar
	      ELSE: \ Terminal symbol
		 WRITE sym, " "

	>>> GENERATE "SENT" FROM sentence
	Susan loves clever John

SETS

Here are some necessary functions on sets. Set union:

	HOW TO RETURN set1 with set2: \ Union
	   FOR x IN set2:
	      IF x not.in set1:
		 INSERT x IN set1
	   RETURN set1

Set difference:

	HOW TO RETURN set1 less set2: \ Difference
	   FOR x IN set2:
	      IF x in set1:
		 REMOVE x FROM set1
	   RETURN set1

Here is a function that collects all symbols used in the rules of a grammar:

	HOW TO RETURN used grammar: \Collect all used symbols
	   PUT {} IN all
	   FOR rule IN grammar:
	      FOR alt IN rule:
		 FOR sym IN alt:
		    IF sym not.in all:
		       INSERT sym IN all
	   RETURN all

	>>> WRITE used sentence
	{"ADJ"; "BOY"; "EMPTY"; "GIRL"; "John"; "Kevin"; "Mary";
	 "OBJ"; "SUBJ"; "Susan"; "clever"; "loves"; "shy"}

The terminals of the grammar are all the symbols less the nonterminals:

	>>> WRITE (used sentence) less keys sentence
	{"John"; "Kevin"; "Mary"; "Susan"; "clever"; "loves"; "shy"}

and the unused nonterminals (such as the root symbol) are the
nonterminals less the used symbols:

	>>> WRITE (keys sentence) less used sentence
	{"SENT"}

For neater output, the function "listed" converts a set to a text:

	HOW TO RETURN listed set:
	   PUT "" IN line
	   FOR element IN set:
	      PUT line ^ "`element` " IN line
	   RETURN line

	>>> WRITE listed ((used sentence) less keys sentence)
	John Kevin Mary Susan clever loves shy

A useful set is the set of nonterminals that can generate empty. This
is generated by repeatedly doing a pass over the rules that we don't
know yet can generate empty, until we find no more:

	HOW TO RETURN empties grammar:
	   PUT keys grammar, {} IN to.do, empties
	   WHILE SOME name IN to.do HAS empty.rule:
	      INSERT name IN empties
	      REMOVE name FROM to.do
	   RETURN empties
	empty.rule:
	   REPORT SOME alt IN grammar[name] HAS empty.alt
	empty.alt:
	   REPORT EACH sym IN alt HAS sym in empties

	>>> WRITE listed empties sentence
	ADJ EMPTY

RELATIONS

Relations between symbols of the grammar are the essential element of
the grammar tools. A relation is represented as a table whose keys are
symbols, and whose items are sets of symbols.

For instance, if symbol b follows symbol a in some rule, "b" will be
in the set for follows["a"], so you can say, for instance:

	IF "b" in follows["a"]: ....

Relations are sparse (i.e. a symbol is not in the keys of the relation
if the set of elements is empty), so we use the following to access a
relation:

	HOW TO RETURN relation for k: \relation[k] for sparse relations
	   IF k in keys relation:
	      RETURN relation[k]
	   RETURN {}

To add an element to a relation, we use this:

	HOW TO ADD element TO relation FOR thing:
	   IF thing not.in keys relation: \First time
	      PUT {} IN relation[thing]
	   IF element not.in relation[thing]:
	      INSERT element IN relation[thing]

for instance:

	>>> ADD "b" TO follows FOR "a"

We'll display a relation with:

	HOW TO SHOW relation:
	   FOR k IN keys relation:
	      WRITE "`k`: ", listed relation[k], /

Some general functions on relations. The inverse:

	HOW TO RETURN inverse relation:
	   PUT {} IN inv
	   FOR k IN keys relation:
	      FOR x IN relation[k]:
		 ADD k TO inv FOR x
	   RETURN inv

The product of two relations (a P c iff a R1 b and b R2 c):

	HOW TO RETURN r1 prod r2: \product of relations
	   PUT {} IN prod
	   FOR c IN keys r2:
	      FOR b IN r2[c]:
		 IF b in keys r1:
		    FOR a IN r1[b]:
		       ADD a TO prod FOR c
	   RETURN prod

The closure:

	HOW TO RETURN closure r:
	   FOR i IN keys r:
	      FOR j IN keys r:
		 IF i in r[j]:
		    PUT r[i] with r[j] IN r[j]
	   RETURN r

To make a relation reflexive, we use the following. Since relations
are sparse, we also have to pass the set of symbols that it must be
reflexive over:

	HOW TO RETURN symbols reflexive relation: \make the relation reflexive
	   FOR sym IN symbols:
	      ADD sym TO relation FOR sym
	   RETURN relation

SOME EXAMPLES OF RELATIONS

To collect the *direct* followers for each symbol, we walk along each
alternative, collecting adjacent symbols. There is one catch: in a
rule like:

	SENT: the ADJ PERSON

"the" and "ADJ" are adjacent, but if "ADJ" can generate empty, then so
are "the" and "PERSON":

	HOW TO RETURN followers grammar:
	   PUT {}, empties grammar IN foll, empty
	   FOR rule IN grammar:
	      FOR alt IN rule:
		 TREAT ALT
	   RETURN foll
	TREAT ALT:
	   FOR i IN {1..#alt-1}:
	      PUT alt item i IN this
	      TREAT PART
	TREAT PART:
	   FOR j IN {i+1..#alt}:
	      PUT alt item j IN next
	      ADD next TO foll FOR this
	      IF next not.in empty: QUIT

	>>> SHOW followers sentence
	ADJ: BOY GIRL
	SUBJ: loves
	loves: OBJ

To collect the direct starter symbols of each rule, you also have to
deal with symbols that produce empty:

	HOW TO RETURN heads grammar:
	   PUT {}, empties grammar IN heads, empty
	   FOR name IN keys grammar:
	      FOR alt IN grammar[name]:
		 TREAT ALT
	   RETURN heads
	TREAT ALT:
	   FOR i IN {1..#alt}:
	      PUT alt item i IN head
	      ADD head TO heads FOR name
	      IF head not.in empty: QUIT

	>>> SHOW heads sentence
	ADJ: EMPTY clever shy
	BOY: John Kevin
	GIRL: Mary Susan
	OBJ: SUBJ
	SENT: SUBJ
	SUBJ: ADJ BOY GIRL

Similarly for the direct enders:

	HOW TO RETURN tails grammar:
	   PUT {}, empties grammar IN tails, empty
	   FOR name IN keys grammar:
	      FOR alt IN grammar[name]:
		 TREAT ALT
	   RETURN tails
	TREAT ALT:
	   FOR i' IN {-#alt..-1}:
	      PUT -i' IN i
	      PUT alt item i IN tail
	      ADD tail TO tails FOR name
	      IF tail not.in empty: QUIT

The closure of the head relation represents all symbols that can start
a rule, either directly or indirectly:

	>>> SHOW closure heads sentence
	ADJ: EMPTY clever shy
	BOY: John Kevin
	GIRL: Mary Susan
	OBJ: ADJ BOY EMPTY GIRL John Kevin Mary SUBJ Susan clever shy
	SENT: ADJ BOY EMPTY GIRL John Kevin Mary SUBJ Susan clever shy
	SUBJ: ADJ BOY EMPTY GIRL John Kevin Mary Susan clever shy

Symbol b may follow symbol a in a phrase if b follows a in an
alternative, or if B follows A in an alternative and b is in heads*(B)
and a is in tails*(A). This is expressed as the product
	head* . follow . inverse(tail*).

Now we have enough to define a command that prints for each symbol in
an alternative what may follow that symbol at that point:

	HOW TO SHOW LOCAL FOLLOWERS grammar:
	   PUT (used grammar) with keys grammar IN symbols
	   PUT symbols reflexive (closure heads grammar) IN head.star
	   PUT symbols reflexive (closure tails grammar) IN tail.star
	   PUT followers grammar IN follow
	   PUT (head.star prod follow) prod (inverse tail.star) IN deep.follow
	   FOR parent IN keys grammar:
	      FOR alt IN grammar[parent]:
		 TREAT ALT
	TREAT ALT:
	   ANNOUNCE ALT
	   FOR i IN {1..#alt}:
	      TREAT SYM
	TREAT SYM:
	   PUT alt item i IN sym
	   WRITE "    `sym`: ", listed local.follow /
	local.follow:
	   PUT {} IN foll
	   FOR j IN {i+1..#alt}:
	      PUT alt item j IN next
	      PUT foll with (head.star for next) IN foll
	      IF next not.in empty:
		 RETURN foll
	   RETURN foll with (deep.follow for parent)
	ANNOUNCE ALT:
	   WRITE "`parent`: ", listed alt /

This prints each alternative separately, followed by each symbol of
the alternative indented one to a line followed by the symbols that
can follow it at that point.

For example:

	>>> SHOW LOCAL FOLLOWERS sentence
	ADJ: EMPTY
	    EMPTY: BOY GIRL John Kevin Mary Susan
	ADJ: clever
	    clever: BOY GIRL John Kevin Mary Susan
	ADJ: shy
	    shy: BOY GIRL John Kevin Mary Susan
	BOY: John
	    John: loves
	BOY: Kevin
	    Kevin: loves
	EMPTY:
	GIRL: Mary
	    Mary: loves
	GIRL: Susan
	    Susan: loves
	OBJ: SUBJ
	    SUBJ:
	SENT: SUBJ loves OBJ
	    SUBJ: loves
	    loves: ADJ BOY EMPTY GIRL John Kevin Mary OBJ SUBJ Susan clever shy
	    OBJ:
	SUBJ: ADJ BOY
	    ADJ: BOY John Kevin
	    BOY: loves
	SUBJ: ADJ GIRL
	    ADJ: GIRL Mary Susan
	    GIRL: loves
[From steven@cwi.nl (Steven Pemberton)]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.