[comp.compilers] MetaTool compiler generator

ttw@mvuxi.att.com (Thomas T Wetmore, Iv +1 508 960 6015) (10/10/90)
There have been many messages on the net recently concerning problems, etc,
with respect to using YACC and LEX, and with respect to "compiler
generators" and similar meta-tools.  I am a long time user of YACC and LEX
(over 12 years), but in the last few years have shifted to using a
meta-translator that, among other things, provides a high-level, front-end
to YACC and LEX.

In this article I will briefly describe how this UNIX(r)-based software
tool, which can be used to generate arbitrary text-to-text translators,
builds the parsing phases of its generated translators.  The tool is
MetaTool(tm) Specification-Driven Tool Builder (MetaTool in the sequel), and
is marketed by the Advanced Software Products Group of AT&T Bell
Laboratories (many AT&T'ers will recognize this tool as the popular STAGE
program).  NOTE: this article is not intended as a product endorsement, but
rather as a description of one successful way to simplify the automatic
generation of parsing phases through a front-end approach to YACC and LEX.
Authors of other meta-tools and program generator generators should be able
to take advantage of the same approach.

MetaTool generates translators; it is a meta-translator, somewhat akin to
compiler generators, and is often called an application generator generator.
To generate a translator, MetaTool requires two kinds of information: a
description of the inputs (source or "specification") that the translator
will read, and a description of the products that the translator will
produce from the inputs.  From this information, which is provided in files
called description files, MetaTool automatically generates the translator.
That is:

            source description              specification
                   file                          |
                    |                            |
                    |                            |
                    V                            V
              MetaTool SDTB ---------------> Translator
                    A                            |
                    |                            |
                    |                            V
            product description               product/s
                  file/s

The information that describes the source inputs to the ultimate translator
is provided in a "grammar section" of the source description file.  From the
contents of the grammar section MetaTool automatically generates both a YACC
file and a LEX file that make up the front-end (parsing phase) of the
translator.

Before describing the grammar section let me first describe the two main
software "problems" that are solved by the information in the grammar
section.  First, foremost, and obviously, the grammar section must describe
a language in such a way that it is possible to generate the front-end
parsing code from the description.  This is handled by MetaTool by
translating the contents of the grammar section into complete YACC and LEX
files.  But secondly, and just as importantly, the information in the
grammar section is used to provide all the parse tree management, navigation
and information retrieval operations.  After all, the parsing phase of a
translator lays bare the syntactic structure of its source, but a great deal
of additional processing is required in any translator, to check the
semantics of the syntactic structure, and to navigate, process and retrieve
the information inherent in the structure.  In this note I only describe how
MetaTool solves the first problem.

The MetaTool grammar section consists of a sequence of definitions.
Definitions all have the form "name: body".  The two most common types of
definitions are grammar rule definitions and token definitions.  Grammar
rule definitions are written in an extended BNF notation that uses
well-established conventions for expressing lists and sequences of items
(with and without separators) and optional items.  Attributes (in the sense
of formal attribute grammars) can also be defined in grammar rules.  Token
definitions are written using the conventional LEX regular expression
language.  Here, for example, is the MetaTool grammar for a small
programming language:

    %grammar
    prgm:   ( id  "{"  state+  "}" )
    state:  ( asgn:   id "=" expr  ";"
            | rdstm:  "read" id  ";"
            | wrtstm: "write" expr  ";"
            | ifstm:  "if" expr "then" ts:state ["else" es:state] "fi"
            | wloop:  "while" expr state
            | block:  "{" state+ "}"
            )
    expr:   ( num:<[0-9]+>
            | id:<[a-z]+>
            | add:expr "+" expr
            | mul:expr "*" expr
            | eql:expr eq:"==" expr
            | neq:expr ne:"!=" expr
            | "(" expr ")"
              val: INT
            )
    sym:    list-structure ( sname:STRING svalue:INT )

This example shows the MetaTool syntax for most grammar section constructs,
so I won't bother to detail the rules.  This grammar contains three grammar
rules, quite a number of anonymous tokens (all the undefined quoted string
things), and a few token definitions (e.g., the definitions of "rdstm",
"wrtstm", "num", "id" and others that are nested within the grammar rule
definitions of "state" and "expr").  In a grammar section, definitions may
be either nested or given sequentially, depending upon programmer preference
-- and obviously, names can be used before they are defined.

In the definition of "prgm" and in the definition of the "block"
alternative, note that the item "state+" is used to indicate a sequence of
one or more statements.  The item "state*" would mean zero or more
statements, and additional syntactic sugar is available to define list item
separators.  In the definition of the if statement alternative there is an
example of the optional feature -- any items between square brackets are
optional.

The definitions that look like "name1:name2" are a third type of definition,
called renaming, and are used to prevent ambiguity during parse tree
traversals and navigation (solving part of the "second" problem).  The
definition "val: INT" is an attribute definition, defining a synthesized
integer value attribute that will be associated with every "expr" node in
the ultimate parse tree.  The definition of "sym" is not a grammar
definition at all, but rather the definition of a "list-structure", which,
in this example, is used to hold a symbol table (the symbol table
manipulating operations are automatically provided).

Another useful definition type, not shown in this example, is the balanced
token.  Here is an example:

    cstates: balance "{" <[^{}]> "}"

This defines a "token" that begins with a "{", ends with a "}", and may
contain a balanced number of "{"s and "}"s.  This definition could be used
to define arbitrary escapes to the C language, within a different language,
as a single token.

MetaTool translates the grammar section into YACC and LEX files.  The YACC
file will obviously contain the extra rules required for handling lists and
optional parts.  This is straightforward.  Semantic actions are
automatically provided in the YACC file (note that the MetaTool user does
not provide semantic actions in the grammar section).  These semantic
actions build the source input parse tree.  The structure of the parse tree
is part of the "second" problem, but once you decide upon the structure,
this is also straightforward.

Before MetaTool generates the LEX file it performs a complete follow graph
analysis of the grammar (the graph that defines which tokens can follow
which other tokens in valid inputs).  This analysis is used to compute LEX
start conditions that are automatically used in the generated LEX file.
(The algorithms to do this are only mildly tricky, and shouldn't present
most of you with much problem.)  This is a very powerful feature that serves
to remove all or most of the LEX ambiguities that would normally be
introduced by overlapping token definitions.  Use of this feature allows
MetaTool to successfully generate parsers for languages that contain
completely differently styled sub-languages.  The lexical actions that are
automatically placed in the generated LEX file serve to build the leaves of
the input parse tree.  Again, straightforward.

The part of MetaTool that generates the YACC and LEX files is but one part
of a larger whole.  However, the techniques used in that part are not
particularly complex, and I hope I have given a detailed enough glimpse of
the process that anyone reasonably expert with context free language design,
YACC and LEX could build their own "friendly" (oh, how I hate that term) LEX
and YACC file generator.  From our own experiences, we know that such tools
make it much easier to build translators and other tools that have parsers
in their front ends.

        Tom Wetmore
        AT&T Bell Laboratories
        ttw@mvuxi.att.com
        (508)960-6015
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.