ttw@mvuxi.att.com (Thomas T Wetmore, Iv +1 508 960 6015) (10/10/90)
There have been many messages on the net recently concerning problems, etc, with respect to using YACC and LEX, and with respect to "compiler generators" and similar meta-tools. I am a long time user of YACC and LEX (over 12 years), but in the last few years have shifted to using a meta-translator that, among other things, provides a high-level, front-end to YACC and LEX. In this article I will briefly describe how this UNIX(r)-based software tool, which can be used to generate arbitrary text-to-text translators, builds the parsing phases of its generated translators. The tool is MetaTool(tm) Specification-Driven Tool Builder (MetaTool in the sequel), and is marketed by the Advanced Software Products Group of AT&T Bell Laboratories (many AT&T'ers will recognize this tool as the popular STAGE program). NOTE: this article is not intended as a product endorsement, but rather as a description of one successful way to simplify the automatic generation of parsing phases through a front-end approach to YACC and LEX. Authors of other meta-tools and program generator generators should be able to take advantage of the same approach. MetaTool generates translators; it is a meta-translator, somewhat akin to compiler generators, and is often called an application generator generator. To generate a translator, MetaTool requires two kinds of information: a description of the inputs (source or "specification") that the translator will read, and a description of the products that the translator will produce from the inputs. From this information, which is provided in files called description files, MetaTool automatically generates the translator. That is: source description specification file | | | | | V V MetaTool SDTB ---------------> Translator A | | | | V product description product/s file/s The information that describes the source inputs to the ultimate translator is provided in a "grammar section" of the source description file. From the contents of the grammar section MetaTool automatically generates both a YACC file and a LEX file that make up the front-end (parsing phase) of the translator. Before describing the grammar section let me first describe the two main software "problems" that are solved by the information in the grammar section. First, foremost, and obviously, the grammar section must describe a language in such a way that it is possible to generate the front-end parsing code from the description. This is handled by MetaTool by translating the contents of the grammar section into complete YACC and LEX files. But secondly, and just as importantly, the information in the grammar section is used to provide all the parse tree management, navigation and information retrieval operations. After all, the parsing phase of a translator lays bare the syntactic structure of its source, but a great deal of additional processing is required in any translator, to check the semantics of the syntactic structure, and to navigate, process and retrieve the information inherent in the structure. In this note I only describe how MetaTool solves the first problem. The MetaTool grammar section consists of a sequence of definitions. Definitions all have the form "name: body". The two most common types of definitions are grammar rule definitions and token definitions. Grammar rule definitions are written in an extended BNF notation that uses well-established conventions for expressing lists and sequences of items (with and without separators) and optional items. Attributes (in the sense of formal attribute grammars) can also be defined in grammar rules. Token definitions are written using the conventional LEX regular expression language. Here, for example, is the MetaTool grammar for a small programming language: %grammar prgm: ( id "{" state+ "}" ) state: ( asgn: id "=" expr ";" | rdstm: "read" id ";" | wrtstm: "write" expr ";" | ifstm: "if" expr "then" ts:state ["else" es:state] "fi" | wloop: "while" expr state | block: "{" state+ "}" ) expr: ( num:<[0-9]+> | id:<[a-z]+> | add:expr "+" expr | mul:expr "*" expr | eql:expr eq:"==" expr | neq:expr ne:"!=" expr | "(" expr ")" val: INT ) sym: list-structure ( sname:STRING svalue:INT ) This example shows the MetaTool syntax for most grammar section constructs, so I won't bother to detail the rules. This grammar contains three grammar rules, quite a number of anonymous tokens (all the undefined quoted string things), and a few token definitions (e.g., the definitions of "rdstm", "wrtstm", "num", "id" and others that are nested within the grammar rule definitions of "state" and "expr"). In a grammar section, definitions may be either nested or given sequentially, depending upon programmer preference -- and obviously, names can be used before they are defined. In the definition of "prgm" and in the definition of the "block" alternative, note that the item "state+" is used to indicate a sequence of one or more statements. The item "state*" would mean zero or more statements, and additional syntactic sugar is available to define list item separators. In the definition of the if statement alternative there is an example of the optional feature -- any items between square brackets are optional. The definitions that look like "name1:name2" are a third type of definition, called renaming, and are used to prevent ambiguity during parse tree traversals and navigation (solving part of the "second" problem). The definition "val: INT" is an attribute definition, defining a synthesized integer value attribute that will be associated with every "expr" node in the ultimate parse tree. The definition of "sym" is not a grammar definition at all, but rather the definition of a "list-structure", which, in this example, is used to hold a symbol table (the symbol table manipulating operations are automatically provided). Another useful definition type, not shown in this example, is the balanced token. Here is an example: cstates: balance "{" <[^{}]> "}" This defines a "token" that begins with a "{", ends with a "}", and may contain a balanced number of "{"s and "}"s. This definition could be used to define arbitrary escapes to the C language, within a different language, as a single token. MetaTool translates the grammar section into YACC and LEX files. The YACC file will obviously contain the extra rules required for handling lists and optional parts. This is straightforward. Semantic actions are automatically provided in the YACC file (note that the MetaTool user does not provide semantic actions in the grammar section). These semantic actions build the source input parse tree. The structure of the parse tree is part of the "second" problem, but once you decide upon the structure, this is also straightforward. Before MetaTool generates the LEX file it performs a complete follow graph analysis of the grammar (the graph that defines which tokens can follow which other tokens in valid inputs). This analysis is used to compute LEX start conditions that are automatically used in the generated LEX file. (The algorithms to do this are only mildly tricky, and shouldn't present most of you with much problem.) This is a very powerful feature that serves to remove all or most of the LEX ambiguities that would normally be introduced by overlapping token definitions. Use of this feature allows MetaTool to successfully generate parsers for languages that contain completely differently styled sub-languages. The lexical actions that are automatically placed in the generated LEX file serve to build the leaves of the input parse tree. Again, straightforward. The part of MetaTool that generates the YACC and LEX files is but one part of a larger whole. However, the techniques used in that part are not particularly complex, and I hope I have given a detailed enough glimpse of the process that anyone reasonably expert with context free language design, YACC and LEX could build their own "friendly" (oh, how I hate that term) LEX and YACC file generator. From our own experiences, we know that such tools make it much easier to build translators and other tools that have parsers in their front ends. Tom Wetmore AT&T Bell Laboratories ttw@mvuxi.att.com (508)960-6015 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.