[comp.compilers] Integrating C preprocessor with the parser

peterson@Compass.COM (Richard Peterson) (03/21/90)
Michael Platoff's inquiry regarding work to incorporate the semantics of C
preprocessor directives within a program representation used for source
code transformation yielded a response from Jean-Marie Larcheveque citing
the line number and file name directives generated by text-to-text
preprocessors.  I believe Michael's point, though, was that having a line
for line correspondence between the source input to the preprocessor and
the source input to the parser doesn't really help much in analyzing the
program structure in terms of the original identifiers (sometimes macros)
and expressions (sometimes #if expressions) by which the program is
understood and maintained.

We have implemented a C (ANSI plus some extensions) front end which
includes an integrated preprocessor and generates an AST-like program
representation.  The primary motives for the integrated preprocessor were
to eliminate the overhead of performing lexical analysis twice, and allow
for well-integrated error reporting.  Informational, warning, and error
messages from preprocessing, parsing, and semantics all show the original
source line with a flag on the offending token, whether it's a
preprocessing directive, a macro invocation, or is encountered in excluded
text.  Although not implemented currently, the design allows for flagging
error-causing token(s) within a macro expansion in addition to flagging
the macro invocation in the original source file.

We have not attempted to represent preprocessing information other than
true source position in the AST, and the phrase structure grammar input to
our LALR(1) frontend generator has no knowledge of preprocessing
constructs.  Knowledge of preprocessing is confined to the lexical grammar
input to the frontend generator, and to hand-coded routines in the
screener.  The parser calls the screener for each token, unaware of
preprocessing actions because they are not useful in the normal process of
compilation.

However, at the time that the parser calls the screener there is a
significant amount of information available about preprocessing, because
the preprocessor really produces one token at a time.  An #ifdef that
appears within an expression is processed right after the preceding token
is handed to the parser, and each token that reaches the parser does so in
the presence of the conditional inclusion stack maintained by the
preprocessor.  Even within macro expansions, at the time the parser gets a
token generated by a macro, there exists a data structure representing the
derivation of that token through nested macro calls, parameter expansions,
and token pasting or stringization operations.

So it's certainly possible for a C language processor to be aware of the
effects of preprocessing in some sense.  Storing the information in a
program representation that would be useful to the kinds of tools
described sounds pretty intractable, at least in the general case, for the
reasons Michael listed.  Perhaps with a suitable set of (checkable)
constraints imposed on the use of the preprocessor by programs to be
represented, some useful capabilities could be developed.

Rich Peterson              Internet: peterson@compass.com
Compass, Inc.              UUCP:     {think,encore,cvbnet}!compass!peterson
550 Edgewater Drive        Phone:    (617) 245-9540
Wakefield, MA 01880, USA.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.