[comp.std.c] Many ANSI C questions

dfp@cbnewsl.ATT.COM (david.f.prosser) (10/13/89)

In the referenced articles diamond@ws.sony.junet (Norman Diamond)
asks a number of questions:

>The grammar given in the standard does not permit an empty source
>file.  Strangely enough, a preprocessor source file may be empty,
>but since it cannot generate a non-empty real source file, error
>detection is only delayed.

>However, in 2.1.1.2, phase 2, "A source file that is not empty shall
>end in a new-line character, ...."  Why the condition "that is not
>empty"?  Did the committee intend to allow an empty source file, but
>forgot that somewhere along the way?

A translation unit must contain at least one external declaration,
but can be made up of a number of source files, all but the original
via #include directives.  Any of these #included files could be empty.

--------------------------------------------------------
>Consider the following fragment of code:

>  x = y + z;  /* comment
>  more comment */  #define a b

>Is this a legal #define?  I think no, because the comment is replaced
>by a single space, so the "#" is preceded by more than whitespace --
>it is preceded by "x = y + z;" on the same line.

>If my processor encounters and obeys such an illegal #define, is it
>required to produce a warning message, or may it obey silently?

The token sequence does not include a valid preprocessing directive.
Since it includes a # token, and since there are no valid uses of #
as a separate token in ANSI C other than within directive lines, a
grammar-based constraint must have been violated and a diagnostic
issued.  Once such has occurred, an implementation can choose to
give an interpretation to the invalid token sequence.

--------------------------------------------------------
>Consider the following sample program:

>  #define junk #include <garbage.h>
>  junk

>Does my preprocessor have to include <garbage.h>?

>Note that this is different from the situation mentioned in the
>standard.  For the following sample program:

>  #define junk2(arg) arg
>  junk2(#include <garbage.h>)

>the standard says that the behavior is undefined.

No.  88/12/07, section 3.8.3.4, page 92, lines 11-12:

	The resulting completely macro-replaced preprocessing token
	sequence is not processed as a preprocessing directive even
	if it resembles one.

--------------------------------------------------------
>In evaluating the constant-expression of a #if directive, any
>identifiers that are left over after macro substitution are changed
>into the constant 0.  What about preprocessor numbers that are left
>over after numeric conversion?  For example, 1f0x3 does not convert
>into a long or double.  Should it be replaced by the constant 0?

88/12/07, section 3.8.1, page 88, lines 1-4:

	After all replacements due to macro expansion and the
	defined unary operator have been performed, all remaining
	identifiers are replaced with the pp-number 0, and then
	each preprocessing token is converted into a token.

This is exactly analogous to the translation phases that are being
emulated in the evaluation of the #if expression.  Likewise, if there
are pp-tokens remaining that are not also valid (regular) tokens,
the behavior is undefined.

--------------------------------------------------------
>In article <11163@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:

>>... as a general design principle the preprocessor is not required to
>>back up, apart from the quite limited case of rescanning the
>>replacement buffer.

>Thank you for answering my question, but I think that your answer might
>be a bit too broad.  Section 3.8.3.4 (Dec. 7 '88) says that after
>replacement, the replacement buffer must be rescanned WITH THE REST OF
>THE SOURCE FILE'S PREPROCESSING TOKENS.

>For example, consider:
>
>  #include <stdio.h>
>  #define mf1(x) printf("Hello, world.\n")
>  #define m2     (z);
>  main(){
>    mf1 m2
>  }

>After "m2" changes into "(z);", it must be rescanned with the rest of
>the source file's preprocessing tokens.  At this time mf1 must be
>recognized as a function-like macro call.  The preprocessor MUST BACK
>UP EARLIER THAN THE REPLACEMENT BUFFER.

>I will bet that every pre-ANSI compiler/preprocessor would have
>rejected the above program.  But ANSI requires it to print
>"Hello, world.\n".

Fortunately, Doug is correct.  Unfortunately, English is not as
precise a language as we would like.  I can see that you've taken
"rest of the source file's preprocessing tokens" to mean all preceding
and subsequent tokens.  The intent was merely to clarify that there
was no artificial boundary imposed between the end of the replacement
sequence and the following--the "rest" of the--preprocessing tokens,
despite the need to remember where the replacement sequence ended for
purposes of recursion detection.

Doug's generalization was one of the guidelines that the Committee
used in its macro replacement algorithm choices.

>In fact I think I can construct a few perverse macros, which will make
>the preprocessor take EXPONENTIAL TIME in proportion to the length of
>the program.

Not that it matters, but macro replacement in its simplest form
is an exponential process, as long as replacements are rescanned
for further macros.

--------------------------------------------------------
>Consider the following code:

>  #define a(x) b
>  #define b(x) x

>  a(a)(a)(a)

>The macro replacement for a(a) results in b.
>  First replacement buffer:  b       Remaining tokens:  (a)(a)
>Inside the first replacement buffer, no further nested replacements
>will recognize the macro name "a".  The name "a" is painted blue.

>The first replacement buffer is rescanned not by itself, but along with
>the rest of the source program's tokens.  "b(a)" also causes macro
>replacement and becomes "a".
>  Second replacement buffer:  a      Remaining tokens:  (a)

>The second replacement buffer is rescanned not by itself, but along with
>the rest of the source program's tokens.

>The "a" in the second replacement buffer did not come from the first
>replacement buffer.  It came from three of the remaining tokens which
>were in the source file following the first replacement buffer.
>Is this "a" part of a nested replacement?  Is it still painted blue?

>Note that there are many "paths" that can be taken for a possible
>macro name to travel from a preprocessor token (outside the replacement
>buffer) to one that is inside the replacement buffer.  When do they
>stop getting painted blue?  If either too early or too late, they cause
>very surprising results.

>The standard is very unclear on when the blue painting stops.  Could
>someone please clarify?

There is, indeed, an ambiguity in the replacement algorithm as you
point out.  In summary, given a preprocessing token sequence that is
recognized to be a function-like macro invocation

	m ( ... )

the set of macro names to "paint blue" for "m" is not necessarily the
same as the set for the rest of the invocation.  The pANS gives the
implementation complete freedom regarding those preprocessing tokens
*not* in common in these sets.  (It can be shown that only the macro
name's and the close paren's sets matter.)  For those preprocessing
tokens in common, they (and "m") are in the "paint blue" set applied
to the preprocessing tokens of the replacement sequence.  The obvious
two choices are either the union of the sets or the intersection.  In
your example, a "union" or "restrictive" implementation would produce

	a(a)(a)

while the "intersection" or "nonrestrictive" implementation produces

	a

What the programmer's intent may of been is not at all clear to me,
so any argument from that point is moot for me.

In my document presented to X3J11 on this algorithm, I noted this
problem and said that I'd prefer the "intersection" choice, mostly
because this allowed for more replacements without getting into a
recursion loop.  However, I think I'm safe in stating that most of
the Committee believes that this issue, while something to wrestle
with within an implementation, is not one that matters in "real code",
and can live with preprocessing implementations that do not behave
identically.  (This should be clear from other parts of the
preprocessing description; take "empty" invocation arguments, for
example.)

Hope this compendium of answers helps.  These all appeared within a
very short period of time on my netnews machine.

Dave Prosser	...not an official X3J11 answer...

ndjc@capone.UUCP (Nick Crossley) (10/14/89)

In the referenced articles diamond@ws.sony.junet (Norman Diamond) writes:
>The grammar given in the standard does not permit an empty source
>file.  Strangely enough, a preprocessor source file may be empty,
>but since it cannot generate a non-empty real source file, error
>detection is only delayed.

In article <2259@cbnewsl.ATT.COM> dfp@cbnewsl.ATT.COM (david.f.prosser) replies:
>A translation unit must contain at least one external declaration,
>but can be made up of a number of source files, all but the original
>via #include directives.  Any of these #included files could be empty.

But empty translation units are useful.  I have several products
where tracing code or machine/OS/version/... code is all put into
separate source files; references to this code is controlled by
appropriate macros and/or #ifs.  These separate sources would look
like:

	#if  DEBUG
	function declarations
	#endif

Now on an ANSI C compiler this will fail, or more likely give a
warning (as it does on the AT&T V.4 system).  I have to revise this
apparently logical arrangement, or add dummy declarations, to remove
these warnings.

The same is true of a source file which contains only #ident (or
#pragma ident) lines.

I suppose there was a good reason for this decision, such as a
known system incapable of handling emtpy translation units?
-- 

<<< standard disclaimers >>>
Nick Crossley, ICL NA, 9801 Muirlands, Irvine, CA 92718-2521, USA 714-458-7282
uunet!ccicpg!ndjc  /  ndjc@ccicpg.UUCP