[comp.std.c] The Preprocessor and tokens

cmills@wyse.wyse.com (03/25/91)

Whilst composing this year's entry for the IOCCC, I stumbled on some
ANSI wierdness and I was wondering, what does the Standard say about
constuctions like these:

#define _	+ 42

int j = 6_;

#define N	42

int k = 0x7e+N;

My compiler refuses to exapnd the macros in these cases.  Is there a good
reason for this (other than the preprocessor is a little easier to write
if you define [0-9][0-9a-zA-Z_]* as a token)?  I was under the impression
that the preprocessor and the compiler agreed on what constituted a
token...  (I find the second example particulary obfuscating :)

Do K&R compilers do it any differently?

BTW: The compiler in question is GCC.


					Chris Mills

bhoughto@pima.intel.com (Blair P. Houghton) (03/25/91)

In article <3137@wyse.wyse.com> cmills@wyse.wyse.com () writes:
>Whilst composing this year's entry for the IOCCC, I stumbled on some
>ANSI wierdness and I was wondering, what does the Standard say about
>constuctions like these:
>
>#define _	+ 42
>
>int j = 6_;
>
>#define N	42
>
>int k = 0x7e+N;

The standard calls these things (`6_' and `0x7e+N')
"preprocessing numbers" and says they don't have to
be tokenized any further than that.

>My compiler refuses to exapnd the macros in these cases.

As it should; the macros aren't invoked.

>Is there a good
>reason for this (other than the preprocessor is a little easier to write
>if you define [0-9][0-9a-zA-Z_]* as a token)?

You forgot [+-.], which can also appear anywhere
after the leading digit.

Nope.  The rationale says the committee punted, claiming
that they didn't want to "burden" the preprocessor (uh, yeah...).

The last sentence of the paragraph in the rationale is,
"[in this case,] exercise of reasonable precaution in coding
style avoids surprises."

You can't BUY irony like that.

				--Blair
				  "If you think the IOCCC is a
				   mess, try reading X3.159-1989,
				   sometime..."

henry@zoo.toronto.edu (Henry Spencer) (03/26/91)

In article <3137@wyse.wyse.com> cmills@wyse.wyse.com () writes:
>...I was under the impression
>that the preprocessor and the compiler agreed on what constituted a
>token... 

Unfortunately, no.  There is a separate notion of a "preprocessing token",
which is like a real token in a lot of ways but has a much more generous
syntax for numbers.  This is what you're running afoul of.

The problem is that there are demented people who want to use vile and
disgusting perversions like token concatenation to actually put numbers
together out of pieces at preprocessing time.  The lexical syntax of C
numbers is really ugly, and trying to produce a lexical definition of a
valid *piece* of a number is agonizing.  X3J11 therefore came up with
the notion of a "preprocessing number" which covers anything vaguely
number-like, and stipulated full lexical validation of numbers only
after preprocessing.
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

diamond@jit345.swstokyo.dec.com (Norman Diamond) (03/27/91)

In article <3137@wyse.wyse.com> cmills@wyse.wyse.com () writes:

>what does the Standard say about constuctions like these:
>#define _	+ 42
>int j = 6_;
>#define N	42
>int k = 0x7e+N;

The standard forbids those macros from being expanded.
However, since your program violates the ANSI syntax, the processor may do
whatever it wishes, as long as it produces at least one diagnostic.  It
could, if it wishes, give a warning and then do what you asked for.

>Is there a good reason for this (other than the preprocessor is a little
>easier to write if you define [0-9][0-9a-zA-Z_]* as a token)?

No, there is no good reason for it.  The possible reason, which you suggest,
also falls pretty far short of being a good one.  There is no good reason.

>I was under the impression that the preprocessor and the compiler agreed
>on what constituted a token...

The standard specifically distinguishes preprocessor-tokens from tokens.
RTFS.

>Do K&R compilers do it any differently?

Usually.  The standard broke some working, valid, code this time.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

gwyn@smoke.brl.mil (Doug Gwyn) (03/28/91)

In article <1991Mar27.033525.21697@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>>what does the Standard say about constuctions like these:
>>#define _	+ 42
>>int j = 6_;
>>#define N	42
>>int k = 0x7e+N;
>>Do K&R compilers do it any differently?
>Usually.  The standard broke some working, valid, code this time.

I've commented before about how sick I get of hearing people pontificate
over how "ANSI C breaks valid code".  In fact, the standard UNIX (Reiser)
preprocessor produces

	int j = 6_;

	int k = 0x7e+42;

for the example in question, which half does "what was wanted" and half
does NOT do "what was wanted".  It is this sort of inconsistent existing
practice that gave X3J11 the latitude to decide on the behavior specified
in the C standard.  Many alternative proposals were offered, including
several from members of the public during the public review process, but
ALL of them were technically flawed.  Given the difficulty of "getting it
right" during preprocessing, X3J11 opted to defer "getting it right" to a
later phase of translation.  Anyone using a decent coding style should
encounter no problems.  The only serious trap to watch out for is the
second example, which is avoidable by using whitespace around operators.
The first example is stupid anyway.

karl@ima.isc.com (Karl Heuer) (03/28/91)

In article <15606@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>Many alternative proposals were offered, including several from members of
>the public during the public review process, but ALL of them were
>technically flawed.

It may be that D. Hugh Redelmeier's proposal (X3J11/88-083, letter P24,
item 33/36) had some technical flaw, but if so, it wasn't the flaw that the
Committee asserted in their response.  (Hugh's proposal did not misparse
"1e4+6", nor any other valid construct as far as I can see.)

In case anyone's interested, here are the relevant patterns in egrep form.
X3J11:	[.]?[0-9]([0-9a-zA-Z_.]|[eE][-+])*
Karl:	[.]?[0-9][0-9.]*([eE][-+])?[0-9a-zA-Z_.]*
Hugh:	[.]?[0-9]([0-9.]|[eE][-+])*[0-9a-zA-Z_.]*

There's not much point in continuing to argue this, since the current
Standard is frozen.  On the bright side, we have:

[0] At least it's not a Quiet Change; any code that's broken will cause a
diagnostic rather than getting the wrong answer.

[1] A helpful compiler can always issue the diagnostic, and then guess what
you meant and do the right thing anyway.

[2] Those of us who normally put whitespace around binary operators won't be
hit by it anyway.

[3] If the C-2001 Committee decides it was a bad idea, they can still fix it,
since no working code can depend on this glitch.

[4] If, at the C-2001 Standardization Committee meeting, somebody puts LSD
into the drinking water and they decide to allow hexadecimal floating-point
constants, it'll be relatively easy to add.

[5] The experience may be helpful in designing a new language from scratch.
Using "1.0E2" and "1.0E_2" is perhaps a better idea than allowing "+" and "-"
to appear in the middle of a token.

Karl W. Z. Heuer (karl@ima.isc.com or uunet!ima!karl), The Walking Lint

cmills@wyse.wyse.com (Chris Mills x2427 dept203) (03/29/91)

In article <1991Mar27.033525.21697@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>In article <3137@wyse.wyse.com> cmills@wyse.wyse.com () writes:
>
>>what does the Standard say about constuctions like these:
>>#define _	+ 42
>>int j = 6_;
>>#define N	42
>>int k = 0x7e+N;
>
>The standard forbids those macros from being expanded.
>However, since your program violates the ANSI syntax, the processor may do
>whatever it wishes, as long as it produces at least one diagnostic.  It
>could, if it wishes, give a warning and then do what you asked for.

What do you mean the preprocessor can do whatever it wants?  If the standard
forbids the expansions of the macros, then it can't do what I want, right?
Actually, since I'm deliberately trying to be obfuscated, I'd like

#include <stdio.h>
int N;
#define N 3
void main(argc, argv) char **argv; { printf("%d\n", 0x7e+N); }

to print 254 instead of 257 consistantly.  What, exactly is being violated
in this example that allows the preprocessor to be unpredictable?  As my
newfound understanding goes, the preproc. will see "0x73+N" as one token
and will not expand "N", and the compiler will see it as three, right?

>RTFS.

Love to.  Can't justify the cost of TFS for the occasional oddity like this.
Did manage to borrow a copy of K&RII, but could not find any reference to
the preprocessor being given license to think that numbers can contain
W's... (I suppose allowing all alphabetics was slightly less confusing than
allowing only [ulfeULFE]).

By the way, does L"string" work the same way?  Does the preproc. see this as
one token, or two, and is Q"string" also a preproc token?  Does anybody
actually use these things?

>--
>Norman Diamond       diamond@tkov50.enet.dec.com
>If this were the company's opinion, I wouldn't be allowed to post it.

				Chris Mills

bhoughto@pima.intel.com (Blair P. Houghton) (03/29/91)

In article <3140@wyse.wyse.com> cmills@wyse.UUCP (Chris Mills x2427 dept203) writes:
>in this example that allows the preprocessor to be unpredictable?  As my
>newfound understanding goes, the preproc. will see "0x73+N" as one token
>and will not expand "N", and the compiler will see it as three, right?

I like it.  Way-cool obfuscation.

>W's... (I suppose allowing all alphabetics was slightly less confusing than
>allowing only [ulfeULFE]).

The problem is that by the time the tokenizer sees the 'N'
it has already stuffed the `0x73+' into a PP_NUMBER
token-bin, and it's too late to ungetc() that much stuff
(only one char at a time, please).  This is a really lame
excuse for this non-parsing, if it's the real excuse.

>By the way, does L"string" work the same way?  Does the preproc. see this as
>one token, or two, and is Q"string" also a preproc token?  Does anybody
>actually use these things?

Two tokens, and yes, if Q isn't L.  Two strings
concatenated lexically will be concatenated as data; i.e.
"hello,""world" becomes "hello,world".

It's only `preprocessing numbers' that are broken.

				--Blair
				  "Otherwise, I'd have bought it."

diamond@jit345.swstokyo.dec.com (Norman Diamond) (03/29/91)

In article <3140@wyse.wyse.com> cmills@wyse.UUCP (Chris Mills x2427 dept203) writes:
>In article <1991Mar27.033525.21697@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>>In article <3137@wyse.wyse.com> cmills@wyse.wyse.com () writes:
>>>what does the Standard say about constuctions like these:
>>>#define _	+ 42
>>>int j = 6_;
>>>#define N	42
>>>int k = 0x7e+N;
>>The standard forbids those macros from being expanded.
>>However, since your program violates the ANSI syntax, the processor may do
>>whatever it wishes, as long as it produces at least one diagnostic.  It
>>could, if it wishes, give a warning and then do what you asked for.
>What do you mean the preprocessor can do whatever it wants?  If the standard
>forbids the expansions of the macros, then it can't do what I want, right?

The PREPROCESSOR is not allowed to do this expansion.  AFTER preprocessing,
the PROCESSOR must give at least one diagnostic.  The PROCESSOR is then
allowed to proceed as it likes, including, uh, repreprocessing.

Of course, the processor does not really have to break out a separate
preprocessor step; it only has to behave as though the phases of translation
occured in the standard order.

>Actually, since I'm deliberately trying to be obfuscated, I'd like
>#include <stdio.h>
>int N;
>#define N 3
>void main(argc, argv) char **argv; { printf("%d\n", 0x7e+N); }
>to print 254 instead of 257 consistantly.

Not from ANSI 1989 C.

>As my newfound understanding goes, the preproc. will see "0x73+N" as one
>token and will not expand "N",

Yes.

>and the compiler will see it as three, right?

No.  The compiler will see it as one, issue at least one diagnostic, and
then do whatever it wishes (including possibly converting it to three).

>>RTFS.
>Love to.  Can't justify the cost of TFS for the occasional oddity like this.

If you can't justify $50.00, then you can't afford to use net bandwidth
on it either.

>By the way, does L"string" work the same way?  Does the preproc. see this as
>one token, or two,

One.

>and is Q"string" also a preproc token?

No.  For consistency it should have been (sarcasm here).

>Does anybody actually use these things?

People have actually used strings with multibyte characters in them for
over 10 years.  The L"..." syntax was (to the best of my knowledge) an ANSI
invention and not previously used.  At least this one did not break working
programs.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

karl@ima.isc.com (Karl Heuer) (04/03/91)

In article <1991Mar29.073102.1136@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>The L"..." syntax was (to the best of my knowledge) an ANSI invention and not
>previously used.  At least this one did not break working programs.

Actually, it did (potentially, if not in practice).  Classic C and ANSI C
assign different meanings to the program fragment
	#define L -
	int x = L'\1';
I pointed out this out during one of the public review periods, which is why
the feature is now noted as a QUIET CHANGE in the Rationale document.

Karl W. Z. Heuer (karl@ima.isc.com or uunet!ima!karl), The Walking Lint