[comp.std.c] Macro names imbedded in pp-numbers repost

diamond@csl.sony.co.jp (Norman Diamond) (11/16/89)

Sorry for the repost, but the original posting has not drawn any
replies.  Perhaps it was buried in kddlabs again.

Both the standard and the rationale say that in the pp-number
  0x7e-getchar()
it is illegal for my preprocessor to expand the getchar() macro.
If there is a real getchar() function, it is guaranteed that the
real function must be invoked by this expression.  This appears
to match the committee's intention, is not optional, and is not
implementation-defined.  Why?

I will have to add code to my scanner, and slow it down, so that
it will not call the preprocessor if it finds a macro in the middle
of a pp-number.

We have recently had discussions of what-is-reasonable vs. what-is-
written.  Does anyone think we can appeal to reason in this case,
so that implementations might be allowed to expand macros that are
found as independent real-tokens even though they're not separate
preprocessor-tokens?

-- 
Norman Diamond, Sony Corp. (diamond%ws.sony.junet@uunet.uu.net seems to work)
  Should the preceding opinions be caught or     |  James Bond asked his
  killed, the sender will disavow all knowledge  |  ATT rep for a source
  of their activities or whereabouts.            |  licence to "kill".

datanguay@watmath.waterloo.edu (David Adrien Tanguay) (11/17/89)

In article <11134@riks.csl.sony.co.jp> diamond@ws.sony.junet (Norman Diamond) writes:
>Sorry for the repost, but the original posting has not drawn any
>  0x7e-getchar()
>it is illegal for my preprocessor to expand the getchar() macro.
>If there is a real getchar() function, it is guaranteed that the
>real function must be invoked by this expression.  This appears
>to match the committee's intention, is not optional, and is not
>implementation-defined.  Why?

The "0x7e-getchar" is picked up as a pre-processor number and later converted
into a token. In section 3.1, under constraints, it says

	"Each preprocessing token that is converted into a token shall
	 have the lexical form of a keyword, an identifier, a constant,
	 a string literal, an operator, or a punctuator."

"0x7e-getchar" is none of these, so I think a diagnostic message must be
issued at that point. However, there might be a statement elsewhere
that says that a pre-processor token can be converted into a sequence
of tokens.

>We have recently had discussions of what-is-reasonable vs. what-is-
>written.  Does anyone think we can appeal to reason in this case,
>so that implementations might be allowed to expand macros that are
>found as independent real-tokens even though they're not separate
>preprocessor-tokens?

This problem was brought to the committee's attention, but it took them
a while to understand the problem (they thought everybody was complaining
about the concept of a pre-processor number, rather than the specific
definition). By the time they did figure it out, they had already declared
that the botched definition would stand. (Hopefully a committee member will
inject some reality into the previous sentence.) Oh well, you should be
using white space anyway.

David Tanguay

henry@utzoo.uucp (Henry Spencer) (11/18/89)

In article <11134@riks.csl.sony.co.jp> diamond@ws.sony.junet (Norman Diamond) writes:
>... Does anyone think we can appeal to reason in this case,
>so that implementations might be allowed to expand macros that are
>found as independent real-tokens even though they're not separate
>preprocessor-tokens?

I don't think the situation can arise, actually.  A careful reading of
2.1.1.2 item 7 yields:  "Each preprocessing token is converted into a
token."  Note the singular pronoun; it's in there because I pointed out
that there was no requirement elsewhere that the conversion be one-to-one.
A preprocessing token which cannot be converted into a single real token
is illegal.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

walter@hpclwjm.HP.COM (Walter Murray) (11/18/89)

Norman Diamond writes:

> Both the standard and the rationale say that in the pp-number
>   0x7e-getchar()
> it is illegal for my preprocessor to expand the getchar() macro.
> If there is a real getchar() function, it is guaranteed that the
> real function must be invoked by this expression.

Isn't this overlooking the constraint in 3.1?  As I read it,
<0x7e-getchar> is a pp-number.  In translation phase 7, the
translator attempts to convert each preprocessing token into
a token.  At that point, each preprocessing token must have
the form of a keyword, an identifier, a constant, a string
literal, an operator, or a punctuator.  Because <0x7e-getchar>
doesn't match any of these, the constraint is violated, the program
is illegal, and a diagnostic must be produced.

Walter Murray
---

karl@haddock.ima.isc.com (Karl Heuer) (11/18/89)

In article <11134@riks.csl.sony.co.jp> diamond@ws.sony.junet (Norman Diamond) writes:
>Both the standard and the rationale say that in the pp-number
>  0x7e-getchar()
>it is illegal for my preprocessor to expand the getchar() macro.
>If there is a real getchar() function, it is guaranteed that the
>real function must be invoked by this expression.

"0x7e-getchar" scans as a single pp-number.  It fails to convert into a token
when it hits translation phase 7, and hence the program containing it is not
strictly conforming.

We are now in the realm of undefined behavior.  An ANSI-conforming compiler is
required to issue a diagnostic, but is then permitted to guess what the user
meant and continue processing the program.  In particular, it should be legal
for it to consider "getchar" for macro replacement.

>I will have to add code to my scanner, and slow it down, so that
>it will not call the preprocessor if it finds a macro in the middle
>of a pp-number.

Why should it "find" a macro in the middle of a pp-number, any more than it
should try to expand the substring "getc" in the token "fgetc"?

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
(For those readers who don't know why "0x7e-getchar" is a single pp-token: The
Committee defined a single pattern to cover all variants of numeric constants,
including floating-point with exponents as well as hex integers.  They chose
to accept the resulting wart (that hex constants ending with "e" must not be
immediately followed by a sign) rather than rewrite the pattern to fix it.
Yes, I think this was a mistake.  No, it can't be changed.)

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/18/89)

In article <11134@riks.csl.sony.co.jp>, diamond@csl.sony.co.jp (Norman Diamond) writes:
- Both the standard and the rationale say that in the pp-number
-   0x7e-getchar()
- it is illegal for my preprocessor to expand the getchar() macro.
- If there is a real getchar() function, it is guaranteed that the
- real function must be invoked by this expression.  This appears
- to match the committee's intention, is not optional, and is not
- implementation-defined.  Why?

The Rationale explains that.

- I will have to add code to my scanner, and slow it down, so that
- it will not call the preprocessor if it finds a macro in the middle
- of a pp-number.

No, since so far as I can tell you really have to implement a
tokenizing preprocessor anyway, the pp-number is a single preprocessing
token and thus will not match "getchar" naturally; no additional
kludgery is required to ensure this.

- We have recently had discussions of what-is-reasonable vs. what-is-
- written.  Does anyone think we can appeal to reason in this case,
- so that implementations might be allowed to expand macros that are
- found as independent real-tokens even though they're not separate
- preprocessor-tokens?

There ARE no "real-tokens" until translation phase 7.  Preprocessing
is REQUIRED to deal solely with preprocessing tokens.  I think that
general framework is fairly easy to defend on "reasonable" grounds.

You could of course complain that you don't want to have to implement
a tokenizing preprocessor, or that you don't want to have to wait
until translation phase 7 to convert pp-numbers to C numbers, or that
you just don't like the whole notion of pp-numbers.  Believe me,
X3J11 has heard all the arguments before..

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/18/89)

In article <15217@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
| (For those readers who don't know why "0x7e-getchar" is a single pp-token: The
| Committee defined a single pattern to cover all variants of numeric constants,
| including floating-point with exponents as well as hex integers.  They chose
| to accept the resulting wart (that hex constants ending with "e" must not be
| immediately followed by a sign) rather than rewrite the pattern to fix it.
| Yes, I think this was a mistake.  No, it can't be changed.)

  Correct on all three. That is the way the standard works, it is a
mistake, and it can't be changed.

  I *don't* believe that there is a body of existing programs using hex
constants with exponential notation, and I *do* believe that it breaks
existing programs. I think the committee got tired of the job and
decided that it was good enough. I admit I only found one program it
actually broke, although I did find about 30 instances of #defined hex
constants ending in e which *could* break if used with +/-.

Example, for those not following this:

	#define F_LIMIT	0x14e

	/* and later in the program */
	long error_count[F_LIMIT+5];	/* room for quadrant totals */

	/* and also */
	top3 = triad(bset, F_LIMIT+3);

  The last is interesting, because if F_LIMIT+3 is taken as a float
value, and if there is a prototype, the number gets converted back to an
int. This gives the arg the correct type but a vastly wrong value.

  I'm happy to say that for the moment I haven't seen any compilers
implement this, even those which have many other ANSI features. I'm
really hoping that this could be treated as a wording change, not
requiring a vote, but I suspect it is too big for that.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

quiroz@cs.rochester.edu (Cesar Quiroz) (11/18/89)

In <1643@crdos1.crd.ge.COM>, davidsen@crdos1.UUCP (bill davidsen) wrote:
| 	#define F_LIMIT	0x14e
| 
| 	/* and later in the program */
| 	long error_count[F_LIMIT+5];	/* room for quadrant totals */

Aside:  Over-parenthesizing your defines for paranoid reasons would
have saved this program.  Of course, the criticized behavior remains
buggy in the general case.



-- 
                                      Cesar Augusto Quiroz Gonzalez
                                      Department of Computer Science
                                      University of Rochester
                                      Rochester,  NY 14627

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/18/89)

In article <15217@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
-We are now in the realm of undefined behavior.  An ANSI-conforming compiler is
-required to issue a diagnostic, but is then permitted to guess what the user
-meant and continue processing the program.  In particular, it should be legal
-for it to consider "getchar" for macro replacement.

That's a useful point that has application in other areas as well.

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/18/89)

In article <1643@crdos1.crd.ge.COM> davidsen@crdos1.UUCP (bill davidsen) writes:
>Example, for those not following this:
>	#define F_LIMIT	0x14e
>	/* and later in the program */
>	long error_count[F_LIMIT+5];	/* room for quadrant totals */
>	/* and also */
>	top3 = triad(bset, F_LIMIT+3);

There is no problem here, because the result of macro substitution is
not retokenized.

I don't feel like recounting the entire history of pp-numbers, even if
I could remember it all, but it was a reasonable solution to a very
difficult technical problem.  Earlier drafts tried to do it the way
you and Norman seem to think is "right", and it got more and more
snarled as we tried to get it untangled.  pp-numbers work well for the
intended purpose and cause problems only in very rare circumstances
and only for programmers with an obsessive aversion to white space.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/18/89)

In article <1989Nov17.205004.19236@cs.rochester.edu> quiroz@cs.rochester.edu (Cesar Quiroz) writes:

| Aside:  Over-parenthesizing your defines for paranoid reasons would
| have saved this program.  Of course, the criticized behavior remains
| buggy in the general case.

  That's what I had to go thru and do, but if that doesn't constitute
"egregiously breaking existing programs" I don't know what does. If I
ever get the time I'll grep thru the net sources and see how many have
defined hex constants ending in e. Note that of the programs which did,
only one actually failed, the rest were time-bombs, waiting until
someone used them in an expression.

  I don't think this will bring a huge number of programs crashing
down, but it does look like a case of a committee whose majority is
vendors (or was during the two years I was there) choosing a behavior
which has no benefit other than to simplify the writing of the parser.
If Global sends me the rationale with this order I'll look to see if
the thought process is described.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

karl@haddock.ima.isc.com (Karl Heuer) (11/18/89)

In article <1643@crdos1.crd.ge.COM> davidsen@crdos1 (bill davidsen) writes:
>	#define F_LIMIT	0x14e
>	top3 = triad(bset, F_LIMIT+3);
>The last is interesting, because if F_LIMIT+3 is taken as a float
>value, and if there is a prototype...

As Doug has already pointed out, this is not a problem because there is a
token delimiter folling F_LIMIT.  Besides, there are no hex-floats in ANSI C
(nor in any extension that I'm aware of); "0x14e+3" is a pp-token that cannot
be resolved into a real token.  (As is also, for example, "018" or "4s".)  If
the Committee had tried to make this a legal token, it would have been a Quiet
Change--and *that* would have been a lot more controversial!

>I'm really hoping that this could be treated as a wording change, not
>requiring a vote, but I suspect it is too big for that.

The guiding principle is that a change that reflects the Committee's original
intent is editorial, one that does not is substantive.  Unfortunately, this
behavior appeared as an explicit example in the Rationale, so it's hard to
argue that the Committee didn't intend it.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

pkturner@cup.portal.com (Prescott K Turner) (11/19/89)

> Both the standard and the rationale say that in the pp-number
>   0x7e-getchar()
> it is illegal for my preprocessor to expand the getchar() macro.
> If there is a real getchar() function, it is guaranteed that the
> real function must be invoked by this expression.
The draft standard says, "Each preprocessing token is converted into a
token".   Since 0x7e-getchar cannot be converted into a token, there
is an error, and the real getchar() function need not be invoked.
> I will have to add code to my scanner, and slow it down, so that
> it will not call the preprocessor if it finds a macro in the middle
> of a pp-number.
"getchar" in the middle of a pp-number should give a scanner no
more difficulty than "getchar" in the middle of an identifier, e.g.
    mygetchar
--
Prescott K. Turner, Jr.
13 Burning Tree Rd., Natick, MA 01760 USA    (508) 653-0357
UUCP: ...sun!cup.portal.com!pkturner    Internet: pkturner@cup.portal.com

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/19/89)

In article <1653@crdos1.crd.ge.COM> davidsen@crdos1.UUCP (bill davidsen) writes:
>I don't think this will bring a huge number of programs crashing down,

It won't.  Several committee members "grepped" for such usage in existing
code to see if it would be a significant factor.  For example, the whole
source code for UNIX was scanned.  We determined that it would not be a
significant problem for existing source code.

>but it does look like a case of a committee whose majority is
>vendors (or was during the two years I was there) choosing a behavior
>which has no benefit other than to simplify the writing of the parser.

That's a significant reason for this particular feature of the
specification.  However, you seem to be implying that selfish
considerations by vendors are acting to the detriment of users.
There were a significant number of user-oriented X3J11 committee
members (including myself), and of course most vendor representatives
also function as C users themselves.  We bought into the notion of
simplicitiy in this case as being of value to programmers as well as
implementors.  Only ignorant programmers might have a problem, but
that is true of very many aspects of C; C is not a language for those
who refuse to learn before doing.  We expect that this particular
quirk will be taught in C textbooks much as the need for () around
parameters in macro definitions already is taught.

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/19/89)

In article <31615@watmath.waterloo.edu> datanguay@watmath.waterloo.edu (David Adrien Tanguay) writes:
>However, there might be a statement elsewhere that says that a
>pre-processor token can be converted into a sequence of tokens.

No; the conversion in translation phase 7 is one-to-one.

>This problem was brought to the committee's attention, but it took them
>a while to understand the problem (they thought everybody was complaining
>about the concept of a pre-processor number, rather than the specific
>definition). By the time they did figure it out, they had already declared
>that the botched definition would stand. (Hopefully a committee member will
>inject some reality into the previous sentence.) Oh well, you should be
>using white space anyway.

This is misleading, because whenever really solid arguments were made,
X3J11 was always willing to fix a demonstrated error in the draft
specification; there were numerous occasions when this did occur.

As I recall the committee sentiment, it wasn't felt that this slightly
over-generous glomming onto source characters for pp-numbers posed a
serious practical problem, and it did drastically simplify that part
of the preprocessor.  The trade-off seemed worthwhile.

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/19/89)

In article <12570031@hpclwjm.HP.COM> walter@hpclwjm.HP.COM (Walter Murray) writes:
>Because <0x7e-getchar> doesn't match any of these, the constraint
>is violated, the program is illegal, and a diagnostic must be produced.

Right, and thus this is not a "quiet change".
Fixing the source code might be a nuisance, but fortunately the
cases where it would be necessary are exceedingly rare.

" Maynard) (11/19/89)

In article <11641@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>pp-numbers work well for the
>intended purpose and cause problems only in very rare circumstances
>and only for programmers with an obsessive aversion to white space.

Uhm, Doug...does this mean that the behavior of a program differs with
the use or non-use of white space? Isn't this different from the rest of
C?

-- 
Jay Maynard, EMT-P, K5ZC, PP-ASEL   | Never ascribe to malice that which can
jay@splut.conmicro.com       (eieio)| adequately be explained by stupidity.
{attctc,bellcore}!texbell!splut!jay +----------------------------------------
Shall we try for comp.protocols.tcp-ip.eniac next, Richard? - Brandon Allbery

gwyn@smoke.BRL.MIL (Doug Gwyn) (11/19/89)

In article <3060@splut.conmicro.com> jay@splut.conmicro.com (Jay "you ignorant splut!" Maynard) writes:
>Uhm, Doug...does this mean that the behavior of a program differs with
>the use or non-use of white space? Isn't this different from the rest of C?

No, white space has always been significant for preprocessing.  Consider:
	#define	foo	(void)
	foo bar();
	foobar();

White space is also sometimes significant outside preprocessing:
	a = b / *c;	/* comment */	;
	a = b/*c;	/* comment */	;
This was even worse when we had =op assignment operators.

The latter example strikes me as quite analogous to the pp-number situation.

scjones@sdrc.UUCP (Larry Jones) (11/20/89)

In article <11645@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
> In article <1653@crdos1.crd.ge.COM> davidsen@crdos1.UUCP (bill davidsen) writes:
> >but it does look like a case of a committee whose majority is
> >vendors (or was during the two years I was there) choosing a behavior
> >which has no benefit other than to simplify the writing of the parser.
> 
> There were a significant number of user-oriented X3J11 committee
> members (including myself), and of course most vendor representatives
> also function as C users themselves.  We bought into the notion of
> simplicitiy in this case as being of value to programmers as well as
> implementors.  Only ignorant programmers might have a problem, but
> that is true of very many aspects of C; C is not a language for those
> who refuse to learn before doing.  We expect that this particular
> quirk will be taught in C textbooks much as the need for () around
> parameters in macro definitions already is taught.

As another user-oriented committee member, I agree with Doug's
assessment -- the decision was made to keep the specification
simple, not to make implementers' jobs easier.  In fact, a fair
number of committee members found "greedy" pp numbers to be
aesthetically repugnant and a few tried to rewrite the spec to
make them less so.  All of these were found to be defective in
one way or another, although one was so subtle that it nearly got
addopted!  In the end, most everyone agreed that the existing
spec does not cause any serious problems, is simple, and, most
importantly, does cover all the desirable cases.
----
Larry Jones                         UUCP: uunet!sdrc!scjones
SDRC                                      scjones@SDRC.UU.NET
2000 Eastman Dr.                    BIX:  ltl
Milford, OH  45150-2789             AT&T: (513) 576-2070
"You know how Einstein got bad grades as a kid?  Well MINE are even WORSE!"
-Calvin

scjones@sdrc.UUCP (Larry Jones) (11/20/89)

In article <3060@splut.conmicro.com>, jay@splut.conmicro.com (Jay "you ignorant splut!" Maynard) writes:
> In article <11641@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
> >pp-numbers work well for the
> >intended purpose and cause problems only in very rare circumstances
> >and only for programmers with an obsessive aversion to white space.
> 
> Uhm, Doug...does this mean that the behavior of a program differs with
> the use or non-use of white space? Isn't this different from the rest of
> C?

Well, not to put words in Doug's mouth ;-), but consider the
following:

	i--1	vs	i - -1
	i/*p	vs	i / *p
	i=-1	vs	i = -1	(obsolete)

Although whitespace (or lack thereof) USUALLY doesn't make a
difference, it's not unprecidented.
----
Larry Jones                         UUCP: uunet!sdrc!scjones
SDRC                                      scjones@SDRC.UU.NET
2000 Eastman Dr.                    BIX:  ltl
Milford, OH  45150-2789             AT&T: (513) 576-2070
"You know how Einstein got bad grades as a kid?  Well MINE are even WORSE!"
-Calvin