[comp.std.c] ANSI C token set

leendert@cs.vu.nl (Leendert van Doorn) (12/22/88)

In article <11262@haddock.ima.isc.com> karl@haddock.ima.isc.com writes:

>>anything using @ or $ (since they're not part of the C character set)
>
>That's what I thought, too, until I noticed that 3.1.7 in the May88 dpANS
>contains an example wherein @ and $ are scanned as separate preprocessing
>tokens.  The accompanying text does not mention whether or not this behavior
>is required of a conforming implementation.

This example is valid. Hence the behaviour is required of a conforming
implementation. This has to do with section 2.1.1.2 (phases of translation).
In phase 3 the source file is decomposed into preprocessing tokens, and
in phase 7 the preprocessor tokens are converted into (normal) tokens.
This allows @ and $ character to be part of the preprocessor token set,
but not to be part of the (normal) token set.

However, nowhere in the standard is the conversion of preprocessor tokens
to (normal) tokens described. This is an issue that should be clarified.

In the compiler I wrote the lexical analyser simply breaks up the input into
preprocessor tokens and these go (without any conversion) into the compilation
process. The later one will filter out illegal things like $ and @. (the
parser chokes and starts up an error recovery routine).

-- 
Leendert P. van Doorn 			   		 <leendert@cs.vu.nl>
Vrije Universiteit / Dept. of Maths. & Comp. Sc.
De Boelelaan 1081
1081 HV Amsterdam / The Netherlands			tel. +31 20 548 5302

gwyn@smoke.BRL.MIL (Doug Gwyn ) (12/27/88)

In article <1844@zell.cs.vu.nl> leendert@cs.vu.nl (Leendert van Doorn) writes:
>However, nowhere in the standard is the conversion of preprocessor tokens
>to (normal) tokens described. This is an issue that should be clarified.

That too is covered under Phases of Translation, in recent drafts.

karl@haddock.ima.isc.com (Karl Heuer) (01/05/89)

Let's see if I've got this straight yet.

o  `$' is required to scan as a separate pp-token, despite existing practice
   making it an optional identifier-character.

o  When converting pp-tokens to tokens, an implementation is free to merge
   {foo}{$}{bar} into a single token {foo$bar}.  (I'm guessing on this one.)

o  But, since macro expansion happens first, it is {foo}, and not {foo$bar},
   that is subject to macro replacement, even if the above is true.

o  Hence, certain features of DEC and APOLLO implementations cannot be
   conforming.

o  DEC and APOLLO, through their representatives on X3J11, are aware of the
   above and accept it.  Their ANSI C implementations, if any, will not use
   `$' in identifiers.

o  Non-English letters, which are clearly not usable in a strictly conforming
   program, are in fact not usable in *any* conforming program, for the same
   reasons that apply to `$'.

o  The international community is aware of this and accepts it.

How much of the above is correct?

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

blarson@skat.usc.edu (Bob Larson) (01/05/89)

In article <11343@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
[discussion of using $ in identifiers]

>o  DEC and APOLLO, 

Prime should be added to the list of compiler venders who use (and
require in their non-portable libraries) $ in identifiers.
-- 
Bob Larson	Arpa: Blarson@Ecla.Usc.Edu	blarson@skat.usc.edu
Uucp: {sdcrdcf,cit-vax}!oberon!skat!blarson
Prime mailing list:	info-prime-request%ais1@ecla.usc.edu
			oberon!ais1!info-prime-request

leendert@cs.vu.nl (Leendert van Doorn) (01/05/89)

The following comments are based on the X3J11/88-090 (may/88) version of
the dpANS report. In a couple of days I'll get the latest version, but for
now it will do.

In article <11343@haddock.ima.isc.com> karl@haddock.ima.isc.com writes:

> Let's see if I've got this straight yet.
>
>o `$' is required to scan as a separate pp-token, despite existing practice
>   making it an optional identifier-character.

Yes. The syntax of an identifier is (par. 3.1.2):

	identifier:	nondigit | identifier nondigit | identifier digit ;
	nondigit:	"_[a-z][A-Z]"
	digit:		"0-9"

Whether the '$' should be scanned as a separate pp-token depends on the source
character set.

>o  When converting pp-tokens to tokens, an implementation is free to merge
>   {foo}{$}{bar} into a single token {foo$bar}.  (I'm guessing on this one.)

No, in this conversion the '$' is a garbage character. So what you get is
{foo} <ERROR> {bar}. (the $ character is not part of the non-terminal identifier,
see above).

>o  But, since macro expansion happens first, it is {foo}, and not {foo$bar},
>   that is subject to macro replacement, even if the above is true.

{foo$bar} can never be subject to any macro replacement, since it's not an
identifier (see 3.8.3).

>o  Hence, certain features of DEC and APOLLO implementations cannot be
>   conforming.

I don't know about DEC or APOLLO, but if they allow things like described
above their implementations are not strictly conforming (perhaps there is
a flag -pendatic as with the GNU C compiler ?).

>o  DEC and APOLLO, through their representatives on X3J11, are aware of the
>   above and accept it.  Their ANSI C implementations, if any, will not use
>   `$' in identifiers.

Depends on there policy. They are free to add features. Perhaps they will
make a flag (if $ is the only nonconforming aspect).

>o  Non-English letters, which are clearly not usable in a strictly conforming
>   program, are in fact not usable in *any* conforming program, for the same
>   reasons that apply to `$'.  

The basic source set, the set in which source files are written, does not
contain $, umlaut, accent grave, etc. The strings however, may contains these
characters (depending on the size of the character representation you could
use single or multibyte character strings).

>o  The international community is aware of this and accepts it.

Yep, why not ?

BTW: The best wishes for 1989. "Hope it's a good one"
-- 
Leendert P. van Doorn 			   		 <leendert@cs.vu.nl>
Vrije Universiteit / Dept. of Maths. & Comp. Sc.
De Boelelaan 1081
1081 HV Amsterdam / The Netherlands			tel. +31 20 548 5302

scjones@sdrc.UUCP (Larry Jones) (01/06/89)

In article <11343@haddock.ima.isc.com>, karl@haddock.ima.isc.com (Karl Heuer) writes:
> Let's see if I've got this straight yet.
> 
> o  `$' is required to scan as a separate pp-token, despite existing practice
>    making it an optional identifier-character.

I don't believe that '$' is required to scan as anything.  Since
it is not in the C source character set, a conforming compiler is
under no obligation to do anything in particular with it and so
is at liberty to do anyting at all with it.  If an implementation
chooses to allow it in identifiers, that's fine (although it
should diagnose the syntax violation - perhaps by congratulating
you for seeing the value of using names containing dollar
signs).

----
Larry Jones                         UUCP: uunet!sdrc!scjones
SDRC                                      scjones@sdrc.uucp
2000 Eastman Dr.                    BIX:  ltl
Milford, OH  45150                  AT&T: (513) 576-2070
"Save the Quayles" - Mark Russell

karl@haddock.ima.isc.com (Karl Heuer) (01/11/89)

In article <1858@zell.cs.vu.nl> leendert@cs.vu.nl () writes:
>In article <11343@haddock.ima.isc.com> karl@haddock.ima.isc.com writes:
>> Let's see if I've got this straight yet.
>>
>>o `$' is required to scan as a separate pp-token, despite existing practice
>>   making it an optional identifier-character.
>
>Yes. The syntax of an identifier is [the pattern /[_a-zA-Z][_a-zA-Z0-9]*/].
>
>Whether the '$' should be scanned as a separate pp-token depends on the source
>character set.

In the environment I'm thinking of, `$' should be legal in strings (where it
represents the same symbol in the execution character set), hence it must be a
member of the source character set, and by 3.1 it scans as a pp-token.

>>o  Hence, certain features of DEC and APOLLO implementations cannot be
>>   conforming.
>
>I don't know about DEC or APOLLO, but if they allow things like described
>above their implementations are not strictly conforming (perhaps there is
>a flag -pendatic as with the GNU C compiler ?).

`Strictly conforming' is an attribute of programs, not implementations.  An
implementation is either ANSI C, or it isn't.  According to the rules,
accepting `$' in an identifier seems to yield a non-ANSI implementation.

>>o  DEC and APOLLO, through their representatives on X3J11, are aware of the
>>   above and accept it.  Their ANSI C implementations, if any, will not use
>>   `$' in identifiers.
>
>Depends on there policy. They are free to add features. Perhaps they will
>make a flag (if $ is the only nonconforming aspect).

Hmm, assuming they do, I wonder if they'll follow Doug's suggestion of turning
off __STDC__ whenever `$' is enabled.

>>o  Non-English letters, which are clearly not usable in a strictly conforming
>>   program, are in fact not usable in *any* conforming program, for the same
>>   reasons that apply to `$'.  
>
>The basic source set, the set in which source files are written, does not
>contain $, umlaut, accent grave, etc. The strings however, may contains these
>characters (depending on the size of the character representation you could
>use single or multibyte character strings).

The source character set is used both inside and outside of string literals;
those within string literals (or character constants) are mapped to the
execution character set as they are tokenized.  For the purposes of this
discussion, I'm assuming that the source and execution character sets are
identical, and that they contain `$' and/or non-English letters in addition to
the minimal character set of 2.2.1.

>>o  The international community is aware of this and accepts it.
>
>Yep, why not ?

Because the users can't use their native languages to name their variables.
Doesn't it bother you that you can't have a variable named `IJspret' with a
proper ligature instead of separate letters?  It bothers me, and I don't even
have any plans to use such a feature.

(Actually, the problem occurs even in English; I once had a set of constants
named DONT_xxx to selectively suppress individual features of a large system.
I didn't worry about the lack of an apostrophe, because (a) there's nothing to
be done about it, since the symbol is already in use, and (b) the meaning was
clear without it.  The correct use of the apostrophe seems to be declining in
American English anyway.  But that's a topic for a different group.)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

henry@utzoo.uucp (Henry Spencer) (01/17/89)

In article <11383@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
>`Strictly conforming' is an attribute of programs, not implementations.  An
>implementation is either ANSI C, or it isn't.  According to the rules,
>accepting `$' in an identifier seems to yield a non-ANSI implementation.

Only if it is not diagnosed (e.g. by a warning message).  I'm getting a
bit tired of repeating this:  accepting extensions does not make a compiler
non-conforming.  The requirements for a conforming implementation are that
it handle all strictly conforming programs correctly, and that it diagnose
(not necessarily reject, just diagnose) any construct which is illegal
according to the standard.

Actually, the character-set issue may be even less severe than this,
depending on how the wording goes, but my copy of the October draft went
out on a few days' loan a month ago (sigh) and isn't back yet, so I can't
check the fine print just now.
-- 
"God willing, we will return." |     Henry Spencer at U of Toronto Zoology
-Eugene Cernan, the Moon, 1972 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

scjones@sdrc.UUCP (Larry Jones) (01/19/89)

In article <1989Jan16.204214.15979@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> In article <11383@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
> >`Strictly conforming' is an attribute of programs, not implementations.  An
> >implementation is either ANSI C, or it isn't.  According to the rules,
> >accepting `$' in an identifier seems to yield a non-ANSI implementation.
> 
> Only if it is not diagnosed (e.g. by a warning message).  I'm getting a
> bit tired of repeating this:  accepting extensions does not make a compiler
> non-conforming.  The requirements for a conforming implementation are that
> it handle all strictly conforming programs correctly, and that it diagnose
> (not necessarily reject, just diagnose) any construct which is illegal
> according to the standard.

That's what I thought, too.  But Karl pointed out to me that is
is possible to write a strictly conforming program that will NOT
be interpreted correctly by an implementation that allows '$' in
identifiers.  All you need do is something like:

	#define foo$bar
	#ifdef foo
	.
	.
	.
	#endif

The standard requires the #ifdef to be true, but any
implementation that allows '$' in an identifier will evaluate it
as false.

----
Larry Jones                         UUCP: uunet!sdrc!scjones
SDRC                                      scjones@sdrc.UU.NET
2000 Eastman Dr.                    BIX:  ltl
Milford, OH  45150                  AT&T: (513) 576-2070
"When all else fails, read the directions."

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/19/89)

In article <504@sdrc.UUCP> scjones@sdrc.UUCP (Larry Jones) writes:
>That's what I thought, too.  But Karl pointed out to me that is
>is possible to write a strictly conforming program that will NOT
>be interpreted correctly by an implementation that allows '$' in
>identifiers.

No, it isn't.  Use of the $ character in an identifier produces
"undefined behavior".  The implementation of free to treat $ like
_ in identifiers, because that cannot affect translation of any
strictly conforming program.

henry@utzoo.uucp (Henry Spencer) (01/21/89)

In article <9438@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>>That's what I thought, too.  But Karl pointed out to me that is
>>is possible to write a strictly conforming program that will NOT
>>be interpreted correctly by an implementation that allows '$' in
>>identifiers.
>
>No, it isn't.  Use of the $ character in an identifier produces
>"undefined behavior"...

Doug, can you cite chapter and verse for this?  Remember that preprocessor
tokens which are never converted to tokens are one of the exemptions from
the rule in 2.2.1.  After some study of the matter, I'm afraid my tentative
conclusion is that when a funny character disappears before pptoken->token
conversion time, the Oct draft is not entirely clear about whether its use
is undefined, implementation-defined, or neither.  In reality it must be
considered to be at least implementation-defined, since it may not even 
exist in the source character set on some weird system, but I cannot find
explicit words to that effect.  One would actually prefer that it be
undefined, but I doubt that you can do that without making it difficult
to have funny characters in sections of code that are #ifdefed out -- and
it is highly desirable that *that* be legitimate.

A.6.3.4 thinks that 2.2.1 says that any extra members of either character
set are implementation-defined, but those words are not found in 2.2.1.

I think the right approach would be to tighten the "preprocessor token"
exemption in 2.2.1 so that it refers only to the none-of-the-above
single-character preprocessor tokens, but it's too late.

In practice one can argue that the behavior is undefined under the
"anything not mentioned is undefined" rule in 1.6, but this is not really
entirely satisfactory.
-- 
Allegedly heard aboard Mir: "A |     Henry Spencer at U of Toronto Zoology
toast to comrade Van Allen!!"  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

scjones@sdrc.UUCP (Larry Jones) (01/21/89)

In article <9438@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes:
> In article <504@sdrc.UUCP> scjones@sdrc.UUCP (Larry Jones) writes:
> >That's what I thought, too.  But Karl pointed out to me that is
> >is possible to write a strictly conforming program that will NOT
> >be interpreted correctly by an implementation that allows '$' in
> >identifiers.
> 
> No, it isn't.  Use of the $ character in an identifier produces
> "undefined behavior".  The implementation of free to treat $ like
> _ in identifiers, because that cannot affect translation of any
> strictly conforming program.

But the critical point is that the $ character ISN'T in an
identifier if the implementation is conforming: foo$bar gets
parsed as three tokens just like foo+bar would.  As long as the $
doesn't make it past the preprocessor phases of translation, I
don't see anything in the standard that makes the program non-
conforming, and that makes any implementation that allows $ in
identifiers non-conforming since they do not parse the program
correctly and thus do not translate it correctly.

Please take another look at my (well, actaully Karl's) example.

----
Larry Jones                         UUCP: uunet!sdrc!scjones
SDRC                                      scjones@sdrc.UU.NET
2000 Eastman Dr.                    BIX:  ltl
Milford, OH  45150                  AT&T: (513) 576-2070
"When all else fails, read the directions."

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/21/89)

In article <511@sdrc.UUCP> scjones@sdrc.UUCP (Larry Jones) writes:
>But the critical point is that the $ character ISN'T in an
>identifier if the implementation is conforming: foo$bar gets
>parsed as three tokens just like foo+bar would.

It's still the case that $ is not going to appear in foo$bar
context in a strict conforming application.  I think the problem
you have in mind is that foo$bar leads to surprises if foo or
bar is a macro, just as use of EGAD when <errno.h> is included
can lead to surprises.

Perhaps the best way to implement extended identifier character
sets would be with a non-conforming mode flag to the compiler
to enable such an extension.  I can see serious problems with
use of non-Roman characters in foreign-language contexts.
What did we respond to the Japanese comment about this?

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/21/89)

In article <1989Jan20.175532.7447@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>No, it isn't.  Use of the $ character in an identifier produces
>>"undefined behavior"...
>Doug, can you cite chapter and verse for this?

I was concerned only with identifiers that made it through the
preprocessing phase, since that is the situation I'm familiar with
where $ in identifiers really is wanted in some implementations.

Obviously, I don't recommend using $ in identifiers unless you
HAVE to.

henry@utzoo.uucp (Henry Spencer) (01/24/89)

In article <9470@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>It's still the case that $ is not going to appear in foo$bar
>context in a strict conforming application...

I repeat a previous comment:  as far as I can tell, there is no rule saying
that a strictly-conforming program can't use $ in this context, provided
that it disappears (or hides inside a string or whatever) before the end
of preprocessing.  I do believe that appearance of such a character in a
context where it isn't being ignored *ought* to make a program non-strictly-
conforming, but I cannot find anything in the Oct draft that *says* this.
-- 
Allegedly heard aboard Mir: "A |     Henry Spencer at U of Toronto Zoology
toast to comrade Van Allen!!"  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu