[comp.std.c] C source character set

wittig@gmdzi.UUCP (Georg Wittig) (10/02/89)

May be the follwing are RTFM questions, but I don't have the ANSI C papers;
Harbison & Steele II don't seem to cover it ...

My questions are about the legal characters in a C source programme:

[1] There exist editors that allow you to enter any ASCII character. Consider
    the following program fragment:

		/* in the following lines let @ be the character '\0' */
		int x;
		x = 1 +	/* foo @ bar */
		    2	/* */
		    ;

    Is this program fragment equivalent to

	[a] ``int x; x = 1 + 2;''
	    In this case C compilers cannot use ``fgets'' to read the source
	    lines.

    or  [b] ``int x; x = 1 +   ;''
	    This will result in a syntax error message in later compiler
	    phases.

    What about a '\0' outside a C comment? Does it terminate the current line
    or must it be kept so that a syntax error message will be the result?

    What about a '\0' in a string constant?

[2] Furthermore, there are (non-UNIX) operating systems that encode the end of
    a source line by the number of bytes of that line instead of inserting a
    newline character (\x0a or \x0d in ASCII, \x15 in EBCDIC) at the end of
    that line.
    As an example, the line ``abc'' could be encoded as ``\3abc'', and not as
    ``abc\x0d''. In those environments ``[f]getc'' must generate an artificial
    '\n' character at the end of the line. Or am I mistaken?

    What if exactly this artificial '\n' is also a character of the line?
    What is a ``line'' in this context?

    Consider a (perverse looking) macro like the following:

			/* in the following line let @ be the character '\n' */
		#define X(a,b)	foo@#define X(a,b) ((a)+(b))
		i = X(27,38);

    Is this required to pass the preprocessor phase without an error message,
    and if so what is the output of that phase? I can think of at least 5
    different ways to process such a crazy macro.

[3] Line continuation by `\': Does it only apply to #define contexts and string
    constant contexts, or is it a general rule? Example:

		int terrible_long_identifier;
		terrible_lon\
		g_identifier = 1;

    Does the assignment statement alter the value of that terrible long
    variable, or is it a syntax error (``terrible_lon'' and ``g_identifier''
    undeclared)?

Thanks in advance,
-- 
Georg Wittig   GMD-Z1.BI   P.O. Box 1240   D-5205 St. Augustin 1 (West Germany)
email: wittig@gmdzi.uucp   phone: (+49 2241) 14-2294
-------------------------------------------------------------------------------
"Freedom's just another word for nothing left to lose" (Kris Kristofferson)

minow@mountn.dec.com (Martin Minow) (10/03/89)

In article <1302@gmdzi.UUCP> wittig@gmdzi.UUCP (Georg Wittig) writes:
>[1] There exist editors that allow you to enter any ASCII character. Consider
>    the following program fragment:
>
>		/* in the following lines let @ be the character '\0' */
>		int x;
>		x = 1 +	/* foo @ bar */
>		    2	/* */
>		    ;
This is probably a "quality of implementation" issue (because of NUL's
specific use in C to terminate strings.  A good implementation ought to
sweep out such characters (my opinion). More interesting is whether the
'@' can stand for one of the national letters in the ISO Latin-1 alphabet
(these have values from 0xA0 to 0xFF).   Again, "good" implementations will
allow characters in comments, 'char' and "string" constants that aren't
in the C source alphabet.

>
>[2] Furthermore, there are (non-UNIX) operating systems that encode the end of
>    a source line by the number of bytes of that line instead of inserting a
>    newline character

fgets() should encode these lines as "string\n" -- how it would treat an
embedded \n is a quality of implementation issue.  I would suggest that
there should be no difference between an explicit \n and one generated
to signal an end-of-record.

> I can think of at least 5
>    different ways to process such a crazy macro.

>[3] Line continuation by `\'

May occur anywhere (ignoring trigraphs).  Thus "terribly_lon\
g_identifier" is legal anywhere.

Martin Minow
minow@thundr.dec.com

gwyn@smoke.BRL.MIL (Doug Gwyn) (10/03/89)

In article <1302@gmdzi.UUCP> wittig@gmdzi.UUCP (Georg Wittig) writes:
>		/* in the following lines let @ be the character '\0' */
>		int x;
>		x = 1 +	/* foo @ bar */
>		    2	/* */
>		    ;

The character you're representing by "@" is not in the standard C source
character set, so such a program is not strictly conforming.  Some
implementations may be able to deal with that source code but others
will not.  If an implementation does deal with it, it is up to that
implementation how to interpret this non-standard extension.

>[2] Furthermore, there are (non-UNIX) operating systems that encode the end of
>    a source line by the number of bytes of that line ...

There is a misunderstanding here.  The specifications for C source
character set do not constrain how C source code files are represented
in a particular implementation, nor how text editors present C source
code visually, nor myriad other similar issues.  C source code
characters must be seen as distinct units by the conforming C translator;
what mapping is done from physical source character encoding before that
point lies beyond the scope of the C standard.  Presumably it will be
similar to that done for "text" files in the hosted C library text-stream
support, but it need not be.

>[3] Line continuation by `\': Does it only apply to #define contexts and string
>    constant contexts, or is it a general rule?

It's a general rule.  The first translation phase is physical-to-C source
code character mapping, then trigraph replacement, then \ newline splicing.
Preprocessing occurs after that.