[net.lang.c] Problem with the C pre-processor

mikeb@inset.UUCP (Mike Banahan) (10/03/85)

Subject: netnews comments

The ANSI draft on C is clear on the subject of what characters
are allowed in strings. It draws a distinction between the source
character set (the one you compile in) and the destination character
set (the one you execute in).

It says explicitly that every character in the source character set
except newline may be present in a string literal ( though some need
escapes, eg \"  ).
Interestingly, it doesn't explain what happens to them.
For example, if you use ASCII as your source character set, but execute
in EBCDIC, and use a non-EBCDIC character in a literal: is that an error?
It means that, if it is an error, then the ``legality'' of a program
depends on the target environment. I am not sure that the committee
has given a lot of thought to this problem. I remember some conversations
on the character set subject and drew the conclusion that, although
some committee members thought they understood it properly,
there were a lot who didn't. I found the whole thing pretty confusing
at the time - but have since had time to think hard about it.

It does seem clear that the compiler is EXPECTED to transform the
source character set into the target character set, not just
pass byte streams through.

The problems with the BSD compiler are presumably that someone
thought they were entitled to use the top bit for some internal purpose.
They made the assumption that the source character set was strictly seven bit.
If you think the C compiler has problems here, there are a hell of a lot
of other things that are worse! Of course, they weren't looking at the
X3J11 proposals when they wrote it.

It's really an educational issue; too many people think that ASCII
was written on the back of the ten commandments and that its word is law.
I found it took a conscious effort to realise that the repertoire
and the encoding are unrelated; but then I'm probably just dim.

Mike Banahan.
-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb

gwyn@BRL.ARPA (VLD/VMB) (10/06/85)

If the target environment cannot support the full C character set
then it obviously cannot support an X3J11 implementation.

I was told that several committee members wanted to specify ASCII
for the character set, and that the current tip-toeing around the
issue was a compromise consensus.  The thing is, you see, ASCII
is an official standard and EBCDIC was just IBM doing its own
thing, as usual.  The properties of EBCDIC are pretty disgusting,
from a programmer's viewpoint.

I once had to cope with this issue, and decided the best thing was
to insist on all characters available inside a running program
being ASCII, with all characters outside the program being
expressed in some "host character set".  I used the toascii()
function to map into internal (ASCII) form and a new function
tohostc() to map to external (host) form.  this made it possible
to use the nice properties of the ASCII character set in my code.
On an ASCII host, these are trivial macros, and on other hosts
they are just indexed translation tables.