mikeb@inset.UUCP (Mike Banahan) (10/03/85)
Subject: netnews comments The ANSI draft on C is clear on the subject of what characters are allowed in strings. It draws a distinction between the source character set (the one you compile in) and the destination character set (the one you execute in). It says explicitly that every character in the source character set except newline may be present in a string literal ( though some need escapes, eg \" ). Interestingly, it doesn't explain what happens to them. For example, if you use ASCII as your source character set, but execute in EBCDIC, and use a non-EBCDIC character in a literal: is that an error? It means that, if it is an error, then the ``legality'' of a program depends on the target environment. I am not sure that the committee has given a lot of thought to this problem. I remember some conversations on the character set subject and drew the conclusion that, although some committee members thought they understood it properly, there were a lot who didn't. I found the whole thing pretty confusing at the time - but have since had time to think hard about it. It does seem clear that the compiler is EXPECTED to transform the source character set into the target character set, not just pass byte streams through. The problems with the BSD compiler are presumably that someone thought they were entitled to use the top bit for some internal purpose. They made the assumption that the source character set was strictly seven bit. If you think the C compiler has problems here, there are a hell of a lot of other things that are worse! Of course, they weren't looking at the X3J11 proposals when they wrote it. It's really an educational issue; too many people think that ASCII was written on the back of the ten commandments and that its word is law. I found it took a conscious effort to realise that the repertoire and the encoding are unrelated; but then I'm probably just dim. Mike Banahan. -- Mike Banahan, Technical Director, The Instruction Set Ltd. mcvax!ukc!inset!mikeb
gwyn@BRL.ARPA (VLD/VMB) (10/06/85)
If the target environment cannot support the full C character set then it obviously cannot support an X3J11 implementation. I was told that several committee members wanted to specify ASCII for the character set, and that the current tip-toeing around the issue was a compromise consensus. The thing is, you see, ASCII is an official standard and EBCDIC was just IBM doing its own thing, as usual. The properties of EBCDIC are pretty disgusting, from a programmer's viewpoint. I once had to cope with this issue, and decided the best thing was to insist on all characters available inside a running program being ASCII, with all characters outside the program being expressed in some "host character set". I used the toascii() function to map into internal (ASCII) form and a new function tohostc() to map to external (host) form. this made it possible to use the nice properties of the ASCII character set in my code. On an ASCII host, these are trivial macros, and on other hosts they are just indexed translation tables.