gnu@hoptoad.uucp (John Gilmore) (12/02/86)
[This is posted to comp.lang.c because mod.std.c seems to be dead. Love those mod groups!] The committee did not want to tie C to ASCII. Fair enough. What they did was require that all the relevant characters be in the character set (section 2.2.1), but not say anything about their character encoding. In fact, you could compile source code in ASCII to run on a machine that uses EBCDIC in the runtime environment. This is great. The problem is that they went ahead to try to define a way to represent all the relevant characters in all the ISO code sets used in Europe. Since various countries reuse #, [, {, }, ], \, |, ~, and ^ as letters and such, they have defined three-character sequences that can be used to represent these characters. Now, these are major characters in the language. The preprocessor prefix #. The block structuring construct { }. The array subscripter [ ]. And the ultimate escape character \, as well as a bunch of logical ops. My question is this. Is a C program that is written in plain old ASCII, using the above characters, portable? Is it a "strictly conforming program"? Is every ANSI standard C compiler in the world required to read in such a program and translate it properly? Next question. Is a C program that uses local letters outside character strings (e.g. as letters in French or Swedish identifiers) portable? Is it a "strictly conforming program"? Are there ANY C compilers anywhere in the world which will read in such a program and translate it properly? My preliminary answers are: C programs that use ASCII characters had damn well better be strictly conforming, or every C program in the world is broken. C compilers on European machines could support the national letters in identifiers and such, but any program that used this feature would not be portable. Since a European C compiler which supported using the local characters AS LETTERS would encourage unportable code, it would be better to make European C compilers which did not support using the local characters as letters. This is tough, but are we trying to be nice or are we trying to encourage portability? Since the specific intent of the standard is to prompte portability, features in the standard which encourage the generation of nonportable code should be questioned. Newly introduced features discouraging portability should be removed. Now. If European C compilers do not support using the local characters as letters, and don't support using them as ASCII punctuation, everyone in Europe will be forced to write their code using trigraphs. Of course, any code written in North America or the UK will use ASCII characters, so the Europeans will have to write a program to translate the imported {, }, etc into trigraphs. I think that a better solution is for the European compilers to support these character codes to mean what they mean in ASCII. Now imported sources can be compiled directly. Also, Europeans would have the choice of editing the ASCII sources rather than using trigraphs. The programs will look funny on local terminals, but I don't see how it can be harder to read a program filled with local letters as punctuation, than it can be to read a program that looks like: ??=include <stdio.h> main(argc, argv) int argc; char **argv; ??< char buf??(??) = "Hello, world!??/r??/n"; if (feof(stdin) ??!??! argc != 0) ??< printf(buf); ??> ??> Since the trigraphs are even uglier than the alternative, and since European compilers will not be able to use those character codes for anything else, there is no need for introducing the trigraphs. "The X3J11 charter clearly mandates the committee to *codify common existing practice*" (emphasis theirs -- Rationale, pg. 1). The committee's justification for ignoring common practice here is too weak. The trigraphs should be removed. -- John Gilmore {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu jgilmore@lll-crg.arpa "I can't think of a better way for the War Dept to spend money than to subsidize the education of teenage system hackers by creating the Arpanet."
jmlang@water.UUCP (Jerome M Lang) (12/02/86)
In article <1381@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: > >Of course, any code written in North America or the UK will use ASCII >characters, so the Europeans will have to write a program to translate >the imported {, }, etc into trigraphs. Not quite true. There is a French portion of North America. (Quebec is well known). I remember at the Universite' de Moncton (In New Brunswick, Canada) we had quite a few terminals that used a "French" ascii. Makes your code real funny. Besides, doesn't the UK have some differences in what they use as character set (the pound sign instead of the dollar sign, at least). This situation is very serious when the code is in C. -- Je'ro^me M. Lang || jmlang@water.bitnet jmlang@water.uucp Dept of Applied Math || jmlang%water@waterloo.csnet U of Waterloo || jmlang%water%waterloo.csnet@csnet-relay.arpa
gwyn@brl-smoke.ARPA (Doug Gwyn ) (12/02/86)
In article <1381@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >Since various countries reuse #, [, {, }, ], \, |, ~, and ^ as letters >and such, they have defined three-character sequences that can be used >to represent these characters. >... >Since the trigraphs are even uglier than the alternative, and since >European compilers will not be able to use those character codes for >anything else, there is no need for introducing the trigraphs. "The >X3J11 charter clearly mandates the committee to *codify common existing >practice*" (emphasis theirs -- Rationale, pg. 1). The committee's >justification for ignoring common practice here is too weak. The >trigraphs should be removed. The ??* trigraphs were introduced early in the C standard drafting process (before my time, actually). I agree that the issue should be re-examined now that X3J11 is paying more attention to international issues. Notice that X3J11 recently came down firmly on the side of the ideas that "C source is English" and that the default start-up run-time "locale" is "C standard". (AT&T, and to some extent I, would have preferred that the start-up locale be left up to the implementation, to permit setting it via UNIX's environment, rather than requiring nearly every international application to explicitly invoke setlocale(), but the majority preferred to have a well-defined initial state that would permit systems code to completely ignore locale without peril.) It does seem rather peculiar to buy into the ISO invariant code set characters without buying into ISO/ASCII encoding standards. Send your comment in formally to be sure that it receives consideration.
aeb@mcvax.cwi.nl (Andries Brouwer) (12/03/86)
In article <1381@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >... >My preliminary answers are: C programs that use ASCII characters had >damn well better be strictly conforming, or every C program in the world >is broken. C compilers on European machines could support the national >letters in identifiers and such, but any program that used this feature >would not be portable. > >Since a European C compiler which supported using the local characters >AS LETTERS would encourage unportable code, it would be better to make >European C compilers which did not support using the local characters >as letters. This is tough, but are we trying to be nice or are we >trying to encourage portability? What you ask, in fact, is that everybody in the world use english when programming. In Danish, for example, {,|, and } are three major vowels, like a, o and u in English. What would you say if I were to suggest that you'd better avoid a's, o's and u's in your identifiers? Your view about non-portability is a bit too pessimistic. It would be very easy to modify existing compilers to accept a pragmat (compiler directive) like #letter "{|}" so that each source program could define its set of letters used in identifiers. Of course, one has to use trigraphs for the symbols lost this way, but I can assure you that that is much to be preferred above seeing letters in places where one expects punctuation.
henry@utzoo.UUCP (Henry Spencer) (12/04/86)
> Not quite true. There is a French portion of North America. (Quebec ... Although it's quite irrelevant to C and such, I would also point out that France is technically a North American country: the islands of St. Pierre and Miquelon (sp?) in the Gulf of St. Lawrence are part of France. Not just territories or possessions, mind you, they are provinces (districts? whatever...) of France. For example, they are an electoral district in French elections, electing one representative to the legislature. > ... Besides, doesn't the UK have some > differences in what they use as character set (the pound sign instead > of the dollar sign, at least)... I believe they normally have pound sign where we have number sign (#). This is actually fairly harmless. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
bzs@bu-cs.BU.EDU (Barry Shein) (12/05/86)
I still don't understand why the omission of the graphic characters open/close curly brace was somehow necessary to provide a European character set. Are those character sets defined by the characters they exclude? Or is the pot calling the kettle black here? It seems like we are now all in a mess, we don't have umlauts, they don't have curly braces etc. Obviously a superset is the only remotely acceptable solution. -Barry Shein, Boston University