[comp.std.internat] draft ANSI standard: trigraphs rear their ugly heads again

gnu@hoptoad.uucp (John Gilmore) (12/02/86)

[This is posted to comp.lang.c because mod.std.c seems to be dead.  Love
those mod groups!]

The committee did not want to tie C to ASCII.  Fair enough.  What they
did was require that all the relevant characters be in the character
set (section 2.2.1), but not say anything about their character encoding.
In fact, you could compile source code in ASCII to run on a machine
that uses EBCDIC in the runtime environment.  This is great.

The problem is that they went ahead to try to define a way to represent
all the relevant characters in all the ISO code sets used in Europe.
Since various countries reuse #, [, {, }, ], \, |, ~, and ^ as letters
and such, they have defined three-character sequences that can be used
to represent these characters.

Now, these are major characters in the language.  The preprocessor
prefix #.  The block structuring construct { }.  The array subscripter
[ ].  And the ultimate escape character \, as well as a bunch of
logical ops.

My question is this.  Is a C program that is written in plain old ASCII,
using the above characters, portable?  Is it a "strictly conforming program"?
Is every ANSI standard C compiler in the world required to read in such
a program and translate it properly?

Next question.  Is a C program that uses local letters outside character
strings (e.g. as letters in French or Swedish identifiers) portable?
Is it a "strictly conforming program"?  Are there ANY C compilers anywhere
in the world which will read in such a program and translate it properly?

My preliminary answers are:  C programs that use ASCII characters had
damn well better be strictly conforming, or every C program in the world
is broken.  C compilers on European machines could support the national
letters in identifiers and such, but any program that used this feature
would not be portable.

Since a European C compiler which supported using the local characters
AS LETTERS would encourage unportable code, it would be better to make
European C compilers which did not support using the local characters
as letters.  This is tough, but are we trying to be nice or are we
trying to encourage portability?  Since the specific intent of the
standard is to prompte portability, features in the standard which
encourage the generation of nonportable code should be questioned.
Newly introduced features discouraging portability should be removed.

Now.  If European C compilers do not support using the local characters
as letters, and don't support using them as ASCII punctuation, everyone
in Europe will be forced to write their code using trigraphs.

Of course, any code written in North America or the UK will use ASCII
characters, so the Europeans will have to write a program to translate
the imported {, }, etc into trigraphs.

I think that a better solution is for the European compilers to support
these character codes to mean what they mean in ASCII.  Now imported
sources can be compiled directly.  Also, Europeans would have the choice of
editing the ASCII sources rather than using trigraphs.  The programs
will look funny on local terminals, but I don't see how it can be
harder to read a program filled with local letters as punctuation, than
it can be to read a program that looks like:

??=include <stdio.h>

main(argc, argv)
	int argc; char **argv;
??<
	char buf??(??) = "Hello, world!??/r??/n";

	if (feof(stdin) ??!??! argc != 0) ??<
		printf(buf);
	??>
??>

Since the trigraphs are even uglier than the alternative, and since
European compilers will not be able to use those character codes for
anything else, there is no need for introducing the trigraphs.  "The
X3J11 charter clearly mandates the committee to *codify common existing
practice*" (emphasis theirs -- Rationale, pg. 1).  The committee's 
justification for ignoring common practice here is too weak.  The
trigraphs should be removed.
-- 
John Gilmore  {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu   jgilmore@lll-crg.arpa
    "I can't think of a better way for the War Dept to spend money than to
  subsidize the education of teenage system hackers by creating the Arpanet."

jmlang@water.UUCP (Jerome M Lang) (12/02/86)

In article <1381@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>
>Of course, any code written in North America or the UK will use ASCII
>characters, so the Europeans will have to write a program to translate
>the imported {, }, etc into trigraphs.

Not quite true.  There is a French portion of North America. (Quebec
is well known). I remember at the Universite' de Moncton (In New
Brunswick, Canada) we had quite a few terminals that used a "French"
ascii.  Makes your code real funny.  Besides, doesn't the UK have some
differences in what they use as character set (the pound sign instead
of the dollar sign, at least).  This situation is very serious when
the code is in C.
-- 
Je'ro^me M. Lang	   ||    jmlang@water.bitnet        jmlang@water.uucp
Dept of Applied Math       ||			  jmlang%water@waterloo.csnet
U of Waterloo		   ||  	 jmlang%water%waterloo.csnet@csnet-relay.arpa

gwyn@brl-smoke.ARPA (Doug Gwyn ) (12/02/86)

In article <1381@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>Since various countries reuse #, [, {, }, ], \, |, ~, and ^ as letters
>and such, they have defined three-character sequences that can be used
>to represent these characters.
>...
>Since the trigraphs are even uglier than the alternative, and since
>European compilers will not be able to use those character codes for
>anything else, there is no need for introducing the trigraphs.  "The
>X3J11 charter clearly mandates the committee to *codify common existing
>practice*" (emphasis theirs -- Rationale, pg. 1).  The committee's 
>justification for ignoring common practice here is too weak.  The
>trigraphs should be removed.

The ??* trigraphs were introduced early in the C standard drafting
process (before my time, actually).  I agree that the issue should
be re-examined now that X3J11 is paying more attention to international
issues.  Notice that X3J11 recently came down firmly on the side of the
ideas that "C source is English" and that the default start-up run-time
"locale" is "C standard".  (AT&T, and to some extent I, would have
preferred that the start-up locale be left up to the implementation, to
permit setting it via UNIX's environment, rather than requiring nearly
every international application to explicitly invoke setlocale(), but
the majority preferred to have a well-defined initial state that would
permit systems code to completely ignore locale without peril.)

It does seem rather peculiar to buy into the ISO invariant code set
characters without buying into ISO/ASCII encoding standards.

Send your comment in formally to be sure that it receives consideration.

aeb@mcvax.cwi.nl (Andries Brouwer) (12/03/86)

In article <1381@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>...
>My preliminary answers are:  C programs that use ASCII characters had
>damn well better be strictly conforming, or every C program in the world
>is broken.  C compilers on European machines could support the national
>letters in identifiers and such, but any program that used this feature
>would not be portable.
>
>Since a European C compiler which supported using the local characters
>AS LETTERS would encourage unportable code, it would be better to make
>European C compilers which did not support using the local characters
>as letters.  This is tough, but are we trying to be nice or are we
>trying to encourage portability?

What you ask, in fact, is that everybody in the world use english
when programming. In Danish, for example, {,|, and } are three major
vowels, like a, o and u in English. What would you say if I were to
suggest that you'd better avoid a's, o's and u's in your identifiers?

Your view about non-portability is a bit too pessimistic.
It would be very easy to modify existing compilers to accept
a pragmat (compiler directive) like
#letter "{|}"
so that each source program could define its set of letters used
in identifiers. Of course, one has to use trigraphs for the symbols
lost this way, but I can assure you that that is much to be preferred
above seeing letters in places where one expects punctuation.

henry@utzoo.UUCP (Henry Spencer) (12/04/86)

> Not quite true.  There is a French portion of North America. (Quebec ...

Although it's quite irrelevant to C and such, I would also point out that
France is technically a North American country:  the islands of St. Pierre
and Miquelon (sp?) in the Gulf of St. Lawrence are part of France.  Not
just territories or possessions, mind you, they are provinces (districts?
whatever...) of France.  For example, they are an electoral district in
French elections, electing one representative to the legislature.

> ... Besides, doesn't the UK have some
> differences in what they use as character set (the pound sign instead
> of the dollar sign, at least)...

I believe they normally have pound sign where we have number sign (#).
This is actually fairly harmless.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

bzs@bu-cs.BU.EDU (Barry Shein) (12/05/86)

I still don't understand why the omission of the graphic characters
open/close curly brace was somehow necessary to provide a European
character set. Are those character sets defined by the characters
they exclude? Or is the pot calling the kettle black here?

It seems like we are now all in a mess, we don't have umlauts, they
don't have curly braces etc. Obviously a superset is the only remotely
acceptable solution.

	-Barry Shein, Boston University