[comp.lang.c] ANSI C -- trigraphs and character sets

minow@decvax.UUCP (Martin Minow) (12/14/86)

This is one of a collection of comments on the Draft Standard, posted to
comp.lang.c for discussion before I mail a final draft to the Ansi C
committee.  Each message discusses one problem I have found with the Draft
Standard that I feel warrants a "no" vote.  Note that this message is my
personal opinion, and does not reflect on the opinions of my employer.

---- Problem:

Page 10, line 1ff. The Standard should recognize the primacy of the ISO
Latin 1 character set.

Page 10, line 34ff. Trigraphs should be deleted from the standard.

---- Motivation:

Page 10, line 1ff.  The character set should be defined in terms of ISO
Latin 1 (ISO 8859/1, ANSI X3.134.1, ECMA-94).  While other character sets
may be used, they should be defined with reference to this standard.
Latin 1 contains representations for the accented characters needed for
many European languages.  These representations do not conflict with the
characters, such as backslash, that are needed for C syntax.  The standard
should permit the use of accented characters (positions 12/0 through 15/15)
in variable names (noting, however, that this may be non-portable and not
requiring it in a conforming compiler).  It should also require acceptance
of all 255 characters in strings.  (Some existing compilers use the 0x80
bit to mark variable substitution in the preprocessor.)  A reasonable
extension, but not one that I would mandate, would be to accept the Latin 1
multiply and divide signs as equivalents to '*' and '/' and the raised
dot as equivalent to period in numeric quantities.

Page 10, line 34ff.  Trigraphs were added to the standard in order to
accomodate European users who currently use the character set positions
occupied by # [ \ ] ^ { | } ~.  A better solution is offered by the Latin 1
alphabet, which consists of the USASCII 7-bit alphabet augmented by a 128
byte character set containing the ``special'' letters used by most European
countries.  This standard was prepared jointly by ANSI, ISO, and CBEMA
(the European business equipment manufacturers).

During the transitional period, users of existing equipment that supports
national letters are better served by implementation-specific conversion
routines that are external to the C language. These would compose multi-byte
sequences into Latin 1 and display Latin 1 characters (using either the
representations available on the terminal or fallback composition sequences)

The composition process would be external to, and independent of, the C
language.  It may be provided by the implementation by a #pragma.

Note that the standard does not offer the implementor guidance in handling
programs that mix trigraph sequences and national letters.  As stated, it
is clear that the sequence `??/' functions as a backslash.  However, it is
not clear how the compiler is to treat an input character (assuming 7-bit
Ascii) in position 5/12 (having decimal value 92).  Is this also a backslash,
or is it a national letter (such as the Swedish capital 'O' with two dots)?

----

Martin Minow
decvax!minow

gwyn@brl-smoke.ARPA (Doug Gwyn ) (12/15/86)

In article <106@decvax.UUCP> minow@decvax.UUCP (Martin Minow) writes:
>Page 10, line 1ff. The Standard should recognize the primacy of the ISO
>Latin 1 character set.

We (they, actually; that was before my time) tried to do
essentially that once, but immediately ran up against the
problem that some important vendors much prefer EBCDIC.
There doesn't seem to be any strong logical argument for
forbidding reasonable non-ISO-conforming character sets.

>Page 10, line 34ff. Trigraphs should be deleted from the standard.

I could easily back such a proposal; trigraphs appear to be
a remnant of a premature attempt to cope with
internationalization and are probably neither necessary nor
desirable.

The above remarks are by no means official X3J11 positions;
please go ahead and send in your suggestions to ANSI.

ccplumb@watnot.UUCP (Colin Plumb) (12/17/86)

>>Page 10, line 1ff. The Standard should recognize the primacy of the ISO
>>Latin 1 character set.
>
>We (they, actually; that was before my time) tried to do
>essentially that once, but immediately ran up against the
>problem that some important vendors much prefer EBCDIC.

Are you sure that should be the plural "important vendors"?   :-)

Seriously, if "before your time" means any significant number of years, they
should try again.  With apologies to any EBCDIC (or other) users out there,
ASCII (And supersets, like the JIS standard) reigns pretty much unchallenged
these days.

	-Colin Plumb (ccplumb@watnot.UUCP)

Zippy says:
Hey, wait a minute!!  I want a divorce!!  ..you're not Clint Eastwood!!

barmar@mit-eddie.MIT.EDU (Barry Margolin) (12/18/86)

In article <106@decvax.UUCP> minow@decvax.UUCP (Martin Minow) writes:
>This standard was prepared jointly by ANSI, ISO, and CBEMA
>(the European business equipment manufacturers).

You must have meant ECMA, not CBEMA.  The latter has its offices in
Washington, DC, and is involved heavily with ANSI.  I think the "E" in
ECMA stands for "European".
-- 
    Barry Margolin
    ARPA: barmar@MIT-Multics
    UUCP: ..!genrad!mit-eddie!barmar

barmar@mit-eddie.MIT.EDU (Barry Margolin) (12/18/86)

In article <12304@watnot.UUCP> ccplumb@watnot.UUCP (Colin Plumb) writes:
>>We (they, actually; that was before my time) tried to do
>>essentially that once...
>Seriously, if "before your time" means any significant number of years, they
>should try again.

X3J11 has only been in existence for a couple of years, which I think is
less than "any significant number of years."  If they ran up against
significant pressures then, they most likely will again.  IBM hasn't
dropped EBCDIC, but there are lots of people who want to use C on IBM
equipment.  The C standard takes many pains to be character set
independent; I remember lots of flaming on mod.std.c about how to word
the descriptions of \r, \n, \g, etc., so that they would not obviously
discriminate against non-ASCII implementations.
-- 
    Barry Margolin
    ARPA: barmar@MIT-Multics
    UUCP: ..!genrad!mit-eddie!barmar

brunner@sri-spam.istc.sri.com (Thomas Eric Brunner) (12/19/86)

In article <4327@mit-eddie.MIT.EDU> barmar@eddie.MIT.EDU (Barry Margolin writes:
>In article <106@decvax.UUCP> minow@decvax.UUCP (Martin Minow) writes:
>>This standard was prepared jointly by ANSI, ISO, and CBEMA
>>(the European business equipment manufacturers).
>
>You must have meant ECMA, not CBEMA.  The latter has its offices in
>Washington, DC, and is involved heavily with ANSI.  I think the "E" in
>ECMA stands for "European".

ECMA=European Computer Manufacturer's Assn.
There are 12 member companies, when I worked on/drafted the X/OPEN
text (the green book), my clients comprised 5/12th of ECMA.

Other ECMA-ish entities are SPAG, and ROSE, though these are both I
believe ESPRIT-entities.

Shall we start an informative discussion of the sundry ECMA/ESPRIT-like
entities? I'm sure that there are many of which I am ignorant.

-- 
Cheers!  o/
/teb    _0_
.if\\n()t .ds ]D OPEN UNIX CLUB DRAFT 1.0