[comp.std.c] wchar_t values

erik@sra.co.jp (Erik M. van der Poel) (03/29/91)

ANSI C says the following about wchar_t:

    ... an integral type whose range of values can represent distinct
    codes for all members of the largest extended character set specified
    among the supported locales; the null character shall have the code
    value zero and each member of the basic character set defined in 2.2.1
    shall have a code value equal to its value when used as the lone
    character in an integer character constant.

Now, if this question has been asked before, I apologize. But here
goes:

Which of the following two conditions is the correct interpretation of
the ANSI C standard:

	('c' == L'c')

	('c' == ((char) L'c'))
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

diamond@jit345.swstokyo.dec.com (Norman Diamond) (03/29/91)

In article <990@sranha.sra.co.jp> erik@sra.co.jp (Erik M. van der Poel) writes:

>Which of the following two conditions is the correct interpretation of
>the ANSI C standard:
>	('c' == L'c')
>	('c' == ((char) L'c'))

Both must be true.  However, if you try it with @ instead of c, or with
any other character which is not in the basic character set defined in
section 2.2.1, then all bets are off.

(Recall that '@' does not even have to compare equal to ((char) '@'), with
no use of wide characters at all.)
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

meissner@osf.org (Michael Meissner) (03/30/91)

In article <1991Mar29.073917.1217@tkou02.enet.dec.com> diamond@jit345.swstokyo.dec.com (Norman Diamond) writes:

| In article <990@sranha.sra.co.jp> erik@sra.co.jp (Erik M. van der Poel) writes:
| 
| >Which of the following two conditions is the correct interpretation of
| >the ANSI C standard:
| >	('c' == L'c')
| >	('c' == ((char) L'c'))
| 
| Both must be true.  However, if you try it with @ instead of c, or with
| any other character which is not in the basic character set defined in
| section 2.2.1, then all bets are off.
| 
| (Recall that '@' does not even have to compare equal to ((char) '@'), with
| no use of wide characters at all.)

Actually this is not true.  Nowhere in the standard does it say that
the bits for the multibyte character constant 'c' must equal the bits
for the wide character constant L'c'.  As long as mbtowc and wctomb do
the appropriate translations, and that the null byte give all 0's in
wchar_t's, everything is standard conforming.

In fact, if you use Unicode (as opposed to ISO 10646) to hold the
wchar_t's, 'c' will NOT equal L'c'.  Whether this is a reasonable
thing for programmers to expect is immaterial.  My personal bias is
that 'c' should equal L'c'.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

gwyn@smoke.brl.mil (Doug Gwyn) (03/30/91)

In article <990@sranha.sra.co.jp> erik@sra.co.jp (Erik M. van der Poel) writes:
>Which of the following two conditions is the correct interpretation of
>the ANSI C standard:
>	('c' == L'c')
>	('c' == ((char) L'c'))

Neither one, although the first one is close.  The numerical values
of these two (possibly distinct) integer types shall be the same.
Note, by the way, that 'c' has type int.

keld@login.dkuug.dk (Keld J|rn Simonsen) (03/31/91)

erik@sra.co.jp (Erik M. van der Poel) writes:

>ANSI C says the following about wchar_t:

>    ... an integral type whose range of values can represent distinct
>    codes for all members of the largest extended character set specified
>    among the supported locales; the null character shall have the code
>    value zero and each member of the basic character set defined in 2.2.1
>    shall have a code value equal to its value when used as the lone
>    character in an integer character constant.

Note that there are problems with this; for all the multibyte
standard character sets that I know of, L'c' is not equal to 'c' .
ISO 10646, JIS X 0208, GB 2312 and KSC 5601 all have a value of L'c'
different from ASCII 'c'. 

WG14 has got a comment with respect to this from SC2, and we are
working to see if we can solve this problem.

Another problem present in the above is that "the null character"
shall have the value zero. The NUL character in ISO DIS 10646
does not have this value, but then the ANSI/ISO C  has different 
meaning with the term "null character", it means the string terminator
and does not have a direct relation to the SC2 "NUL character".

Keld Simonsen
member of ISO/IEC JTC1/SC22/WG14

gwyn@smoke.brl.mil (Doug Gwyn) (03/31/91)

In article <keld.670360834@dkuugin> keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>Note that there are problems with this; for all the multibyte
>standard character sets that I know of, L'c' is not equal to 'c' .
>ISO 10646, JIS X 0208, GB 2312 and KSC 5601 all have a value of L'c'
>different from ASCII 'c'. 

In principle this could be handled by the C implementation, which is
NOT obliged to map these characters into any particular code set.

Perhaps a better solution, however, is to all agree not to test for
conformance with the L'c'=='c' criterion, since there seems to be
no good reason for this requirement.  I sure don't remember this
being discussed in X3J11, although it might have been.

>Another problem present in the above is that "the null character"
>shall have the value zero. The NUL character in ISO DIS 10646
>does not have this value, but then the ANSI/ISO C  has different 
>meaning with the term "null character", it means the string terminator
>and does not have a direct relation to the SC2 "NUL character".

Correct -- "null character" is a character (byte) with value zero.
There is no requirement in the C standard that anybody's idea of
"NUL" be represented in the implementation's character set.

erik@srava.sra.co.jp (Erik M. van der Poel) (03/31/91)

As several people have guessed, the real reason for bringing up the
wchar_t issue is because I am wondering how ISO 10646 can be used in
the C language. Personally, I think that we should use it as follows:

	C	ISO DIS 10646/4		wchar_t

	L'c'	032/032/032/099		000/000/000/099
	L'\t'	009/128/128/128		000/000/000/009

I think that this is the most reasonable way to do it since it seems
to conform to ANSI C. Of course, this wchar_t encoding does not
conform to 10646's processing code, but I do not think that this is
important. It is more important for the external code (files, network,
etc) to conform to 10646.

However, I don't really care what encoding we use for wchar_t, as long
as implementors who wish to use 10646 for wchar_t all agree on one
encoding. So we should create an international standard the specifies
how to use 10646 as a processing code in C. If this spec appears some
time after 10646 becomes an IS, implementors might do things
differently. So the spec should appear together with 10646. Perhaps in
a normative annex in 10646?
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

keld@login.dkuug.dk (Keld J|rn Simonsen) (04/01/91)

erik@srava.sra.co.jp (Erik M. van der Poel) writes:

>As several people have guessed, the real reason for bringing up the
>wchar_t issue is because I am wondering how ISO 10646 can be used in
>the C language. Personally, I think that we should use it as follows:

>	C	ISO DIS 10646/4		wchar_t

>	L'c'	032/032/032/099		000/000/000/099
>	L'\t'	009/128/128/128		000/000/000/009

>I think that this is the most reasonable way to do it since it seems
>to conform to ANSI C.

Erik writes: ANSI C does not handle 10646 properly -> let's change 10646!
I do not think this is the right way of reasoning.

ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
correctly. So ANSI C multibyte specifications *cannot* be used on any
multibyte de jure character set. Seems to me to be a fault with
ANSI C. Also the character standards should be the base standards and
programming language standards build on these and provide appropiate
functionality to cover the standard character sets. 

If then another programming language or maybe some communication
standard have other requirements for a universal character standard,
should character standard then also be changed to accomodate that use?
And what if the different requirements are contradictionary, should that
lead to different character set standards? Well, that was what happened
in the past, with the ISO 646 and 8859 standards in programming languages
and 6937/T.61 in the communications world. I hope that this problem
will be a historical one with the appearance of 10646.

>However, I don't really care what encoding we use for wchar_t, as long
>as implementors who wish to use 10646 for wchar_t all agree on one
>encoding. So we should create an international standard the specifies
>how to use 10646 as a processing code in C. If this spec appears some
>time after 10646 becomes an IS, implementors might do things
>differently. So the spec should appear together with 10646. Perhaps in
>a normative annex in 10646?

It could also appear in the ISO C addendum that is being worked on
by WG14. I think that is the most natural place, 10646 should not
as a base standard for other JTC1 work reference the ISO C standard.

I have some ideas on how to solve it in C:

1. include a table for mapping ASCII characters into the current
execution character set in the runtime library. This table is
changed with a new call to setlocale(). L'c' then points to the
table entry of ASCII 'c' with the current wchar_t 'c' value.
Effectivenes: quite good, just a pointed value instead of an immediate
value. For widechar characters this may even be without any loss
as the widechar value may have to be stored  in a 2 or 4-byte location
anyway.

2. Have a function which returns a character from a charmap name
(POSIX term). This will have the generality that not only ASCII
characters can be handled in this way. Say a character <c,> (C-cedille)
can also be tested on in this way. 
Effectivenes: less good,  needs a function call and a table lookup
on a name (hashed or the like).

Maybe we should have both ways of  handling the identity of widechars.

Keld Simonsen

gwyn@smoke.brl.mil (Doug Gwyn) (04/01/91)

In article <keld.670436534@dkuugin> keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>erik@srava.sra.co.jp (Erik M. van der Poel) writes:
>>	C	ISO DIS 10646/4		wchar_t
>>	L'c'	032/032/032/099		000/000/000/099
>>	L'\t'	009/128/128/128		000/000/000/009
>Erik writes: ANSI C does not handle 10646 properly -> let's change 10646!

No, he didn't say that, and his suggestion seemed reasonable to me.

>ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
>correctly. So ANSI C multibyte specifications *cannot* be used on any
>multibyte de jure character set.

I think you have mixed multibyte character sequences with wchar_t.
They are NOT the same thing!  That is why there are interconversion
functions specified in the C standard.

The advice X3J11 received during development of this aspect of the
standard, from such organizations as NTSCJ who have a major stake in
so-called multibyte character encodings, was that the mechanisms in
the C standard were adequate for this purpose.  Unless you can explain
WHAT it is that you think is wrong, I suggest that your comments be
ignored.  I haven't seen a significant technical argument against
the "wide character" mechanisms in the C standard; what I have seen
are misunderstandings.  Perhaps you should refer to P.J. Plauger's
model standard C library implementation in his new book, to see what
is actually involved in exploiting and implementing these facilities.

>Also the character standards should be the base standards and
>programming language standards build on these and provide appropiate
>functionality to cover the standard character sets. 

WHICH "character standard"?  There are so many to choose from, all
of them botched in one way or another.  That is why programming
language standards should be INDEPENDENT of any particular choice of
character code set, rather than based on one choice that may not be
appropriate for many of the potential users of the language.  In the
case of C, the only requirements on the basic source and execution
character sets are that there be at least 96 distinct values, that
the values assigned by the C implementation to represent digit glyphs
be a contiguous ascending sequence, that there be three additional
distinct values in the execution set, and that all the previously
mentioned internal values be distinct from zero.  THE MAPPING BETWEEN
INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN, AND SHOULD, BE
DEFINED BY THE C IMPLEMENTATION.  Thus, a straight 6-bit external
code set could not have a one-to-one correspondence between external
"characters" and C source OR execution "character" values, and in
such a system environment there would have to be at least one added
convention for representing the full set of internal C characters,
with support tools to facilitate working with such special text files.

Mapping is an extremely important mathematical concept, with particular
relevance to applications involving multiple alphabets.  (As a former
cryptanalyst I am especially sensitive to this.)  That is why the VERY
FIRST STEP in translating a C program, as spelled out in the C standard
("Translation Phases", section 2.1.1.2 in X3.159-1989), is the
application of a MAPPING from physical (i.e. external) source file
characters to (internal) C source characters.  Systems with record-
oriented text files can exploit this mapping subphase to introduce
line delimiter internal characters ("new-line" characters in the C
source character set), and systems that lack standard representations
for some of the required C source characters can take advantage of
this mapping subphase to interpret, for example, digraph
representations for the characters not normally considered to be
represented in the "native" code set.  This is a simple and clean
approach to satisfying the C source character set requirements.

Indeed, X3J11 explained this to you in the third public review response
document.  Judging from your continued pursuit of more obtrusive
solutions for your own particular limited character set problem,
it would appear that you either did not understand the X3J11 response
or that, for reasons of your own, you wish to ignore it.  For purposes
of documentation for those who have not seen the response X3J11 gave
long ago, here it is:

	In response to Letter #177, Doc. No. X3J11/88-134:

	Summary of issue:
		Proposal for more readable supplement to trigraphs.

	X3J11 response:
		The Committee discussed this proposal but decided
	against it.

		We cannot support this proposal for a number of reasons.
	Trigraphs were intended to provide a universally portable format
	for the *transmission* of C programs; they were never intended
	to be used for day-to-day reading and writing of programs.
	Should it be necessary to do so, however, the preprocessor can
	already be used to improve their readability (exact macro names
	and definitions are not provided as the Committee prefers to
	avoid stylistic issues).  As larger character sets become more
	and more popular, the chances of having to deal with a
	"deficient" character set become smaller and smaller.

		Conversion between the current trigraph representations
	and "normal" representations can be done simply in a context-
	free manner, but this is not possible with the proposed notation.
	Also, there are a number of difficulties with the infix subscript
	operator where empty brackets would have been used.  Either the
	operator must be allowed as a postfix unary operator as well as
	a binary operator, or the grammar must be extended to allow empty
	parentheses to appear in those contexts where empty brackets can.
	Although these problems are by no means insurmountable, we feel
	that the current trigraphs are adequate for their intended use
	and that no further enhancements are necessary.

		Translation phase 1 actually consists of two parts, first
	the mapping (about which we say very little) from the external
	host character set to the C source character set, then the
	replacement of C source trigraph sequences with single C source
	characters.  (Note that the C source characters represented in
	our documents in Courier font need not appear graphically the
	same in the host environment, although a reasonable
	implementation will make them as nearly so as possible.)
	The kind of mapping you propose can in fact be done in the first
	part of translation phase 1, and several such "convenience"
	mappings are already common practice.  However, attempting to
	standardize this mapping is outside the scope of the C Standard,
	since what is appropriate may depend on the capabilities of the
	specific hardware, availability of fonts, and so forth.

		Although the Committee regrets any "no" votes on either
	the national or international proposed standards, we feel we
	must represent our best judgement on technical issues.  We hope
	you will reconsider your objection to the current specification.

Note that your "trigraph alternative" proposals had been discussed many
times in the standards committees, and still were resoundingly defeated
during a joint X3J11/WG14 meeting.  The only reason this issue is still
"on the table" for WG14 is that there was some political maneuvering at
the SC22 level in the absence of anybody who could represent the actual
issues and history, and SC22 mistakenly thought, on the basis of your
argumentation, that there was a problem that needed to be solved, and
thus directed that work toward a normative addendum to the ISO C
standard begin to address this "problem".  Later, the Japanese in
particular thought that it would be appropriate to add more support for
multibyte character sequences to the ISO C standard as part of this
normative addendum.  Your original hobby horse had nothing to do with
multibyte character sequences, and so far as I can determine, the
Japanese have not found any problem with them other than the desire for
more standard library functions to make their use more convenient.

It is also worth noting that there is continuing discussion of this
issue on the X3J11 and WG14 electronic mailing lists.

>I hope that this problem will be a historical one with the appearance
>of 10646.

Surely you should be able to see the possible problems with 10646?
The very idea of using 32 bits to represent a character is bound to
meet stiff opposition, particularly from users of small systems,
who already have more efficient solutions to the "problem" of a
diversity of alphabets.  It seems to me that 10646 is one of the
technically worst character-set standards yet to be adopted.  No
wonder there has been renewed interest in other standards such as
"Unicode" (about which I know little at present other than that it
has a broad base of industry support).

One does not solve "people problems" by simply adopting a technical
standard.  History provides much evidence of that.

DISCLAIMER:  None of the above should be construed as an official X3J11
position, not even the attempt to cite from an X3J11 document.  However,
I believe that I have correctly represented the situation as I
understand it.

diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/01/91)

In article <MEISSNER.91Mar29120040@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
>In article <1991Mar29.073917.1217@tkou02.enet.dec.com> diamond@jit345.swstokyo.dec.com (Norman Diamond) writes:
>> In article <990@sranha.sra.co.jp> erik@sra.co.jp (Erik M. van der Poel) writes:
emvdp>Which of the following two conditions is the correct interpretation of
emvdp>the ANSI C standard:
emvdp>	('c' == L'c')
emvdp>	('c' == ((char) L'c'))
ndd>Both must be true.  However, if you try it with @ instead of c, or with
ndd>any other character which is not in the basic character set defined in
ndd>section 2.2.1, then all bets are off.
ndd>(Recall that '@' does not even have to compare equal to ((char) '@'), with
ndd>no use of wide characters at all.)
mm>Actually this is not true.  Nowhere in the standard does it say that
mm>the bits for the multibyte character constant 'c' must equal the bits
mm>for the wide character constant L'c'.  As long as mbtowc and wctomb do
mm>the appropriate translations, and that the null byte give all 0's in
mm>wchar_t's, everything is standard conforming.

Section 4.1.5, page 99 lines 19-20:
ansi>each member of the basic character set defined in section 2.2.1 shall
ansi>have a code value equal to its value when used as the lone character
ansi>in an integer character constant.

c is in the basic character set defined in section 2.2.1.
L'c' has type wchar_t and must be equal to 'c'.
((char) L'c') must be equal to ((char) 'c'), which in turn must be equal
to 'c' again because c is in the basic character set.
Therefore, both of Mr. van der Poel's propositions are true.
I believe that Mr. Meissner is wrong.

In order to answer the more general case, I gave an example with @ instead
of c.  It is not necessary for '@' to equal L'@' and it is not necessary
for '@' to equal ((char) L'@').  The same non-necessities, of course, apply
to multibyte characters.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/01/91)

In article <15640@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>In article <990@sranha.sra.co.jp> erik@sra.co.jp (Erik M. van der Poel) writes:
>>Which of the following two conditions is the correct interpretation of
>>the ANSI C standard:
>>	('c' == L'c')
>>	('c' == ((char) L'c'))
>Neither one, although the first one is close.  The numerical values
>of these two (possibly distinct) integer types shall be the same.

Huh?  When they have the same numerical value and both are integer types,
how is it possible for them to not compare equal?
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

gwyn@smoke.brl.mil (Doug Gwyn) (04/01/91)

In article <1991Apr1.065249.25920@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
-In article <15640@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
->In article <990@sranha.sra.co.jp> erik@sra.co.jp (Erik M. van der Poel) writes:
->>Which of the following two conditions is the correct interpretation of
->>the ANSI C standard:
->>	('c' == L'c')
->>	('c' == ((char) L'c'))
->Neither one, although the first one is close.  The numerical values
->of these two (possibly distinct) integer types shall be the same.
-Huh?  When they have the same numerical value and both are integer types,
-how is it possible for them to not compare equal?

My point was that the correct interpretation of the standard encompasses
more than the specific example.  The example is implied by the standard's
specification, but not vice-versa.  Thus they are not equivalent.

keld@login.dkuug.dk (Keld J|rn Simonsen) (04/04/91)

gwyn@smoke.brl.mil (Doug Gwyn) writes as a reply to an article of mine:

>>ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
>>correctly. So ANSI C multibyte specifications *cannot* be used on any
>>multibyte de jure character set.

>I think you have mixed multibyte character sequences with wchar_t.
>They are NOT the same thing!  That is why there are interconversion
>functions specified in the C standard.

I wrote "multibyte" to cover multibyte character sets,  and the
support in ISO (ANSI) C for these character sets; this support
consists of the multibyte support functions and the widechar
support functions.

>The advice X3J11 received during development of this aspect of the
>standard, from such organizations as NTSCJ who have a major stake in
>so-called multibyte character encodings, was that the mechanisms in
>the C standard were adequate for this purpose.  Unless you can explain
>WHAT it is that you think is wrong, I suggest that your comments be
>ignored.  I haven't seen a significant technical argument against
>the "wide character" mechanisms in the C standard; what I have seen
>are misunderstandings.  Perhaps you should refer to P.J. Plauger's
>model standard C library implementation in his new book, to see what
>is actually involved in exploiting and implementing these facilities.

OK, some hard facts:

The character 'c' has the following encodings in these basic 16-bit
East-asian de jure character standards:

GB 2312-80 (basic Chinese 16-bit  standard)  /035/099
JIS X 0208 (basic Japanese 16-bit standard)  /035/099
KS C 5601  (basic Korean 16-bit standard)    /035/099

ISO DIS 10646 has the following value for 'c':  /032/032/032/099

None of these values have the nice property of having ASCII 'c'
extend into these values when loading as a 16-bit or 32-bit int.

Either this is a problem for all of the above character sets, or
it is not a problem at all. I hope the latter is true, then there
is no problem to fix. Unfortunately quite some knowledgeable people
like Erik v.d.Poel and SC2/WG2 people plus WG14 think there is a problem
and they have not yet been able to solve it.

>>Also the character standards should be the base standards and
>>programming language standards build on these and provide appropiate
>>functionality to cover the standard character sets. 

>WHICH "character standard"?  There are so many to choose from, all
>of them botched in one way or another.  That is why programming
>language standards should be INDEPENDENT of any particular choice of
>character code set, rather than based on one choice that may not be
>appropriate for many of the potential users of the language.

Something we agree on, Doug!

> In the
>case of C, the only requirements on the basic source and execution
>character sets are that there be at least 96 distinct values, that
>the values assigned by the C implementation to represent digit glyphs
>be a contiguous ascending sequence, that there be three additional
>distinct values in the execution set, and that all the previously
>mentioned internal values be distinct from zero.  THE MAPPING BETWEEN
>INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN, AND SHOULD, BE
>DEFINED BY THE C IMPLEMENTATION.  Thus, a straight 6-bit external
>code set could not have a one-to-one correspondence between external
>"characters" and C source OR execution "character" values, and in
>such a system environment there would have to be at least one added
>convention for representing the full set of internal C characters,
>with support tools to facilitate working with such special text files.

This I read to have the consequences that all characters in the
source C program and every input and output file SHOULD have
conversions applied to all widechar strings. This could be done in the
mbstowcs() and other mb/wc functions.

Thus the internal widechar representation of 'c' and the external
multibyte representation SHOULD not be the same for character sets
like ISO 10646, JIS X 0208, KS C 5601 and GB 2312.
At least this should hold for characters in the C character set.

This interpretation of the C standard sounds OK to me,
and solves the problem mentioned by Erik v.d. Poel.

Doug continues to make an ad hominem attack at me (well that is not the
first time he does so). Some comments:

ANSI and DS (Danish Standards, the Danish equivalent to ANSI in ISO)
have disagreed on the necessity of a *readable* and *writeable* alternative
to a representation of C source in invariant ISO 646. Invariant ISO 646
is the same as ASCII with 12 positions left undefined - to be decided by
national ISO bodies. ANSI decided on the ASCII character set, a lot of
other ISO member bodies - especially in Europe - decided to use these
positions for national letters and the like. Then invariant ISO 646 is
the greatest common denominator for these character sets, all derived
from the international standard ISO 646. 

WG14 has supported DS in this opinion at several meetings, and also
SC22 has passed resolutions to require this support in the whole
area of programming languages. WG14 has passed resolutions of support
several times, latest in the WG14 meeting in Copenhagen Nov 1990.

>Indeed, X3J11 explained this to you in the third public review response
>document.  Judging from your continued pursuit of more obtrusive
>solutions for your own particular limited character set problem,
>it would appear that you either did not understand the X3J11 response
>or that, for reasons of your own, you wish to ignore it.  For purposes
>of documentation for those who have not seen the response X3J11 gave
>long ago, here it is:

>	In response to Letter #177, Doc. No. X3J11/88-134:

>	Summary of issue:
>		Proposal for more readable supplement to trigraphs.

>	X3J11 response:
>		The Committee discussed this proposal but decided
>	against it.

>		We cannot support this proposal for a number of reasons.
>	Trigraphs were intended to provide a universally portable format
>	for the *transmission* of C programs; they were never intended
>	to be used for day-to-day reading and writing of programs.

Here X3J11 is in disagreement with SC22 and SC22/WG14, who have both
passed resolutions on the desireability of alternate representation
of C source for reading and writing.

>	Should it be necessary to do so, however, the preprocessor can
>	already be used to improve their readability (exact macro names
>	and definitions are not provided as the Committee prefers to
>	avoid stylistic issues).  As larger character sets become more
>	and more popular, the chances of having to deal with a
>	"deficient" character set become smaller and smaller.

True to a great extent. But this is not a technical argument of
why an alternate representation is impossible, but a means on how
to implement it.

>		Conversion between the current trigraph representations
>	and "normal" representations can be done simply in a context-
>	free manner, but this is not possible with the proposed notation.

Well, it was not meant to be context-free substituton...

>	Also, there are a number of difficulties with the infix subscript
>	operator where empty brackets would have been used.  Either the
>	operator must be allowed as a postfix unary operator as well as
>	a binary operator, or the grammar must be extended to allow empty
>	parentheses to appear in those contexts where empty brackets can.
>	Although these problems are by no means insurmountable, we feel
>	that the current trigraphs are adequate for their intended use
>	and that no further enhancements are necessary.

Here X3J11 admits that the technical problems are solveable ("not by any
means insurmountable").

Instead the proposal is turned down for political reasons.

>		Translation phase 1 actually consists of two parts, first
>	the mapping (about which we say very little) from the external
>	host character set to the C source character set, then the
>	replacement of C source trigraph sequences with single C source
>	characters.  (Note that the C source characters represented in
>	our documents in Courier font need not appear graphically the
>	same in the host environment, although a reasonable
>	implementation will make them as nearly so as possible.)
>	The kind of mapping you propose can in fact be done in the first
>	part of translation phase 1, and several such "convenience"
>	mappings are already common practice.  However, attempting to
>	standardize this mapping is outside the scope of the C Standard,
>	since what is appropriate may depend on the capabilities of the
>	specific hardware, availability of fonts, and so forth.

This I read as just a story with irrelevant facts about fonts
(Courier etc.). What was proposed was support for a very central 
character set, namely invariant ISO 646.

>		Although the Committee regrets any "no" votes on either
>	the national or international proposed standards, we feel we
>	must represent our best judgement on technical issues.  We hope
>	you will reconsider your objection to the current specification.

This X3J11 judgement is not at all "on technical issues" - all technical
issues are admitted to be solveable! The decision is a political one.

>Note that your "trigraph alternative" proposals had been discussed many
>times in the standards committees, and still were resoundingly defeated
>during a joint X3J11/WG14 meeting. 

And other WG14 meetings have supported it, and asked ANSI X3J11 to 
implement it. X3J11 just ignored these ISO WG14 resolutions.

> The only reason this issue is still
>"on the table" for WG14 is that there was some political maneuvering at
>the SC22 level in the absence of anybody who could represent the actual
>issues and history, and SC22 mistakenly thought, on the basis of your
>argumentation, that there was a problem that needed to be solved, and
>thus directed that work toward a normative addendum to the ISO C
>standard begin to address this "problem".

I did not personally argue this in SC22, so others than me must
have been convinced. Actually X3J11 has been the only body to
not be convinced. And so the story about X3J11 says on quite some other 
issues (such as the British comments).

>Later, the Japanese in
>particular thought that it would be appropriate to add more support for
>multibyte character sequences to the ISO C standard as part of this
>normative addendum.  Your original hobby horse had nothing to do with
>multibyte character sequences, and so far as I can determine, the
>Japanese have not found any problem with them other than the desire for
>more standard library functions to make their use more convenient.

OK, so my hobby horse has nothing to do with multibyte support.
So I cannot address other issues than my "hobby horse"?
I do happen to have a fair amount of knowledge of character sets
and their usage in programming languages and communications.

The reason why the Japanese have not seen the problem before with
JIS X 0208, but first with 10646, is beyond my understanding.
Maybe some Japanese could enlighten us (me!) on this?

>It is also worth noting that there is continuing discussion of this
>issue on the X3J11 and WG14 electronic mailing lists.

Both you and I participate in these discussions. I actually think
it would have been more appropiate for you to share your valuable
technical insight at an earlier stage in this discussion,
e.g. in November when SC2/WG2 made their first letter about 10646.
Oh well, better later than never. I am, as always, grateful for your
contributions (although I would prefer your tone to be more gentle:-)

>>I hope that this problem will be a historical one with the appearance
>>of 10646.

>Surely you should be able to see the possible problems with 10646?
>The very idea of using 32 bits to represent a character is bound to
>meet stiff opposition, particularly from users of small systems,
>who already have more efficient solutions to the "problem" of a
>diversity of alphabets.  It seems to me that 10646 is one of the
>technically worst character-set standards yet to be adopted.  No
>wonder there has been renewed interest in other standards such as
>"Unicode" (about which I know little at present other than that it
>has a broad base of industry support).

Now you are talking about things that you know very little of, Doug!

>One does not solve "people problems" by simply adopting a technical
>standard.  History provides much evidence of that.

>DISCLAIMER:  None of the above should be construed as an official X3J11
>position, not even the attempt to cite from an X3J11 document.  However,
>I believe that I have correctly represented the situation as I
>understand it.

Keld Simonsen

rja@altair.cho.ge.com (Randall Atkinson) (04/05/91)

[ The quoting here is getting confusing, so I'm putting Keld's remarks with
 the greater than symbol and Doug's remarks with the percent symbol. :-) rja ]

In article <keld.670719584@dkuugin> keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
gwyn@smoke.brl.mil (Doug Gwyn) writes as a reply to an article of Keld's:

>>>ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
>>>correctly. So ANSI C multibyte specifications *cannot* be used on any
>>>multibyte de jure character set.

% I think you have mixed multibyte character sequences with wchar_t.
% They are NOT the same thing!  That is why there are interconversion
% functions specified in the C standard.

>I wrote "multibyte" to cover multibyte character sets,  and the
>support in ISO (ANSI) C for these character sets; this support
>consists of the multibyte support functions and the widechar
>support functions.

I see no reason that the support provided in ANSI X3.159 is not
adequate.  Keld has presented no such technical details and it
seems to me to be incumbent on Keld to present in purely technical
terms why he feels it won't work.

I've worked on multi-lingual applications for several years and
during the mid 1980s I wrote a thesis on the whole area of 
Chinese/Japanese language support in computer systems, so this
is not an area that I am unfamiliar with.

% The advice X3J11 received during development of this aspect of the
% standard, from such organizations as NTSCJ who have a major stake in
% so-called multibyte character encodings, was that the mechanisms in
% the C standard were adequate for this purpose.  Unless you can explain
% WHAT it is that you think is wrong, I suggest that your comments be
% ignored.  I haven't seen a significant technical argument against
% the "wide character" mechanisms in the C standard; what I have seen
% are misunderstandings. 

There are a lot of misunderstandings and I suspect that they are at the
heart of the problem.

>OK, some hard facts:

>The character 'c' has the following encodings in these basic 16-bit
>East-asian de jure character standards:

>GB 2312-80 (basic Chinese 16-bit  standard)  /035/099
>JIS X 0208 (basic Japanese 16-bit standard)  /035/099
>KS C 5601  (basic Korean 16-bit standard)    /035/099

>ISO DIS 10646 has the following value for 'c':  /032/032/032/099

>None of these values have the nice property of having ASCII 'c'
>extend into these values when loading as a 16-bit or 32-bit int.

>Either this is a problem for all of the above character sets, or
>it is not a problem at all. I hope the latter is true, then there
>is no problem to fix. Unfortunately quite some knowledgeable people
>like Erik v.d.Poel and SC2/WG2 people plus WG14 think there is a problem
>and they have not yet been able to solve it.

Alleging that someone else said they thought there was a problem isn't
nearly the same as stating an argument on technical grounds in "hard
facts".  I don't think it is a problem for any of the above (or in
support of the JIS C6220 and C6226 standards for Kanji and Kana for
that matter).

[ text deleted here for brevity ]

>This interpretation of the C standard sounds OK to me,
>and solves the problem mentioned by Erik v.d. Poel.

Here Keld appears to acknowledge that there is NO technical problem
with the standard.  But watch how he ignores this down below:

>ANSI and DS (Danish Standards, the Danish equivalent to ANSI in ISO)
>have disagreed on the necessity of a *readable* and *writeable* alternative
>to a representation of C source in invariant ISO 646. Invariant ISO 646
>is the same as ASCII with 12 positions left undefined - to be decided by
>national ISO bodies. ANSI decided on the ASCII character set, a lot of
>other ISO member bodies - especially in Europe - decided to use these
>positions for national letters and the like. Then invariant ISO 646 is
>the greatest common denominator for these character sets, all derived
>from the international standard ISO 646. 

However, all of western Europe is moving rapidly to the ISO 8859/1 
standard which has none of the ISO 646 problems at the source level
and moreover the trigraphs address the ISO 646 technical problem 
(which I emphasise is a temporary problem already starting to fade).

The POLITICAL issue is coming from the Danes who feel that their
local character set standard based in ISO 646 should be the focus
of the whole world and that long standing C practice should be 
broken to make them feel better.

WG14 passed resolutions both ways depending on who had spoken more
recently to the group, whether the history of the question at the
X3J11 level and the history of the C language was presented, 
and also some political grounds.  The Danes ignored the WG14 decisions
against them and then turn around and accuse X3J11 of being indifferent
for their insistence on sticking to technical merit rather than
political issues.

% Indeed, X3J11 explained this to you in the third public review response
% document.  Judging from your continued pursuit of more obtrusive
% solutions for your own particular limited character set problem,
% it would appear that you either did not understand the X3J11 response
% or that, for reasons of your own, you wish to ignore it.  

The continuing lack of a technical representation of a problem from the
Danes is amasing.  X3J11 addressed the technical issues adequately.

>This I read as just a story with irrelevant facts about fonts
>(Courier etc.). What was proposed was support for a very central 
>character set, namely invariant ISO 646.

ISO 646 is a character set whose use is rapidly diminishing and whose
use is supported adequately by the trigraph feature present in the
standard.  Keld keeps reiterating that there is a problem (except
above, which see) without specifying the problem technically.

% Note that your "trigraph alternative" proposals had been discussed many
% times in the standards committees, and still were resoundingly defeated
% during a joint X3J11/WG14 meeting. 

>And other WG14 meetings have supported it, and asked ANSI X3J11 to 
>implement it. X3J11 just ignored these ISO WG14 resolutions.

The Danes just ignored the X3J11 and the ISO WG14 resolutions that
favor of the existing solution.  There has NOT been a clear,
definitive position taken at the ISO level.  There has been a lot of
waffling for the reasons noted above.

% The only reason this issue is still "on the table" for WG14 is that
% there was some political maneuvering at the SC22 level in the absence
% of anybody who could represent the actual issues and history, and SC22
% mistakenly thought, on the basis of your argumentation, that there was
% a problem that needed to be solved, and thus directed that work toward
% a normative addendum to the ISO C standard begin to address this
% "problem".

Doug's comments here are essentially correct; there wasn't adequate
explanation at the ISO level of the issues and history of the
alleged ( but actually unstated in Keld's message ) problem.

>I did not personally argue this in SC22, so others than me must
>have been convinced. Actually X3J11 has been the only body to
>not be convinced. And so the story about X3J11 says on quite some other 
>issues (such as the British comments).

No.  See the WG14 decision against the Danish proposal above.

>OK, so my hobby horse has nothing to do with multibyte support.

[ Note that his original posting really alleged a multibyte character 
  support problem that in fact he conceded isn't a problem at all ]

% It is also worth noting that there is continuing discussion of this
% issue on the X3J11 and WG14 electronic mailing lists.

>Both you and I participate in these discussions. I actually think
>it would have been more appropiate for you to share your valuable
>technical insight at an earlier stage in this discussion,
>e.g. in November when SC2/WG2 made their first letter about 10646.

>>>I hope that this problem will be a historical one with the appearance
>>>of 10646.

% Surely you should be able to see the possible problems with 10646?
% The very idea of using 32 bits to represent a character is bound to
% meet stiff opposition, particularly from users of small systems,
% who already have more efficient solutions to the "problem" of a
% diversity of alphabets.  It seems to me that 10646 is one of the
% technically worst character-set standards yet to be adopted.  No
% wonder there has been renewed interest in other standards such as
% "Unicode" (about which I know little at present other than that it
% has a broad base of industry support).

As it happens, there is some discussion of UNICODE and ISO DIS 10646
merging together, though there I personally doubt it will come to pass.
Neither is UNICODE the solution yet because it doesn't even fully 
support languages with Romanised character sets (let alone all of
the non-Roman languages).  I think that the ISO 8859 family of 
compatible 8-bit standards will come into wide use long before any
of the 16-bit or 32-bit standards do for a number of technical
reasons and some political ones (but this isn't really germane
to the pseudo-problems raised about X3.159 :-).

I am not on any of the groups working on the C standard (neither
ISO nor ANSI nor any other group).  I have followed the multibyte
support in some detail because of my work in multilingual applications
development.  I don't see any problems and Doug is free to quote me
as one developer who thinks that the multilingual support is adequate
and doesn't want to see the Danish proposal accepted because it will
harm the standard.  It is annoying that Keld keeps posting these
vague allegations without clearly stating a technical (not political)
problem that is unresolved by the C standard.  If he would post
such a technical problem statement, it could then be discussed on
this list on its technical merits, if any.

Randall Atkinson
rja@edison.cho.ge.com

Comments are the author's and are not necessarily the opinions of
GE, Fanuc, or GE-Fanuc.

harkcom@spinach.pa.yokogawa.co.jp (04/05/91)

In article <keld.670719584@dkuugin> keld@login.dkuug.dk
   (Keld J|rn Simonsen) writes:

 =}JIS X 0208 (basic Japanese 16-bit standard)  /035/099

   JIS X 0208 doesn't cover the ASCII characters. It has a double
sized (zenkaku) English character set though. 'c' in all three of
the popular multibyte encodings (EUC, JIS, SJIS) is 0x63 (same as
ASCII). The most common wide character format (UJIS) has 'c' as
0x0063 (ASCII in 2 bytes).

   I don't know the encodings for the Chinese & Korean well, but the
standards don't seem to cover 'c'...

 =}None of these values have the nice property of having ASCII 'c'
 =}extend into these values when loading as a 16-bit or 32-bit int.

   See above...

 =}think there is a problem
 =}and they have not yet been able to solve it.

   A problem with ISO 10646? A problem with the 'East-asian de jure'
character sets in reference to wchar_t? 

 =}Thus the internal widechar representation of 'c' and the external
 =}multibyte representation SHOULD not be the same for character sets
 =}like ISO 10646, JIS X 0208, KS C 5601 and GB 2312.
 =}At least this should hold for characters in the C character set.

   Huh? This doesn't follow... It doesn't even sound correct. A single
byte wide character set using values above 0x80 in addition to the
ASCII characters would become difficult...

 =}The reason why the Japanese have not seen the problem before with
 =}JIS X 0208, but first with 10646, is beyond my understanding.
 =}Maybe some Japanese could enlighten us (me!) on this?

   What 'problem' do the 'Japanese' see with ISO 10646?

 =}>No wonder there has been renewed interest in other standards such as
 =}>"Unicode" (about which I know little at present other than that it
 =}>has a broad base of industry support).
 =}
 =}Now you are talking about things that you know very little of, Doug!

   Speaking of 'harsh tones'...

   Your apparent knowledge of the JIS standard shows you have little
room to point...

Al

keld@login.dkuug.dk (Keld J|rn Simonsen) (04/06/91)

harkcom@spinach.pa.yokogawa.co.jp writes:

>In article <keld.670719584@dkuugin> keld@login.dkuug.dk
>   (Keld J|rn Simonsen) writes:

> =}JIS X 0208 (basic Japanese 16-bit standard)  /035/099

>   JIS X 0208 doesn't cover the ASCII characters. It has a double
>sized (zenkaku) English character set though. 'c' in all three of
>the popular multibyte encodings (EUC, JIS, SJIS) is 0x63 (same as
>ASCII). The most common wide character format (UJIS) has 'c' as
>0x0063 (ASCII in 2 bytes).

I understand what Al is saying, that the row 2 in the Japanese, Chinese
and Korean basic 16-bit character sets, which all contains what to
me looks like complete ASCII, is in fact not ASCII, but double-sized
English characters. When doing coding, at least in Japan, the programmer
usually combine the 16-bit character set with ASCII in an encoding
which is 8/16 bits (or 7/14 bits).

(Now I do not have great luck in saying what I think other people mean:-(

>   I don't know the encodings for the Chinese & Korean well, but the
>standards don't seem to cover 'c'...

I have my information from the ECMA registry of character sets,
and I really doubt that these informations are incorrect or
that I have misread them.

> =}None of these values have the nice property of having ASCII 'c'
> =}extend into these values when loading as a 16-bit or 32-bit int.

>   See above...

My points still hold. You could have troubles handeling widechar
characters in clean 16-bit de jure standards. Apparantly people
out there don't program widechars in these character sets (true 16-bit),
But always combine with other character sets.

> =}think there is a problem
> =}and they have not yet been able to solve it.

>   A problem with ISO 10646? A problem with the 'East-asian de jure'
>character sets in reference to wchar_t? 

WG14 has got a letter from SC2 pointing out an apparant problem with
10646, that the characters in the C repertoire in 10646 canonical form
was different from a sign-extended single-byte character. I have been
actioned by WG14 to respond to SC2.

>   Your apparent knowledge of the JIS standard shows you have little
>room to point...

Well, my knowledge can always be improved. Still the facts I have
represented on 16-bit character sets are true. They may be
irrelevant as the usage is done in combination with other character sets 
in an encoding. And the whole problem with 10646 (and other multibyte
character sets) usage in widechar strings may be non-existing.
I really hope there is no problem, then we do not need to make
changes anywhere. But we should write some explanations on how this
is supposed to function, as quite some people have had problems
with this. I think the best place to write such interpretations is in
the forthcoming ISO C addendum.

Keld Simonsen

keld@login.dkuug.dk (Keld J|rn Simonsen) (04/06/91)

rja@altair.cho.ge.com (Randall Atkinson) writes:

>[ The quoting here is getting confusing, so I'm putting Keld's remarks with
> the greater than symbol and Doug's remarks with the percent symbol. :-) rja ]

>In article <keld.670719584@dkuugin> keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>gwyn@smoke.brl.mil (Doug Gwyn) writes as a reply to an article of Keld's:

>I see no reason that the support provided in ANSI X3.159 is not
>adequate.  Keld has presented no such technical details and it
>seems to me to be incumbent on Keld to present in purely technical
>terms why he feels it won't work.

This is a part of a longer discussion, originating with an article
by Erik v.d. Poel. Please refer to the row of articles, if you need
more information. In short, Doug has said that the problem that Erik
raised was not a problem, and I tend to believe Doug.

>Alleging that someone else said they thought there was a problem isn't
>nearly the same as stating an argument on technical grounds in "hard
>facts".  I don't think it is a problem for any of the above (or in
>support of the JIS C6220 and C6226 standards for Kanji and Kana for
>that matter).

Is your opinion based on similar arguments as Doug's?
Or what is it based on? I think that solving Erik's (and SC2's)
problem with some interpretation wordings would be a good result
of this news discussion.

Keld Simonsen

peter@ficc.ferranti.com (Peter da Silva) (04/06/91)

Why can't the Danes define a character set, and preprocess it into trigraphs
for compilation? There's no reason the editing character set needs to match
the character set the compiler sees.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

gwyn@smoke.brl.mil (Doug Gwyn) (04/06/91)

In article <RFIAQY1@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes:
>Why can't the Danes define a character set, and preprocess it into trigraphs
>for compilation? There's no reason the editing character set needs to match
>the character set the compiler sees.

Exactly right.

steve@taumet.com (Stephen Clamage) (04/07/91)

peter@ficc.ferranti.com (Peter da Silva) writes:

>Why can't the Danes define a character set, and preprocess it into trigraphs
>for compilation? There's no reason the editing character set needs to match
>the character set the compiler sees.

I can't tell if Peter's tongue is in his cheek.  Apologies, if so.

This is the kind of attitude which annoys those in the world whose native
language is not English.  I am not in that category, but in working on
the ANSI C++ committee I have been made aware of the sensibilities of
the European members.

One member's name contains an umlaut (two horizontal dots above a vowel).
He asked us to imagine how it feels NEVER to be able to see your name
spelled correctly in any computer correspondence.  (I can't even provide
the example here.)

Another member asked how we would feel if, for example, the letters
'l' and 'r' would always be considered equivalent, and the letter 'f'
was forbidden.

[Anothel membel asked how we wourd eer i, ol exampre, the rettels
'r' and 'l' wourd arways be consideled equivarent, and the rettel ''
was olbidden.]

So let's turn Peter's question around:  Why can't the Americans use
a preprocessor to convert ASCII source into some international
character set before compiling?  (I don't advocate this, but it seems
like an equally fair question.)
-- 

Steve Clamage, TauMetric Corp, steve@taumet.com

randall@Virginia.EDU (Randall Atkinson) (04/08/91)

In article <661@taumet.com> steve@taumet.com (Stephen Clamage) writes:

>This is the kind of attitude which annoys those in the world whose native
>language is not English.  I am not in that category, but in working on
>the ANSI C++ committee I have been made aware of the sensibilities of
>the European members.
>
>One member's name contains an umlaut (two horizontal dots above a vowel).
>He asked us to imagine how it feels NEVER to be able to see your name
>spelled correctly in any computer correspondence.  (I can't even provide
>the example here.)
>
>Another member asked how we would feel if, for example, the letters
>'l' and 'r' would always be considered equivalent, and the letter 'f'
>was forbidden.
>
>[Anothel membel asked how we wourd eer i, ol exampre, the rettels
>'r' and 'l' wourd arways be consideled equivarent, and the rettel ''
>was olbidden.]

This is a problem with character sets and has been FULLY RESOLVED
by the ISO with the creation of the ISO 8859/* family of 8-bit
character set standards.  ISO 8859/1 supports all national languages
used in Western Europe including umlauts and accents and tildes
and everything.  The problem is with the terminal and OS 
implementations not with any programming language.  There is NO
REASON to force the language to try to resolve a terminal/OS issue.

>So let's turn Peter's question around:  Why can't the Americans use
>a preprocessor to convert ASCII source into some international
>character set before compiling?  (I don't advocate this, but it seems
>like an equally fair question.)

ASCII is a proper subset of ISO 8859/1 and to force US ASCII to be
modified as suggested above before compiling will mean that ALL
EUROPEAN users who are migrating to the ISO 8859 character set
standards will also have to preprocess FOREVER.  By contrast,
the ISO 646 7-bit variant code standards are DEPRECATED by the
arrival of the ISO 8859 family and so those folks using ISO 646
would be using preprocessing temporarily and eventually would
be able to stop using it as their equipment supported ISO 8859
so the preprocessing requirement would be TEMPORARY and TRANSITIONAL
rather than PERMANENT.

Note that even Keld ackowldeges that ISO 646 is "fading" in favor
of ISO 8859/* and hence the problem is also going away.

Randall Atkinson

gwyn@smoke.brl.mil (Doug Gwyn) (04/08/91)

In article <661@taumet.com>, steve@taumet.com (Stephen Clamage) writes:
> This is the kind of attitude which annoys those in the world whose native
> language is not English.

Pardon me for pointing this out, but the issue has nothing to do with
English vs. other languages, even if some people choose to take it
that way.  The issue is, how does one guarantee portability of source
code for C programs.  One important aspect of that is the requirement
for a certain SUFFICIENT set of source code "characters", which need
not be taken from ANY human language (witness APL).  However, due to
the universal support for a certain set of characters on all modern
computer systems, that was selected as the minimum requirement.  The
conversion of trigraphs (expressed in this universal character set) to
internal code values for a handful of special characters that are not
universally supported is simply the minimum mapping necessary to
ensure source code portability among all standard-conforming sites.
The C standard definitely allows for use of additional native
characters in strings, comments, etc. (use in an identifier would
have to generate at least one diagnostic, but otherwise is permitted).
However, programs that make use of such extensions are clearly not
going to be portable.

If we had known how confused people would get over character set
issues, perhaps X3J11 would have used totally weird glyphs for the
members of the source character set represented by Courier font in
the standard.  Then EVERYone would be forced to understand the
external physical -to- internal source set mapping that is ALWAYS
applied as the first part of translation phase 1 by ALL conforming
implementations.  It happens that in most environments (apparently
not including some in Denmark), an implementation can perform that
mapping quite trivially, which is perhaps why many people don't
seem to appreciate that a mapping is nevertheless being applied.

diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/08/91)

In article <15737@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>In article <RFIAQY1@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes:
>>Why can't the Danes define a character set, and preprocess it into trigraphs
>>for compilation? There's no reason the editing character set needs to match
>>the character set the compiler sees.
>
>Exactly right.

For exactly the same reason that Americans weren't told to write a
preprocessor to convert C into Algol.  Some people believe that an
international C standard might be as useful as an international Algol
standard, so it would be nice if C could be WRITTEN and READ in a
standard way, independently of transmission and other issues.

(Some of these people originally misinterpreted the purpose of trigraphs,
but have figured out their error.  Some of their opponents, who believe
that the C language should differ from country to country, misinterpreted
the purpose of the Danish proposal and have yet to understand their error.)
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/08/91)

In article <661@taumet.com> steve@taumet.com (Stephen Clamage) writes:

>One member's name contains an umlaut (two horizontal dots above a vowel).
>He asked us to imagine how it feels NEVER to be able to see your name
>spelled correctly in any computer correspondence.  (I can't even provide
>the example here.)

This might be a bit excessive.  Surely the member's name is spelled
correctly in correspondence from his home country.  Japanese people's
names are usually spelled correctly in correspondence from Japan
(except when I'm writing it) but not in correspondence from Europe.

One issue is to make a programming language readable and writable
in a standard manner internationally, which means using some common
denominator in the character set and human readability factors.
This might be a wise idea.

The other issue is to force every computer in every country to support
all characters that are used in all the worlds' languages.  I think
this would be excessive.  Anyway, it is different from the other issue.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

gnu@hoptoad.uucp (John Gilmore) (04/10/91)

peter@ficc.ferranti.com (Peter da Silva) writes:
>Why can't the Danes define a character set, and preprocess it into trigraphs
>for compilation? There's no reason the editing character set needs to match
>the character set the compiler sees.

They don't even need to preprocess it, they can just use a compiler that
handles that input character encoding.  Nothing in the standard prohibits this!
If they use characters outside the standard C character set, their code
won't be portable to other character encodings, but what else is new?

steve@taumet.com (Stephen Clamage) wrote:
> This is the kind of attitude which annoys those in the world whose native
> language is not English...

Americans are not boors or uncivilized because they invent programming
languages that use the entire ASCII character set.  Not guilty!

> One member's name contains an umlaut (two horizontal dots above a vowel).
> He asked us to imagine how it feels NEVER to be able to see your name
> spelled correctly in any computer correspondence.

Why doesn't he get a better computer?  Surely some local company sells
machines that include umlauted letters.  I correspond with a Swede
whose name is Torbj|rn; it looks funny in the States (a vertical bar in
the middle of the name), but should look fine on his computer in
Sweden.

And my windows all use ISO Latin 1.  If Torbj|rn would send the
umlauted letter in that standardized character set, it would look right
in both the States and in Sweden.

> Another member asked how we would feel if, for example, the letters
> 'l' and 'r' would always be considered equivalent, and the letter 'f'
> was forbidden.

Why don't we stick to discussions of the C language rather than generic
character set guilt trips?  So far, NOBODY has proposed a way to change
ANSI/ISO C so that the full local character set could be used in
identifier names in portable programs.  So any language that adds some
alphabetics beyond ASCII's is going to have some words that just can't
appear in portable C programs.  Note I'm saying PORTABLE; on your
local compiler you can do what you want.

> So let's turn Peter's question around:  Why can't the Americans use
> a preprocessor to convert ASCII source into some international
> character set before compiling?

Americans *can* use a preprocessor to convert ASCII source into some
international character set before compiling.  What's the point?

Compiler vendors are free to choose whatever input character set and
encoding they want to implement -- including ASCII, Danish local
characters, JIS, ISO Latin 1, or others, as long as it contains the
required character set specified in the C standard.  All of these do.
What is your complaint?
-- 
John Gilmore   {sun,uunet,pyramid}!hoptoad!gnu   gnu@toad.com   gnu@cygnus.com
*  Truth :  the most deadly weapon ever discovered by humanity. Capable of   *
*  destroying entire perceptual sets, cultures, and realities. Outlawed by   *
*  all governments everywhere. Possession is normally punishable by death.   *
*      ..{amdahl|decwrl|octopus|pyramid|ucbvax}!avsd!childers@tycho          *

erik@srava.sra.co.jp (Erik M. van der Poel) (04/10/91)

Sorry, I'm a bit late with this reply. Just a few minor nits:


Al Harkcom writes:
> 'c' in all three of
> the popular multibyte encodings (EUC, JIS, SJIS) is 0x63 (same as
> ASCII). The most common wide character format (UJIS) has 'c' as
> 0x0063 (ASCII in 2 bytes).

EUC is the name of the scheme, while UJIS is the name of the Japanese
EUC. UJIS is not a wchar_t encoding.


>  Keld Simonsen writes:
>  =}Thus the internal widechar representation of 'c' and the external
>  =}multibyte representation SHOULD not be the same for character sets
>  =}like ISO 10646, JIS X 0208, KS C 5601 and GB 2312.
>  =}At least this should hold for characters in the C character set.
> 
>    Huh? This doesn't follow... It doesn't even sound correct. A single
> byte wide character set using values above 0x80 in addition to the
> ASCII characters would become difficult...

You're probably referring to the European characters with the 8th bit
up. These are not relevant in this discussion since the ANSI C wchar_t
spec explicitly refers to the basic character set, which does not
include these European characters.


>  =}The reason why the Japanese have not seen the problem before with
>  =}JIS X 0208, but first with 10646, is beyond my understanding.
>  =}Maybe some Japanese could enlighten us (me!) on this?
> 
>    What 'problem' do the 'Japanese' see with ISO 10646?

Keld is referring to the problem that I brought up in the first
article in this thread. I.e. 10646 'c' does not have the same numeric
value as ASCII 'c'.
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (04/10/91)

In article <1107@sranha.sra.co.jp>
	erik@srava.sra.co.jp (Erik M. van der Poel) writes:

>Keld is referring to the problem that I brought up in the first
>article in this thread. I.e. 10646 'c' does not have the same numeric
>value as ASCII 'c'.

It is very strange that international character code standard is affected
by C standard.

If C standard want (wchar_t)'c' == 'c', They can do so simply by ignoring
10646. Currently, C standard has nothing to do with 10646.

If C standard want to incorporate 10646, it may:

	1) define standard way to convert 10646 to wchar_t
or
	2) loosen the requirement of wchar_t  and provide conversion
	   functions or macros (such as isascii())
or
	3) introduce a new character type (say, is10646char_t :-) )
	   whose semantics strictly follows 10646 with appropriate
	   conversion functions or macros

							Masataka Ohta

erik@srava.sra.co.jp (Erik M. van der Poel) (04/11/91)

Masataka Ohta writes:
> Erik M. van der Poel writes:
> >Keld is referring to the problem that I brought up in the first
> >article in this thread. I.e. 10646 'c' does not have the same numeric
> >value as ASCII 'c'.
> 
> It is very strange that international character code standard is affected
> by C standard.

I never said that we should change 10646 (for wchar_t).

> If C standard want (wchar_t)'c' == 'c'

Wrong. L'c' must be numerically equivalent to 'c'.

> If C standard want [L'c' equals 'c'], They can do so simply by ignoring
> 10646. Currently, C standard has nothing to do with 10646.

Yes, this is what I've been saying all along. Have you read any of the
other articles in this thread?

> If C standard want to incorporate 10646, it may:
> 
> 	1) define standard way to convert 10646 to wchar_t

Yes, this is exactly what I want. Either in an ISO C addendum, or in a
10646 normative annex, or in a separate International Standard, as
long as it is published at around the same time as IS 10646.

> or
> 	2) loosen the requirement of wchar_t  and provide conversion
> 	   functions or macros (such as isascii())

The point is that I don't want to change ANSI/ISO C. Unnecessary
changes at this late stage may confuse implementors and users.

> or
> 	3) introduce a new character type (say, is10646char_t :-) )
> 	   whose semantics strictly follows 10646 with appropriate
> 	   conversion functions or macros

Aren't we trying to achieve codeset independence?
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

harkcom@spinach.pa.yokogawa.co.jp (04/11/91)

In article <1107@sranha.sra.co.jp> erik@srava.sra.co.jp
   (Erik M. van der Poel) writes:

 =}EUC is the name of the scheme, while UJIS is the name of the Japanese
 =}EUC. UJIS is not a wchar_t encoding.

   Though the term EUC is used as the name of an encoding scheme, it is
also the name used for the multibyte encoding of the JIS standard using
SS2 and SS3 single shifts. UJIS is the name used to refer to the 2 byte
encoding of the EUC scheme JIS standard. The 2 byte (4 byte on HP) wide
character encodings for Japanese are usually UJIS...

 =}You're probably referring to the European characters with the 8th bit
 =}up. These are not relevant in this discussion since the ANSI C wchar_t
 =}spec explicitly refers to the basic character set, which does not
 =}include these European characters.

   But my point was that if you have a single byte wide character using
all 255 characters, it would be a dsability to require that the multibyte
encoding and the wide character encoding be unequal. This does seem to be
relevant to this discussion...

 =}Keld is referring to the problem that I brought up in the first
 =}article in this thread. I.e. 10646 'c' does not have the same numeric
 =}value as ASCII 'c'.

   OK, let me get this straight. The numeric value of multibyte 'c' does
not have to equal the numeric value of wide character 'c' under ISO 10646.
You feel that this is a problem because you then become unable to use
such things as:  ('c' == L'c') or ('c' == ((char )L'c'))... Or, in
other words, comparisons of ASCII characters in the mb format with the
equivalent in the wc format can not be done so simply. My question is,
is it so important to be able to do such comparisons that we should limit
the encodings allowed for wide characters? The comparisons of mb to mb and
wc to wc are legit either way...

Al

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (04/11/91)

In article <1117@sranha.sra.co.jp>
	erik@srava.sra.co.jp (Erik M. van der Poel) writes:

>> If C standard want [L'c' equals 'c'], They can do so simply by ignoring
>> 10646. Currently, C standard has nothing to do with 10646.

>Yes, this is what I've been saying all along. Have you read any of the
>other articles in this thread?

I have been reading the thread and felt the point become fuzzy.

So, I made it clear.

>> 	1) define standard way to convert 10646 to wchar_t

>Yes, this is exactly what I want. Either in an ISO C addendum, or in a
>10646 normative annex, or in a separate International Standard, as
>long as it is published at around the same time as IS 10646.

>Aren't we trying to achieve codeset independence?

How can you be codeset independent by having ISO C addendum about
10646?

>The point is that I don't want to change ANSI/ISO C. Unnecessary
>changes at this late stage may confuse implementors and users.

Why not, if it is necessary?

						Masataka Ohta

diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/11/91)

In article <78@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:

>How can you be codeset independent by having ISO C addendum about 10646?

The same way you can be word-size independent and still specify minimum
values.  Well, almost the same way.  It is still worth standardizing a
translation scheme to be used WHEN the character set is 10646.  (It would
also be worth standardizing a translation scheme to be used when the
character set is the invariant subset of 646.)
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

mleisher@nmsu.edu (Mark Leisher) (04/11/91)

While we're on the subject of wchar_t, has anyone experimented with
implementing a libc sort of setup for wchar_t I/O?  Any suggestions on
approaches?

Also while we're at it, I've written a library that supports ctype
like operations on the Big5 (Chinese:Taiwan/Hong Kong), GB 2312-1980,
JIS X0208-1983, and KS C5601-1987 codesets.  It's messy, but it seems
to work fine.  Also supported are string functions specifically for
the wchar_t type.  I wrote the library because I cannot for the life
of me get the LC_CTYPE stuff on our Suns to cooperate for mbcs tables
:-)

Over and above the standard ctype sorts of things, I've added a number
of other personally useful macros and functions.  Please feel free to
do whatever you want to the library (site:file [IP] below).

crl.nmsu.edu:pub/misc/wcslib.tar.Z  [128.123.1.14]

Almost forgot, I don't really have any documentation except for
code level comments (quite a few) and a README file.

-----------------------------------------------------------------------------
mleisher@nmsu.edu                      "I laughed.
Mark Leisher                                I cried.
Computing Research Lab                          I fell down.
New Mexico State University                        It changed my life."
Las Cruces, NM                     - Rich [Cowboy Feng's Space Bar and Grille]

harkcom@spinach.pa.yokogawa.co.jp (04/12/91)

In article <MLEISHER.91Apr11060555@thrinakia.nmsu.edu> mleisher@nmsu.edu
   (Mark Leisher) writes:

 =}While we're on the subject of wchar_t, has anyone experimented with
 =}implementing a libc sort of setup for wchar_t I/O?  Any suggestions on
 =}approaches?

   SVR4 contains routines for I/O using wchar_t in both 16 bit and 32 bit
sizes. Details can be found in the "Multi-National Language Supplement".

 =}Also supported are string functions specifically for
 =}the wchar_t type.  I wrote the library because I cannot for the life
 =}of me get the LC_CTYPE stuff on our Suns to cooperate for mbcs tables

   The wstrings.c file in kinput also contains routines for handling
wide characters. But it has a lot of error checking stuff than can be
chucked...

   The problem with using such libraries is that they can only be used
when the encoding of the input matches the one the library uses. The
object of using locales is to remove such dependencies...

Al

enag@ifi.uio.no (Erik Naggum) (04/12/91)

In article <1991Apr8.011657.1780@tkou02.enet.dec.com> diamond@jit345.swstokyo.dec.com (Norman Diamond) writes:

   (Some of these people originally misinterpreted the purpose of
   trigraphs, but have figured out their error.  Some of their
   opponents, who believe that the C language should differ from
   country to country, misinterpreted the purpose of the Danish
   proposal and have yet to understand their error.)

You mean, instead of [, \, ], {, |, and } looking funny in Denmark on
old terminals, all the C code in the world is going to lack the
visually appealing brackets and braces?  Very clever.

Please note my country of origin: This is Norway, in Europe.  I'm a
programmer, and I'm OK, no, that wasn't what I intended to say.  I'm a
programmer, and I think my nation's characters in C code is about the
grossest thing that ever happened.  I'm sure Keld is feeling the same
way.  However, the difference in approach between Keld and yours truly
is tremendous:

	Keld favors changing the world to accomodate his old terminal.

	I favor changing my terminal to accomodate ISO 8859-1 (Latin 1).

(Then I'm working to get ISO Latin 1 allowed all over the place,
instead, but at least I can allow myself to think it's a step
forward.)

I maintain that the whole problem is one of (bad) display equipment.

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (04/12/91)

In article <1991Apr11.093836.10553@tkou02.enet.dec.com>
	diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:

>>How can you be codeset independent by having ISO C addendum about 10646?

>The same way you can be word-size independent and still specify minimum
>values.

The, the same way, we can introduce a new type, say, is10646char_t, without
losing codeset independence. That is, if standard translation functions
between 10646 and other code systems are provided, the entire system
is codeset independent in the same manner.

>Well, almost the same way.  It is still worth standardizing a
>translation scheme to be used WHEN the character set is 10646.

There is a problem, when a system with 16 bit wchar_t want to use 10646.

Is it required for a strictly conforming system to follow the scheme
WHENEVER the character set is 10646?

If so, it is somewhat codeset independent. If not so, there will be
systems using 10646 in the different way from the standarized scheme,
which means that the standard is not a standard.

						Masataka Ohta

erik@srava.sra.co.jp (Erik M. van der Poel) (04/12/91)

I'm directing followups to comp.std.internat. I apologize to
comp.std.c readers for the current noise level, which I seem to have
started.

Al Harkcom writes:
>    Though the term EUC is used as the name of an encoding scheme, it is
> also the name used for the multibyte encoding of the JIS standard using
> SS2 and SS3 single shifts.

Yes, people often say "EUC" when they mean "Japanese EUC". That
doesn't mean that they are right. Think of it this way: EUC is the
generic international `class', while UJIS is a name for the particular
Japanese `instance'.

Also, you refer to "the JIS standard". This is rather misleading,
since several implementations use *two* JIS standards, namely JIS X
0208 (Kanji, etc) and the right-hand part of JIS X 0201 (`half-sized'
Katakana, etc).

> UJIS is the name used to refer to the 2 byte
> encoding of the EUC scheme JIS standard. The 2 byte (4 byte on HP) wide
> character encodings for Japanese are usually UJIS...

Perhaps we're getting confused because we are looking at different
documents. I got my information from a paper by Yasushi Nakahara,
"Nihongo Koodo No Genjo To Mondaiten", Jan. 1988. In this paper, he
says that UJIS was the name that the Sigma project gave to a Japanese
usage of EUC. He refers to codesets 1, 2 and 3 (i.e. not only 0208
Kanji, etc).

According to this paper, UJIS is not a 2 byte code. It is an encoding
in which characters require 1, 2 or 3 bytes each. I.e. it is an mb
code, definitely not a wc code.
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

erik@srava.sra.co.jp (Erik M. van der Poel) (04/12/91)

Al Harkcom writes:
>    But my point was that if you have a single byte wide character using
> all 255 characters, it would be a dsability to require that the multibyte
> encoding and the wide character encoding be unequal. This does seem to be
> relevant to this discussion...

You're absolutely right, it is *relevant*. What I meant to say was
that the C standard does not *require* characters other than those in
the basic character set to follow the rule L'c' equals 'c'.


>    The numeric value of multibyte 'c' does
> not have to equal the numeric value of wide character 'c' under ISO 10646.
> You feel that this is a problem because you then become unable to use
> such things as:  ('c' == L'c') or ('c' == ((char )L'c'))...

In a previous article, I already stated that I don't really care how
we use 10646 in wchar_t, as long as everybody does it the same way.
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

diamond@jit345.swstokyo.dec.com (Norman Diamond) (04/15/91)

In article <ENAG.91Apr12042249@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:
>In article <1991Apr8.011657.1780@tkou02.enet.dec.com> diamond@jit345.swstokyo.dec.com (Norman Diamond) writes:
>   (Some of these people originally misinterpreted the purpose of
>   trigraphs, but have figured out their error.  Some of their
>   opponents, who believe that the C language should differ from
>   country to country, misinterpreted the purpose of the Danish
>   proposal and have yet to understand their error.)
>You mean, instead of [, \, ], {, |, and } looking funny in Denmark on
>old terminals, all the C code in the world is going to lack the
>visually appealing brackets and braces?  Very clever.

No.  There should be two visually moderately appealing methods, one using
the existing tokens and one using readable, writable combinations.  IBM
once made .. equivalent to : and ., equivalent to ; in one of their languages,
but users did not have to use .. and ., unless their keypunches or printers
made it convenient.  But they could carry their card decks to machines that
had ; and : on their printers, still have .. and ., printed that way, but
still understandable to the compiler.  I once used an editor where the
combination '( was equivalent to { and ') was equivalent to }.  The use
of ' was a poor choice, and that vendor chose \ to escape things in their
newer software (such as a new operating system and programming language,
guess which ones).
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

gwyn@smoke.brl.mil (Doug Gwyn) (04/16/91)

In article <1107@sranha.sra.co.jp> erik@srava.sra.co.jp (Erik M. van der Poel) writes:
>I.e. 10646 'c' does not have the same numeric value as ASCII 'c'.

But it doesn't need to, so long as the compiler translates both L'c'
and 'c' from the multibyte (10646) characters into the same numeric
code values for the resulting char-piece-of-an-int and wchar_t, and
the stdio functions when dealing with text streams provide a similar
mapping (to make the external forms of text streams maximally useful).

There is certainly no requirement in the C standard that 'c' be
represented using the ASCII code set.

harkcom@spinach.pa.yokogawa.co.jp (Alton Harkcom) (04/23/91)

   I've been asked to clarify this and I'm tired of writing notes
saying "please see comp.std.internat" so I'm reposting my reply to
Mr. Van der Poel's post here. Sorry for wasting the time of those
of you to who this is old news...

--------------------Repost Alert-----------------------------------------------
In article <1130@sranha.sra.co.jp> erik@srava.sra.co.jp
   (Erik M. van der Poel) writes:

 =}Also, you refer to "the JIS standard". This is rather misleading,
 =}since several implementations use *two* JIS standards, namely JIS X
 =}0208 (Kanji, etc) and the right-hand part of JIS X 0201 (`half-sized'
 =}Katakana, etc).

   Actually 3 popular codesets are JIS standard 0201, 0208, and 0212.
JIS X 0212 is a set of additional kanzi.

 =}Perhaps we're getting confused because we are looking at different
 =}documents.
 =} [...]
 =}He refers to codesets 1, 2 and 3 (i.e. not only 0208
 =}Kanji, etc).

   Yes, I'm looking at the documentation from various software packages
which use the UJIS encoding. They refer to four code sets:
   G0:	ASCII
   G1:	KANZI	(JIS X 0208)
   G2:	HANKAKU	(JIS X 0201)
   G3:	GAIZI
All four code sets are 16 bits wide.

 =}According to this paper, UJIS is not a 2 byte code. It is an encoding
 =}in which characters require 1, 2 or 3 bytes each. I.e. it is an mb
 =}code, definitely not a wc code.

   I hate to disagree, but all of the implementations I have seen which
use a mb encoding refer to the Japanese EUC as EUC and the wc encodings
refer to it as UJIS (except of course HP which refers to both as UJIS).

Al
-------------------------------------------------------------------------------