[comp.std.internat] Hex escape for quoted multibyte character

kuro%shochu@Sun.COM (Teruhiko Kurosaka - Sun Intercon) (04/26/89)

I have a question about relationship among three new concepts and
notation introduced by ANSI-C draft: multibyte characters,
wide characters, and hexadecimal escape notation.

For the following discussion, let's assume a character X is a
multibyte character and is represented by three byte sequnce: 0x8e 0xab 0xcd,
in some system.

The first question I have is how to represent this three-byte character
by hexadecimal escape sequnce within double-quoted strings.
The draft (12/7/88 p.30 line 14) says:
	The hexadecimal digits that follow the backslash and the letter x in
	a hexadecimal escape sequnce are taken to be part of the construction of a
	single character for an integer character constant or of a single wide
	character for a wide character constant.  The numeric value of the
	hexadecimal integer so formed specifies the value of the desired
	character or wide character.

If I take this literally, it would be:
	char *the_multibyte_char="\x8eabcd";		/* I-1 */

However, I noticed, the draft sometimes use the word "character" and
"byte" interexchangably.  If the "character" actually means a byte, then
	char *the_multibyte_char="\x8e\xab\xcd";	/* I-2 */
must be the right notation.
What I want to mean here is:
	char the_multibyte_char_array[]={0x8e, 0xab, 0xcd, 0};	
	char *the_multibyte_char=the_multibyte_char_array;	


Another related question is, how to use the hexadecimal escape in
the wide character string ( L"..." ).  Let's say, the wide character value
for this character X is 0xbcde.  Then, a wide character string
that includes only one character X should be written as:
	wchar_t *the_wide_char_str=L"\xbcde";		/* II-1 */
or should it be:
	whcar_t *the_wide_char_str=L"\xbc\xde";		/* II-2 */
to mean:
	whcar_t the_wide_char_array={0xbcde, 0};
	whcar_t *the_wide_char_str=the_wide_char_array;
?

And finally, which is right?
	whcar_t the_wide_char=L'\xbcde';		/* III-1 */
	whcar_t the_wide_char=L'\xbc\xde';		/* III-2 */

My personal choices are I-2, II-I and III-1.  This is based on my personal belief that
a hexadecimal escape sequnce should describe the value of the 'atom' element
in a notation.  Because a double quoted string is of type (char *), it's atom's datatype
is char, which actually means a byte for historical reasons all of you know.  Therfore
an escape sequnce should describe a byte.  For the same reason, a hexadecimal
escape sequnce within a wide character constant/string-literal should describe
a wide character.  

I would like to know what other people's think about this.
In your response, please distinguesh what you think ANSI-C should have been, and
what ANSI-C spec (draft) should be interpreted.
Thank you in advance.

-T.Kurosaka, Sun Microsystems

gwyn@smoke.BRL.MIL (Doug Gwyn) (04/26/89)

In article <101058@sun.Eng.Sun.COM> kuro%shochu@Sun.COM (Teruhiko Kurosaka - Sun Intercon) writes:
>	char *the_multibyte_char="\x8eabcd";		/* I-1 */

No, other than the null-byte terminator there is just one char in that string.
Its value is implementation-dependent but is very likely either 0x8E or 0xCD.

>However, I noticed, the draft sometimes use the word "character" and
>"byte" interexchangably.

It always uses these terms interchangeably; the difference is merely one of
emphasis.  See their definitions in section 1.6.  Note also that "multibyte
character" is defined as a separate concept, and that the occurrence of the
word "character" in the phrase "multibyte character" is not covered by the
definition given for just "character".  This is an unfortunate property of
technical English, and perhaps we should have invented some other name for
"multibyte character", but nobody could think of an acceptable alternative.

>	char *the_multibyte_char="\x8e\xab\xcd";	/* I-2 */

Correct.  You could also simply place the Kanji or whatever character
directly between the " marks, although that would make your source code
less portable, since different implementations would interpret the bytes
in your multibyte source character in different ways, some of them perhaps
invalid syntactically.  (For example, one of the bytes might represent the
" mark in some other implementation.)

>	wchar_t *the_wide_char_str=L"\xbcde";		/* II-1 */

Correct.

>	whcar_t *the_wide_char_str=L"\xbc\xde";		/* II-2 */
[       wchar_t]

No, this string contains three distinct values: 0x00BC, 0x00DE, and 0x0000.

>	whcar_t the_wide_char=L'\xbcde';		/* III-1 */
[       wchar_t]

Correct, assuming you fix the typographic error as indicated.

>My personal choices are I-2, II-I and III-1.

The Standard agrees with you (or vice versa).