kuro%shochu@Sun.COM (Teruhiko Kurosaka - Sun Intercon) (04/26/89)
I have a question about relationship among three new concepts and notation introduced by ANSI-C draft: multibyte characters, wide characters, and hexadecimal escape notation. For the following discussion, let's assume a character X is a multibyte character and is represented by three byte sequnce: 0x8e 0xab 0xcd, in some system. The first question I have is how to represent this three-byte character by hexadecimal escape sequnce within double-quoted strings. The draft (12/7/88 p.30 line 14) says: The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequnce are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant. The numeric value of the hexadecimal integer so formed specifies the value of the desired character or wide character. If I take this literally, it would be: char *the_multibyte_char="\x8eabcd"; /* I-1 */ However, I noticed, the draft sometimes use the word "character" and "byte" interexchangably. If the "character" actually means a byte, then char *the_multibyte_char="\x8e\xab\xcd"; /* I-2 */ must be the right notation. What I want to mean here is: char the_multibyte_char_array[]={0x8e, 0xab, 0xcd, 0}; char *the_multibyte_char=the_multibyte_char_array; Another related question is, how to use the hexadecimal escape in the wide character string ( L"..." ). Let's say, the wide character value for this character X is 0xbcde. Then, a wide character string that includes only one character X should be written as: wchar_t *the_wide_char_str=L"\xbcde"; /* II-1 */ or should it be: whcar_t *the_wide_char_str=L"\xbc\xde"; /* II-2 */ to mean: whcar_t the_wide_char_array={0xbcde, 0}; whcar_t *the_wide_char_str=the_wide_char_array; ? And finally, which is right? whcar_t the_wide_char=L'\xbcde'; /* III-1 */ whcar_t the_wide_char=L'\xbc\xde'; /* III-2 */ My personal choices are I-2, II-I and III-1. This is based on my personal belief that a hexadecimal escape sequnce should describe the value of the 'atom' element in a notation. Because a double quoted string is of type (char *), it's atom's datatype is char, which actually means a byte for historical reasons all of you know. Therfore an escape sequnce should describe a byte. For the same reason, a hexadecimal escape sequnce within a wide character constant/string-literal should describe a wide character. I would like to know what other people's think about this. In your response, please distinguesh what you think ANSI-C should have been, and what ANSI-C spec (draft) should be interpreted. Thank you in advance. -T.Kurosaka, Sun Microsystems
gwyn@smoke.BRL.MIL (Doug Gwyn) (04/26/89)
In article <101058@sun.Eng.Sun.COM> kuro%shochu@Sun.COM (Teruhiko Kurosaka - Sun Intercon) writes: > char *the_multibyte_char="\x8eabcd"; /* I-1 */ No, other than the null-byte terminator there is just one char in that string. Its value is implementation-dependent but is very likely either 0x8E or 0xCD. >However, I noticed, the draft sometimes use the word "character" and >"byte" interexchangably. It always uses these terms interchangeably; the difference is merely one of emphasis. See their definitions in section 1.6. Note also that "multibyte character" is defined as a separate concept, and that the occurrence of the word "character" in the phrase "multibyte character" is not covered by the definition given for just "character". This is an unfortunate property of technical English, and perhaps we should have invented some other name for "multibyte character", but nobody could think of an acceptable alternative. > char *the_multibyte_char="\x8e\xab\xcd"; /* I-2 */ Correct. You could also simply place the Kanji or whatever character directly between the " marks, although that would make your source code less portable, since different implementations would interpret the bytes in your multibyte source character in different ways, some of them perhaps invalid syntactically. (For example, one of the bytes might represent the " mark in some other implementation.) > wchar_t *the_wide_char_str=L"\xbcde"; /* II-1 */ Correct. > whcar_t *the_wide_char_str=L"\xbc\xde"; /* II-2 */ [ wchar_t] No, this string contains three distinct values: 0x00BC, 0x00DE, and 0x0000. > whcar_t the_wide_char=L'\xbcde'; /* III-1 */ [ wchar_t] Correct, assuming you fix the typographic error as indicated. >My personal choices are I-2, II-I and III-1. The Standard agrees with you (or vice versa).