[comp.std.internat] wchar_t values

erik@srava.sra.co.jp (Erik M. van der Poel) (04/12/91)

I'm directing followups to comp.std.internat. I apologize to
comp.std.c readers for the current noise level, which I seem to have
started.


Al Harkcom writes:
>    Though the term EUC is used as the name of an encoding scheme, it is
> also the name used for the multibyte encoding of the JIS standard using
> SS2 and SS3 single shifts.

Yes, people often say "EUC" when they mean "Japanese EUC". That
doesn't mean that they are right. Think of it this way: EUC is the
generic international `class', while UJIS is a name for the particular
Japanese `instance'.

Also, you refer to "the JIS standard". This is rather misleading,
since several implementations use *two* JIS standards, namely JIS X
0208 (Kanji, etc) and the right-hand part of JIS X 0201 (`half-sized'
Katakana, etc).


> UJIS is the name used to refer to the 2 byte
> encoding of the EUC scheme JIS standard. The 2 byte (4 byte on HP) wide
> character encodings for Japanese are usually UJIS...

Perhaps we're getting confused because we are looking at different
documents. I got my information from a paper by Yasushi Nakahara,
"Nihongo Koodo No Genjo To Mondaiten", Jan. 1988. In this paper, he
says that UJIS was the name that the Sigma project gave to a Japanese
usage of EUC. He refers to codesets 1, 2 and 3 (i.e. not only 0208
Kanji, etc).

According to this paper, UJIS is not a 2 byte code. It is an encoding
in which characters require 1, 2 or 3 bytes each. I.e. it is an mb
code, definitely not a wc code.
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

harkcom@spinach.pa.yokogawa.co.jp (04/15/91)

In article <1130@sranha.sra.co.jp> erik@srava.sra.co.jp
   (Erik M. van der Poel) writes:

 =}Also, you refer to "the JIS standard". This is rather misleading,
 =}since several implementations use *two* JIS standards, namely JIS X
 =}0208 (Kanji, etc) and the right-hand part of JIS X 0201 (`half-sized'
 =}Katakana, etc).

   Actually 3 popular codesets are JIS standard 0201, 0208, and 0212.
JIS X 0212 is a set of additional kanzi.

 =}Perhaps we're getting confused because we are looking at different
 =}documents.
 =} [...]
 =}He refers to codesets 1, 2 and 3 (i.e. not only 0208
 =}Kanji, etc).

   Yes, I'm looking at the documentation from various software packages
which use the UJIS encoding. They refer to four code sets:
   G0:	ASCII
   G1:	KANZI	(JIS X 0208)
   G2:	HANKAKU	(JIS X 0201)
   G3:	GAIZI
All four code sets are 16 bits wide.

 =}According to this paper, UJIS is not a 2 byte code. It is an encoding
 =}in which characters require 1, 2 or 3 bytes each. I.e. it is an mb
 =}code, definitely not a wc code.

   I hate to disagree, but all of the implementations I have seen which
use a mb encoding refer to the Japanese EUC as EUC and the wc encodings
refer to it as UJIS (except of course HP which refers to both as UJIS).

Al