[comp.std.c] Multibyte characters

mikeb@inset.UUCP (Mike Banahan) (07/03/90)

On the interesting subject of wide characters, multibyte characters and
so on, I haven't noticed a discussion in this group which touches on
the following.

Let's say that I do have a multibyte execution character set which supports
for the sake of argument, English and Greek, with Greek using a shift-in
shift-out mechanism.
A string of the form "abc@d" is valid C (using @ to represent the Greek
character `alpha'.
It will contain 8 bytes, counting the shift-in, shift-out and the null
at the end.

Presumably the integral constant '@' is a three-byte constant, no matter
what it may look like? An alternative interpretation is that it violates
the constraint in 2.2.1.2 `a .. character constant .. shall begin
and end in the initial shift state', but presumably I can expect my
implementation to do the necessary good deeds and put a shift-out
in there too.


Since it is a three-byte constant (assuming I'm right), then can I be
sure that I do not get overflow when I assign it to a char variable?
3.1.3.4 says that the value of a multi-character character constant
will be implementation-defined, and 3.2.1.2 says that that (paraphrase)
demoting an int to a char gives an implementation-defined result.
So to call it `overflow' is perhaps overstating the case, but I clearly
end up in implementation-defined territory twice over.

Sorry if this has been discussed before. If not, could someone enlighten
me as to the actual situation?

Thanks in advance,
Mike Banahan
-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb

marking@drivax.UUCP (M.Marking) (07/05/90)

mikeb@inset.UUCP (Mike Banahan) writes:

) Let's say that I do have a multibyte execution character set which supports
) for the sake of argument, English and Greek, with Greek using a shift-in
) shift-out mechanism.
) A string of the form "abc@d" is valid C (using @ to represent the Greek
) character `alpha'.
) It will contain 8 bytes, counting the shift-in, shift-out and the null
) at the end.

) Presumably the integral constant '@' is a three-byte constant, no matter
) what it may look like?

I don't know about Greek, but I have seen situations where the mbchar itself
is three bytes, so with the shift in/out you have five bytes. Not all schemes
use shift/in shift out: some don't know about shifts at all and some have
an implicit shift after each character, so it's *always* in the initial
state. For others, the shift is implied by the initial character of the
multibyte sequence being in certain ranges. Furthermore, some schemes use
characters of mixed lengths, so that a string might consist of a mixture
of 1, 2, and 3-byte characters.

(My apologies if you want to know about Greek specifically, but my
presumption is that we want to write code that will work in a variety of
locales.)

) An alternative interpretation is that it violates
) the constraint in 2.2.1.2 `a .. character constant .. shall begin
) and end in the initial shift state', but presumably I can expect my
) implementation to do the necessary good deeds and put a shift-out
) in there too.

Good question. In Japanese, there are no separate shift characters, so
I don't know what compilers do when there are. Anyone?

) Since it is a three-byte constant (assuming I'm right), then can I be
) sure that I do not get overflow when I assign it to a char variable?

A char is not a multibyte char, so truncation or overflow or whatever
is the likely result. The type char is still a single scalar value, so
an array of them is needed for multibyte data.

) 3.1.3.4 says that the value of a multi-character character constant
) will be implementation-defined, and 3.2.1.2 says that that (paraphrase)
) demoting an int to a char gives an implementation-defined result.
) So to call it `overflow' is perhaps overstating the case, but I clearly
) end up in implementation-defined territory twice over.

You can test MB_LEN_MAX (for the compiler's worst case) or MB_CUR_MAX
(for the current locale's worst case) to check how many bytes you
might need to hold the value.

My question: do MB_LEN_MAX and MB_CUR_MAX include shift characters in
locales that use them? If not, my recollection is that the old ansii
spec on extended characters allows multibyte shift sequences, so how
do we know the maximum length of a shift sequence (in or out)?

My experiences with shift characters antedate the introduction of
multibyte and wide characters into C. Any information on current use
of shift characters here would be appreciated.

gwyn@smoke.BRL.MIL (Doug Gwyn) (07/05/90)

In article <1467@inset.UUCP> mikeb@inset.co.uk (Mike Banahan) writes:
>Presumably the integral constant '@' is a three-byte constant, no matter
>what it may look like?

No, the value of such a multibyte character constant is implementation-
defined.  The type of the constant is int.

>An alternative interpretation is that it violates the constraint in
>2.2.1.2 `a .. character constant .. shall begin and end in the initial
>shift state', but presumably I can expect my implementation to do the
>necessary good deeds and put a shift-out in there too.

No, you had better put the shift-out in there too or the final ' may not
be recognized by the compiler.

>Since it is a three-byte constant (assuming I'm right), then can I be
>sure that I do not get overflow when I assign it to a char variable?

It is most unlikely that the implementation definition will assign a
value less than 256 to '@'.  Therefore (assuming that chars are
represented in 8 bits, as is usually the case these days), information
will be lost if you assign that character constant to a char variable.

Situations like this are best dealt with by explicit use of the wchar_t
type, which should be large enough to contain any source character.