[comp.windows.x] Unicode; Internationalizing char sets

erc@pai.UUCP (Eric F. Johnson) (02/26/91)

InfoWorld (25 Feb 91) has an interesting article on internationalizing
character sets. According to the article, a joint venture, Unicode, Inc.,
will replace the ASCII with a 16-bit character set (with 27,000 characters
used). IBM, Apple, Sun, Microsoft, Next, Go (a pen-based company) and
Novell are members of Unicode. The article was rather scant on technical
details, which makes it hard to judge the merits of their approach.

Apparently, the Unicode character set will include traditional ASCII for
characters 0-127 and other national alphabets in the higher characters.
It looks like one can use at least Arabic, French and English in the
same document using Unicode. This seems to indicate that Unicode 
supports multi-lingual applications (which the X internationalization effort
has apparently chosen to put off, according to statements at the 5th
X Technical Conference).

I wonder what impact, if any, this will have on the internationalization (I18N)
efforts for the X Window System, with IBM, Apple and Sun as members of
Unicode.

From the purely selfish attitude of an application developer, it would be
nice if the X folks involved with internationalization would at least
_talk_ to the Unicode folks. 

Yes, I know that a character set is not all internationalization entails
(especially if it has to justify I18N and other jargon terms :-). But, it
looks like the industry is moving to standardize at least _part_ of
this (a character set).

Have fun,
-Eric


-- 
Eric F. Johnson               phone: +1 612 894 0313    BTI: Industrial
Boulware Technologies, Inc.   fax:   +1 612 894 0316    automation systems
415 W. Travelers Trail        email: erc@pai.mn.org     and services
Burnsville, MN 55337 USA

harald.alvestrand@elab-runit.sintef.no (02/26/91)

UNICODE is (IMHO) another US loser.
It is (as far as I know) heartily disliked by the Japanese, Chinese, Korean
and others whom IBM et al are trying to say that they make it for.
The reason is that they try to squeeze characters that look *almost* the same
and mean *almost* the same thing into a single character position.
Kind of like writing French without the accents: Readable, but UGLY.

The ISO guys are gathering around ISO 10646, a *32-bit* (gasp) character set
with compaction methods that make it compatible with ISO 8859-1 (Latin-1).

                   Harald Tveit Alvestrand
Harald.Alvestrand@elab-runit.sintef.no
C=no;PRMD=uninett;O=sintef;OU=elab-runit;S=alvestrand;G=harald
+47 7 59 70 94

mleisher@nmsu.edu (Mark Leisher) (02/27/91)

In article <1991Feb26.094326.3341@ugle.unit.no> harald.alvestrand@elab-runit.sintef.no writes:

>UNICODE is (IMHO) another US loser.
>It is (as far as I know) heartily disliked by the Japanese, Chinese, Korean
>and others whom IBM et al are trying to say that they make it for.
>The reason is that they try to squeeze characters that look *almost* the same
>and mean *almost* the same thing into a single character position.
>Kind of like writing French without the accents: Readable, but UGLY.
>

Yep.  The argument, wrt both Unicode and ISO 10646, is over the Han
unification.

>The ISO guys are gathering around ISO 10646, a *32-bit* (gasp) character set
>with compaction methods that make it compatible with ISO 8859-1 (Latin-1).

IMHO, if a Han unification is not agreed upon, it looks like it would
be easier to fit potentially different Han sets in ISO 10646.

Besides, 32 bits gives a lot of working room for future additions, of
whatever sort, and the compaction methods available in ISO 10646 still
allow some modicum of efficiency.

Another thing to keep in mind is that Japan, Korea, and PRC are now
working on new standards with internationalization in mind.  Once
these standards are out, maybe the unification questions will be
easier to resolve.

Perhaps delaying the internationalization of X is a good idea.  Who
wants potentially major modifications staring them in the face after
choosing one international character set over another.

-----------------------------------------------------------------------------
mleisher@nmsu.edu                      "I laughed.
Mark Leisher                                I cried.
Computing Research Lab                          I fell down.
New Mexico State University                        It changed my life."
Las Cruces, NM                     - Rich [Cowboy Feng's Space Bar and Grille]

mleisher@nmsu.edu (Mark Leisher) (02/27/91)

>Perhaps delaying the internationalization of X is a good idea.  Who
>wants potentially major modifications staring them in the face after
>choosing one international character set over another.

Thanks to Bob Scheifler for alerting me to this (wrong) little gem
from my last posting.

As he pointed out to me, internationalization efforts are primarily
oriented towards programming interfaces that provide codeset
independence as opposed to being dependent on a particular codeset.

Humblest apologies for my clumsy mis-info.
-----------------------------------------------------------------------------
mleisher@nmsu.edu                      "I laughed.
Mark Leisher                                I cried.
Computing Research Lab                          I fell down.
New Mexico State University                        It changed my life."
Las Cruces, NM                     - Rich [Cowboy Feng's Space Bar and Grille]

harkcom@spinach.pa.yokogawa.co.jp (03/16/91)

In article <MLEISHER.91Feb27070453@thrinakia.nmsu.edu> mleisher@nmsu.edu
   (Mark Leisher) writes:

 =}>Perhaps delaying the internationalization of X is a good idea.  Who
 =}>wants potentially major modifications staring them in the face after
 =}>choosing one international character set over another.
 =}
 =}Thanks to Bob Scheifler for alerting me to this (wrong) little gem
 =}from my last posting.
 =}
 =}As he pointed out to me, internationalization efforts are primarily
 =}oriented towards programming interfaces that provide codeset
 =}independence as opposed to being dependent on a particular codeset.

   But your question was useful in that it prompted me to ask myself
some others. And one of those questions seems to point to a potential
pitfall.

   The type wchar_t will be supported in X. X can be used over a network.
wchar_t can be defined to have different sizes on two different machines
as the only requirement is that it be large enough to support all locales
on one machine (one machine has largest size of 2 bytes while another has
3 or 4, particularly 4).

   Now suppose we have a machine, A,  running the server and it has
wchar_t defined as unsigned short. Now I remotely run a client on
another machine, B, which has wchar_t defined as unsigned int. I use
the display on A for the client. The client sends a text string which
has only two bytes per character in the four byte format to the server.
The server will draw a mess. And the reverse (server on B client on A),
will make an even prettier mess.

   Has there been an attempt to avoid this situation? If so, how?

dan@ibm.COM (Walt Daniels) (03/17/91)

>From: harkcom@spinach.pa.yokogawa.co.jp
>   Now suppose we have a machine, A,  running the server and it has
>wchar_t defined as unsigned short. Now I remotely run a client on
>another machine, B, which has wchar_t defined as unsigned int. I use
>the display on A for the client. The client sends a text string which
>has only two bytes per character in the four byte format to the server.
>The server will draw a mess. And the reverse (server on B client on A),
>will make an even prettier mess.
>
>   Has there been an attempt to avoid this situation? If so, how?

There are no problems - read your xlib manual about the draw string
functions. They come in two flavors, 8 bit and 16 bits. The codepoints
used in the text strings of the applications get converted to the
glyph indexes into fonts by the draw operations for transmition over
the wire protocol to the server.

There are problems with cuting and pasting arbitrary strings between
clients.  Both sides must agree on a codeset.  The usual thing is to
use compound text but conscenting clients can use other codesets.