[net.unix] International character set

rcd@opus.UUCP (Dick Dunn) (11/04/85)

> >>Unfortunately, you CAN'T build a good international character set.
> >>Some of those silly European countries have the same character in
> >>several languages, but sort the character in different places in each
> >>language.  They also have interesting constructs like characters that
> >>sort as two characters, and pairs of characters that sort as single
> >>characters...

This view mixes several different matters:
	- getting an international character set
	- dealing with multiple-letter groupings and ligatures
	- sorting and searching
	- collating

To get a useful international character set, all you need is a mapping of
enough codes to enough different symbols that you can prepare text for all
the languages of interest.  This doesn't prejudice how many bits you need,
nor whether some symbols may require more than one datum (byte) for
representation.  (Example:  Letters with diacritical marks might be one
byte for a special code, two to combine a non-escaping diacritical with a
letter, or three to combine a letter, a backspace, and a diacritical.  The
only real constraint is that in the third case you should decide whether
the letter or the mark comes first in the sequence.)

Multiple letters are probably represented as multiple data for the actual
sequence of letters.  Ligatures are probably represented with single codes
since they are distinct from the letters they comprise.  (I am referring to
ligatures in the linguistic sense, such as AE, rather than in the
typographic sense, such as ffl.)

Sorting, used in the sense of putting things in -some- order, for the
purpose of being able to search or compare, is easy--note that you DON'T
need to have any "sensible" order for this; it must only be consistent.
You could process American text and order the alphabet A X C B D E Y...
as long as you do so consistently.  This makes more sense if I compare it
with:

Collating, used in the sense of putting things in order so that humans will
agree that they are in order and can use the ordering, is quite a difficult
problem.  It cannot be done based on the character codes.  Look at the
ordering that ASCII would produce.  If you think it's reasonable, consider
that MacDonald will sort between MacAfee and Macafee.  ASCII only works
somewhat reasonably for simplistic use with American or English text; even
then, it cannot deal reasonably with words of foreign origin.  (Suppose
that instead of canyon you have can~on (assume it were represented as it
should be) - what sequence of bytes would represent it, and where would
that sequence sort to, assuming the sort program didn't go weird on you?)
What you have to do for humans is to sort according to a set of rules
designed for the natural language you are processing.  You have to know
this at the outset, and the rules for most languages are not going to be
trivial.  It doesn't have to influence the character set design all that
much.
-- 
Dick Dunn	{hao,ucbvax,allegra}!nbires!rcd		(303)444-5710 x3086
   ...Never attribute to malice what can be adequately explained by stupidity.