[net.internat] Int'l character sets

guy@sun.uucp (Guy Harris) (02/21/86)

> I don't think the "umlauts" should be moved elsewhere, at least not in
> ASCII.  Move the braces and brackets somewhere else.

ASCII isn't likely to change, it's stable.  Now ISO Latin Alphabet No. 1,
that's another story.  They've already moved the "umlauts", so it's too
late.  Dan Sahlins at The Royal Institute in Stockholm (ttds!dan) posted a
part of the ISO Draft standard.  The lower 128 code points (to use IBMese
instead of English) are the same as they are in ASCII; the upper 128 code
points include the various alphabetic characters used in non-English
languages and some special characters (cent sign, pound sign, accents).
"Capital letter A with ring above", as in "\oAngstrom), is in position
12/05, which in ANSI and ISO's somewhat opaque notation indicates the
character with code 12*16 + 5, i.e. hexadecimal C5.

Someone in an earlier article made reference to "the Danish Standard ISO 646
character set", and to other Scandinavian countries using their national
versions of ISO 646.  Is ISO 646 the current 7-bit alphabets with {, |, },
etc. replaced with additional letters?

> It just so happens that the ASCII I am used to represents my alphabet.
> (Almost: W isn't really part of it and oA is placed two steps wrong.)
> What would you think of a collating sequence like: 
>    A B C $ / # D E F ) = #  and so forth.

The second sentence voids the first.  What would an English speaker think of
a collating sequence like: "A B C F D E G ..." (I presume that's what you
mean by "oA is placed two steps wrong.")  "Almost" isn't good enough.

> I think one of basic ideas behind this conference is to make it possible
> not only for English-speaking people to not to have write special sort 
> programs, but be able rely on standard program (like grep) or standard 
> functions in programming languages. (like < >, etc for string comparisons
> in Pascal.)

GOOD LUCK.  I don't think you stand a snowball's chance in hell of having
standard string comparison functions in programming languages doing the
right thing.  From a posting by Lambert Meertens at the Centre for
Mathematics and Computer Science in Amsterdam (mcvax!lambert):

> 3.  It is not really clear whether ij should be considered one letter or
> two letters representing one vowel (just like no-one would dream of calling
> aa a ligature).  At school, Dutch kids are taught an alphabet ending in ...
> x, ij, y, z.  Also, if a word starting with ij is capitalized, the result
> is always IJ (so ijspret, the joy of ice skating, becomes IJspret).  Some
> Dutch typewriters have a separate ij key.  If I use such a typewriter, I
> won't touch that key because the result is esthetically less satisfying
> than that of i+j.
>
> 5.  Really conclusive would be the sorting convention.  ...
> This, however, is anarchy.  Most dictionaries sort ij like the two letters
> i+j, so ignorant < ijspret < illusoir.  Most encyclopedias use the school
> alphabet, so Xenophobia < IJspret < Yggdrasil.  The PTT sort on ij = y, so
> Wijchen < Wymbritseradeel < Wijngaarden.  They have a very good reason for
> this: before standardization settled on ij, many Dutch family names had
> already fixed themselves on y; only different branches could have different
> spellings.  So we have families De Bruyn next to families De Bruijn.
> Usually, you don't know which of the two is used officially; it is not even
> unheard of that a bearer of such a name doesn't know it themself unless
> they look it up in their passport or driver's licence.

And a subsequent posting:

> As a kind reader points out to me:
>
> + I think you are mistaken when you say that "rr" is sorted as a single
> + letter in Spanish.  Although "ch" and "ll" do sort as single letters,
> + "rr" does not (even though it is considered to be a separate letter).
> + Perhaps this is because no Spanish words start with it.

From "The International Utilities Package" in "Inside Macintosh":

	Note: ... String comparison in Pascal yields very different
	results (from the "international string comparison" routines
	in Macintosh - gh), since it simply follows the ordering of
	the characters' ASCII codes.

These routines, from a quick reading of that section of "Inside Macintosh",
change their behavior depending on the setting of a global flag indicating
which language, etc. is in use.

So comparison of character strings depending on the national sorting rules
is a lot more complicated than comparison of character strings on a
byte-by-byte basis.  As such, I think the position of characters within the
character set isn't really all that relevant.  Sorting English-language text
may run faster, since ASCII happens to be set up with the letters in the
right order, but remember that "dictionary order" treats upper-case and
lower-case letters the same, so even there a straight byte-by-byte
comparison isn't always waht you want.

> This of course also includes how things are represented on the screen and
> the keyboard.

Yes, screens will have to display national characters, and keyboards will
have to have keys for them.  I don't mind that, although you'll probably
have to stuff {, |, }, etc. onto keyboards which currently don't have them.

> So you're right, compilers will need to be rewritten. Not only to fit the
> different keyboards, but also the HUMAN BEEINGS behind them.

If the compiler accepts ISO Latin Alphabet No. 1, it won't have a problem.
{, |, } are all in that alphabet.  The only reason a compiler would have to
be rewritten would be to support the 7-bit character sets, and the only
reasons to do that would be if ISO Latin Alphabet No. 1, and keyboards which
allowed you to type in all the characters of that character set you need
(i.e., all of the lower 128 ASCII code points and all of the upper 128 code
points you need in the languages you use), didn't become common.  If we end
up stuck with 7-bit character sets and keyboards which have oA, etc. instead
of {, |, }, etc. rather than keyboards which have them in addition, we'll be
stuck with modifying compilers.  Unfortunatly, if that happens, BNF will
have to be rewritten as well, since it uses "|"....

> And of course it's a very big tail wagging a small dog. The tail is the
> vast majority of the people in the world who don't have English as their
> native language and the dog is those who do.

No, there are many dogs, and the Chinese one is not only bigger than the
Swedish one, it's bigger than the English-speaking one.  Chinese won't even
fit into an 8-bit character set, and lord only knows *how* you sort Chinese
strings!  If you warn people against Anglophone ethnocentrism, beware of
Western ethnocentrism....

On the subject of non-Western language support:

Note that AT&T is offering a version of System V which has been "turned
Japanese".  It supports several two-byte and three-byte character sets; it
mentions JIS C6226 Kanji and JIS C6220 Kana.  (According to Issue 2 of the
System V Interface Definition, Volume 2's section on Future Directions, all
the international character sets used by UNIX will be in conformance with
ISO standard 2022-1982.  It also indicates that ISO Latin Alphabet No. 1 is
DIS 8859/1; I presume DIS is Draft International Standard.)  The brochure
AT&T handed out at UniForum indicates:

	addition of Japanese terminal and input attributes to "terminfo"

	addition of methods for entering Japanese characters, including
	a kana-to-kanji translation mechanism; they indicate two methods
	for entering Japanese characters, an "in-line kana to kanji module",
	whatever that means, and "jvi", which presumably stands for
	"Japanese vi"

	"Utility programs for preparation and maintenance of ESC and
	dictionary.

		o Extended characters font creation program

		o Extended character font load program

		o Dictionary maintenance program"
	(with no indication of what this all means, unfortunately).

	C language changes to support the use of Japanese characters
	in literals and comments - presumably, this just means the
	scanner has been changed to handle 8-bit characters and not
	get tripped up by character sequences, so this compiler presumably
	will be the standard C compiler in future UNIX releases and
	will work in any national environment.

	Changes to some commands to permit the processing of data written
	in Japanese (this, like the C compiler change, is listed as
	"International" rather than "Japanese", so presumably most of
	it will be part of future UNIX releases and will apply to all
	national environments).  The changes include support of 8-bit
	character sets.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.arpa	(yes, really)

craig@dcl-cs.UUCP (Craig Wylie) (02/26/86)

>> And of course it's a very big tail wagging a small dog. The tail is the
>> vast majority of the people in the world who don't have English as their
>> native language and the dog is those who do.
>
>No, there are many dogs, and the Chinese one is not only bigger than the
>Swedish one, it's bigger than the English-speaking one.  Chinese won't even
>fit into an 8-bit character set, and lord only knows *how* you sort Chinese
>strings!  ...

I have seen some English Chinese dictionaries that use the number of strokes
in the character to sort.

Currently trying to find a Chinese Researcher to ask her personally.

Craig.

-- 
UUCP:	 ...!seismo!mcvax!ukc!dcl-cs!craig| Post: University of Lancaster,
DARPA:	 craig%lancs.comp@ucl-cs 	  |	  Department of Computing,
JANET:	 craig@uk.ac.lancs.comp		  |	  Bailrigg, Lancaster, UK.
Phone:	 +44 524 65201 Ext. 4146   	  |	  LA1 4YR
Project: Cosmos Distributed Operating Systems Research Group

ken@rochester.UUCP (Ipse dixit) (02/28/86)

In article <1026@dcl-cs.UUCP> craig@comp.lancs.ac.uk (Craig Wylie) writes:
>I have seen some English Chinese dictionaries that use the number of strokes
>in the character to sort.

There are two common lexicographic orderings for Chinese dictionaries:

1. Sorted by major radical, then by number of strokes. The radicals are
also sorted by number of strokes (I don't remember what to do for
ties).  Thus words to do with "wood" come before words to do with
"metal" because "metal" (really "gold", the king of metals) has more
strokes.

2. The Four Corner Digit method. There are 10 classes of strokes and
the four corners are assigned digits corresponding to the nearest
stroke.  Then one uses the 4 digit number to index into the dictionary.
A fifth digit is sometimes used to disambiguate. A lot like hashing.

I prefer the second method because it is fast (for me, but I was
regarded as a radical [pun intended] when I used a FCD dictionary in
school ages ago).  The disadvantage is that unrelated words will sort
together. For this reason the traditional radical sort is used for
phone books, etc.

	Ken
-- 
UUCP: ..!{allegra,decvax,seismo}!rochester!ken ARPA: ken@rochester.arpa
Snail: Comp. of Disp. Sci., U. of Roch., NY 14627. Voice: Ken!

schwrtze@acf8.UUCP (E. Schwartz group) (03/02/86)

A Chinese-English dictionary in my posession, one that is used by many people
from Taiwan as well as by myself, has four indices to the characters:

	1.  by the traditional radical ordering.  Each radical's characters
	    are sorted in increasing stroke order.
	2.  by strict stroke order.  The characters within each number of
	    strokes are sorted by radical.
	3.  by the well-known ordering of the Chinese phonetic symbols
	    (zhuyin fuhao, or 'bo-po-mo-fo'), subdivided by tone.
	4.  by alphabetic order in the Wade-Giles romanisation, subdivided
	    by tone.

I find that each of the indices has its moments, depending on what properties
of the character I am searching for come most easily to mind.  I would
further submit that the four-corner system is not so commonly used as any
of the above.

				Alan Shaw
				alan@nyu-alaya.arpa
				cmcl2!alaya!alan

ljdickey@water.UUCP (Lee Dickey) (03/21/86)

> Someone in an earlier article made reference to "the Danish Standard ISO 646
> character set", and to other Scandinavian countries using their national
> versions of ISO 646.  Is ISO 646 the current 7-bit alphabets with {, |, },
> etc. replaced with additional letters?

I think that the writer must mean some Danish character set similar to ISO 646. 

ISO 646 is a well defined object, and is not Danish.
The 94 graphic characters of ISO 646 are not the same as the UK set,
the ASCII set, the NATS set for DANISH and Norwegian, and so on.

A few registrations:
	2	ISO 646 (international reference version)
	4	UK Graphics
	6	US Graphics (ASCII)
	8-1	NATS main graphic set (Fin - Swed)
	8-2	NATS additional set (Fin - Swed)
	9-1	NATS main graphic set (Den - Nor)
	9-2	NATS additional set (Den - Nor)
	13	JIS Katakana (JIS 6220-1969)
	68	APL character set for Workspace interchange
	87	JIS Katakana (JIS 6226-1983)