[net.internat] character sets

mikeb@inset.UUCP (Mike Banahan) (10/08/85)

Pete Delaney suggests that character sets would be a good place to start.

He's right - it's a horrible area.

The first problem that strikes typical C programmers is how they should
represent characters outside the normal ASCII set. They then start thinking
about using the `top' bit to extend the range of usable characters up to 255.
Somebody throws in a suggestion that the Japanese will want around 7000
(seven thousand) characters, so the next idea is to start using shift
sequences.

In fact, there are a whole bunch of industry `standards' for this sort
of thing. For those of us who can get by with 256 characters, in Western
Europe (including Iceland), this is not a bad solution. Draft Iso Standard
(DIS) 8859-1 gives us what we think we need. ISO 2022 gives a suggested set
of shifting mechanisms which allow the top 128 characters of 8859-1 to be
switched on the fly, so that in 8 bits I can produce documents
in English, French, German, Icelandic and so on. If I want to throw in some
Greek (whose characters *aren't* in the top half of 8859-1), then
I can use either a locking or non-locking shift sequence which say `the
top 128 characters are now some other set' and in this way get some Greek
in there.

And so it goes on, up to ways of getting 16 bit characters.
In fact one of the problems here is that there are so many standards,
that there aren't any, if you see what I mean.

But there are problems. First, characters aren't fixed length any more.
You should see what *that* does to C code. Fixed length arrays aren't
fixed in length any more, you can't index into them to find the nth
character, because if it's preceded by a shift code it will mean something
else.

Toupper() and tolower() have to be warned what the current top half
of the codeset is.

And much, much more.

Moving on from character sets to interpreting their meaning, we tread
on a particularly obnoxious little serpent: Regular Expressions.
This is a famous little problem in its own right, and it is caused
by ranges in REs. If the current codeset doesn't use a consecutive
encoding for the characters in its repertoire, what does
	[a-z]
mean?????
It's more obvious with a concrete example: let's use German and the
convention that <u"> means u with an umlaut. What does the last
regular expression mean. Does it include <u"> or not? Does it really
mean "all alphabetic characters" (in which case does it embrace
Greek alpha through omega?) and if it does, does it include vowels
with an umlaut? If not, do they have to be put in explicitly?
How, if I want to, do I write a regular expression explicitly to match
all alphabetic characters with or without umlaute?
How, with grep, do I find only those lines with at least one umlaut?
This problem rolls on and on and on. It's even better with the kanji
ideographic languages :-).

Collating sequences become very interesting round about now - but that's
a whole article to itself!

Back to character encoding methods. The current AT&T proposals are based
on ISO 2022, in a draft document released to the /usr/group/uk working
party, dated June 24th, 1985. Copies of it, and other relevant literature
received so far, can be obtained by writing to
	Mrs. J. H. Burley,
	Secretary,
	/usr/group/uk,
	8, Chequer Street,
	St. Albans, Herts,
	AL1, 3YJ
	England.
and saying that you wish to be put on the Internationalisation mailing list.

For my own part, I believe that the discussions on how to encode stuff is
premature. I think that it is more important to find out what `characters'
the users want first. If a solution that cannot easily handle such features
as `all european and asian characters in different fonts and point sizes'
is proposed, yet the users want exactly those features, then we have let them
down. If they say that they have got used to working in English and don't
want anything different, then there is no point in changing.

Though we know for a fact that the latter, English only, is already not an
option. The time may have come for a much more radical solution, with
an abstract object-oriented view of character handling. I am personally
convinced that it has, and have prepared a paper on the topic for those who
wish to see it. It a a little large to post to the news net, but I will
mail it to those who want to see it. (It uses pic for the diagrams; sorry).

Please, let's see some real debate on these topics. THEY MATTER.
Internationalisation may be the next big hurdle for computers to overcome.
Users want to use their own language and characters; the technical
problems are fascinating and the market opportunities immense!
-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb

meissner@rtp47.UUCP (Michael Meissner) (10/11/85)

In article <719@inset.UUCP> mikeb@inset.UUCP (Mike Banahan) writes:
>
>The first problem that strikes typical C programmers is how they should
>represent characters outside the normal ASCII set. They then start thinking
>about using the `top' bit to extend the range of usable characters up to 255.
>Somebody throws in a suggestion that the Japanese will want around 7000
>(seven thousand) characters, so the next idea is to start using shift
>sequences.
>
>	...
>
>But there are problems. First, characters aren't fixed length any more.
>You should see what *that* does to C code. Fixed length arrays aren't
>fixed in length any more, you can't index into them to find the nth
>character, because if it's preceded by a shift code it will mean something
>else.
>

I don't know much about all the ramifications, but I think not having fixed
length characters would be horribly expensive.  I think that the best solution
would be a new character type, which can hold all of the glyphs (spelling?)
that anybody (not just western europe & USA) needs to use.  I would think
that something on the order of 4 octet's (32 bits) should be able to hold
all of the information, complete with font/size.  I would think that the
current ISO eight bit encoding for europe/USA would be used if the upper 3
octets were zero, and that it be easy to isolate font info via masking.

	Michael Meissner
	Data General

tmb@talcott.UUCP (Thomas M. Breuel) (10/13/85)

In article <214@rtp47.UUCP>, meissner@rtp47.UUCP (Michael Meissner) writes:
> In article <719@inset.UUCP> mikeb@inset.UUCP (Mike Banahan) writes:
> >Somebody throws in a suggestion that the Japanese will want around 7000
> >(seven thousand) characters, so the next idea is to start using shift
> >sequences.
> I don't know much about all the ramifications, but I think not having fixed
> length characters would be horribly expensive.  I think that the best solution
> would be a new character type, which can hold all of the glyphs (spelling?)
> that anybody (not just western europe & USA) needs to use.  I would think
> that something on the order of 4 octet's (32 bits) should be able to hold
> all of the information, complete with font/size.  I would think that the
> current ISO eight bit encoding for europe/USA would be used if the upper 3
> octets were zero, and that it be easy to isolate font info via masking.

This would blow up all you text and many of your data files by a factor
of *four*. Mass storage may be cheap, but it isn't that cheap. Huffman
encoding would, of course, help, but Huffman encoded files can only be
read sequentially, and writing data into a file requires a lot of work.

Honestly, I don't see why any European or American customer should pay
for facilities that he isn't using, and I don't think that any computer
company can afford to sell computers that, even though they have roughly
the same specifications as their competition's, only have one fourth of
the performance.

						Thomas.

peter@graffiti.UUCP (Peter da Silva) (10/16/85)

7000 Japanese characters... hmmm...

How about using the 8th bit set to indicate that this byte and the following
encode one of 32767 extended characters. (1xxxxxxx 0000000 is illegal here)

In ASCII text file or stream:

	Normal ASCII character: 0xxxxxxx
	Foreign character:	1xxxxxxx xxxxxxxx

In memory or foreign file:

	Normal ascii character:	00000000 xxxxxxxx
	Foreign character:	1xxxxxxx xxxxxxxx
	Null:			00000000 00000000
	Two ASCII characters:	0xxxxxxx 0xxxxxxx

So an ASCII text file is a compressed form of the foreign file. The ascii
character pair should probably be used with caution. Maybe it should just
be undefined.

If this suggestion has already been made (and it probably has, it seems a
pretty obvious way of doing things to me...) just pretend I'm not here.