[comp.lang.c] Programming and international character sets.

kjartan@rhi.hi.is (Kjartan R. Gudmundsson) (10/28/88)

How difficult is it convert american/english programs so that they can 
be used to handle foreign text? The answer of course depends on the language
one has in mind. In Europe most nations ues the Latin alfabet and english
is one of them. Unfortunately english uses very few charaters compered to
other european languages, therefore the code set that is widely used by
americans and english, the ASCII character set, only defines 128 characters.
It is a 7 bit character set. In other european countries than England
the ASCII character set is also widely used but with extension.
The character set is 8 bit thus allowing 256 characters. 
The problem is however that the extension is not standard.
We have one possability in the IBM-PC character set, other one from HP called
Roman-8, DEC gives us DEC-multinational character set and the Macintosh
has yet another. So if we have a program that for example converts lower case
letters to uppercase, it has to be coded diffrently for each character
set.

Let's look at some code from MicroEMACS:

input.org:        if (c>=0x00 && c<=0x1F)
input.org:        if (c>=0x00 && c<=0x1F)                 /* C0 control -> C-     */
main.org:				case 'a':	/* process error file */
main.org:        if ((c>=0x20 && c<=0xFF)) {	/* Self inserting.      */
random.org:		if (*scan >= 'a' && *scan <= 'z')
random.org:                else if (c<0x20 || c==0x7F)
random.org:                else if (c<0x20 || c==0x7F)
region.org:                                lputc(linep, loffs, c+'a'-'A');
region.org:                                lputc(linep, loffs, c-'a'+'A');
region.org:                        if (c>='a' && c<='z')
search.org:		else if (c < 0x20 || c == 0x7f)	/* control character */
word.org:					c += 'a'-'A';
word.org:				c += 'a'-'A';
word.org:				c -= 'a'-'A';
word.org:				c -= 'a'-'A';
word.org:			if (c>='a' && c<='z') {
word.org:			if (c>='a' && c<='z') {
word.org:		wordflag = ((ch >= 'a' && ch <= 'z') ||
word.org:	if (c>='a' && c<='z')
 

Ugly isn't it?

An other way of doing this is using "is.." functions that are
defined in ctype.h, include file that comes with (almost) all c-compilers
Some of the above lines would look like this:

basic.c:                else if (iscntrl(c))
display.c:			if (iscntrl(c))
display.c:	} else if (iscntrl(c)) {
eval.c:			*sp = tolower(*sp);
eval.c:			*sp = toupper(*sp);
eval.c:		if (islower(*sp) )
fileio.c:	  if (iscntrl( fn[tel++] ) )
input.c:				if (iscntrl(buf[--cpos]) ) {
input.c:				if (iscntrl(buf[--cpos])) {
input.c:			c = toupper(c);
input.c:			c = toupper(c);			/*Force to upper */
input.c:		if ( islower(c) && ( SPEC != (SPEC & c) ))
input.c:	        if (iscntrl(c) )		/* control key */
input.c:	        if (iscntrl(c) )		/* control key */
input.c:	        if (iscntrl(c) )		/* control key? */

This code is better (most of the is.. things are macros that mask
the argument and return the binary mask that is either zero or positve)
has more style to it and is easiear to port to a diffrent character set.

An other bad habit of american programmers is this:
character_value = (character_value & 0x7F ) 
don't do this!! If you must, you can use 0xFF insted:
character_value = (character_value & 0xFF )
(Unless of course your machine breaks to peaces if it gets
an 8 bit character in its io channels.)

###############################################################################
#                                     #
#	Kjartan R. Gudmundsson        #     
#	Raudalaek 12                  #     
#	105 Reykjavik                 #     Internet:  kjartan@rhi.hi.is      #
#                                     #     uucp:  ...mcvax!hafro!rhi!kjartan #
#                                     #                                       #
###############################################################################

gwyn@smoke.BRL.MIL (Doug Gwyn ) (10/31/88)

In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
>How difficult is it convert american/english programs so that they can 
>be used to handle foreign text? [etc.]

Where have you been the last few years?  This subject area is known as
"internationalization" and has been the featured topic of special issues
of several journals, including UNIX Review and UNIX/World.  The draft
proposed ANSI/ISO C standard specifically addresses this issue (it is
one of the reasons production of the final standard was delayed).

ok@quintus.uucp (Richard A. O'Keefe) (10/31/88)

In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
>The problem is however that the extension is not standard.
There is an international standard for 8-bit character sets: ISO 8859.
There are several versions of 8859, just as there were several national
versions of ISO 646 (of which ASCII was only one).  All versions include
ASCII has the bottom half.  ISO Latin 1 (8859/1) is pretty close to DEC's
Multinational Character Set, and is supposed to cover most West European
languages (including Icelandic).  There is a Cyrillic version (I think it
is 8859/2) and others are under way.

>An other bad habit of american programmers is this:
>character_value = (character_value & 0x7F ) 
>don't do this!!  If you must, you can use 0xFF insted:
>character_value = (character_value & 0xFF )

The only time when I've wanted to do this is when stripping off a parity
bit, and using 0xFF would be totally wrong.  The toascii() macro *might*
be appropriate.  When you're dealing with a 7 data + 1 parity bit device,
there is no point in pretending that you're prepared to accept anything
other than 7 data bits.

The real problem is trying to write portable code that uses character
classes which _aren't_ in <ctype.h>.  Consider isvowel()...

nwd@j.cc.purdue.edu (Daniel Lawrence) (10/31/88)

In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
>
>How difficult is it convert american/english programs so that they can 
>be used to handle foreign text? The answer of course depends on the language
	[a description of some of the problems using 8 bit chars]
>
>Let's look at some code from MicroEMACS:
>
	[a code excerpt from MicroEMACS 3.9]
>Ugly isn't it?
>

	Ok, I am feeling a little picked on here... a lot of people like
using uEMACS for pointing things like this out.  When I first started
working with it, it was just for me. But that is really no excuse... 

>An other way of doing this is using "is.." functions that are
	[an alternative which is better]
>This code is better (most of the is.. things are macros that mask
	[More descriptions of 8 bit problems...]

	And someone finally proposes some solutions rather than just
blindly stabbing out and complaining.  The last round of complaints I
sent out a request for information on this problem, and the best I got
back was.. go to the library and do some research.  Well for a project I
am doing in my spare time, considering the poor library system round
here I really wasn't happy to here all the griping and then get no help
from anyone to fix the problems.  So I applaud Mr. Gudmundsson for his
mail.

>#	Kjartan R. Gudmundsson        #     
>#	Raudalaek 12                  #     
>#	105 Reykjavik                 #     Internet:  kjartan@rhi.hi.is      #


	However, after the last round, I thaought carefully about the 8
bit problems, and resolved that the issue was too complex on a language
by language basic for me to ever attempt to get all the case mappings
correct.  So when you see the next version of MicroEMACS, it will have
a user changable upper/lowercase mapping function (which is working
right now).  Note: This slows down the regular pattern matching code
considerable, so uEMACS can be compiled with the diacritical (un
american in this case) turned off, but both options now exits.

			Daniel Lawrence		(317) 742-5153
			UUCP:	{pur-ee!}j.cc.purdue.edu!nwd
			ARPA:	nwd@j.cc.purdue.edu
			FIDO:	1:201/10 The Programmer's Room (317) 742-5533

nelson@sun.soe.clarkson.edu (Russ Nelson) (11/01/88)

In article <8045@j.cc.purdue.edu> nwd@j.cc.purdue.edu (Daniel Lawrence) writes:

   In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
   >
   >How difficult is it convert american/english programs so that they
   >can be used to handle foreign text? The answer of course depends
   >on the language

   So when you see the next version of MicroEMACS, it will have
   a user changable upper/lowercase mapping function (which is working
   right now).

Same for Freemacs.  I also used to take over the keyboard interrupt (INT 9),
but some of the international users complained that it broke their keyboard
mapper (not to mention the fact that it lost with TSRs), so I took it out.
--
--russ (nelson@clutx [.bitnet | .clarkson.edu])
To surrender is to remain in the hands of barbarians for the rest of my life.
To fight is to leave my bones exposed in the desert waste.

guy@auspex.UUCP (Guy Harris) (11/01/88)

>An other bad habit of american programmers is this:

And another one is using the 8th bit internally for quoting; don't do that.

guy@auspex.UUCP (Guy Harris) (11/01/88)

>There is a Cyrillic version (I think it is 8859/2)

No, 8859/2 is another Latin set; there are four Latin alphabets
(8859/[1234], I think), and there seem to be at least drafts for Greek
and Cyrillic.

>The only time when I've wanted to do this is when stripping off a parity
>bit, and using 0xFF would be totally wrong.  The toascii() macro *might*
>be appropriate.  When you're dealing with a 7 data + 1 parity bit device,
>there is no point in pretending that you're prepared to accept anything
>other than 7 data bits.

Except that most devices can be *told* to handle 8 bits; never assume
that when you're dealing with a terminal that you're dealing with a 7
data + 1 parity bit device (unless your software deals *only* with one
specific terminal that *can't* generate 8 bits).

>The real problem is trying to write portable code that uses character
>classes which _aren't_ in <ctype.h>.  Consider isvowel()...

Or, for that matter, consider "toupper()"; what's "toupper()" of a
German "ss" (or is it "sz") character?

dik@cwi.nl (Dik T. Winter) (11/01/88)

In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
 > >There is a Cyrillic version (I think it is 8859/2)
 > 
 > No, 8859/2 is another Latin set; there are four Latin alphabets
 > (8859/[1234], I think), and there seem to be at least drafts for Greek
 > and Cyrillic.
Cyrillic is 8859/5 and approved. /6 is arabic, /7 greek and /8 hebrew;
these three are still draft (as are /3 and /4).
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

robert@pvab.UUCP (Robert Claeson) (11/01/88)

In article <8804@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes:

> In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:

> >How difficult is it convert american/english programs so that they can 
> >be used to handle foreign text? [etc.]

> Where have you been the last few years?  This subject area is known as
> "internationalization" and has been the featured topic of special issues
> of several journals, including UNIX Review and UNIX/World.

The fact is, internationalization hasn't showed up in any product I know
of (except for HP's NLS, but few uses it). People are still just *talking*
about internationalization, but not *doing* it.
-- 
Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden
Tel: +46 758-202 50   Fax: +46 758-197 20
EUnet: rclaeson@erbe.se   ARPAnet: rclaeson%erbe.se@uunet.uu.net

mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) (11/02/88)

In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
>>How difficult is it convert american/english programs so that they can 
>>be used to handle foreign text? [etc.]
>
>Where have you been the last few years?  This subject area is known as
>"internationalization" and has been the featured topic of special issues
>of several journals, including UNIX Review and UNIX/World.  The draft
>proposed ANSI/ISO C standard specifically addresses this issue (it is
>one of the reasons production of the final standard was delayed).

Unfortunately, the C standard is still lacking in this area.  It is true
that the attempt was made, however, X3J11 will have to go through another
round if it is to be truly internationalized.

One problem is that, althougth the standard supports multi-byte characters
which are required for a number of languages around the world, especially
those in Asia, no support is provided to pass those characters to any of
the is...() or to...() functions.  Since all the is...() and to...() 
functions  take an integer parameter, it would be impossible to evaluate
a multi-byte character.

Another problem is that an application has no way of portabily 
determining where the current character in a string ends and the next
begins; you can't just use ch++ to advance to the next character anymore.
And it is even harder to move backwards though a string.

There are some other problems with collation as well, some language may
have several lowercase characters corresponding to a single uppercase
character, or vice-versa.  This presents some problems when using toupper()
and tolower() to covert a character to it's opposite case.  In addition in
some languages and/or collation sequences there are some characters which
do not have a corresponding opposite case (i.e. there is only an uppercase
character with no corresponding lowercase character in a code set)

To be fair, we did not uncover these deficiencies until just recently (just
after we sent our ballot in for the third public review), so these may not
have been issues specifically addressed by the commitee.

There are some solutions to these problems, which would allow for 
internationalization without breaking any existing programs.  Here are some
suggestions:

  1.  Develop some functions which provide the same functionatality as the
      is...() functions but which take a character pointer as an argument.
      For example:

	int	wcislower(char *string)


  2.  Develop some functions which provide the same functionality as the
      to...() function but which return a character pointer.  Unfortunatly,
      these functions may need to allocate space in order for the
      transformation to work, or they may need to pass back a pointer to a
      static string which would then need to be copied.  The latter is
      probably the way most implementations would do it since it is
      essentially a table lookup.  For example:

	char   *wctolower(char *string)


  3.  Provide some functions to allow traversing a character string.  These
      functions would return a pointer to the next character in the string
      as determined by the current local.  For example:

	char   *nextchar(char *string)
	char   *prevchar(char *string, char *backup)

      These last two functions were presented at the latest IEEE POSIX meeting
      by one of the commitee members to cope with this problem.  The backup 
      string in prevchar() provides a pointer to a known character boundry 
      that the function can use to scan forward in the string in order to 
      determine where the actual character boundry of the previous character 
      is.


  4.  Some of the string functions would need to be revised as well, 
      specifically strlen().

	int	wcstrlen(char *string)

      This function would return the string length of the current string
      according to the current locale setting.  Therefore the string "abss"
      would give a length of 4 in the C locale, but may return 3 in a
      German local.  The functionality of this could be put in the current 
      strlen, however, there are still requirements to get the number of 
      bytes in a string, as well as the number of characters, so the old
      strlen should not be replaced.

Internationalization is a tricky and invovled problem.  Unfortunately it is 
not possible for an existing program to recompile under and ANSI compiler 
and become internationalized.  A number of changes to the application are 
required in order to provide for maximally portable code.  However, it is 
possible to provide the internationalization without breaking any existing 
code.

What has been discussed so far is character level internationalization, 
which is only one side of the fence.  The other side is language translation 
of strings.  This is known as "messaging" in the circles which talk about 
internationalization (let's overload yet another computer science term...).  
However, messaging can be accomplished by developing messaging libraries which 
contain the strings required by the application, translated into every 
language which your application needs to support.  When you wish to display a 
string, such as "press spacebar to continue" you call the messaging library 
with a unique identifier which is associated with your string, and the 
messaging library returns a string, based on the current local, which depicts 
the same idea as "press space bar to continue".

This also requires some fancy footwork on the part of applications, since
displaying these messages is bound to be very difficult since some 
languages read left-to-right, some read right-to-left, and some sucn as
Mongolian, do both and even go diagonally.  Add string attributs such as 
centering and justification and character attributes such as inverse, normal 
and blinking and messaging becomes very interesting indeed.

Internationalization is a relatively new field, and a number of things
still need to be ironed out, but I think that we are making progress, and
that progress should continue.  
-- 
Mark H. Colburn                  "They didn't understand a different kind of 
NAPS International                smack was needed, than the back of a hand, 
mark@jhereg.mn.org                something else was always needed."

gauss@homxc.UUCP (E.GAUSS) (11/02/88)

In article <8804@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) asks:

> >How difficult is it convert american/english programs so that they can 
> >be used to handle foreign text? [etc.]
 
> In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson)
replies:

> Where have you been the last few years?  This subject area is known as
> "internationalization" and has been the featured topic of special issues ...

An author friend that I work with, Eb Colville, has been trying for a
number of years to find a VI editor that will handle the German characters
available in the extended ASCI characters on his MS-DOS  PC.  He used those
in his novel, THE LAST ZEPPELIN, which is trying to find a publisher.  Whatever
the talk, it does not seem to be possible to do this.  Extended ASCII
requires the full eight bits to be available, and all VI's that we have
seen simply toss away the lead bit folding umlauted characters into
control characters.  We ended up writing a filter so that Eb types u/e
when he wants an umlauted e and just before printing we run his text
through the filter which replaces it by the appropriate extended ASCII
character.  (It also unfolds the folded characters, but that is risky as
you cannot have any control characters hidden in your text.)  If your
wordprocessor does not balk at eight bit characters, this is a workable
way of putting the characters in in the first place.  Eb has been
asking about Cyrillic (Russian) characters  for his next novel, BEYOND THE
YUKON, and I have refused even to discuss this with him. 

There are methods for doing Japannese where the keyboardist types in
"Romanji" and the computer makes a guess at the konji.  I told Eb
that if he has any plans to try Japennese word processing he will have
to go to Japan.

Ed Gauss

bill@twwells.uucp (T. William Wells) (11/02/88)

In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
: >The real problem is trying to write portable code that uses character
: >classes which _aren't_ in <ctype.h>.  Consider isvowel()...
:
: Or, for that matter, consider "toupper()"; what's "toupper()" of a
: German "ss" (or is it "sz") character?

The character is the "Scharfes S" and looks something like a beta.  It
doesn't have an upper case form. When written where an upper case
letter should go, it is written unchanged.  Alternately, it might be
be written "SS".

---
Bill
{uunet|novavax}!proxftl!twwells!bill

ok@quintus.uucp (Richard A. O'Keefe) (11/02/88)

In article <7690@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
[about variants of ISO 8859]

I know about ISO 8859/1 because someone gave me a copy, and about
8859/5 because someone posted a draft on comp.std.internat (though
I managed to lose that in a disc scavenge).

How would someone in the USA go about getting a copy of each of these
standards and draft standards?

guy@auspex.UUCP (Guy Harris) (11/02/88)

In article <4002@homxc.UUCP> gauss@homxc.UUCP (E.GAUSS) writes:

(And misattributes both quotes - that's why I don't like the "In article
..., ... writes:" lines)

>An author friend that I work with, Eb Colville, has been trying for a
>number of years to find a VI editor that will handle the German characters
>available in the extended ASCI characters on his MS-DOS  PC.  He used those
>in his novel, THE LAST ZEPPELIN, which is trying to find a publisher.  Whatever
>the talk, it does not seem to be possible to do this.  Extended ASCII
>requires the full eight bits to be available, and all VI's that we have
>seen simply toss away the lead bit folding umlauted characters into
>control characters.

The "vi" in System V Release 3.1 handles 8-bit characters. 
Unfortunately, I don't know if anybody's ported it to MS-DOS....

Also, some version of Unipress EMACS can be configured to support 8-bit
characters as well (I don't know if that version has been released yet
or not).

>There are methods for doing Japannese where the keyboardist types in
>"Romanji" and the computer makes a guess at the konji.

The ones I've seen convert Romaji to Kana as you type (this is, as I
understand it, a straightforward translation) and then permit you to
request that the computer translate the Kana you typed since the last
checkpoint (switching mode into Kanji mode, or asking for a
Kana-to-Kanji translation) into Kanji.  It gives you a list of the
possible translations, and lets you choose which one you want.

Of course, now you'd need an editor that handles *16*-bit characters; I
think AT&T has a "vi" that will handle them, and I don't know about
EMACS (although I remember an #ifdef in the aforementioned Unipress
version for Kanji).

guy@auspex.UUCP (Guy Harris) (11/02/88)

>This also requires some fancy footwork on the part of applications, since
>displaying these messages is bound to be very difficult since some 
>languages read left-to-right, some read right-to-left, and some sucn as
>Mongolian, do both and even go diagonally.  Add string attributs such as 
>centering and justification and character attributes such as inverse, normal 
>and blinking and messaging becomes very interesting indeed.

And then add, on top of that, the fact that computer displays are often
graphical, not just text.  For instance, you might have an array of
"buttons" on the screen; the layout of the panel might differ depending
on whether the buttons are filled up with short English words or
typicalverylongGermanwords.

weemba@garnet.berkeley.edu (Matthew P Wiener) (11/02/88)

Hey folks, I think you're all wonderful, but could you take this
discussion out of comp.emacs unless it's Emacs related?  There's
even a comp.std.internat newsgroup.

ucbvax!garnet!weemba	Matthew P Wiener/Brahms Gang/Berkeley CA 94720
"Nil sounds like a lot of kopins! I never got paid nil before!" --Groo

ok@quintus.uucp (Richard A. O'Keefe) (11/02/88)

In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes:
>In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>>In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
>>>How difficult is it convert american/english programs so that they can 
>>>be used to handle foreign text? [etc.]

Xerox have supported a 16-bit character set (XNS) for years.
Some of the surprises mentioned by Mark Colburn have been no news
to Interlisp-D programmers for a long time.

The kludges being proposed for C & UNIX just so that a sequence of
"international" characters can be accessed as bytes rather than pay
the penalty of switching over to 16 bits are unbelievable.

george@mnetor.UUCP (George Hart) (11/03/88)

In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
>
>How difficult is it convert american/english programs so that they can 
>be used to handle foreign text?

If you just need to handle full 8 bit characters, it is merely painful.
If you need to handle multibyte characters (e.g. Kanji) or a mix of
character sets, it is excruciating.

>In other european countries than England
>the ASCII character set is also widely used but with extension.
>The character set is 8 bit thus allowing 256 characters. 
>The problem is however that the extension is not standard.

There is, of course, the ISO 8859 family of 8 bit character sets which
contain ASCII as a perfect subset.

> 	< excerpts of MicroEmacs code >
>
>Ugly isn't it?

Yes. vi and the Bourne shell were(are) other offenders. I believe recent
releases of SysV have cleaned up the naughty uses of the 8th bit.

> < sample ctype.h invocations >
>
>This code is better (most of the is.. things are macros that mask
>the argument and return the binary mask that is either zero or positve)
>has more style to it and is easiear to port to a diffrent character set.

Unfortunately, the results of the macros are undefined unless isascii(c)
is positive which sort of defeats the spirit of what you intend.  Of course,
you could develop an 8 bit ctype.h compatible with a particular 8 bit
character set.

>An other bad habit of american programmers is this:
>character_value = (character_value & 0x7F ) 

This has more to do with assumptions about character sets supported
by the system than nationality.

Historically, assuming an ASCII environment was not unreasonable.  While
this is no longer true, until vendors and standards bodies get off their
collective pots and develop practical character sets and conventions for
multilingual environments (including multibyte characters), things will
remain confused, fragmented, and incompatible.
-- 
Regards.....George Hart, Computer X Canada Ltd.

UUCP: {utzoo,uunet}!mnetor!george
BELL: (416)475-8980

gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/03/88)

In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes:
>Unfortunately, the C standard is still lacking in this area.  It is true
>that the attempt was made, however, X3J11 will have to go through another
>round if it is to be truly internationalized.

Not really; all the issues you raised in your posting were already addressed
in arriving at the current draft proposed C Standard, which had the assistance
and approval of several specialists in internationalization.  I don't want to
try to discuss the details here; however, I will remark that the wchar_t type
is NOT intended to be used in all the contexts where a character would
normally be used in an 8-bit environment.  ITSCJ indicated that the functions
provided were sufficient.  Others can of course be provided as extensions but
were not felt to be sufficiently important to standardize.

P.S. This is not to be taken as an official X3J11 statement!

henry@utzoo.uucp (Henry Spencer) (11/03/88)

In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
>The problem is however that the extension is not standard.
>We have one possability in the IBM-PC character set, other one from HP called
>Roman-8, DEC gives us DEC-multinational character set and the Macintosh
>has yet another...

ISO Latin 1 will probably supersede all of these, pretty much solving that
problem.  (If you haven't heard of it, and want to know what it's like,
it's pretty similar to DEC's multinational set.)
-- 
The Earth is our mother.        |    Henry Spencer at U of Toronto Zoology
Our nine months are up.         |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/03/88)

In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>The kludges being proposed for C & UNIX just so that a sequence of
>"international" characters can be accessed as bytes rather than pay
>the penalty of switching over to 16 bits are unbelievable.

From time to time I remind people that "byte" does not imply 8 bits.
There is nothing in the proposed C standard that precludes an
implementation choosing to use 16 bits for its character types,
and/or providing "stub" functions for the locale and wide-character
stuff.  The main reason all the extra specification for multibyte
character sequences is present is that a majority of vendors already
had decided to take such an approach as opposed to the much cleaner
method of allocating sufficiently wide data to handle all relevant
code sets.  To accommodate existing approaches, it was necessary to
come up with adequate specifications, which has been done.

The main problem we face with 16-bit chars is that a majority of
X3J11 insisted that sizeof(char)==1, so the smallest C-addressable
unit (i.e. "byte") is necessarily the same size as char.  Thus, in
an implementation based on an 8-bit byte-addressable architecture,
if individual byte accessibility is desired in C, the implementation
must necessarily make chars 8 bits, and if large code sets are
necessary, then it HAS to use multibyte sequences for them.

karl@haddock.ima.isc.com (Karl Heuer) (11/03/88)

In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg (Mark Colburn) writes:
>One problem is that, althougth the [C] standard supports multi-byte
>characters ..., no support is provided [for using them with <ctype.h>].
>Here are some suggestions:
>	int     wcislower(char *string)
>	char   *wctolower(char *string)

It's not necessary to pass/return pointers; there is an arithmetic type
"wchar_t" in ANSI C.  Thus it would be simpler to define these as follows:
	int     wcislower(wchar_t wc);
	wchar_t wctolower(wchar_t wc);

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
Followups to comp.std.{internat,c}.

scjones@sdrc.UUCP (Larry Jones) (11/03/88)

In article <207@jhereg.Jhereg.MN.ORG>, mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) writes:
> [a number of misconceptions about draft ANSI C multi-byte chars:
> you can't pass them to is*() or to*(), can't tell how long they
> are, can't walk through arrays of them conveniently, etc. and
> proposes cluttering up the library with a bunch of new functions
> to handle them "correctly"]

You seem to have missed a key point in the internationalization
stuff - you don't use multi-byte characters directly, you convert
them into wchar_t's using the functions in sections 4.10.7 and
4.10.8.  wchar_t is an integral type (probably short or int) that
is large enough to hold ANY character value.

For example, the char 'A' might convert to a wchar_t value of 65
and a multi-byte sequence representing a Japaneese character would
convert to a wchar_t value of 12345.  Since wchar_t's are all the
same size, you can have an array of them that you walk through with
pointers just like you're used to doing with char arrays.

You can also pass them to the is*() and to*() functions provided
you've setlocale() to a locale that supports additional
characters.  If you look at sections 4.3 and 4.4, you will see
that they are all locale dependent.

----
Larry Jones                         UUCP: uunet!sdrc!scjones
SDRC                                      scjones@sdrc.uucp
2000 Eastman Dr.                    BIX:  ltl
Milford, OH  45150                  AT&T: (513) 576-2070
"Save the Quayles" - Mark Russell

gordan@maccs.McMaster.CA (gordan) (11/03/88)

In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
|>There is a Cyrillic version (I think it is 8859/2)
|
|No, 8859/2 is another Latin set; there are four Latin alphabets
|(8859/[1234], I think), and there seem to be at least drafts for Greek
|and Cyrillic.

8859/2 is the Eastern European character set
8859/5 is Cyrillic

Tim Lasko posted listings for both of these to comp.std.internat a
couple of months ago.  If anyone wants a copy, send me e-mail.

--
                 Gordan Palameta
            uunet!ai.toronto.edu!utgpu!maccs!gordan

mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) (11/03/88)

In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>The kludges being proposed for C & UNIX just so that a sequence of
>"international" characters can be accessed as bytes rather than pay
>the penalty of switching over to 16 bits are unbelievable.

There is more to it than just moving to 16 bit characters.  There are a
number of places where a character sequence needs to be recognized.  Often
that character sequence is in 8-bit or 7-bit ASCII.

The draft of ANSI and POSIX both have the notion of coallation sequences;
that is, some idea of how to sort characters in different locales.  The
collation sequence can vary from locale to locale.   I would encourage 
you to look in the draft C standard for more details.  Collation sequences
can be used for more than just internationalization, however.

Consider the phone book which all of us have sitting around.  In the US and
most other English speaking countries, the phone book has some rather ood 
collation sequences in it.  Most notably any names beginning with "Mc" or 
"Mac" come before "M" in the phone book.  It would be useful for some 
applications to define a collation sequence which would provide that 
particular behaviour.  

Now then, "Mc" and "Mac" are not (and should not) be represented as 16 bit 
characters.  Other examples include the German ss character, which could be
represented as a unique character, but most Germans would still type 'ss'
rather than hunting for a new key.

16-bit characters are good for some things, such as Kanji or other Asian
code sets, but may be less useful in a number of other areas.  Requiring 
16-bit characters puts a large burden of unused memory on those applications 
which only use 8-bit characters.  For that reason alone, ANSI would be 
justfied in not requiring 16-bit characters.  However, I don't beleive that 
there is anything in the standard which would preclude a conforming ANSI C 
implementation from having 16-bit characters.

-- 
Mark H. Colburn                  "They didn't understand a different kind of 
NAPS International                smack was needed, than the back of a hand, 
mark@jhereg.mn.org                something else was always needed."

henry@utzoo.uucp (Henry Spencer) (11/05/88)

In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>...The kludges being proposed for C & UNIX just so that a sequence of
>"international" characters can be accessed as bytes rather than pay
>the penalty of switching over to 16 bits are unbelievable.

Some of us don't like the price tag of switching to 16 bits, *especially*
since the vast majority of our jobs and our customers only need 8.  The
Japanese (Chinese, etc.) are going to have to dominate the world economy
much more thoroughly than they do now to convince everyone to make
sacrifices of this magnitude for their sake.  I'm not sure about POSIX,
but the X3J11 folks had expert advice, from the Japanese among others,
on how to provide for internationalization requirements without forcing
non-internationalized users to pay a heavy penalty.
-- 
The Earth is our mother.        |    Henry Spencer at U of Toronto Zoology
Our nine months are up.         |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

colburn@src.honeywell.COM (Mark H. Colburn) (11/08/88)

In article <427@sdrc.UUCP> scjones@sdrc.UUCP (Larry Jones) writes:
>You seem to have missed a key point in the internationalization
>stuff - you don't use multi-byte characters directly, you convert
>them into wchar_t's using the functions in sections 4.10.7 and
>4.10.8.  wchar_t is an integral type (probably short or int) that
>is large enough to hold ANY character value.

This is not always true, although it would make things much easier if it 
were.  You see, there is not way to take a converted string given back to
you by strxform() back to it's native form.

What that means is that there is no way to make modifications to multi-byte
strings.  This would be a serious deficiency (and the one which I was
attempting to address in my last article).  Strxform is only good for
reading stringss, not writing them.  For example, how would you do a
regular expression replacment if you do not know where the next character
is.  What if you need to parse a string and need to know what the data in
the string is?  Strxform translates characters into an implementation
defined format.  That means that there is now way to portably do anything
with the generated string, other than compare it to another string...

[ description of wchar_t types...]

>You can also pass them to the is*() and to*() functions provided
>you've setlocale() to a locale that supports additional
>characters.  If you look at sections 4.3 and 4.4, you will see
>that they are all locale dependent.

You can NOT pass a wchar_t type to is*() functions, at least not portably. 
The is*() functions and to*() functions are defined as:

	int	toupper(int c);

There is no guarentee that the width of a wchar_t is less-than-or-equal-to
and integer, or that it is able to be represented as an integer.  As a
matter of fact, in the (draft) C standard and the POSIX standards and drafts, 
there are hints that it may by at least 4 characters wide.  

One of the bugs which I pointed out, was that the draft C standard does
indeed say that the is*() and to*() functions are locale dependant, but I
see no way that they can be truely locale-dependant when the are defined as
they are.

andrea@hp-sdd.HP.COM (Andrea K. Frankel) (11/08/88)

In article <615@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>In article <7690@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>[about variants of ISO 8859]
>
>I know about ISO 8859/1 because someone gave me a copy, and about
>8859/5 because someone posted a draft on comp.std.internat (though
>I managed to lose that in a disc scavenge).
>
>How would someone in the USA go about getting a copy of each of these
>standards and draft standards?

Official approved access route:

	ANSI
	1430 Broadway
	NY, NY 10018

	sales:  212-642-4900

ANSI sells both American National Standards and the ISO standards in
this country.  They maintain a document listing ISO standards which have
made it to DIS stage as well.  You can also subscribe to their newsletter,
"Standards Action", which lets you know when various standards have hit
public review milestones (and where you can get a copy).

Another way to stay informed is to become an observer on the ANSI committee
which is either working on the US version of the standard in question
and/or formulating the US position contributing to the ISO standard.  ANSI
can also give you the master list of ANSI committees and their contact persons.

Good luck!


Andrea Frankel, Hewlett-Packard (San Diego Division) (619) 592-4664
                "...I brought you a paddle for your favorite canoe."
______________________________________________________________________________
UUCP     : {hplabs|nosc|hpfcla|ucsd}!hp-sdd!andrea 
Internet : andrea%hp-sdd@hp-sde.sde.hp.com (or @nosc.mil, @ucsd.edu)
CSNET    : andrea%hp-sdd@hplabs.csnet
USnail   : 16399 W. Bernardo Drive, San Diego CA 92127-1899 USA

terry@wsccs.UUCP (Every system needs one) (11/10/88)

In article <621@quintus.UUCP>, ok@quintus.uucp (Richard A. O'Keefe) writes:
> In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes:
> >In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
> >>In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes:
> >>>How difficult is it convert american/english programs so that they can 
> >>>be used to handle foreign text? [etc.]
> 
> Xerox have supported a 16-bit character set (XNS) for years.
> Some of the surprises mentioned by Mark Colburn have been no news
> to Interlisp-D programmers for a long time.
> 
> The kludges being proposed for C & UNIX just so that a sequence of
> "international" characters can be accessed as bytes rather than pay
> the penalty of switching over to 16 bits are unbelievable.

First of all, there are too many 8-bit character models available:

	All of the ISO models, DEC Multinational, 7-bit replacement sets,
	Wang-PC international sets, and IBM-PC International sets.

There is no way to consolidate it without mapping, and that's so device
dependant it isn't funny.  Consider your termcap growing by at least 128
times the number of entries characters... assuming that there is no need for
multiple GS/GE strings, as it may require more than one additional character
set on some terminals.

Second, vi in the US strips the 8th bit out, and is therefore not
usable for programming international (8-bit) characters using either model.


Problems with 16 bit characters:

O	The Xerox model is 16-bit and only valid for bitmapped displays,
	like Mac, and we all know how slowly that scrolls.

O	All of the current software would break without extensive rewrite

O	The internal overhead in a non-message passing operating system
	(most of them) is so high that it's ridiculous.

O	Think of pipes and all file I/O going half as fast.

O	Think of your hard disks shrinking to half their size... source
	files, after all, are text.

			terry@wsccs

gordan@maccs.McMaster.CA (gordan) (11/11/88)

|Some of us don't like the price tag of switching to 16 bits, *especially*
|since the vast majority of our jobs and our customers only need 8.

Yes, but a 16-bit or even 32-bit architecture has many advantages.
Programming is a great deal easier with a flat address space.

Whoops, we're talking about character sets, not chips?  Gosh, how
embarrassing...

On that bright future day when we've all got 1G of memory sitting on our
desktops and optical disk storage coming out of our ears, will 8-bit
character sets be the "segmented architecture" of the 21st century?

--
                 Gordan Palameta
            uunet!ai.toronto.edu!utgpu!maccs!gordan

ok@quintus.uucp (Richard A. O'Keefe) (11/13/88)

In article <774@wsccs.UUCP> terry@wsccs.UUCP (Every system needs one) writes:
>Second, vi in the US strips the 8th bit out, and is therefore not
>usable for programming international (8-bit) characters using either model.

AT&T announced clearly in the SVID that they were going to stop doing
that kind of thing, _and_they_have_.

>Problems with 16 bit characters:
>
>O	The Xerox model is 16-bit and only valid for bitmapped displays,
>	like Mac, and we all know how slowly that scrolls.
>
The Xerox model (XSIS 058404) has nothing to do with bitmapped displays.

>O	All of the current software would break without extensive rewrite

It's going to break _anyway_.  If you do one-character-equals-one-byte
operations on Kanji, the results just aren't going to make sense.  With
a 16-bit model (actually, the Xerox model already has provision for
24-bit characters, though the implementation I was familiar with didn't
provide them yet).  In fact, when XNS support was added to InterLisp,
most programs didn't even need to be recompiled, and those that needed
other changes mostly _could_ have been written to be independent of
character set using facilities already in the language.

>O	The internal overhead in a non-message passing operating system
>	(most of them) is so high that it's ridiculous.
>O	Think of pipes and all file I/O going half as fast.
>O	Think of your hard disks shrinking to half their size... source
>	files, after all, are text.

These are essentially the same point, and are equally mistaken.  There is
no reason why a _single_ character and a _sequence_ of characters need
to use the same coding.  There are three representations used for
character sequences in Interlisp-D:  thin strings (vectors of 8-bit
characters from "character set 0"), fat strings (vectors of 16-bit
characters), and files (sequences of characters drawn from the same
256-character block are stored as sequences of 8-bit codes, with
"font change" codes inserted as needed).  Since a file is presumed to
start in character set 0, files of 8-bit characters DIDN'T CHANGE AT ALL.
If you want to position randomly in a sequence, then you have to know
what the "font" is there, or a font change code could be inserted
at the start of every block.  It is only when a program picks up a single
character and looks at it on its own that it materialises as 16 bits.
[This coding wins if you tend to mix languages with small character sets,
e.g. if you have whole sentences in English, Russian, Hebrew, Greek, &c,
because then you can stay in the same "font" for at least a word at a time.
It does not pay off for Kanji, but with a certain amount of cunning you
can make it no worse than the ISO 2022 method.]

Now you can only achieve code-set independence as easily as that in a
high-level language, and font-compressed files really require all the
utilities in the system to be internationalised at once, so the ANSI
committee didn't really have the option of adopting a solution like this.