kjartan@rhi.hi.is (Kjartan R. Gudmundsson) (10/28/88)
How difficult is it convert american/english programs so that they can be used to handle foreign text? The answer of course depends on the language one has in mind. In Europe most nations ues the Latin alfabet and english is one of them. Unfortunately english uses very few charaters compered to other european languages, therefore the code set that is widely used by americans and english, the ASCII character set, only defines 128 characters. It is a 7 bit character set. In other european countries than England the ASCII character set is also widely used but with extension. The character set is 8 bit thus allowing 256 characters. The problem is however that the extension is not standard. We have one possability in the IBM-PC character set, other one from HP called Roman-8, DEC gives us DEC-multinational character set and the Macintosh has yet another. So if we have a program that for example converts lower case letters to uppercase, it has to be coded diffrently for each character set. Let's look at some code from MicroEMACS: input.org: if (c>=0x00 && c<=0x1F) input.org: if (c>=0x00 && c<=0x1F) /* C0 control -> C- */ main.org: case 'a': /* process error file */ main.org: if ((c>=0x20 && c<=0xFF)) { /* Self inserting. */ random.org: if (*scan >= 'a' && *scan <= 'z') random.org: else if (c<0x20 || c==0x7F) random.org: else if (c<0x20 || c==0x7F) region.org: lputc(linep, loffs, c+'a'-'A'); region.org: lputc(linep, loffs, c-'a'+'A'); region.org: if (c>='a' && c<='z') search.org: else if (c < 0x20 || c == 0x7f) /* control character */ word.org: c += 'a'-'A'; word.org: c += 'a'-'A'; word.org: c -= 'a'-'A'; word.org: c -= 'a'-'A'; word.org: if (c>='a' && c<='z') { word.org: if (c>='a' && c<='z') { word.org: wordflag = ((ch >= 'a' && ch <= 'z') || word.org: if (c>='a' && c<='z') Ugly isn't it? An other way of doing this is using "is.." functions that are defined in ctype.h, include file that comes with (almost) all c-compilers Some of the above lines would look like this: basic.c: else if (iscntrl(c)) display.c: if (iscntrl(c)) display.c: } else if (iscntrl(c)) { eval.c: *sp = tolower(*sp); eval.c: *sp = toupper(*sp); eval.c: if (islower(*sp) ) fileio.c: if (iscntrl( fn[tel++] ) ) input.c: if (iscntrl(buf[--cpos]) ) { input.c: if (iscntrl(buf[--cpos])) { input.c: c = toupper(c); input.c: c = toupper(c); /*Force to upper */ input.c: if ( islower(c) && ( SPEC != (SPEC & c) )) input.c: if (iscntrl(c) ) /* control key */ input.c: if (iscntrl(c) ) /* control key */ input.c: if (iscntrl(c) ) /* control key? */ This code is better (most of the is.. things are macros that mask the argument and return the binary mask that is either zero or positve) has more style to it and is easiear to port to a diffrent character set. An other bad habit of american programmers is this: character_value = (character_value & 0x7F ) don't do this!! If you must, you can use 0xFF insted: character_value = (character_value & 0xFF ) (Unless of course your machine breaks to peaces if it gets an 8 bit character in its io channels.) ############################################################################### # # # Kjartan R. Gudmundsson # # Raudalaek 12 # # 105 Reykjavik # Internet: kjartan@rhi.hi.is # # # uucp: ...mcvax!hafro!rhi!kjartan # # # # ###############################################################################
gwyn@smoke.BRL.MIL (Doug Gwyn ) (10/31/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >How difficult is it convert american/english programs so that they can >be used to handle foreign text? [etc.] Where have you been the last few years? This subject area is known as "internationalization" and has been the featured topic of special issues of several journals, including UNIX Review and UNIX/World. The draft proposed ANSI/ISO C standard specifically addresses this issue (it is one of the reasons production of the final standard was delayed).
ok@quintus.uucp (Richard A. O'Keefe) (10/31/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >The problem is however that the extension is not standard. There is an international standard for 8-bit character sets: ISO 8859. There are several versions of 8859, just as there were several national versions of ISO 646 (of which ASCII was only one). All versions include ASCII has the bottom half. ISO Latin 1 (8859/1) is pretty close to DEC's Multinational Character Set, and is supposed to cover most West European languages (including Icelandic). There is a Cyrillic version (I think it is 8859/2) and others are under way. >An other bad habit of american programmers is this: >character_value = (character_value & 0x7F ) >don't do this!! If you must, you can use 0xFF insted: >character_value = (character_value & 0xFF ) The only time when I've wanted to do this is when stripping off a parity bit, and using 0xFF would be totally wrong. The toascii() macro *might* be appropriate. When you're dealing with a 7 data + 1 parity bit device, there is no point in pretending that you're prepared to accept anything other than 7 data bits. The real problem is trying to write portable code that uses character classes which _aren't_ in <ctype.h>. Consider isvowel()...
nwd@j.cc.purdue.edu (Daniel Lawrence) (10/31/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they can >be used to handle foreign text? The answer of course depends on the language [a description of some of the problems using 8 bit chars] > >Let's look at some code from MicroEMACS: > [a code excerpt from MicroEMACS 3.9] >Ugly isn't it? > Ok, I am feeling a little picked on here... a lot of people like using uEMACS for pointing things like this out. When I first started working with it, it was just for me. But that is really no excuse... >An other way of doing this is using "is.." functions that are [an alternative which is better] >This code is better (most of the is.. things are macros that mask [More descriptions of 8 bit problems...] And someone finally proposes some solutions rather than just blindly stabbing out and complaining. The last round of complaints I sent out a request for information on this problem, and the best I got back was.. go to the library and do some research. Well for a project I am doing in my spare time, considering the poor library system round here I really wasn't happy to here all the griping and then get no help from anyone to fix the problems. So I applaud Mr. Gudmundsson for his mail. ># Kjartan R. Gudmundsson # ># Raudalaek 12 # ># 105 Reykjavik # Internet: kjartan@rhi.hi.is # However, after the last round, I thaought carefully about the 8 bit problems, and resolved that the issue was too complex on a language by language basic for me to ever attempt to get all the case mappings correct. So when you see the next version of MicroEMACS, it will have a user changable upper/lowercase mapping function (which is working right now). Note: This slows down the regular pattern matching code considerable, so uEMACS can be compiled with the diacritical (un american in this case) turned off, but both options now exits. Daniel Lawrence (317) 742-5153 UUCP: {pur-ee!}j.cc.purdue.edu!nwd ARPA: nwd@j.cc.purdue.edu FIDO: 1:201/10 The Programmer's Room (317) 742-5533
nelson@sun.soe.clarkson.edu (Russ Nelson) (11/01/88)
In article <8045@j.cc.purdue.edu> nwd@j.cc.purdue.edu (Daniel Lawrence) writes: In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they >can be used to handle foreign text? The answer of course depends >on the language So when you see the next version of MicroEMACS, it will have a user changable upper/lowercase mapping function (which is working right now). Same for Freemacs. I also used to take over the keyboard interrupt (INT 9), but some of the international users complained that it broke their keyboard mapper (not to mention the fact that it lost with TSRs), so I took it out. -- --russ (nelson@clutx [.bitnet | .clarkson.edu]) To surrender is to remain in the hands of barbarians for the rest of my life. To fight is to leave my bones exposed in the desert waste.
guy@auspex.UUCP (Guy Harris) (11/01/88)
>An other bad habit of american programmers is this:
And another one is using the 8th bit internally for quoting; don't do that.
guy@auspex.UUCP (Guy Harris) (11/01/88)
>There is a Cyrillic version (I think it is 8859/2) No, 8859/2 is another Latin set; there are four Latin alphabets (8859/[1234], I think), and there seem to be at least drafts for Greek and Cyrillic. >The only time when I've wanted to do this is when stripping off a parity >bit, and using 0xFF would be totally wrong. The toascii() macro *might* >be appropriate. When you're dealing with a 7 data + 1 parity bit device, >there is no point in pretending that you're prepared to accept anything >other than 7 data bits. Except that most devices can be *told* to handle 8 bits; never assume that when you're dealing with a terminal that you're dealing with a 7 data + 1 parity bit device (unless your software deals *only* with one specific terminal that *can't* generate 8 bits). >The real problem is trying to write portable code that uses character >classes which _aren't_ in <ctype.h>. Consider isvowel()... Or, for that matter, consider "toupper()"; what's "toupper()" of a German "ss" (or is it "sz") character?
dik@cwi.nl (Dik T. Winter) (11/01/88)
In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: > >There is a Cyrillic version (I think it is 8859/2) > > No, 8859/2 is another Latin set; there are four Latin alphabets > (8859/[1234], I think), and there seem to be at least drafts for Greek > and Cyrillic. Cyrillic is 8859/5 and approved. /6 is arabic, /7 greek and /8 hebrew; these three are still draft (as are /3 and /4). -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
robert@pvab.UUCP (Robert Claeson) (11/01/88)
In article <8804@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes: > In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they can > >be used to handle foreign text? [etc.] > Where have you been the last few years? This subject area is known as > "internationalization" and has been the featured topic of special issues > of several journals, including UNIX Review and UNIX/World. The fact is, internationalization hasn't showed up in any product I know of (except for HP's NLS, but few uses it). People are still just *talking* about internationalization, but not *doing* it. -- Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden Tel: +46 758-202 50 Fax: +46 758-197 20 EUnet: rclaeson@erbe.se ARPAnet: rclaeson%erbe.se@uunet.uu.net
mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) (11/02/88)
In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >>How difficult is it convert american/english programs so that they can >>be used to handle foreign text? [etc.] > >Where have you been the last few years? This subject area is known as >"internationalization" and has been the featured topic of special issues >of several journals, including UNIX Review and UNIX/World. The draft >proposed ANSI/ISO C standard specifically addresses this issue (it is >one of the reasons production of the final standard was delayed). Unfortunately, the C standard is still lacking in this area. It is true that the attempt was made, however, X3J11 will have to go through another round if it is to be truly internationalized. One problem is that, althougth the standard supports multi-byte characters which are required for a number of languages around the world, especially those in Asia, no support is provided to pass those characters to any of the is...() or to...() functions. Since all the is...() and to...() functions take an integer parameter, it would be impossible to evaluate a multi-byte character. Another problem is that an application has no way of portabily determining where the current character in a string ends and the next begins; you can't just use ch++ to advance to the next character anymore. And it is even harder to move backwards though a string. There are some other problems with collation as well, some language may have several lowercase characters corresponding to a single uppercase character, or vice-versa. This presents some problems when using toupper() and tolower() to covert a character to it's opposite case. In addition in some languages and/or collation sequences there are some characters which do not have a corresponding opposite case (i.e. there is only an uppercase character with no corresponding lowercase character in a code set) To be fair, we did not uncover these deficiencies until just recently (just after we sent our ballot in for the third public review), so these may not have been issues specifically addressed by the commitee. There are some solutions to these problems, which would allow for internationalization without breaking any existing programs. Here are some suggestions: 1. Develop some functions which provide the same functionatality as the is...() functions but which take a character pointer as an argument. For example: int wcislower(char *string) 2. Develop some functions which provide the same functionality as the to...() function but which return a character pointer. Unfortunatly, these functions may need to allocate space in order for the transformation to work, or they may need to pass back a pointer to a static string which would then need to be copied. The latter is probably the way most implementations would do it since it is essentially a table lookup. For example: char *wctolower(char *string) 3. Provide some functions to allow traversing a character string. These functions would return a pointer to the next character in the string as determined by the current local. For example: char *nextchar(char *string) char *prevchar(char *string, char *backup) These last two functions were presented at the latest IEEE POSIX meeting by one of the commitee members to cope with this problem. The backup string in prevchar() provides a pointer to a known character boundry that the function can use to scan forward in the string in order to determine where the actual character boundry of the previous character is. 4. Some of the string functions would need to be revised as well, specifically strlen(). int wcstrlen(char *string) This function would return the string length of the current string according to the current locale setting. Therefore the string "abss" would give a length of 4 in the C locale, but may return 3 in a German local. The functionality of this could be put in the current strlen, however, there are still requirements to get the number of bytes in a string, as well as the number of characters, so the old strlen should not be replaced. Internationalization is a tricky and invovled problem. Unfortunately it is not possible for an existing program to recompile under and ANSI compiler and become internationalized. A number of changes to the application are required in order to provide for maximally portable code. However, it is possible to provide the internationalization without breaking any existing code. What has been discussed so far is character level internationalization, which is only one side of the fence. The other side is language translation of strings. This is known as "messaging" in the circles which talk about internationalization (let's overload yet another computer science term...). However, messaging can be accomplished by developing messaging libraries which contain the strings required by the application, translated into every language which your application needs to support. When you wish to display a string, such as "press spacebar to continue" you call the messaging library with a unique identifier which is associated with your string, and the messaging library returns a string, based on the current local, which depicts the same idea as "press space bar to continue". This also requires some fancy footwork on the part of applications, since displaying these messages is bound to be very difficult since some languages read left-to-right, some read right-to-left, and some sucn as Mongolian, do both and even go diagonally. Add string attributs such as centering and justification and character attributes such as inverse, normal and blinking and messaging becomes very interesting indeed. Internationalization is a relatively new field, and a number of things still need to be ironed out, but I think that we are making progress, and that progress should continue. -- Mark H. Colburn "They didn't understand a different kind of NAPS International smack was needed, than the back of a hand, mark@jhereg.mn.org something else was always needed."
gauss@homxc.UUCP (E.GAUSS) (11/02/88)
In article <8804@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) asks: > >How difficult is it convert american/english programs so that they can > >be used to handle foreign text? [etc.] > In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) replies: > Where have you been the last few years? This subject area is known as > "internationalization" and has been the featured topic of special issues ... An author friend that I work with, Eb Colville, has been trying for a number of years to find a VI editor that will handle the German characters available in the extended ASCI characters on his MS-DOS PC. He used those in his novel, THE LAST ZEPPELIN, which is trying to find a publisher. Whatever the talk, it does not seem to be possible to do this. Extended ASCII requires the full eight bits to be available, and all VI's that we have seen simply toss away the lead bit folding umlauted characters into control characters. We ended up writing a filter so that Eb types u/e when he wants an umlauted e and just before printing we run his text through the filter which replaces it by the appropriate extended ASCII character. (It also unfolds the folded characters, but that is risky as you cannot have any control characters hidden in your text.) If your wordprocessor does not balk at eight bit characters, this is a workable way of putting the characters in in the first place. Eb has been asking about Cyrillic (Russian) characters for his next novel, BEYOND THE YUKON, and I have refused even to discuss this with him. There are methods for doing Japannese where the keyboardist types in "Romanji" and the computer makes a guess at the konji. I told Eb that if he has any plans to try Japennese word processing he will have to go to Japan. Ed Gauss
bill@twwells.uucp (T. William Wells) (11/02/88)
In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
: >The real problem is trying to write portable code that uses character
: >classes which _aren't_ in <ctype.h>. Consider isvowel()...
:
: Or, for that matter, consider "toupper()"; what's "toupper()" of a
: German "ss" (or is it "sz") character?
The character is the "Scharfes S" and looks something like a beta. It
doesn't have an upper case form. When written where an upper case
letter should go, it is written unchanged. Alternately, it might be
be written "SS".
---
Bill
{uunet|novavax}!proxftl!twwells!bill
ok@quintus.uucp (Richard A. O'Keefe) (11/02/88)
In article <7690@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
[about variants of ISO 8859]
I know about ISO 8859/1 because someone gave me a copy, and about
8859/5 because someone posted a draft on comp.std.internat (though
I managed to lose that in a disc scavenge).
How would someone in the USA go about getting a copy of each of these
standards and draft standards?
guy@auspex.UUCP (Guy Harris) (11/02/88)
In article <4002@homxc.UUCP> gauss@homxc.UUCP (E.GAUSS) writes: (And misattributes both quotes - that's why I don't like the "In article ..., ... writes:" lines) >An author friend that I work with, Eb Colville, has been trying for a >number of years to find a VI editor that will handle the German characters >available in the extended ASCI characters on his MS-DOS PC. He used those >in his novel, THE LAST ZEPPELIN, which is trying to find a publisher. Whatever >the talk, it does not seem to be possible to do this. Extended ASCII >requires the full eight bits to be available, and all VI's that we have >seen simply toss away the lead bit folding umlauted characters into >control characters. The "vi" in System V Release 3.1 handles 8-bit characters. Unfortunately, I don't know if anybody's ported it to MS-DOS.... Also, some version of Unipress EMACS can be configured to support 8-bit characters as well (I don't know if that version has been released yet or not). >There are methods for doing Japannese where the keyboardist types in >"Romanji" and the computer makes a guess at the konji. The ones I've seen convert Romaji to Kana as you type (this is, as I understand it, a straightforward translation) and then permit you to request that the computer translate the Kana you typed since the last checkpoint (switching mode into Kanji mode, or asking for a Kana-to-Kanji translation) into Kanji. It gives you a list of the possible translations, and lets you choose which one you want. Of course, now you'd need an editor that handles *16*-bit characters; I think AT&T has a "vi" that will handle them, and I don't know about EMACS (although I remember an #ifdef in the aforementioned Unipress version for Kanji).
guy@auspex.UUCP (Guy Harris) (11/02/88)
>This also requires some fancy footwork on the part of applications, since >displaying these messages is bound to be very difficult since some >languages read left-to-right, some read right-to-left, and some sucn as >Mongolian, do both and even go diagonally. Add string attributs such as >centering and justification and character attributes such as inverse, normal >and blinking and messaging becomes very interesting indeed. And then add, on top of that, the fact that computer displays are often graphical, not just text. For instance, you might have an array of "buttons" on the screen; the layout of the panel might differ depending on whether the buttons are filled up with short English words or typicalverylongGermanwords.
weemba@garnet.berkeley.edu (Matthew P Wiener) (11/02/88)
Hey folks, I think you're all wonderful, but could you take this discussion out of comp.emacs unless it's Emacs related? There's even a comp.std.internat newsgroup. ucbvax!garnet!weemba Matthew P Wiener/Brahms Gang/Berkeley CA 94720 "Nil sounds like a lot of kopins! I never got paid nil before!" --Groo
ok@quintus.uucp (Richard A. O'Keefe) (11/02/88)
In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes: >In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >>In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >>>How difficult is it convert american/english programs so that they can >>>be used to handle foreign text? [etc.] Xerox have supported a 16-bit character set (XNS) for years. Some of the surprises mentioned by Mark Colburn have been no news to Interlisp-D programmers for a long time. The kludges being proposed for C & UNIX just so that a sequence of "international" characters can be accessed as bytes rather than pay the penalty of switching over to 16 bits are unbelievable.
george@mnetor.UUCP (George Hart) (11/03/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they can >be used to handle foreign text? If you just need to handle full 8 bit characters, it is merely painful. If you need to handle multibyte characters (e.g. Kanji) or a mix of character sets, it is excruciating. >In other european countries than England >the ASCII character set is also widely used but with extension. >The character set is 8 bit thus allowing 256 characters. >The problem is however that the extension is not standard. There is, of course, the ISO 8859 family of 8 bit character sets which contain ASCII as a perfect subset. > < excerpts of MicroEmacs code > > >Ugly isn't it? Yes. vi and the Bourne shell were(are) other offenders. I believe recent releases of SysV have cleaned up the naughty uses of the 8th bit. > < sample ctype.h invocations > > >This code is better (most of the is.. things are macros that mask >the argument and return the binary mask that is either zero or positve) >has more style to it and is easiear to port to a diffrent character set. Unfortunately, the results of the macros are undefined unless isascii(c) is positive which sort of defeats the spirit of what you intend. Of course, you could develop an 8 bit ctype.h compatible with a particular 8 bit character set. >An other bad habit of american programmers is this: >character_value = (character_value & 0x7F ) This has more to do with assumptions about character sets supported by the system than nationality. Historically, assuming an ASCII environment was not unreasonable. While this is no longer true, until vendors and standards bodies get off their collective pots and develop practical character sets and conventions for multilingual environments (including multibyte characters), things will remain confused, fragmented, and incompatible. -- Regards.....George Hart, Computer X Canada Ltd. UUCP: {utzoo,uunet}!mnetor!george BELL: (416)475-8980
gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/03/88)
In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes: >Unfortunately, the C standard is still lacking in this area. It is true >that the attempt was made, however, X3J11 will have to go through another >round if it is to be truly internationalized. Not really; all the issues you raised in your posting were already addressed in arriving at the current draft proposed C Standard, which had the assistance and approval of several specialists in internationalization. I don't want to try to discuss the details here; however, I will remark that the wchar_t type is NOT intended to be used in all the contexts where a character would normally be used in an 8-bit environment. ITSCJ indicated that the functions provided were sufficient. Others can of course be provided as extensions but were not felt to be sufficiently important to standardize. P.S. This is not to be taken as an official X3J11 statement!
henry@utzoo.uucp (Henry Spencer) (11/03/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >The problem is however that the extension is not standard. >We have one possability in the IBM-PC character set, other one from HP called >Roman-8, DEC gives us DEC-multinational character set and the Macintosh >has yet another... ISO Latin 1 will probably supersede all of these, pretty much solving that problem. (If you haven't heard of it, and want to know what it's like, it's pretty similar to DEC's multinational set.) -- The Earth is our mother. | Henry Spencer at U of Toronto Zoology Our nine months are up. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/03/88)
In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >The kludges being proposed for C & UNIX just so that a sequence of >"international" characters can be accessed as bytes rather than pay >the penalty of switching over to 16 bits are unbelievable. From time to time I remind people that "byte" does not imply 8 bits. There is nothing in the proposed C standard that precludes an implementation choosing to use 16 bits for its character types, and/or providing "stub" functions for the locale and wide-character stuff. The main reason all the extra specification for multibyte character sequences is present is that a majority of vendors already had decided to take such an approach as opposed to the much cleaner method of allocating sufficiently wide data to handle all relevant code sets. To accommodate existing approaches, it was necessary to come up with adequate specifications, which has been done. The main problem we face with 16-bit chars is that a majority of X3J11 insisted that sizeof(char)==1, so the smallest C-addressable unit (i.e. "byte") is necessarily the same size as char. Thus, in an implementation based on an 8-bit byte-addressable architecture, if individual byte accessibility is desired in C, the implementation must necessarily make chars 8 bits, and if large code sets are necessary, then it HAS to use multibyte sequences for them.
karl@haddock.ima.isc.com (Karl Heuer) (11/03/88)
In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg (Mark Colburn) writes: >One problem is that, althougth the [C] standard supports multi-byte >characters ..., no support is provided [for using them with <ctype.h>]. >Here are some suggestions: > int wcislower(char *string) > char *wctolower(char *string) It's not necessary to pass/return pointers; there is an arithmetic type "wchar_t" in ANSI C. Thus it would be simpler to define these as follows: int wcislower(wchar_t wc); wchar_t wctolower(wchar_t wc); Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint Followups to comp.std.{internat,c}.
scjones@sdrc.UUCP (Larry Jones) (11/03/88)
In article <207@jhereg.Jhereg.MN.ORG>, mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) writes: > [a number of misconceptions about draft ANSI C multi-byte chars: > you can't pass them to is*() or to*(), can't tell how long they > are, can't walk through arrays of them conveniently, etc. and > proposes cluttering up the library with a bunch of new functions > to handle them "correctly"] You seem to have missed a key point in the internationalization stuff - you don't use multi-byte characters directly, you convert them into wchar_t's using the functions in sections 4.10.7 and 4.10.8. wchar_t is an integral type (probably short or int) that is large enough to hold ANY character value. For example, the char 'A' might convert to a wchar_t value of 65 and a multi-byte sequence representing a Japaneese character would convert to a wchar_t value of 12345. Since wchar_t's are all the same size, you can have an array of them that you walk through with pointers just like you're used to doing with char arrays. You can also pass them to the is*() and to*() functions provided you've setlocale() to a locale that supports additional characters. If you look at sections 4.3 and 4.4, you will see that they are all locale dependent. ---- Larry Jones UUCP: uunet!sdrc!scjones SDRC scjones@sdrc.uucp 2000 Eastman Dr. BIX: ltl Milford, OH 45150 AT&T: (513) 576-2070 "Save the Quayles" - Mark Russell
gordan@maccs.McMaster.CA (gordan) (11/03/88)
In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: |>There is a Cyrillic version (I think it is 8859/2) | |No, 8859/2 is another Latin set; there are four Latin alphabets |(8859/[1234], I think), and there seem to be at least drafts for Greek |and Cyrillic. 8859/2 is the Eastern European character set 8859/5 is Cyrillic Tim Lasko posted listings for both of these to comp.std.internat a couple of months ago. If anyone wants a copy, send me e-mail. -- Gordan Palameta uunet!ai.toronto.edu!utgpu!maccs!gordan
mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) (11/03/88)
In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >The kludges being proposed for C & UNIX just so that a sequence of >"international" characters can be accessed as bytes rather than pay >the penalty of switching over to 16 bits are unbelievable. There is more to it than just moving to 16 bit characters. There are a number of places where a character sequence needs to be recognized. Often that character sequence is in 8-bit or 7-bit ASCII. The draft of ANSI and POSIX both have the notion of coallation sequences; that is, some idea of how to sort characters in different locales. The collation sequence can vary from locale to locale. I would encourage you to look in the draft C standard for more details. Collation sequences can be used for more than just internationalization, however. Consider the phone book which all of us have sitting around. In the US and most other English speaking countries, the phone book has some rather ood collation sequences in it. Most notably any names beginning with "Mc" or "Mac" come before "M" in the phone book. It would be useful for some applications to define a collation sequence which would provide that particular behaviour. Now then, "Mc" and "Mac" are not (and should not) be represented as 16 bit characters. Other examples include the German ss character, which could be represented as a unique character, but most Germans would still type 'ss' rather than hunting for a new key. 16-bit characters are good for some things, such as Kanji or other Asian code sets, but may be less useful in a number of other areas. Requiring 16-bit characters puts a large burden of unused memory on those applications which only use 8-bit characters. For that reason alone, ANSI would be justfied in not requiring 16-bit characters. However, I don't beleive that there is anything in the standard which would preclude a conforming ANSI C implementation from having 16-bit characters. -- Mark H. Colburn "They didn't understand a different kind of NAPS International smack was needed, than the back of a hand, mark@jhereg.mn.org something else was always needed."
henry@utzoo.uucp (Henry Spencer) (11/05/88)
In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >...The kludges being proposed for C & UNIX just so that a sequence of >"international" characters can be accessed as bytes rather than pay >the penalty of switching over to 16 bits are unbelievable. Some of us don't like the price tag of switching to 16 bits, *especially* since the vast majority of our jobs and our customers only need 8. The Japanese (Chinese, etc.) are going to have to dominate the world economy much more thoroughly than they do now to convince everyone to make sacrifices of this magnitude for their sake. I'm not sure about POSIX, but the X3J11 folks had expert advice, from the Japanese among others, on how to provide for internationalization requirements without forcing non-internationalized users to pay a heavy penalty. -- The Earth is our mother. | Henry Spencer at U of Toronto Zoology Our nine months are up. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
colburn@src.honeywell.COM (Mark H. Colburn) (11/08/88)
In article <427@sdrc.UUCP> scjones@sdrc.UUCP (Larry Jones) writes: >You seem to have missed a key point in the internationalization >stuff - you don't use multi-byte characters directly, you convert >them into wchar_t's using the functions in sections 4.10.7 and >4.10.8. wchar_t is an integral type (probably short or int) that >is large enough to hold ANY character value. This is not always true, although it would make things much easier if it were. You see, there is not way to take a converted string given back to you by strxform() back to it's native form. What that means is that there is no way to make modifications to multi-byte strings. This would be a serious deficiency (and the one which I was attempting to address in my last article). Strxform is only good for reading stringss, not writing them. For example, how would you do a regular expression replacment if you do not know where the next character is. What if you need to parse a string and need to know what the data in the string is? Strxform translates characters into an implementation defined format. That means that there is now way to portably do anything with the generated string, other than compare it to another string... [ description of wchar_t types...] >You can also pass them to the is*() and to*() functions provided >you've setlocale() to a locale that supports additional >characters. If you look at sections 4.3 and 4.4, you will see >that they are all locale dependent. You can NOT pass a wchar_t type to is*() functions, at least not portably. The is*() functions and to*() functions are defined as: int toupper(int c); There is no guarentee that the width of a wchar_t is less-than-or-equal-to and integer, or that it is able to be represented as an integer. As a matter of fact, in the (draft) C standard and the POSIX standards and drafts, there are hints that it may by at least 4 characters wide. One of the bugs which I pointed out, was that the draft C standard does indeed say that the is*() and to*() functions are locale dependant, but I see no way that they can be truely locale-dependant when the are defined as they are.
andrea@hp-sdd.HP.COM (Andrea K. Frankel) (11/08/88)
In article <615@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >In article <7690@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >[about variants of ISO 8859] > >I know about ISO 8859/1 because someone gave me a copy, and about >8859/5 because someone posted a draft on comp.std.internat (though >I managed to lose that in a disc scavenge). > >How would someone in the USA go about getting a copy of each of these >standards and draft standards? Official approved access route: ANSI 1430 Broadway NY, NY 10018 sales: 212-642-4900 ANSI sells both American National Standards and the ISO standards in this country. They maintain a document listing ISO standards which have made it to DIS stage as well. You can also subscribe to their newsletter, "Standards Action", which lets you know when various standards have hit public review milestones (and where you can get a copy). Another way to stay informed is to become an observer on the ANSI committee which is either working on the US version of the standard in question and/or formulating the US position contributing to the ISO standard. ANSI can also give you the master list of ANSI committees and their contact persons. Good luck! Andrea Frankel, Hewlett-Packard (San Diego Division) (619) 592-4664 "...I brought you a paddle for your favorite canoe." ______________________________________________________________________________ UUCP : {hplabs|nosc|hpfcla|ucsd}!hp-sdd!andrea Internet : andrea%hp-sdd@hp-sde.sde.hp.com (or @nosc.mil, @ucsd.edu) CSNET : andrea%hp-sdd@hplabs.csnet USnail : 16399 W. Bernardo Drive, San Diego CA 92127-1899 USA
terry@wsccs.UUCP (Every system needs one) (11/10/88)
In article <621@quintus.UUCP>, ok@quintus.uucp (Richard A. O'Keefe) writes: > In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes: > >In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: > >>In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >>>How difficult is it convert american/english programs so that they can > >>>be used to handle foreign text? [etc.] > > Xerox have supported a 16-bit character set (XNS) for years. > Some of the surprises mentioned by Mark Colburn have been no news > to Interlisp-D programmers for a long time. > > The kludges being proposed for C & UNIX just so that a sequence of > "international" characters can be accessed as bytes rather than pay > the penalty of switching over to 16 bits are unbelievable. First of all, there are too many 8-bit character models available: All of the ISO models, DEC Multinational, 7-bit replacement sets, Wang-PC international sets, and IBM-PC International sets. There is no way to consolidate it without mapping, and that's so device dependant it isn't funny. Consider your termcap growing by at least 128 times the number of entries characters... assuming that there is no need for multiple GS/GE strings, as it may require more than one additional character set on some terminals. Second, vi in the US strips the 8th bit out, and is therefore not usable for programming international (8-bit) characters using either model. Problems with 16 bit characters: O The Xerox model is 16-bit and only valid for bitmapped displays, like Mac, and we all know how slowly that scrolls. O All of the current software would break without extensive rewrite O The internal overhead in a non-message passing operating system (most of them) is so high that it's ridiculous. O Think of pipes and all file I/O going half as fast. O Think of your hard disks shrinking to half their size... source files, after all, are text. terry@wsccs
gordan@maccs.McMaster.CA (gordan) (11/11/88)
|Some of us don't like the price tag of switching to 16 bits, *especially* |since the vast majority of our jobs and our customers only need 8. Yes, but a 16-bit or even 32-bit architecture has many advantages. Programming is a great deal easier with a flat address space. Whoops, we're talking about character sets, not chips? Gosh, how embarrassing... On that bright future day when we've all got 1G of memory sitting on our desktops and optical disk storage coming out of our ears, will 8-bit character sets be the "segmented architecture" of the 21st century? -- Gordan Palameta uunet!ai.toronto.edu!utgpu!maccs!gordan
ok@quintus.uucp (Richard A. O'Keefe) (11/13/88)
In article <774@wsccs.UUCP> terry@wsccs.UUCP (Every system needs one) writes: >Second, vi in the US strips the 8th bit out, and is therefore not >usable for programming international (8-bit) characters using either model. AT&T announced clearly in the SVID that they were going to stop doing that kind of thing, _and_they_have_. >Problems with 16 bit characters: > >O The Xerox model is 16-bit and only valid for bitmapped displays, > like Mac, and we all know how slowly that scrolls. > The Xerox model (XSIS 058404) has nothing to do with bitmapped displays. >O All of the current software would break without extensive rewrite It's going to break _anyway_. If you do one-character-equals-one-byte operations on Kanji, the results just aren't going to make sense. With a 16-bit model (actually, the Xerox model already has provision for 24-bit characters, though the implementation I was familiar with didn't provide them yet). In fact, when XNS support was added to InterLisp, most programs didn't even need to be recompiled, and those that needed other changes mostly _could_ have been written to be independent of character set using facilities already in the language. >O The internal overhead in a non-message passing operating system > (most of them) is so high that it's ridiculous. >O Think of pipes and all file I/O going half as fast. >O Think of your hard disks shrinking to half their size... source > files, after all, are text. These are essentially the same point, and are equally mistaken. There is no reason why a _single_ character and a _sequence_ of characters need to use the same coding. There are three representations used for character sequences in Interlisp-D: thin strings (vectors of 8-bit characters from "character set 0"), fat strings (vectors of 16-bit characters), and files (sequences of characters drawn from the same 256-character block are stored as sequences of 8-bit codes, with "font change" codes inserted as needed). Since a file is presumed to start in character set 0, files of 8-bit characters DIDN'T CHANGE AT ALL. If you want to position randomly in a sequence, then you have to know what the "font" is there, or a font change code could be inserted at the start of every block. It is only when a program picks up a single character and looks at it on its own that it materialises as 16 bits. [This coding wins if you tend to mix languages with small character sets, e.g. if you have whole sentences in English, Russian, Hebrew, Greek, &c, because then you can stay in the same "font" for at least a word at a time. It does not pay off for Kanji, but with a certain amount of cunning you can make it no worse than the ISO 2022 method.] Now you can only achieve code-set independence as easily as that in a high-level language, and font-compressed files really require all the utilities in the system to be internationalised at once, so the ANSI committee didn't really have the option of adopting a solution like this.