kjartan@rhi.hi.is (Kjartan R. Gudmundsson) (10/28/88)
How difficult is it convert american/english programs so that they can be used to handle foreign text? The answer of course depends on the language one has in mind. In Europe most nations ues the Latin alfabet and english is one of them. Unfortunately english uses very few charaters compered to other european languages, therefore the code set that is widely used by americans and english, the ASCII character set, only defines 128 characters. It is a 7 bit character set. In other european countries than England the ASCII character set is also widely used but with extension. The character set is 8 bit thus allowing 256 characters. The problem is however that the extension is not standard. We have one possability in the IBM-PC character set, other one from HP called Roman-8, DEC gives us DEC-multinational character set and the Macintosh has yet another. So if we have a program that for example converts lower case letters to uppercase, it has to be coded diffrently for each character set. Let's look at some code from MicroEMACS: input.org: if (c>=0x00 && c<=0x1F) input.org: if (c>=0x00 && c<=0x1F) /* C0 control -> C- */ main.org: case 'a': /* process error file */ main.org: if ((c>=0x20 && c<=0xFF)) { /* Self inserting. */ random.org: if (*scan >= 'a' && *scan <= 'z') random.org: else if (c<0x20 || c==0x7F) random.org: else if (c<0x20 || c==0x7F) region.org: lputc(linep, loffs, c+'a'-'A'); region.org: lputc(linep, loffs, c-'a'+'A'); region.org: if (c>='a' && c<='z') search.org: else if (c < 0x20 || c == 0x7f) /* control character */ word.org: c += 'a'-'A'; word.org: c += 'a'-'A'; word.org: c -= 'a'-'A'; word.org: c -= 'a'-'A'; word.org: if (c>='a' && c<='z') { word.org: if (c>='a' && c<='z') { word.org: wordflag = ((ch >= 'a' && ch <= 'z') || word.org: if (c>='a' && c<='z') Ugly isn't it? An other way of doing this is using "is.." functions that are defined in ctype.h, include file that comes with (almost) all c-compilers Some of the above lines would look like this: basic.c: else if (iscntrl(c)) display.c: if (iscntrl(c)) display.c: } else if (iscntrl(c)) { eval.c: *sp = tolower(*sp); eval.c: *sp = toupper(*sp); eval.c: if (islower(*sp) ) fileio.c: if (iscntrl( fn[tel++] ) ) input.c: if (iscntrl(buf[--cpos]) ) { input.c: if (iscntrl(buf[--cpos])) { input.c: c = toupper(c); input.c: c = toupper(c); /*Force to upper */ input.c: if ( islower(c) && ( SPEC != (SPEC & c) )) input.c: if (iscntrl(c) ) /* control key */ input.c: if (iscntrl(c) ) /* control key */ input.c: if (iscntrl(c) ) /* control key? */ This code is better (most of the is.. things are macros that mask the argument and return the binary mask that is either zero or positve) has more style to it and is easiear to port to a diffrent character set. An other bad habit of american programmers is this: character_value = (character_value & 0x7F ) don't do this!! If you must, you can use 0xFF insted: character_value = (character_value & 0xFF ) (Unless of course your machine breaks to peaces if it gets an 8 bit character in its io channels.) ############################################################################### # # # Kjartan R. Gudmundsson # # Raudalaek 12 # # 105 Reykjavik # Internet: kjartan@rhi.hi.is # # # uucp: ...mcvax!hafro!rhi!kjartan # # # # ###############################################################################
gwyn@smoke.BRL.MIL (Doug Gwyn ) (10/31/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >How difficult is it convert american/english programs so that they can >be used to handle foreign text? [etc.] Where have you been the last few years? This subject area is known as "internationalization" and has been the featured topic of special issues of several journals, including UNIX Review and UNIX/World. The draft proposed ANSI/ISO C standard specifically addresses this issue (it is one of the reasons production of the final standard was delayed).
ok@quintus.uucp (Richard A. O'Keefe) (10/31/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >The problem is however that the extension is not standard. There is an international standard for 8-bit character sets: ISO 8859. There are several versions of 8859, just as there were several national versions of ISO 646 (of which ASCII was only one). All versions include ASCII has the bottom half. ISO Latin 1 (8859/1) is pretty close to DEC's Multinational Character Set, and is supposed to cover most West European languages (including Icelandic). There is a Cyrillic version (I think it is 8859/2) and others are under way. >An other bad habit of american programmers is this: >character_value = (character_value & 0x7F ) >don't do this!! If you must, you can use 0xFF insted: >character_value = (character_value & 0xFF ) The only time when I've wanted to do this is when stripping off a parity bit, and using 0xFF would be totally wrong. The toascii() macro *might* be appropriate. When you're dealing with a 7 data + 1 parity bit device, there is no point in pretending that you're prepared to accept anything other than 7 data bits. The real problem is trying to write portable code that uses character classes which _aren't_ in <ctype.h>. Consider isvowel()...
nwd@j.cc.purdue.edu (Daniel Lawrence) (10/31/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they can >be used to handle foreign text? The answer of course depends on the language [a description of some of the problems using 8 bit chars] > >Let's look at some code from MicroEMACS: > [a code excerpt from MicroEMACS 3.9] >Ugly isn't it? > Ok, I am feeling a little picked on here... a lot of people like using uEMACS for pointing things like this out. When I first started working with it, it was just for me. But that is really no excuse... >An other way of doing this is using "is.." functions that are [an alternative which is better] >This code is better (most of the is.. things are macros that mask [More descriptions of 8 bit problems...] And someone finally proposes some solutions rather than just blindly stabbing out and complaining. The last round of complaints I sent out a request for information on this problem, and the best I got back was.. go to the library and do some research. Well for a project I am doing in my spare time, considering the poor library system round here I really wasn't happy to here all the griping and then get no help from anyone to fix the problems. So I applaud Mr. Gudmundsson for his mail. ># Kjartan R. Gudmundsson # ># Raudalaek 12 # ># 105 Reykjavik # Internet: kjartan@rhi.hi.is # However, after the last round, I thaought carefully about the 8 bit problems, and resolved that the issue was too complex on a language by language basic for me to ever attempt to get all the case mappings correct. So when you see the next version of MicroEMACS, it will have a user changable upper/lowercase mapping function (which is working right now). Note: This slows down the regular pattern matching code considerable, so uEMACS can be compiled with the diacritical (un american in this case) turned off, but both options now exits. Daniel Lawrence (317) 742-5153 UUCP: {pur-ee!}j.cc.purdue.edu!nwd ARPA: nwd@j.cc.purdue.edu FIDO: 1:201/10 The Programmer's Room (317) 742-5533
nelson@sun.soe.clarkson.edu (Russ Nelson) (11/01/88)
In article <8045@j.cc.purdue.edu> nwd@j.cc.purdue.edu (Daniel Lawrence) writes: In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they >can be used to handle foreign text? The answer of course depends >on the language So when you see the next version of MicroEMACS, it will have a user changable upper/lowercase mapping function (which is working right now). Same for Freemacs. I also used to take over the keyboard interrupt (INT 9), but some of the international users complained that it broke their keyboard mapper (not to mention the fact that it lost with TSRs), so I took it out. -- --russ (nelson@clutx [.bitnet | .clarkson.edu]) To surrender is to remain in the hands of barbarians for the rest of my life. To fight is to leave my bones exposed in the desert waste.
guy@auspex.UUCP (Guy Harris) (11/01/88)
>An other bad habit of american programmers is this:
And another one is using the 8th bit internally for quoting; don't do that.
guy@auspex.UUCP (Guy Harris) (11/01/88)
>There is a Cyrillic version (I think it is 8859/2) No, 8859/2 is another Latin set; there are four Latin alphabets (8859/[1234], I think), and there seem to be at least drafts for Greek and Cyrillic. >The only time when I've wanted to do this is when stripping off a parity >bit, and using 0xFF would be totally wrong. The toascii() macro *might* >be appropriate. When you're dealing with a 7 data + 1 parity bit device, >there is no point in pretending that you're prepared to accept anything >other than 7 data bits. Except that most devices can be *told* to handle 8 bits; never assume that when you're dealing with a terminal that you're dealing with a 7 data + 1 parity bit device (unless your software deals *only* with one specific terminal that *can't* generate 8 bits). >The real problem is trying to write portable code that uses character >classes which _aren't_ in <ctype.h>. Consider isvowel()... Or, for that matter, consider "toupper()"; what's "toupper()" of a German "ss" (or is it "sz") character?
dik@cwi.nl (Dik T. Winter) (11/01/88)
In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: > >There is a Cyrillic version (I think it is 8859/2) > > No, 8859/2 is another Latin set; there are four Latin alphabets > (8859/[1234], I think), and there seem to be at least drafts for Greek > and Cyrillic. Cyrillic is 8859/5 and approved. /6 is arabic, /7 greek and /8 hebrew; these three are still draft (as are /3 and /4). -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
robert@pvab.UUCP (Robert Claeson) (11/01/88)
In article <8804@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes: > In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they can > >be used to handle foreign text? [etc.] > Where have you been the last few years? This subject area is known as > "internationalization" and has been the featured topic of special issues > of several journals, including UNIX Review and UNIX/World. The fact is, internationalization hasn't showed up in any product I know of (except for HP's NLS, but few uses it). People are still just *talking* about internationalization, but not *doing* it. -- Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden Tel: +46 758-202 50 Fax: +46 758-197 20 EUnet: rclaeson@erbe.se ARPAnet: rclaeson%erbe.se@uunet.uu.net
mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) (11/02/88)
In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >>How difficult is it convert american/english programs so that they can >>be used to handle foreign text? [etc.] > >Where have you been the last few years? This subject area is known as >"internationalization" and has been the featured topic of special issues >of several journals, including UNIX Review and UNIX/World. The draft >proposed ANSI/ISO C standard specifically addresses this issue (it is >one of the reasons production of the final standard was delayed). Unfortunately, the C standard is still lacking in this area. It is true that the attempt was made, however, X3J11 will have to go through another round if it is to be truly internationalized. One problem is that, althougth the standard supports multi-byte characters which are required for a number of languages around the world, especially those in Asia, no support is provided to pass those characters to any of the is...() or to...() functions. Since all the is...() and to...() functions take an integer parameter, it would be impossible to evaluate a multi-byte character. Another problem is that an application has no way of portabily determining where the current character in a string ends and the next begins; you can't just use ch++ to advance to the next character anymore. And it is even harder to move backwards though a string. There are some other problems with collation as well, some language may have several lowercase characters corresponding to a single uppercase character, or vice-versa. This presents some problems when using toupper() and tolower() to covert a character to it's opposite case. In addition in some languages and/or collation sequences there are some characters which do not have a corresponding opposite case (i.e. there is only an uppercase character with no corresponding lowercase character in a code set) To be fair, we did not uncover these deficiencies until just recently (just after we sent our ballot in for the third public review), so these may not have been issues specifically addressed by the commitee. There are some solutions to these problems, which would allow for internationalization without breaking any existing programs. Here are some suggestions: 1. Develop some functions which provide the same functionatality as the is...() functions but which take a character pointer as an argument. For example: int wcislower(char *string) 2. Develop some functions which provide the same functionality as the to...() function but which return a character pointer. Unfortunatly, these functions may need to allocate space in order for the transformation to work, or they may need to pass back a pointer to a static string which would then need to be copied. The latter is probably the way most implementations would do it since it is essentially a table lookup. For example: char *wctolower(char *string) 3. Provide some functions to allow traversing a character string. These functions would return a pointer to the next character in the string as determined by the current local. For example: char *nextchar(char *string) char *prevchar(char *string, char *backup) These last two functions were presented at the latest IEEE POSIX meeting by one of the commitee members to cope with this problem. The backup string in prevchar() provides a pointer to a known character boundry that the function can use to scan forward in the string in order to determine where the actual character boundry of the previous character is. 4. Some of the string functions would need to be revised as well, specifically strlen(). int wcstrlen(char *string) This function would return the string length of the current string according to the current locale setting. Therefore the string "abss" would give a length of 4 in the C locale, but may return 3 in a German local. The functionality of this could be put in the current strlen, however, there are still requirements to get the number of bytes in a string, as well as the number of characters, so the old strlen should not be replaced. Internationalization is a tricky and invovled problem. Unfortunately it is not possible for an existing program to recompile under and ANSI compiler and become internationalized. A number of changes to the application are required in order to provide for maximally portable code. However, it is possible to provide the internationalization without breaking any existing code. What has been discussed so far is character level internationalization, which is only one side of the fence. The other side is language translation of strings. This is known as "messaging" in the circles which talk about internationalization (let's overload yet another computer science term...). However, messaging can be accomplished by developing messaging libraries which contain the strings required by the application, translated into every language which your application needs to support. When you wish to display a string, such as "press spacebar to continue" you call the messaging library with a unique identifier which is associated with your string, and the messaging library returns a string, based on the current local, which depicts the same idea as "press space bar to continue". This also requires some fancy footwork on the part of applications, since displaying these messages is bound to be very difficult since some languages read left-to-right, some read right-to-left, and some sucn as Mongolian, do both and even go diagonally. Add string attributs such as centering and justification and character attributes such as inverse, normal and blinking and messaging becomes very interesting indeed. Internationalization is a relatively new field, and a number of things still need to be ironed out, but I think that we are making progress, and that progress should continue. -- Mark H. Colburn "They didn't understand a different kind of NAPS International smack was needed, than the back of a hand, mark@jhereg.mn.org something else was always needed."
gauss@homxc.UUCP (E.GAUSS) (11/02/88)
In article <8804@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) asks: > >How difficult is it convert american/english programs so that they can > >be used to handle foreign text? [etc.] > In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) replies: > Where have you been the last few years? This subject area is known as > "internationalization" and has been the featured topic of special issues ... An author friend that I work with, Eb Colville, has been trying for a number of years to find a VI editor that will handle the German characters available in the extended ASCI characters on his MS-DOS PC. He used those in his novel, THE LAST ZEPPELIN, which is trying to find a publisher. Whatever the talk, it does not seem to be possible to do this. Extended ASCII requires the full eight bits to be available, and all VI's that we have seen simply toss away the lead bit folding umlauted characters into control characters. We ended up writing a filter so that Eb types u/e when he wants an umlauted e and just before printing we run his text through the filter which replaces it by the appropriate extended ASCII character. (It also unfolds the folded characters, but that is risky as you cannot have any control characters hidden in your text.) If your wordprocessor does not balk at eight bit characters, this is a workable way of putting the characters in in the first place. Eb has been asking about Cyrillic (Russian) characters for his next novel, BEYOND THE YUKON, and I have refused even to discuss this with him. There are methods for doing Japannese where the keyboardist types in "Romanji" and the computer makes a guess at the konji. I told Eb that if he has any plans to try Japennese word processing he will have to go to Japan. Ed Gauss
bill@twwells.uucp (T. William Wells) (11/02/88)
In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
: >The real problem is trying to write portable code that uses character
: >classes which _aren't_ in <ctype.h>. Consider isvowel()...
:
: Or, for that matter, consider "toupper()"; what's "toupper()" of a
: German "ss" (or is it "sz") character?
The character is the "Scharfes S" and looks something like a beta. It
doesn't have an upper case form. When written where an upper case
letter should go, it is written unchanged. Alternately, it might be
be written "SS".
---
Bill
{uunet|novavax}!proxftl!twwells!bill
guy@auspex.UUCP (Guy Harris) (11/02/88)
In article <4002@homxc.UUCP> gauss@homxc.UUCP (E.GAUSS) writes: (And misattributes both quotes - that's why I don't like the "In article ..., ... writes:" lines) >An author friend that I work with, Eb Colville, has been trying for a >number of years to find a VI editor that will handle the German characters >available in the extended ASCI characters on his MS-DOS PC. He used those >in his novel, THE LAST ZEPPELIN, which is trying to find a publisher. Whatever >the talk, it does not seem to be possible to do this. Extended ASCII >requires the full eight bits to be available, and all VI's that we have >seen simply toss away the lead bit folding umlauted characters into >control characters. The "vi" in System V Release 3.1 handles 8-bit characters. Unfortunately, I don't know if anybody's ported it to MS-DOS.... Also, some version of Unipress EMACS can be configured to support 8-bit characters as well (I don't know if that version has been released yet or not). >There are methods for doing Japannese where the keyboardist types in >"Romanji" and the computer makes a guess at the konji. The ones I've seen convert Romaji to Kana as you type (this is, as I understand it, a straightforward translation) and then permit you to request that the computer translate the Kana you typed since the last checkpoint (switching mode into Kanji mode, or asking for a Kana-to-Kanji translation) into Kanji. It gives you a list of the possible translations, and lets you choose which one you want. Of course, now you'd need an editor that handles *16*-bit characters; I think AT&T has a "vi" that will handle them, and I don't know about EMACS (although I remember an #ifdef in the aforementioned Unipress version for Kanji).
guy@auspex.UUCP (Guy Harris) (11/02/88)
>This also requires some fancy footwork on the part of applications, since >displaying these messages is bound to be very difficult since some >languages read left-to-right, some read right-to-left, and some sucn as >Mongolian, do both and even go diagonally. Add string attributs such as >centering and justification and character attributes such as inverse, normal >and blinking and messaging becomes very interesting indeed. And then add, on top of that, the fact that computer displays are often graphical, not just text. For instance, you might have an array of "buttons" on the screen; the layout of the panel might differ depending on whether the buttons are filled up with short English words or typicalverylongGermanwords.
weemba@garnet.berkeley.edu (Matthew P Wiener) (11/02/88)
Hey folks, I think you're all wonderful, but could you take this discussion out of comp.emacs unless it's Emacs related? There's even a comp.std.internat newsgroup. ucbvax!garnet!weemba Matthew P Wiener/Brahms Gang/Berkeley CA 94720 "Nil sounds like a lot of kopins! I never got paid nil before!" --Groo
ok@quintus.uucp (Richard A. O'Keefe) (11/02/88)
In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes: >In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >>In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >>>How difficult is it convert american/english programs so that they can >>>be used to handle foreign text? [etc.] Xerox have supported a 16-bit character set (XNS) for years. Some of the surprises mentioned by Mark Colburn have been no news to Interlisp-D programmers for a long time. The kludges being proposed for C & UNIX just so that a sequence of "international" characters can be accessed as bytes rather than pay the penalty of switching over to 16 bits are unbelievable.
george@mnetor.UUCP (George Hart) (11/03/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >How difficult is it convert american/english programs so that they can >be used to handle foreign text? If you just need to handle full 8 bit characters, it is merely painful. If you need to handle multibyte characters (e.g. Kanji) or a mix of character sets, it is excruciating. >In other european countries than England >the ASCII character set is also widely used but with extension. >The character set is 8 bit thus allowing 256 characters. >The problem is however that the extension is not standard. There is, of course, the ISO 8859 family of 8 bit character sets which contain ASCII as a perfect subset. > < excerpts of MicroEmacs code > > >Ugly isn't it? Yes. vi and the Bourne shell were(are) other offenders. I believe recent releases of SysV have cleaned up the naughty uses of the 8th bit. > < sample ctype.h invocations > > >This code is better (most of the is.. things are macros that mask >the argument and return the binary mask that is either zero or positve) >has more style to it and is easiear to port to a diffrent character set. Unfortunately, the results of the macros are undefined unless isascii(c) is positive which sort of defeats the spirit of what you intend. Of course, you could develop an 8 bit ctype.h compatible with a particular 8 bit character set. >An other bad habit of american programmers is this: >character_value = (character_value & 0x7F ) This has more to do with assumptions about character sets supported by the system than nationality. Historically, assuming an ASCII environment was not unreasonable. While this is no longer true, until vendors and standards bodies get off their collective pots and develop practical character sets and conventions for multilingual environments (including multibyte characters), things will remain confused, fragmented, and incompatible. -- Regards.....George Hart, Computer X Canada Ltd. UUCP: {utzoo,uunet}!mnetor!george BELL: (416)475-8980
gwyn@smoke.BRL.MIL (Doug Gwyn ) (11/03/88)
In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes: >Unfortunately, the C standard is still lacking in this area. It is true >that the attempt was made, however, X3J11 will have to go through another >round if it is to be truly internationalized. Not really; all the issues you raised in your posting were already addressed in arriving at the current draft proposed C Standard, which had the assistance and approval of several specialists in internationalization. I don't want to try to discuss the details here; however, I will remark that the wchar_t type is NOT intended to be used in all the contexts where a character would normally be used in an 8-bit environment. ITSCJ indicated that the functions provided were sufficient. Others can of course be provided as extensions but were not felt to be sufficiently important to standardize. P.S. This is not to be taken as an official X3J11 statement!
henry@utzoo.uucp (Henry Spencer) (11/03/88)
In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >The problem is however that the extension is not standard. >We have one possability in the IBM-PC character set, other one from HP called >Roman-8, DEC gives us DEC-multinational character set and the Macintosh >has yet another... ISO Latin 1 will probably supersede all of these, pretty much solving that problem. (If you haven't heard of it, and want to know what it's like, it's pretty similar to DEC's multinational set.) -- The Earth is our mother. | Henry Spencer at U of Toronto Zoology Our nine months are up. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
karl@haddock.ima.isc.com (Karl Heuer) (11/03/88)
In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg (Mark Colburn) writes: >One problem is that, althougth the [C] standard supports multi-byte >characters ..., no support is provided [for using them with <ctype.h>]. >Here are some suggestions: > int wcislower(char *string) > char *wctolower(char *string) It's not necessary to pass/return pointers; there is an arithmetic type "wchar_t" in ANSI C. Thus it would be simpler to define these as follows: int wcislower(wchar_t wc); wchar_t wctolower(wchar_t wc); Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint Followups to comp.std.{internat,c}.
gdh@raider.MFEE.TN.US (Gordon Hull) (11/03/88)
In article <7690@boring.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes: > Cyrillic is 8859/5 and approved. /6 is arabic, /7 greek and /8 hebrew; > these three are still draft (as are /3 and /4). Can someone tell me where I could get hold of said standards? I've got a font generation program, and it'd be nice if it conformed with the official standards. Thanx... -- Gordon Hull ------------ via RaiderNet (node 2, UNIX) (615) 896-8716 ---------- Internet: gdh@raider.mfee.tn.us | FROM CommStuff IMPORT StdDisclaimer; FIDO: Gordon Hull at 1:116/9 | "What a time! What a civilization!" | - Cicero
gordan@maccs.McMaster.CA (gordan) (11/03/88)
In article <362@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: |>There is a Cyrillic version (I think it is 8859/2) | |No, 8859/2 is another Latin set; there are four Latin alphabets |(8859/[1234], I think), and there seem to be at least drafts for Greek |and Cyrillic. 8859/2 is the Eastern European character set 8859/5 is Cyrillic Tim Lasko posted listings for both of these to comp.std.internat a couple of months ago. If anyone wants a copy, send me e-mail. -- Gordan Palameta uunet!ai.toronto.edu!utgpu!maccs!gordan
mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) (11/03/88)
In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >The kludges being proposed for C & UNIX just so that a sequence of >"international" characters can be accessed as bytes rather than pay >the penalty of switching over to 16 bits are unbelievable. There is more to it than just moving to 16 bit characters. There are a number of places where a character sequence needs to be recognized. Often that character sequence is in 8-bit or 7-bit ASCII. The draft of ANSI and POSIX both have the notion of coallation sequences; that is, some idea of how to sort characters in different locales. The collation sequence can vary from locale to locale. I would encourage you to look in the draft C standard for more details. Collation sequences can be used for more than just internationalization, however. Consider the phone book which all of us have sitting around. In the US and most other English speaking countries, the phone book has some rather ood collation sequences in it. Most notably any names beginning with "Mc" or "Mac" come before "M" in the phone book. It would be useful for some applications to define a collation sequence which would provide that particular behaviour. Now then, "Mc" and "Mac" are not (and should not) be represented as 16 bit characters. Other examples include the German ss character, which could be represented as a unique character, but most Germans would still type 'ss' rather than hunting for a new key. 16-bit characters are good for some things, such as Kanji or other Asian code sets, but may be less useful in a number of other areas. Requiring 16-bit characters puts a large burden of unused memory on those applications which only use 8-bit characters. For that reason alone, ANSI would be justfied in not requiring 16-bit characters. However, I don't beleive that there is anything in the standard which would preclude a conforming ANSI C implementation from having 16-bit characters. -- Mark H. Colburn "They didn't understand a different kind of NAPS International smack was needed, than the back of a hand, mark@jhereg.mn.org something else was always needed."
alex@mks.UUCP (Alex White) (11/03/88)
In article <380@auspex.UUCP>, guy@auspex.UUCP (Guy Harris) writes: < In article <4002@homxc.UUCP> gauss@homxc.UUCP (E.GAUSS) writes: < >An author friend that I work with, Eb Colville, has been trying for a < >number of years to find a VI editor that will handle the German characters < >available in the extended ASCI characters on his MS-DOS PC. He used those < >in his novel, THE LAST ZEPPELIN, which is trying to find a publisher. Whatever < >the talk, it does not seem to be possible to do this. Extended ASCII < >requires the full eight bits to be available, and all VI's that we have < >seen simply toss away the lead bit folding umlauted characters into < >control characters. < < The "vi" in System V Release 3.1 handles 8-bit characters. < Unfortunately, I don't know if anybody's ported it to MS-DOS.... MKS vi also handles 8-bit characters if you set an option.
terry@wsccs.UUCP (Every system needs one) (11/10/88)
In article <621@quintus.UUCP>, ok@quintus.uucp (Richard A. O'Keefe) writes: > In article <207@jhereg.Jhereg.MN.ORG> mark@jhereg.MN.ORG (Mark H. Colburn) writes: > >In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: > >>In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: > >>>How difficult is it convert american/english programs so that they can > >>>be used to handle foreign text? [etc.] > > Xerox have supported a 16-bit character set (XNS) for years. > Some of the surprises mentioned by Mark Colburn have been no news > to Interlisp-D programmers for a long time. > > The kludges being proposed for C & UNIX just so that a sequence of > "international" characters can be accessed as bytes rather than pay > the penalty of switching over to 16 bits are unbelievable. First of all, there are too many 8-bit character models available: All of the ISO models, DEC Multinational, 7-bit replacement sets, Wang-PC international sets, and IBM-PC International sets. There is no way to consolidate it without mapping, and that's so device dependant it isn't funny. Consider your termcap growing by at least 128 times the number of entries characters... assuming that there is no need for multiple GS/GE strings, as it may require more than one additional character set on some terminals. Second, vi in the US strips the 8th bit out, and is therefore not usable for programming international (8-bit) characters using either model. Problems with 16 bit characters: O The Xerox model is 16-bit and only valid for bitmapped displays, like Mac, and we all know how slowly that scrolls. O All of the current software would break without extensive rewrite O The internal overhead in a non-message passing operating system (most of them) is so high that it's ridiculous. O Think of pipes and all file I/O going half as fast. O Think of your hard disks shrinking to half their size... source files, after all, are text. terry@wsccs
ok@quintus.uucp (Richard A. O'Keefe) (11/13/88)
In article <774@wsccs.UUCP> terry@wsccs.UUCP (Every system needs one) writes: >Second, vi in the US strips the 8th bit out, and is therefore not >usable for programming international (8-bit) characters using either model. AT&T announced clearly in the SVID that they were going to stop doing that kind of thing, _and_they_have_. >Problems with 16 bit characters: > >O The Xerox model is 16-bit and only valid for bitmapped displays, > like Mac, and we all know how slowly that scrolls. > The Xerox model (XSIS 058404) has nothing to do with bitmapped displays. >O All of the current software would break without extensive rewrite It's going to break _anyway_. If you do one-character-equals-one-byte operations on Kanji, the results just aren't going to make sense. With a 16-bit model (actually, the Xerox model already has provision for 24-bit characters, though the implementation I was familiar with didn't provide them yet). In fact, when XNS support was added to InterLisp, most programs didn't even need to be recompiled, and those that needed other changes mostly _could_ have been written to be independent of character set using facilities already in the language. >O The internal overhead in a non-message passing operating system > (most of them) is so high that it's ridiculous. >O Think of pipes and all file I/O going half as fast. >O Think of your hard disks shrinking to half their size... source > files, after all, are text. These are essentially the same point, and are equally mistaken. There is no reason why a _single_ character and a _sequence_ of characters need to use the same coding. There are three representations used for character sequences in Interlisp-D: thin strings (vectors of 8-bit characters from "character set 0"), fat strings (vectors of 16-bit characters), and files (sequences of characters drawn from the same 256-character block are stored as sequences of 8-bit codes, with "font change" codes inserted as needed). Since a file is presumed to start in character set 0, files of 8-bit characters DIDN'T CHANGE AT ALL. If you want to position randomly in a sequence, then you have to know what the "font" is there, or a font change code could be inserted at the start of every block. It is only when a program picks up a single character and looks at it on its own that it materialises as 16 bits. [This coding wins if you tend to mix languages with small character sets, e.g. if you have whole sentences in English, Russian, Hebrew, Greek, &c, because then you can stay in the same "font" for at least a word at a time. It does not pay off for Kanji, but with a certain amount of cunning you can make it no worse than the ISO 2022 method.] Now you can only achieve code-set independence as easily as that in a high-level language, and font-compressed files really require all the utilities in the system to be internationalised at once, so the ANSI committee didn't really have the option of adopting a solution like this.