S.Kille@CS.UCL.AC.UK (Steve Kille) (11/05/90)
The new X.400(1988) body parts refer to characters according to ISO 6937 (General Text). The X-Windows system makes extensive use of the ISO 8859 character set. Can anyone explain why different choices have been made by these groups, and of the resepctive merits of the two standards (I have seen 6937, but not 8859). Steve
eskovgaa@UVCW.UVIC.CA (Erik Skovgaard) (11/06/90)
ISO 6937 was originally in the first draft MOTIS document. CEN/CENELEC and NIST chose to include this character set together with a "default rendition method". The latter maps non-ASCII characters to ASCII codes so they can be displayed on an ASCII terminal, arbeit with loss of information. ISO 6937 has several options, but basically extends the IA5 by prefixing some characters with an escape code. ISO 8859 extends IA5 by using the eighth bit and thus, less characters are possible than ISO 6937. In fact, we had a lot of arguments in NIST over which codes we should include as body parts. One of the arguments used was that FTAM supports ISO 8859, but as I recall, we settled on ISO 6937 since this was decided by CEN/CENELEC in their X.400 profile. ....Erik.
K.P.Donnelly@edinburgh.ac.uk (11/06/90)
ISO 6937 and ISO 8859 are both extensions of ASCII (or ISO 646) to 8 bits. Both of them avoid not only the 32 control characters of ISO 646 (columns 0 and 1) but also their 8-bit equivalents (columns 8 and 9), so as to avoid possible transmission or other difficulties. The most important difference is that ISO 6937 has "floating diacritics" - characters of "zero width" representing accents, so that accented characters are represented by two bytes, one for the unaccented character and one for the accent. This means that it can accommodate many more accented characters within 8 bits than can ISO 8859. In fact it copes with almost all languages using a Latin based alphabet. However, it also means that most existing software, such as text editors, will not cope with ISO 6937, whereas most software needs little or no modification to work with ISO 8859. Probably this is why ISO 6937, although it came earlier than ISO 8859, has never really been adopted, whereas ISO 8859 is becoming very widely used. ISO 8859, because of the limited number of characters which it gets into 8 bits, has to be split into several parts. ISO 8859-1 covers nearly all Western European languages, which includes a lot of languages with economic clout. ISO 8859-2 covers Eastern European languages with a Latin based alphabet such as Czech and Polish. ISO 8859-2 and 8859-3 mop up some of the gaps. ISO 8859-5 is for languages like Russian with a Cyrillic alphabet, 8859-6 is for Arabic, 8859-7 is for Greek and 8859-8 is for Hebrew. ISO 8859-9 is a late addition; it adds to ISO 8859-1 the characters needed for Turkish, at the expense of Icelandic, which has far fewer speakers than Turkish but which got included ISO 8859-1 because the Icelanders got into 8-bit computing at an early stage and also because some of the characters are used in Old English. I don't know whether ISO 6937 has any additional parts for languages such as Russian or Arabic with a non Latin alphabet. ISO 6937 is a development from Teletex. ISO 8859-1 is a development of the DEC multinational character set. Various manufacturors extended ASCII to 8 bits in various ways (e.g. IBM-PC character set; HP Roman 8 character set used on Laserjet II laser printers), but the DEC multinational character set has a far more logical layout of characters than the others. ISO 8859-1 is used on DEC VT320 terminals, and terminal emulations such as MS-Kermit 3.0. The reason that X.400(1988) refers to ISO 6937 whereas X-Windows makes use of ISO 8859 may be the association between CCITT and Teletex and the association between DEC and the development of X-Windows, or it may just be that X.400(1988) was developed earlier on. It is now regarded as wasteful having anything like as many as 64 character positions reserved for control characters, and proposals have been made to extend ISO 8859-1 to cover more languages. Alternatively, it is possible that ISO 6937 might make something of a comeback within the context of structured documents. Or both ideas might be leapfrogged by two-byte or multi-byte character sets, with file compression for storage. I am no expert and some of the above information may be wrong. If so, I would be glad of corrections. Kevin Donnelly
Stef@ICS.UCI.EDU (Einar Stefferud) (11/06/90)
Since no one has hit Steve's question on the head yet, I will take a shot at it. 8859 was designed to facilitate Data Processing, and thus it is limited to only 8 bit codes so as to avoid the pain of data processing on mixed length character codes. 6937 is more "transmission" oriented, with escape codes to signal semantic shifts for subsequent characters. 6937 is favored by various communication oriented processor vendors. I believe that XEROX uses 6937 quite effectively to support many languages and character sets for their very much internationally oriented document publishing systems. FTAM supported 8859 because of the data processing orientation of the people involved with making the implementors agreement profiles. X.400 has an obvious tilt toward documents rather than business records. Hope this helps at the meta understanding level. The conflict between 8859 and 6937 is thus deep and unresolvable, though there are some incomplete ways to map some very useful parts of 6937 onto 8859. I expect that all 8859 characters have 6937 equivalents, but this is only a guess on my part. I have no knowledge of 10464, though I expect that it is intended to somehow resolve the problems between 8859 and 6937. The character set question is a very big mess, and getting worse as the effort to close on something common for the world takes root. There are 3 main camps. North America, where we have little problem with just using ASCII, and we wish the rest of the world would settle the question without making life too complicated for our users who only have ASCII keyboards. Europe, where there are many alphabets and lots of accents and umlauts. EWOS and others in EU are becoming deeply involved in this mess. Asia, where there are 3 main KANJI alphabets which are very difficult to meld into some kind of single "alphabet". Japanese KANJI characters are strictly limited, and Katagana characters are used as modifiers to extend the limited set. Chinese KANJI is not so limited, with new characters being invented over time, and with no "Katagana analog" to use for extension. I expect that Korean is more like Chinese, but I am not even slightly expert in this. Are there any other idiogram alphabets? Anyway, the overall problems will have to be resolved among those countries that have real problems with any loss of the right to use any of their normally used characters as we move to electronic media. Although us ASCII folk may think to look askance at all this character set confusion, I think that we should at least offer our sympathies to those with the real problems, while we try to keep things from getting too complicated. I sort of shudder at being required to enter Katagana or KANJI or umlauts and accents into ORAddresses, now that X.400 allows T.61 characters on ORAddresses. I wonder how it can be done with my present systems and keyboards? I have seen how the Japanese have modified EMACS to input and display Katagana and KANJI. Rather ingenious it is, and a real testimonial to the power to EMACS. Best...\Stef
Harald.alvestrand@elab-runit.sintef.no (11/06/90)
ISO 6937 defines floating accents, that is, an A with an accent is represented as "accent-sign A", 2 octets. ISO 8859 defines a single sign "accented A". ISO 6937 also lists the "supported combinations" of accents and characters, and has a non-spacing underline, which means that you can underline anything. That in turn means that an eight-character name can take 24 bytes of storage if it is all underlined, accented characters. Makes things a bit problematic for programmers of FORTRAN. In total, ISO 6937 requires about 316 characters or character-accent combinations to be supported. That covers the needs of the Europeans that use Latin alphabets. The question of switching character sets belongs to ISO 2022, which defines escape sequences for the purpose. That in turn refers the "international registry of character sets", which is maintained by somebody, I THINK it is ECMA, but I do not remember this clearly. BTW, ISO 8859 is really a collection of character sets, numbered from ISO 8859-1 (the one the US people are pushing) to ISO 8859-9 (as of now). In all the sets, the lower 128 positions are defined in the same way, but the higher positions may have changes. I believe 8859-4 is suitable for the East European languages (characters like C with an inverted circumflex accent are very important in writing the languages of Chekoslovakia, for instance). So, switching BETWEEN character sets is a requirement, at least until ISO 10646 is finalized (if ever). That one attempts to land every character in the world inside one big 32-bit character set, with ISO 8859-1 as the first 256 bit positions, leading to easy compression of 8859-1 text :-) Any clarity added? Harald Tveit Alvestrand harald.alvestrand@elab-runit.sintef.no
philip@beeblebrox.dle.dg.com (Philip Gladstone) (11/06/90)
Standard character sets are a true minefield. The scoop is (I think as follows): 8859.1 8-bit X-Windows character set 8859.n 8-bit A family of character sets, including greek, cyrillic etc. 6937.n 8-bit (with *some* two byte characters) Overall includes the same set of characters as the entire 8859.n family. 6937.1 & 6937.2 are (roughly) equivalent to T.61 (1984) excluding Kanji. T.61 (84) 8-bit (some two byte character), 16-bit Kanji. This contains all western european characters and the Japanese Kanji set. Note that the Kanji set contains Cyrillic and *some* greek characters (but no terminal sigma for instance). T.61 (88) 8-bit (some two byte characters), 16-bit Kanji, 8-bit Greek This is an enhancement to the 84 version by the addition of an 8-bit greek set. I think that Chinese also got added. JIS C 6226 16-bit This is known as Kanji. JIS X 0208 This is the current name for JIS C 6226. T61String (TeletexString) is subtly incompatible between 1984 and 1988 X.400 (but I have a defect report in about that). Note that in any event, T61String (88) is a SUPERSET of 84 in that greek characters are allowed. Also please note that the curly bracket characters {} *are* in T61String, its is just that they are rather difficult to find being located in the kanji portion. In answer to your question 'What are the respective merits?': As far as I am concerned, each character set for conveying data is sufficiently different, with different escape sequences etc, that you need to have a comprehensive solution to the character set problem. Once you have this, it doesn't matter much which set you use provided you don't start losing characters. -- Philip Gladstone Development Lab Europe Data General, Cambridge England. +44 223-67600
hitoaki@kyo-sr.ntt.junet (SAKAMOTO Hitoaki) (11/07/90)
In article <PHILIP.90Nov6103250@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes: > JIS C 6226 16-bit > This is known as Kanji. > JIS X 0208 This is the current name for JIS C 6226. JIS X0201-1976 Code for Information Interchange. JIS X0208-1990 Code of the Japanese Graphic Character Set for Information Interchange. JIS X0212-1990 Code of the Supplementary Japanese Graphic Character Set for Information Interchange. "JIS X0201" is 7bit or 8bit encoding roman and "Kana" characters. "JIS X0208" and "JIS X0212" is 16bit encoding "Kanji" characters. ------ Hitoaki Sakamoto ( hitoaki@nttlab.ntt.jp ) Nippon Telephone and Telegraph corporation. Tokyo Technical and Development Center.
hitoaki@kyo-sr.ntt.junet (SAKAMOTO Hitoaki) (11/07/90)
I am interest in X.400 with Japanese. My English is very poor, sorry. ( I want to write Japanese . ) > Asia, where there are 3 main KANJI alphabets which are very difficult to > meld into some kind of single "alphabet". Japanese KANJI characters are > strictly limited, and Katagana characters are used as modifiers to > extend the limited set. Why do you think about Kanji characters? > I sort of shudder at being required to enter Katagana or KANJI or > umlauts and accents into ORAddresses, now that X.400 allows T.61 > characters on ORAddresses. I Think so..... > I wonder how it can be done with my present > systems and keyboards? For example, Sun has JLE ( Japasnese Language Enviroment?). But We use Normal Export version of Sun-OS and Keyboard. I do not sort of shudder. (^_^) Because We are using X-Window System. > I have seen how the Japanese have modified EMACS to input and display > Katagana and KANJI. Rather ingenious it is, and a real testimonial to > the power to EMACS. I use Japanese Emacs (Nemacs) with EGG now. It is input and display kanakana and Kanji characters. -------------------- Hitoaki Sakamoto ( hitoaki@nttlab.ntt.jp ) Nippon Telephone and Telegraph Corporation. Tokyo Technical and Development Center.
ronald@robobar.co.uk (Ronald S H Khoo) (11/08/90)
[ n.b. followups redirected for this question ] philip@beeblebrox.dle.dg.com (Philip Gladstone) writes: > Standard character sets are a true minefield. The scoop is (I think as > follows): [ useful informative table deleted ] Can anyone tell me if the line drawing characters are standardised in any ISO standard? I mean the kinds of characters you use for drawing boxes on terminals, e.g. like the ones that really annoy you when your VT100 has received some line noise :-) Email would be appreciated. If the info is forthcoming, I will summarise (to std.internat). Thank you. -- ronald@robobar.co.uk +44 81 991 1142 (O) +44 71 229 7741 (H)
prc@erbe.se (Robert Claeson) (11/08/90)
In a recent article K.P.Donnelly@edinburgh.ac.uk writes: >ISO 8859-1 is a development of the DEC multinational character set. Actually, the way I got this explained for me was that when DEC needed an 8 bit character set, they took an early draft of ISO 8859/1. Later drafts of ISO 8859/1 changed, and thus there are now about 10 characters in the right half that's different between DEC Multinational and ISO 8859/1. The DEC VT200 series has DEC Multinational as the only 8 bit character set. The VT300 and VT400 series adds the true ISO 8859/1 character set. Anyone who knows the *true* story behind this? -- Robert Claeson |Reasonable mailers: rclaeson@erbe.se ERBE DATA AB | Dumb mailers: rclaeson%erbe.se@sunet.se | Perverse mailers: rclaeson%erbe.se@encore.com These opinions reflect my personal views and not those of my employer.
lance@motcsd.csd.mot.com (lance.norskog) (11/08/90)
Stef@ICS.UCI.EDU (Einar Stefferud) writes: >Are there any other idiogram alphabets? It gets worse. One of the Indian subcontinent language families, I believe Hindi and its derivatives, uses modified characters. Under this system, the word "snake" becomes "snakes" by adding a squiggle to the bottom leg of the "s". Different squiggles means "red snake" or "angry snake". So, to render a word, you have to treat it as a grammatical parse tree, with a word and possible modifiers, render the first letter of the base word, the rest of the word, and then apply the modifiers to the first letter. This was explained to me a long time ago, and I'm sure it's bollixed, but you get the drift.