cowan@marob.MASA.COM (John Cowan) (04/23/89)
In article <1468@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: >Well, a number of companies are starting to pick up some level of >support for the ISO 8859 character sets - "8-bit-clean" software, ISO >8859/n fonts (for n == 1, at least, and maybe for other values of n), >support for the pANSI C internationalization stuff and/or the X/Open >internationalization stuff, etc. > >Unfortunately, one form of inertia is represented by ISO 646 character >set terminals; I think there are enough of them around now so that >they'll have to be dealt with in the short term (e.g., with code to >translate various national 646 character sets to 8859 character sets on >input and to do the reverse translation as best as can be done on >output). > Xerox has devised a very slick 16-bit character standard for use with their various Interlisp and Office Automation workstations, and with Interpress printers. Unfortunately (typical Xerox) it hasn't migrated to the rest of the world yet. The 16-bit space is divided into 255 "character sets" containing 255 "character codes" each. Not every set is in use, and not every code is in use in every set, but 65025 characters is pretty generous. Set #0 is ISO Latin #1, so 7-bit ASCII and ISO are upward compatible just by adding 8 bits of zeros at the high order end. Other ISO character sets are also used; however, redundancies are stripped -- thus "A" is only character code 65 in character set 0, and does not appear in any other character set. There are character sets for Cyrillic, Greek, Hebrew, Arabic, Korean hangul, Japanese katakana, Japanese hiragana, Chinese bopomofo, etc. etc. There is also a large block of character sets reserved to represent Japanese kanji -- this part is bit-for-bit compatible with the 16-character kanji standard of JIS. There are several character sets for oddball symbols, one for line drawing graphics, and the character sets 224-254 are reserved for "rendering" characters, like the fi and fl ligatures and the special initial, medial, and final forms of Arabic letters, as well as "Old Style" digits. Note that the character set code does >not< represent font-and-face information. To prevent the tremendous wastage of space which would occur when representing running text in "full" 16-bit form (which is defined by the standard to be a big-endian form, with character set preceding character code) a special compression format is defined. Compressed strings represent only character codes, and the character set defaults to zero. The sequence of 255 followed by a byte means "change to the character set numbered by the byte". Therefore, regular ASCII strings are automatically in compressed Xerox format, since their characters are already in set 0! A double 255 serves as an escape from the character set altogether, possibly to a whole different 16-bit character universe (!); the current universe is numbered 1, for future expansion.
henry@utzoo.uucp (Henry Spencer) (04/23/89)
In article <622@marob.MASA.COM> cowan@marob.masa.com (John Cowan) writes: >...The sequence of 255 followed >by a byte means "change to the character set numbered by the byte". >Therefore, regular ASCII strings are automatically in compressed Xerox format, >since their characters are already in set 0! ... However, ISO Latin 1 strings aren't necessarily in compressed Xerox format, because 255 is a printable character in ISO Latin, as I recall. -- Mars in 1980s: USSR, 2 tries, | Henry Spencer at U of Toronto Zoology 2 failures; USA, 0 tries. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
guy@auspex.auspex.com (Guy Harris) (04/23/89)
>However, ISO Latin 1 strings aren't necessarily in compressed Xerox format, >because 255 is a printable character in ISO Latin, as I recall. Yes, it is. In ISO Latin #1, it's "y with a diaresis" (what language uses that?).
huitema@mirsa.inria.fr (Christian Huitema) (04/24/89)
From article <1491@auspex.auspex.com>, by guy@auspex.auspex.com (Guy Harris): > Yes, it is. In ISO Latin #1, it's "y with a diaresis" (what language > uses that?). French. English also, to a degree: diaresis is the only ``accentuation'' that you find in some English dictionnaries. Actually, there was a debate between French and Dutch requirement. For the Dutch language needs the same graphic for the very frequent compound ``ij'', but Dutch people would certainly not name that "y with a diaresis". Christian Huitema.