phipps@fortune.UUCP (Clay Phipps) (08/31/84)
[warning: somewhat long: 59 lines] Simply choosing alternate multicharacter representations was workable for Pascal, but might not be tractable for C or the C Shell. Alternate character representations are important for Pascal not so much because of European character set variations, but because of the absence of some of its characters on common IBM equipment. IBM conventions are to reserve "@" "#" "$" for national-use characters; they are treated as alphabetic even in products for the USA market (including Pascal, I think). Standard Pascal doesn't use those characters at all, most likely because they probably weren't in the character set of the CDC machine on which Pascal was first implemented. If the notion of characters as 7-bit quantities weren't so ingrained into various nooks and crannies of UN*X, the solution for our European colleagues would be simple: use the "Supplementary Graphics Characters" of the North American Presentation Level Protocol Syntax (NAPLPS); in combination with ISO 646 or ASCII, it makes an 8-bit character set. This seems much better than throwing away useful punctuation because some particular Latin alphabets run short of letter codes in a 7-bit character set (I call "[" "\" "]" "^" "_" "{" "|" "}" "~" and other 'special characters' 'punctuation', for lack of a better term). NAPLPS contains 30 supplementary letter characters and approximately 16 diacritical marks, thus providing for all of the letters used by European Latin alphabets (note: English uses a *Latin* alphabet). This approach means that a character is always a letter or always punctuation, not the "punctuation except in Continental Europe" situation that we have now with 7-bit ASCII or ISO 646. By the way, Europeans are not alone in needing more letter characters than (the English variant of) the Latin alphabet used for ASCII provides. African languages (except for Arabic) use the Latin alphabet, as do Spanish and Portugese speaking people in Central and South America. NAPLPS does add one significant complication to compilers (and text processing programs in general): it invalidates the underlying assumption of much existing code that the representation for one letter is one byte. This raises the question of whether the "length" of an identifier is the number of language letters it contains, or the number of bytes used to represent it. I suppose that if requiring an extra byte for accented letters is objectionable, many countries could replace supplementary letters unused by their language with accented letters. The 30 new letter codes available would be sufficient for French, and just right for Czech; all the other Latin alphabets are easier (i.e., have fewer non-English letters). The big advantage of this is that letter codes always represent some letter, never punctuation, and vice versa. This keeps compiler and interpreter scanners simple. An ANSI committee (X3L1, I think) is looking into issues raised by multibyte letters in character sets. [The above is a personal opinion, and may not reflect that of Fortune Systems Corp., despite our terminals' support of NAPLPS] References: _The Chicago Manual Of Style_, 13th Ed., chap. 9: "Foreign Languages In Type", p. 249 .. 265. My NAPLPS wall chart created from an early AT&T NAPLPS specification (see also the _BYTE_ article on NAPLPS from about a year ago). -- Clay Phipps -- { amd hplabs!hpda sri-unix ucbvax!amd } !fortune!phipps { ihnp4 cbosgd decvax!decwrl!amd harpo allegra}