[net.lang.c] NAPLPS And National Char Sets / Re: C and ANSI Standard

phipps@fortune.UUCP (Clay Phipps) (08/31/84)
[warning: somewhat long: 59 lines]

Simply choosing alternate multicharacter representations
was workable for Pascal, but might not be tractable for C or the C Shell.
Alternate character representations are important for Pascal 
not so much because of European character set variations, but  
because of the absence of some of its characters on common IBM equipment. 
IBM conventions are to reserve "@" "#" "$" 
for national-use characters; they are treated as alphabetic
even in products for the USA market (including Pascal, I think).  
Standard Pascal doesn't use those characters at all, 
most likely because they probably weren't in the character set 
of the CDC machine on which Pascal was first implemented.

If the notion of characters as 7-bit quantities weren't so ingrained
into various nooks and crannies of UN*X, 
the solution for our European colleagues would be simple:
use the "Supplementary Graphics Characters"
of the North American Presentation Level Protocol Syntax (NAPLPS);
in combination with ISO 646 or ASCII, it makes an 8-bit character set.
This seems much better than throwing away useful punctuation
because some particular Latin alphabets
run short of letter codes in a 7-bit character set
(I call "[" "\" "]" "^" "_" "{" "|" "}" "~" and other 'special characters'
'punctuation', for lack of a better term). 

NAPLPS contains 30 supplementary letter characters 
and approximately 16 diacritical marks,
thus providing for all of the letters used by European 
Latin alphabets (note: English uses a *Latin* alphabet).
This approach means that a character is always a letter or always
punctuation, not the "punctuation except in Continental Europe" 
situation that we have now with 7-bit ASCII or ISO 646.

By the way, Europeans are not alone in needing more letter characters
than (the English variant of) the Latin alphabet used for ASCII provides.
African languages (except for Arabic) use the Latin alphabet,
as do Spanish and Portugese speaking people in Central and South America.

NAPLPS does add one significant complication to compilers
(and text processing programs in general):
it invalidates the underlying assumption of much existing code
that the representation for one letter is one byte.
This raises the question of whether the "length" of an identifier
is the number of language letters it contains, 
or the number of bytes used to represent it.
I suppose that if requiring an extra byte for accented letters
is objectionable, many countries could replace supplementary letters
unused by their language with accented letters.
The 30 new letter codes available would be sufficient for French,
and just right for Czech; all the other Latin alphabets are easier
(i.e., have fewer non-English letters).  The big advantage of this is
that letter codes always represent some letter, never punctuation,
and vice versa.  This keeps compiler and interpreter scanners simple.

An ANSI committee (X3L1, I think) is looking into issues
raised by multibyte letters in character sets.

[The above is a personal opinion, and may not reflect that of
Fortune Systems Corp., despite our terminals' support of NAPLPS]

References:

_The Chicago Manual Of Style_, 13th Ed., 
chap. 9: "Foreign Languages In Type", p. 249 .. 265.

My NAPLPS wall chart created from an early AT&T NAPLPS specification
(see also the _BYTE_ article on NAPLPS from about a year ago).

-- Clay Phipps

-- 
            { amd  hplabs!hpda  sri-unix  ucbvax!amd }          
                                                      !fortune!phipps
   { ihnp4  cbosgd  decvax!decwrl!amd  harpo  allegra}